Carl de Marcken, Unsupervised Language Acquisition (MIT, 1996)
1.
Bayesian inference:
We
take G to represent a grammar, and U to represent the evidence through which we
hope to learn the right grammar.

We can interpret this as a shift in belief from a prior belief in the grammar p(G) to a belief p(G|U), if we know p(U|G) and also p(U). p(U), the probability of the evidence, is hard to know, but since it is fixed, regardless of what probability we assign to it, we can convince ourselves that the best grammar is

We want to specify G, the class of grammars; we need to specify how each grammar g in G assigns a probability to a set of data U; and we want to specify how we assign a prior probability p(g) to any grammar g.
2. Stochastic language models
Much of this is due to Ray Solomonoff's work in the mid 1950s.
We would like to find a grammar which views out training data as typical:
aba
abba
abbba
abbbba
Grammar 1 S → a B a , B→ B b | b
|
S → a B a |
1.0 |
|
B → B b |
0.5 |
|
B → b |
0.5 |
Grammar 2 S→ a | b | S S
|
S →a |
0.5 |
|
S →b |
0.25 |
|
S →S S |
0.25 |
Consider one sentence: aba. Its probability by Grammar 1 is the probability of its derivation
S→ a B a and B → b
prob( S→ a B a ) = 1
prob (B→ b) = 0.5;
so prob (aba) = 0.5
Grammar 2:
|
[ S S] |
0.5 |
|
[ [ S S ] S ] |
0.5 |
|
[ [ a S ] S ] |
0.25 |
|
[ [ a b ] S ] |
0.25 |
|
[ [ a b ] a ] |
0.25 |
|
|
= 1/256 |
There is another similar derivation; so prob(aba) = 2/256 = 1/128.
How about Grammar 3: S → aba | abba | abbba | abbbba ? What would the probability be of our training corpus be, under Grammar 1 and under this finite language grammar?
Grammar 1 assigns a probability 2-k to a string of the form a bk a, so the total probability of the training corpus is ½ * ¼ * 1/8 * 1/16 = 2 ^^(-1-2-3-4) = 2^^(-10) = 1/1024.
Grammar
3 assigns what probability?
(Sometimes
linguists are surprised to see that probabilities go way down when sentences or
corpora get long: this is just the quantitative reflection of the fact that
when you have a lot of time, there are many different things you could say!)
Now we need to think about the prior probabilities that should be assigned to our grammars.
Minimum Description Length (MDL) analysis says that we should choose the grammar G that minimizes the sum of the description length of the grammar and the evidence U:
and we take |U|g to be –log2 probg (U) – the probability of the evidence under the model.
One
way to view this is to say that we wish to maximize the probability of the
![]()
but
we assume that only one grammar is right, and approximate this as:
![]()
and
we select g* = 
Restating
this in terms of compression, we select
which
is to say, we want to find the grammar such that the sum of the compressed
length it assigns the data, plus its own compressed length, is a minimum. We say
we are thus minimizing its description length.
The challenge now is to find a reasonable prior
probability distribution over the class of possible grammars. Following de
Marcken, we consider a class of grammars (which are lexicons, lists of
words) for which the bulk of the content is in lists¸ and we compress
the grammar G with G itself.
We adopt the assumption that we treat the
probability of such a grammar G as 2-length(G), where the length(G)
is the compressed length of G using a prefix-free encoding.
Example 1:
|
Word |
c |
a |
i |
e |
r |
m |
t |
ice |
|
codeword |
00 |
010 |
011 |
100 |
101 |
110 |
1110 |
1111 |
|
Grammar |
|
|
|
|
|
|
|
011 00 100 |
iateicecream is encoded in 30 bits as 011 010 1110 100 1111 00 101 100 010 110
This
assumes (as we do throughout) that the letters do not need to be encoded any
further: they are part of the universal computer.
Let us
consider three grammar’s treatments of themanonthemoon
Example
2:
|
Word |
o |
n |
t |
h |
e |
m |
a |
|
codeword |
00 |
01 |
100 |
101 |
110 |
1110 |
1111 |
Example
3:
|
word |
o |
n |
the |
m |
t |
h |
e |
a |
|
codeword |
00 |
01 |
100 |
101 |
1100 |
1101 |
1110 |
1111 |
|
grammar |
|
|
110011011110 |
|
|
|
|
|
Example
4:
|
word |
o |
n |
t |
h |
e |
m |
a |
themanonthemoon |
|
codeword |
00 |
01 |
100 |
101 |
1100 |
1101 |
1110 |
1111 |
|
grammar |
|
|
|
|
|
|
|
100101110111011110 10001100101 1101110000001 |
Now we
can compare three analysis:
|
|
Example
2 |
Example
3 ** winner |
Example
4 |
|
Evidence |
100101110111011110 10001100101 1101110000001 |
10010111100100
0100101000001 |
1111 |
|
Evidence
length |
42 |
28 |
4 |
|
Grammar
length |
0 |
12 |
42 |
|
Total
length |
42 |
40 |
46
[sic] |