Carl de Marcken, Unsupervised Language Acquisition (MIT, 1996)

 

1. Bayesian inference:  

We take G to represent a grammar, and U to represent the evidence through which we hope to learn the right grammar.

 

 

We can interpret this as a shift in belief from a prior belief in the grammar   p(G)   to a belief p(G|U), if we know p(U|G) and also p(U). p(U), the probability of the evidence, is hard to know, but since it is fixed, regardless of what probability we assign to it, we can convince ourselves that the best grammar is

 

 

We want to specify G, the class of grammars; we need to specify how each grammar g in G assigns a probability to a set of data U; and we want to specify how we assign a prior probability p(g) to any grammar g.

 

 

 

2. Stochastic language models

Much of this is due to Ray Solomonoff's work in the mid 1950s.

 

We would like to find a grammar which views out training data as typical:

 

aba

abba

abbba

abbbba

 

 

Grammar 1 S a B a , B B b | b

S   a B a

1.0

B B b

0.5

B b

0.5

 

 

 

Grammar 2 S a | b | S S

S

0.5

S b

0.25

S S S 

0.25

 

 

 

Consider one sentence: aba. Its probability by Grammar 1 is the probability of its derivation

S a B a and B b

 

prob( S a B a ) = 1

prob (B b) = 0.5;

 

so prob (aba) = 0.5

 

Grammar 2:

[ S S]

0.5

[ [ S S ] S ]

0.5

[ [ a S ] S ]

0.25

[ [ a b ] S ]

0.25

[ [ a b ] a ]

0.25

 

= 1/256

 

There is another similar derivation; so prob(aba) = 2/256 = 1/128.

 

How about Grammar 3: S aba | abba | abbba | abbbba ? What would the probability be of our training corpus be, under Grammar 1 and under this finite language grammar?

 

Grammar 1 assigns a probability 2-k to a string of the form  a bk a, so the total probability of the training corpus is  ½ * ¼ * 1/8 * 1/16 = 2 ^^(-1-2-3-4) = 2^^(-10) = 1/1024.

 

Grammar 3 assigns what probability?

 

(Sometimes linguists are surprised to see that probabilities go way down when sentences or corpora get long: this is just the quantitative reflection of the fact that when you have a lot of time, there are many different things you could say!)

 

 

 

Now we need to think about the prior probabilities that should be assigned to our grammars.

 

Minimum Description Length (MDL) analysis says that we should choose the grammar G that minimizes the sum of the description length of the grammar and the evidence U:

 

  

and we take |U|g to be –log2 probg (U) – the probability of the evidence under the model.

 

One way to view this is to say that we wish to maximize the probability of the evidence U. From a purely formal point of view, the probability of data could be said to be the sum of these repeated applications of Bayes’ rule:

 

 

 

 

but we assume that only one grammar is right, and approximate this as:

 

 

 

 

and we select g* =

 

 

Restating this in terms of compression, we select 

 

 

 

which is to say, we want to find the grammar such that the sum of the compressed length it assigns the data, plus its own compressed length, is a minimum. We say we are thus minimizing its description length.

 

The challenge now is to find a reasonable prior probability distribution over the class of possible grammars. Following de Marcken, we consider a class of grammars (which are lexicons, lists of words) for which the bulk of the content is in lists¸ and we compress the grammar G with G itself.

 

We adopt the assumption that we treat the probability of such a grammar G as 2-length(G), where the length(G) is the compressed length of G using a prefix-free encoding.

 

Example 1:

Word

c

a

i

e

r

m

t

ice

codeword

00

010

011

100

101

110

1110

1111

Grammar

 

 

 

 

 

 

 

011 00 100

 

iateicecream     is encoded in 30 bits as    011 010 1110 100 1111 00 101 100 010 110

 

This assumes (as we do throughout) that the letters do not need to be encoded any further: they are part of the universal computer.

 

Let us consider three grammar’s treatments of themanonthemoon

Example 2:

Word

o

n

t

h

e

m

a

codeword

00

01

100

101

110

1110

1111

 

Example 3:

word

o

n

 the

m

t

h

e

a

codeword

00

01

100

101

1100

1101

1110

1111

grammar

 

 

110011011110

 

 

 

 

 

 

 

Example 4:

word

o

n

t

h

e

m

a

themanonthemoon

codeword

00

01

100

101

1100

1101

1110

1111

grammar

 

 

 

 

 

 

 

 100101110111011110

10001100101

1101110000001

 

 

Now we can compare three analysis:

 

Example 2

Example 3 ** winner

Example 4

Evidence

100101110111011110

10001100101

1101110000001

10010111100100 0100101000001

1111

Evidence length

42

28

4

Grammar length

0

12

42

Total length

42

40

46 [sic]