Chapter 4: The representational system: details

 

We assume that the fundamental operation is concatenation, but there are hints that not all concatenation works the same way (phonology is different from morphology, morphology from syntax). But this system does not take that into consideration.

 

p. 46: With a single composition operator, the framework offers no internal means of distinguishing between “words” and other parameters – all are treated alike, and any test of “word-dom  must be applied externally. This agrees with the fact that it is extremely difficult to find language-independent definitions that agree without intuition of what a word is. In contrast, if multiple composition operators are used, then parameters can be classified according to the compositional process that they are built with.

 

The idea that the grammar can compress itself is important: the grammar is just a list of lexical items, with their own internal analysis. This changes our guiding expression from (remember, U is the evidence)

 

 

to

 

 

 

Now a longer example:

code

w

representation

encoding

count

|w|

000

the

t h e

01001101011

2

11

001

at

a t

1100010

2

7

010

t

 

 

2

 

0110

h

 

 

2

 

0111

cat

c  at

1101001

1

7

1000

hat

h at

0110001

1

7

1001

thecat

the cat

0000111

1

7

1010

thehat

the hat

0001000

1

7

1011

e

 

 

1

 

1100

a

 

 

1

 

1101

c

 

 

1

 

1110

i

 

 

1

 

1111

n

 

 

1

 

 

thecatinthehat

thecat i n thehat

10011110 1111 1010

 

16

 

 

Length of data: 16
Length of grammar: 46

Total description length of this evidence under this model: 62 bits.

 

This model has been called a multigram model (to distinguish it from unigram, bigram, and trigram models: zeroth-, 1st-, and 2-nd order Markov models).

The way we capture the mutual information between two successive chunks is by adding them to the lexicon, and assigning an observed frequency of the new chunk to the parameters of the system.

 

Learning under this model

Learning under this model consists of adding to (and occasionally deleting from) the lexicon. By adding an element to the lexicon, we permit a more accurate encoding of the form, by using a probability that corresponds to its actual frequency; without the encoding, if the piece in question were interface and we had only inter and face, we would in effect be using an encoding that assumed that interface’s frequency was the product of the frequency of inter and that of face.

 

But by adding an element to the lexicon, we also make the lexicon larger, hence longer. We add an element (what de Marcken calls a parameter) only if the increase in description length of the lexicon is less than the decrease in the log probability of the evidence U.

 

General strategy:

Start with the simplest lexicon.

Iterate until convergence condition is satisfied:

      Refine the parameters of the lexicon to reduce the overall description length.

 

But we can be more specific: we separate our refinement of the model into 2 parts:

      a. the determination of the structure, on the basis of a fixed set of stochastic parameters (these “fixed” parameters are the probabilities associated with the chunks); the structure is what the chunks are in each line of the corpus;

      b. the determination of the optimal set of stochastic parameters, given the structure: this is essentially a maximum-likelihood set of parameters.

 

Stochastic optimization: find the best analysis of the data, given the grammar as it is currently.

 

This is an Expectation operation, in the sense of Expectation/Maximization (EM) (Dempster et al.)
That is, for each chunk w in the lexicon (on a given iteration), determine the number of times it occurs in the corpus. However, this is a so-called soft count, in that it need not be integral.


In particular, we consider all possible parses Pi of a string (or substring which we know has a fixed endpoint, like a sentence), and we weigh them according to their individual probability.

 

We want to consider all possible parses r (“representations”) of all members of u of U’ [these u’s are the “lines” of the corpus or “entries” in the lexicon], and to calculate

 

.

Each representation r contains a discrete number of occurrences of lexical chunks. We can think of the representation as a string of pointers to chunks (or labels of chunks) in the lexicon. Let’s use the symbol “#” to represent the function that takes a chunk and a representation, and tells us how many times the chunk occurs in the representation, like this: #(u,r). Then the result of the stochastic optimization is a set of counts for each chunk in the lexicon equal to

 

Maximization: this step is relatively simple: we use the counts for each chunk, which we have just determined, and divide by the total number of chunks in the entire U’, to give us the new set of parameters (probabilities for each chunk).

 

 

 

Structural refinement

How many changes should we consider making at the same time? Sometimes trying to make one change at a time is counterproductive (we get stuck in a local minimum), because we need to make several changes at the same time in order for the changes to jointly constitute an improvement.

 

De Marcken’s strategy is to consider sequentially a set of additions to the lexicon, and then to consider removing each of the items in the lexicon. The candidates for addition to the lexicon are all those pairs of chunks in the current best parsing that occur next to each other. To do this, we must understand how to do a Viterbi (“single best”) parse.

 

Viterbi parse

 

To compute the single best parse of a string U (where “best” means that the sum of the inverse log probabilities are the smallest, given the lexicon), we scan through the string from left to right, calculating the best parse up to that point. A parse is a division of the string into pieces, such that each piece is in the lexicon. That is, a parse of a string U is a set of increasing integers {li}i=0,n such that l1 = 1 and ln = |U|, and U[li,li+1] is a member of the lexicon. For example, if U = thereisone, a parse of it is {0,5,7,10} and its chunks are there, is, one.

 

At point i in the string (think of this as the point immediately following the ith letter), we consider all parses that consist of a best parse up to a point j < i + a single chunk that goes from U[j+1] toU[i].

 

i = 1;

while ( i <= length U )

      j = 1;

      BestScore = undefined;

      while ( U [ i – j, i ] is a suffix of some  member of the lexicon )

            if  U[ i – j, I ] is a member of the lexicon, call that member N;

                  Look up the best parse of U[0, i-j-1], call that parse M;

                  Let NewScore =  inverse log prob(N) + inverse log prob (M);

                  if   BestScore is undefined or is greater than NewScore,

                        BestScore = NewScore; and

                        BestParse[i] = M concatenated with N (i.e., M + N ).

                  end if

            end if

      end while

end while