Chapter
4: The representational system: details
We
assume that the fundamental operation is concatenation, but there are hints
that not all concatenation works the same way (phonology is different from
morphology, morphology from syntax). But this system does not take that into
consideration.
p.
46: With a single composition operator,
the framework offers no internal means of distinguishing between “words” and
other parameters – all are treated alike, and any test of “word-dom” must be applied
externally. This agrees with the fact that it is extremely difficult to find
language-independent definitions that agree without intuition of what a word
is. In contrast, if multiple composition operators are used, then parameters
can be classified according to the compositional process that they are built
with.
The
idea that the grammar can compress itself
is important: the grammar is just a list of lexical items, with their own
internal analysis. This changes our guiding expression from (remember, U is the
evidence)

to

Now
a longer example:
|
code |
w |
representation |
encoding |
count |
|w| |
|
000 |
the |
t
h e |
01001101011 |
2 |
11 |
|
001 |
at |
a
t |
1100010 |
2 |
7 |
|
010 |
t |
|
|
2 |
|
|
0110 |
h |
|
|
2 |
|
|
0111 |
cat |
c at |
1101001 |
1 |
7 |
|
1000 |
hat |
h
at |
0110001 |
1 |
7 |
|
1001 |
thecat |
the
cat |
0000111 |
1 |
7 |
|
1010 |
thehat |
the
hat |
0001000 |
1 |
7 |
|
1011 |
e |
|
|
1 |
|
|
1100 |
a |
|
|
1 |
|
|
1101 |
c |
|
|
1 |
|
|
1110 |
i |
|
|
1 |
|
|
1111 |
n |
|
|
1 |
|
|
|
thecatinthehat |
thecat
i n thehat |
10011110
1111 1010 |
|
16 |
Length
of data: 16
Length of grammar: 46
Total
description length of this evidence under this model: 62 bits.
This
model has been called a multigram model (to distinguish it from unigram, bigram,
and trigram models: zeroth-, 1st-, and 2-nd order Markov models).
The
way we capture the mutual information between two successive chunks is by adding
them to the lexicon, and assigning an observed frequency of the new chunk to
the parameters of the system.
Learning
under this model
Learning
under this model consists of adding to (and occasionally deleting from) the
lexicon. By adding an element to the lexicon, we permit a more accurate
encoding of the form, by using a probability that corresponds to its actual
frequency; without the encoding, if the piece in question were interface and we
had only inter and face, we would in effect be using an encoding that assumed
that interface’s frequency was the product of the frequency of inter and that
of face.
But
by adding an element to the lexicon, we also make the lexicon larger, hence
longer. We add an element (what de Marcken calls a parameter) only if the
increase in description length of the lexicon is less than the decrease in the
log probability of the
General
strategy:
Start
with the simplest lexicon.
Iterate
until convergence condition is satisfied:
Refine the parameters of the lexicon to
reduce the overall description length.
But
we can be more specific: we separate our refinement of the model into 2 parts:
a. the
determination of the structure, on
the basis of a fixed set of stochastic parameters (these “fixed” parameters are
the probabilities associated with the chunks); the structure is what the chunks
are in each line of the corpus;
b. the
determination of the optimal set of stochastic parameters, given the structure:
this is essentially a maximum-likelihood set of parameters.
Stochastic
optimization: find the best analysis of the data, given the grammar as it is
currently.
This
is an Expectation operation, in the
sense of Expectation/Maximization (EM) (Dempster et al.)
That is, for each chunk w in the
lexicon (on a given iteration), determine the number of times it occurs in the
corpus. However, this is a so-called soft
count, in that it need not be integral.
In particular, we consider all
possible parses Pi of a string (or substring which we know has a
fixed endpoint, like a sentence), and we weigh them according to their
individual probability.
We
want to consider all possible parses r (“representations”) of all members of u
of U’ [these u’s are the “lines” of the corpus or “entries” in the lexicon],
and to calculate
.
Each
representation r contains a discrete
number of occurrences of lexical chunks. We can think of the representation as
a string of pointers to chunks (or labels of chunks) in the lexicon. Let’s use
the symbol “#” to represent the function that takes a chunk and a
representation, and tells us how many times the chunk occurs in the
representation, like this: #(u,r). Then the
result of the stochastic optimization is a set of counts for each chunk in the lexicon equal to

Maximization: this step is
relatively simple: we use the counts for each chunk, which we have just
determined, and divide by the total number of chunks in the entire U’, to give
us the new set of parameters (probabilities for each chunk).
Structural
refinement
How
many changes should we consider making at the same time? Sometimes trying to
make one change at a time is
counterproductive (we get stuck in a local minimum), because we need to make several
changes at the same time in order for the changes to jointly constitute an
improvement.
De
Marcken’s strategy is to consider sequentially a set of additions to the
lexicon, and then to consider removing each of the items in the lexicon. The
candidates for addition to the lexicon are all those pairs of chunks in the
current best parsing that occur next to each other. To do this, we must
understand how to do a Viterbi (“single best”) parse.
Viterbi
parse
To
compute the single best parse of a string U (where “best” means that the sum of
the inverse log probabilities are the smallest, given the lexicon), we scan
through the string from left to right, calculating the best parse up to that
point. A parse is a division of the
string into pieces, such that each piece is in the lexicon. That is, a parse of
a string U is a set of increasing integers {li}i=0,n
such that l1 = 1 and ln = |U|, and U[li,li+1]
is a member of the lexicon. For example, if U = thereisone, a parse of it is {0,5,7,10} and its chunks are there, is, one.
At
point i in the string (think of this
as the point immediately following the ith
letter), we consider all parses that consist of a best parse up to a point j
< i + a single chunk that goes from U[j+1] toU[i].
i
= 1;
while ( i <= length U )
j = 1;
BestScore = undefined;
while ( U [ i –
j, i ] is a suffix of some member of the lexicon )
if U[ i – j, I ] is a member of the lexicon, call that member N;
Look up the best parse of U[0, i-j-1], call that parse M;
Let NewScore = inverse log prob(N)
+ inverse log prob (M);
if BestScore
is undefined or is greater than NewScore,
BestScore = NewScore;
and
BestParse[i]
= M concatenated with N (i.e., M + N ).
end if
end if
end while
end while