Lecture 2
Probabilistic Models of Grammar:
Morphology from a
Machine Learning perspective
Japanese Phonological
Society
August 2001
I want minimum information given with maximum politeness.
I think it was
in September 1923 that a friend of Mayakovsky arrived
in
With all this confounding trafficking in hypotheses about invisible connections with all manner of inconceivable properties, which have checked progress for so many years, I believe it to be most important to open people's eyes to the number of superfluous hypotheses they are making, and would rather exaggerate the opposite view, if need be, than proceed along these false lines.
1 Introduction
I would like to discuss with you material that I have been working on for several years.[1] This is the development of an algorithm (which is embodied in an computer program) whose purpose is to be able to accept as its input a raw text from an unknown language and to produce a morphological analysis of the words. When I say that the language of the text is “unknown”, I mean, of course, that it is unknown to the algorithm – the text may be well-known to the human linguist, but that knowledge of the language is not embedded in the algorithm.
The first problem of the learning of morphology is the problem of segmentation – figuring out where to break a word into its component pieces (traditionally called morphs). If the text that we give to the computer is English, then we expect to find that the words read, reads, and reading (distributed throughout the corpus, of course – not located next to each other) are divided in such a way that read is a single morph, while s and ing are suffixes to that stem. If the text we give it is French, then it will not draw those conclusions, but will discover a different set of suffixes and prefixes, including the suffixes er, a, é, and so on. (By the way, I consider the problem to be essentially the same regardless of whether we are looking at a text in standard orthography or in something like a phonological transcription. I also assume that the text is segmented into words.)

Is this an easy problem? Linguists in recent decades have not devoted much attention to this problem, mainly because (I suppose) it seems so easy – anybody who speaks the language can do it, or at least we assume that it is that easy. In earlier days, beginning students in linguistics were taught techniques for accomplishing this (Nida 1949, e.g.), but very few people do that anymore.
Again, is this an easy problem? The fact that we can easily find the suffixes in a language that we know already is obviously irrelevant. I know English well, and it is not difficult for me to know how Subject-Auxiliary Inversion should apply to any sentence that I may be given. But my ability to do that tells us nothing about how easy or difficult it would be to write a completely formal description of that ability. The only thing to do is to sit down and to try to write an algorithm that will accomplish the task at hand, which is the segmentation of words into component morphs, with no prior knowledge of the language. (I wish to emphasize this point, because I often find that when I explain this to people, they assume that I cannot really mean what I am saying – the idea is to design the algorithm to do the work of the linguist or language-learner.)
And that is what I have been working on, and that is the problem that I wish to discuss with you today. The program that I will discuss with you is reasonably successful within a certain range of languages, but much work remains to be done, as you will see.
2. The question
Let us look at a particular word – let us say, reading. What clues do we have regarding its morphological analysis? We know, of course, that its stem is read and its suffix is ing. But what evidence is there of this in the data? It is true that there are a large number of words ending in ing in a corpus of English – but there are even more words that end in ng (which include all the words ending in –ing, -ong, etc.), and –ng is not a suffix. Raw frequency is not what we care about.
There are two ways of thinking of an analysis such as read + ing and how to find the correct segmentation into morphs. The first focuses on where the morpheme break is (it is between d and i); the second focuses on the pieces that are created (read is a morph and ing is a morph). It may not be obvious right from the start how different these two approaches are. Let us discuss each in turn.
3
Suppose we have a list of the words in our corpus, and
suppose that for each one, we scan it from left to right. After moving along N times, we are looking at
the first N letters of our word. For example, if the word is government, and N=5, then we are looking
at gover.
We might ask, how many different words are there that begin with gover? But

g o v e r 1 n 5 m 1 e n t : successor frequencies as superscripts (1)
So
When I say, “some version of this procedure”, I mean that
b 9 a 14 l 9 l
But
c 9 o 14 m 9 p
e t i
t i o n
This effect is even more clear in the case of a word like competition, where co- is not a prefix. After c, there are 12 successors; after co, there are 23 successors, and after com, there are only 8, and therefore there is a peak of successor frequency after the letters co – but that does not mean that co is a prefix in this word.
The Harris approach of successor frequency counting also
fails when a given stem is present in a corpus with two or more suffixes that
happen to begin with the same letter (e.g.,
and so on.)
I will not pursue this point any further. There is a good deal of insight in
4 Naïve Description Length
Let us return to the question we are considering. We are looking at the word reading, and trying to determine if it is composed of two morphs, and if so, what are they: is it
etc?
One simple and natural way to think about this is to think about a morphology as a way of creating a shortened wordlist for the language. If we construct a wordlist for all the words in any large corpus, we will have many similar (but not identical) related forms of words, such as read, reads, and reading. But if we construct a morphological analysis, we can enter just one time the stem read, and then specify a general pattern that is shared by a large number of stems. The pattern may be that the stem may appear followed by
(i) no suffix (call that “NULL”);
(ii) the suffix s; and
(iii) the suffix ing.
If we do this, then we save on the length of the entire wordlist.
The idea of a pattern of suffixes that appear on several stems is so important that we will give a name to that: we will call it a signature, and we will say that a given stem appears with a (unique) signature in any given corpus. The signature is the alphabetized list of suffixes that appear on a given stem in a corpus.
(1) before morphology:
jump jumps jumping
read reads reading
walk walks walking
Total count of letters: 48
(2) after morphology:
read walk jump:
plus the pattern:
__ NULL (or) s (or) ing
Total count of letters: approximately 16
We can get a little more explicit and actually count the number of letters (or phonemes) in each case. In case (1), we have a total of 48 letters, while on the right, (2), we have 12 letters in the first row (specifying the stems) plus some number for the suffix pattern: the number is 4 if we count only the actual letters (and take “NULL” to be no letters); we also have to inquire as to what the “cost” in letters is of the words “or” in parentheses! We will eventually answer that question, but for present purposes suppose we just count the 4 letters in the suffixes – this will illustrate the basic idea that if we keep track of our data in terms of morphs rather than words, the total length of our list will be considerably shorter. And the more stems we add (such as “proceed, halt, fasten, maintain…”) the more savings of letters there will be if we choose to write the list in morphs rather than words.
An incorrect morphological analysis will usually lead to more letters in the list. If, for example, we incorrectly parse jumping and reading as jumpi + ng and readi + ng (but correctly parse read and read+s) , then we will have a partial and incomplete savings. In other words, if we have a set of word forms that really are related and the algorithm fails to cut them in the same way – so that there are two different stems set up in the analysis – then we will have more letters in the underlying list of stems than is really necessary. And if the goal is to have the shortest and most compact list, then this error will be a bad move.
So this second approach to the problem of finding a morphology focuses on how useful a morph is in terms of compressing a word list. I will refer to this as the Naïve Description Length approach.
5 Comparing Successor
Frequency (
We have already noted that these two approaches differ in
their focus:
b. The Naïve
Description Length approach gives us absolutely no idea how to find a good
analysis for a given set of words. The Naïve Description Length approach is
very good at evaluating alternative morphological analyses, but it is no good
at all at discovering alternative analyses. The
(By the way, this is an interesting contrast, one which we have encountered before in linguistics more than once. We have seen it in the difference of linguistic theory of Zellig Harris and his student, Noam Chomsky; Harris’s theory throughout was of the first sort; Chomsky’s theory, from The Logical Structure of Linguistic Theory (1955) up until Lectures on Government and Binding (1981) (when he shifted to “principles and parameters”, a very different foundational theory) was of the second sort, as he famously argued in Syntactic Structures (1957, p. 55). Similarly, the contrast is found in comparing generative phonology and optimality theory.)
There is something that both approaches fail to speak to,
and that is the overall coherence of the system produced. Clearly, the Naïve
Description Length approach comes closer to making general statements about the
language, but it does not take the problem quite far enough.
Here is an example. In the first 500,000 words of the Brown Corpus of written English, we find seven stems which occur with the signature
(3) NULL
Remember – this is important in everything that follows – that when we include “NULL” in a signature, that means that the stem can appear as a free-standing word in the corpus, without any suffix at all. The stems are:
(4) disrupt project connect protect
prevent suggest predict
(e.g.,
disrupt,
disrupted,
disrupting,
disruption,
disruptive,
disrupts, and same for the others stems).
This information is extremely helpful in establishing that ion and ive are distinct suffixes in English. What we need is a method that will allow that information to have a bearing on how we analyze the words constructive and construction. Again, we need a method that says, if we have analyzed one set of words in one way – such as by means of the signature in (5), then we should analyze
(5) NULL
constructive and construction
with the sub-signature ive - ion, even if
that means having two extra letters in the signature (ive - ion versus ve - on) and one
fewer letter in the stem (construct-
versus constructi).
To put that another way, a smart morphological analyzer will look at the words constructive and construction and say to itself, Yes, I could analyze this as constructi + on, but I’m sure that the suffix is
really -ion, not –on, from seven other words, and so I
will analyze construction in the
parallel way. In short, making sure that the overall system is as coherent and
self-consistent as possible, not just that local savings are made.
Unfortunately, there is a related problem in English morphology which is worse than the constructive/construction problem. In fact, when you think about it, it is a problem that every language will have, if we have a large enough sample corpus from the language. Consider the following situation in English. It so happens that t is the most common letter (or phoneme) to end a stem in English (e.g., halt, seat, interest, attempt, resort, test, blast, alert, accent, request). How does the Naïve Description Length approach deal with these stems, if we have a corpus in which all of the inflected forms of these stems appears? That is, if we have the forms in (6) for a large number of stems that end in t, then what does the Naïve Description Length approach tell us to do?
(6) halts, halted, halting, halt
(7) t, ted, ting, ts
The answer is: the Naïve Description Length approach tells us that if we have 11 or more words like in (6) – which you and I would call “stems that end in t” – then we should set up a new signature of the form in (7) -- because even if this means creating four new suffixes (which “costs” us ten letters to build), we will save one letter in each stem (because the stem will be hal, not halt; or sea- instead of seat-; etc.) In fact, this is true not only for stems ending with t, but with any other letter, if we have eleven or more stems. If there are more than 11 stems, Naïve Description Length tells us to set up a new signature.
So there are some problems looming
if we depend on the Naïve Description Length method, in addition to the other
problems for the
6 Summary so far
I have looked at this in some detail because the problem of segmenting words into pieces seems very easy until you look at it carefully – and especially until you try to write an explicit computer program to accomplish the task. I remember my very first encounter with generative syntax; it involved the analysis of the English verbal auxiliary that I alluded to above. Obviously I knew how to form negated and inverted sentences in English (I’ve known that almost all my life), but I had never thought about how difficult a task it was to propose a formal analysis that actually works, even for the “obvious” cases. And that is the essence of a linguistic problem: finding the best explicit analysis for a set of data which is obvious to the native speaker.
What we need in order to continue
is to combine the strong points of the
7 Signatures: aim for maximal signature
We have encountered earlier the notion of signature: the set of affixes that a given stem occurs with in a corpus. We assign every stem to exactly one signature, and so we can speak more broadly of a signature, in a particular corpus, of being not only a set of affixes, but also of the set of stems that appear with precisely these affixes.
The signature is the most useful
and important tool in understanding how broad and significant a set of affixes
is. A signature which consists of several suffixes (three or more) of length
greater than one letter (phoneme) and a good number of stems (fifteen or more)
is a very strong candidate for being an important morphological pattern in a
language. Conversely, a signature that occurs with only one stem is suspicious,
all other things being equal, even if it contains a large number of suffixes.
For example, the “signature” in (8) in a particular
corpus, has only one stem – mat (match, mate, material, materials, matrimony,
matrons, maturing), and the presence of just one stem tells us (even if we
know nothing about the language) that we have no reason to believe that this is
a real pattern.
(8) ch
e erial erials rimony rons uring [ Spurious! ]
as in: match, mate, material, materials,
matrimony, matrons, maturing
Remark There
is a special case that works differently: if we have a large signature with a
significant set of stems associated with it, then it is perfectly reasonable
for only a single stem to appear with a proper subset of the affixes in the
larger signature. Example
Likewise, stems which appear with only a single suffix may not be true stems at all: we could gather together all the words in a language that end in m, but declaring that m to be a suffix would not make it one! (On the other hand, if the candidate suffix consists of several phonemes, there is a good chance it is indeed a suffix; we return to this later.)
So we will set ourselves the first task, that of finding as many signatures in a language that contain several stems and several suffixes. How are we to begin?
There are a number of ways of
finding good candidate suffixes; we can, for example, consider all word-final sequences of letters
(phonemes) up to some reasonable maximum (6, typically). This is rather slow
and it takes a very large amount of computer memory, so let us take advantage
of
I have brought with me a demonstration (demo) version of Linguistica 2001, which allows us to perform this operation as we speak.
Let’s take the first 100,000 words of the Brown Corpus of written English – that is roughly the size of a novel, and it contains approximately 12,200 distinct words. We obtain 346 signatures, and 175 suffixes. Here is the top set of the signatures:

(9)
Here is the result of the same operation on the Breen word-list of Japanese:

(10)
The left part of the screen shows this; after reading in the corpus, we click on “Successor Freq 1”, as shown below:

(11)
And by clicking on a signature, we obtain a complete list of the stems that appear associated with it. The top signature in this word-list, for example, is this:

Ippen shônin: Tabi no shisaku-sha
(12)
Returning to the analysis of
English, we see excellent results from combining
In general, stems that appear with only one suffix – that is, stems associated with a signature containing only one suffix – are not trustworthy, for reasons we have already touched on.
(13)
u l
t i m a 1
t 2 u 1 m ultimatum
u l
t i
m a 1 t 2 e 2 # ultimate
u l
t i m a 1
t 2 e 2 l y ultimately
The process so far has been
designed in order to be conservative, to make as few mistakes as possible,
rather than to apply to all words. For example, in this corpus, the words ultimate, ultimately, and ultimatum are found. After ultimat-, two letters are found: e and u; after ultimate-, two
letters are found (l and #); and
after ultimatu-, one letter is found. So if we look
for peaks of successor frequency, surrounded by successor frequency of 1, the algorithm will split ultimat-um, but fail to split ultimately. In a sense, it is the
existence of ultimatum that causes
the failure of ultimately to split by
Clearly, at this point, we need to go back over the whole corpus, and ask further questions, based on our growing knowledge of the word-structure of the language. Minimally, we want to ask all of our words: how many of you can be analyzed in terms of the signatures we have just discovered? We want to analyze ultimate and ultimately in terms of the signature NULL.ly because we know that that signature occurs very often in the language (there were 76 occurrences in the first pass, illustrated above – that is a very large number of occurrences); we no longer care whether Harris’ successor frequency algorithm identifies ly as a suffix: we are already almost complete certain that it is a suffix.
8 Looping and heuristics
Let us take a step back, and
discuss the overall strategy that we employ. First of all, we need, and we have
employed, a bootstrap
heuristic—a simple method that gives us an imperfect but reasonably good
sketch of the morphological structure of the language. This came from a
modification of
Following this bootstrap procedure, there are a number of steps to take next, as we noted a moment ago, when we discussed the words ultimate and ultimately. We establish thresholds for signatures, based on the number of stems and suffixes that are present in the signatures, and then gather together all of the suffixes in these signatures. We reconsider all the words, and ask which ones can be split using one or more of the suffixes we have just determined. That, in turn, gives us a new set of candidate stems. These stems become certain, if they participate in known signatures; they become merely candidate stems otherwise. And so on.
Implicit in what we have just
described is a global strategy that might be described informally this way: Find the smallest number of signatures that
will cover the largest number of words in the corpus.
But what does this really mean? And what is its relationship to Naïve Description Length? Its connection to Naïve Description Length is direct: the Naïve Description Length approach tells us to find the smallest number of letters in the list of morphs that make up the corpus, and it is the role of the signatures to do that. Remember that each time we apply a signature to a group of related words that share a stem, we save a large number of letters because we do not have to repeat the stem for each one of its suffixed forms. To repeat, then, the Naïve Description Length is the formalization of the strategy to “cover the largest number of words in the corpus.” But what about the idea of using the smallest number of signatures? Well, some of that follows from the Naïve Description Length: we save on letters if we have fewer signatures, all other things being equal. But, as we have seen when we look at the analysis closely, Naïve Description Length is too willing to allow new signatures to be permitted.
9 The theory behind all of this, and the connection to “Maximize the probability of the data”
We will make a large transition now, and make a connection between this work on the learning of morphology, and the work on phonology that I discussed in the first lecture. In that lecture, I made the case that regardless of what particular theory of phonology one would like to believe in, there is a larger principle in science (or in cognition in general), which is this:
Whatever the data or evidence is that we have, we must find the
hypothesis that maximizes the probability of that data.
Up to this point, I have discussed this principle as if the phrase “the probability of the data” meant the same thing as “the probability of the data as it is computed by some particular model M”. But those two phrases are not at all the same thing:
(14) (i) The probability of the data D;
(ii) The probability of the data D, given some particular model M.
We have not discussed the notion of conditional probability, but I trust that what I have said will seem sufficiently natural. In order to connect these two different concepts in (i) and (ii), we may note the relationship in (iii):
(14) (iii) the
probability of data D =
(the probability of data D, given model M) x
(the probability of model M)
And what is the probability of a
given model? What could that question even mean? In influential work,
(14) (iv) the log
probability of data D =
log probability of data D (given model M)
+
log probability of model M
(v) The grammarian’s restatement of (iv):
the log probability of data D =
log probability of data D (given grammar M) +
length of grammar M
Rissanen calls this perspective Minimum Description Length. Since we have taken logs in (iv) and (v) (and, following what I said in the first lecture, we will take the positive log, which is the absolute value of the log of a probability), we may now translate Rissanen’s Minimum Description Length principle as follows:
Minimize the log probability of the data as calculated by your grammar, and minimize the length of your grammar – and minimize both in such a way that the total sum of the two is a minimum.
There are quite a number of
consequences of this formulation for linguistic theory. We see, for example,
that it contains as a special case the original concept of generative grammar (
Notice that the notion of prediction of data, and the notion of language as a cognitive process play no role here at all; they are unnecessary.
Let us return to the specific problem of learning morphology.
10 Minimum Description Length of the words of a language
We may now replace the Naïve Description Length criterion of a morphology with the more complex notion of a Minimum Description Length. This shift, however, requires a good deal of mathematical work which I will not discuss here; most of the details are found in Goldsmith 2001.
But let us look at some of the structure, even if we do not look at the details of the mathematics. We want to look at the log probability of the data (that is, of the corpus) and the log probability of the morphological grammar which is, as we have said, its length.
What is the morphological log probability of the data? We have not spoken of the morphology as a probabilistic model, but I pointed out in the first lecture that a probabilistic model is a structural model to which we have associated a distribution, which is a set of numbers that add up to 1.0. We assign a probability to every word on the basis of the unique signature which contains it. In particular, the log probability of a word is the sum of (i) the log probability of its signature; (ii) the log probability of the stem, given the signature; and (iii) the log probability of the suffix, given the signature; each of these corresponds to frequencies which are simple to observe in the training corpus.
So, the job of the morphology is to assign as low a log probability of the data as possible. The morphology itself contains some complexity, which is measured by its length. And here is an important key point: the morphology contains the stems and the affixes, and it must “pay the cost” of maintaining lists of the phonemes or letters in each morph. And what is the cost that it pays? It is the phonological log probability of each morph. That was what we computed in the first lecture; and it must necessarily be computed once in each grammar; but then the presence of higher level structure makes it unnecessary to make that computation more than once.
11 Conclusion
Let us draw this discussion to a close. There are two points of significance that I hope to have shared with you in this presentation. First of all, this is a linguistics of doing: of producing computer programs that actually do work – such as language identification or the discovery of morphology. We test our hypotheses by seeing if we can in fact turn them into working software. I think this is a crucial and essential step. (When I say that it is a linguistics of doing, I mean to contrast this with a linguistics of knowing. Of course I have nothing against a linguistics of knowing; but I think we have gone too far in that direction. One of the hallmarks of modern science is its capacity to enable technology. That is not its only goal, and probably is not its primary goal, but it is a significant element of the modern scientific world, one which we linguists should not ignore; we do so at our own risk.)
Second of all, we arrive at what is both a new and an old conception of linguistics. Of course it is very much consistent with the computational methodology that I have just spoken of. It is a conception of a linguistics which is justified purely by virtue of its good analysis of complex linguistic data. It does not need to make promises that someday it will find correspondences between its models and what goes on in the brain. It is not a cognitive theory of linguistics; it is an empirical theory of language.
.
Notes: Lecture 2
[1] I
am grateful to a large number of people for discussions of these issues, especially
[2]
This is a sample of over a million words in computer-readable format, collected
at
[3] You may notice that this will solve the problem of setting putting stem-final consonants in the suffixes.