Lecture 2

Probabilistic Models of Grammar:

Morphology from a Machine Learning perspective

 

John Goldsmith

Japanese Phonological Society

Nihon Oninron Gakkai

August 2001

 

 

 

 

I want minimum information given with maximum politeness.

Jacqueline Kennedy, instructions to press secretary Pamela Turnure

 

---

 

 

I think it was in September 1923 that a friend of Mayakovsky arrived in Berlin from Prague. This was red-haired Romka --  the linguist Roman Osipovich Yakobson, who worked at the Soviet Representation. Roman was pink faced and blue-eyed, with a squint in one eye; he drank a great deal but his head remained clear, and only after the tenth glass would he button his coat the wrong way. What struck me was that he knew everything: the structure of Khlebnikov's verse, old Czech literature, Rimbaud, the machinations of Curzon and Ramsay MacDonald. Occasionally he made things up, but when anyone tried to catch him out in an inaccuracy, he replied with a grin: "That was just a working hypothesis of mine"    

Ilya Ehrenburg, Memoirs, 1921-41, p. 60

 

---

 

 

With all this confounding trafficking in hypotheses about invisible connections with all manner of inconceivable properties, which have checked progress for so many years, I believe it to be most important to open people's eyes to the number of superfluous hypotheses they are making, and would rather exaggerate the opposite view, if need be, than proceed along these false lines.

H. von Helmholtz 1868.

 

 

 

 

 

 

1 Introduction

 

I would like to discuss with you material that I have been working on for several years.[1] This is the development of an algorithm (which is embodied in an computer program) whose purpose is to be able to accept as its input a raw text from an unknown language and to produce a morphological analysis of the words. When I say that the language of the text is “unknown”, I mean, of course, that it is unknown to the algorithm – the text may be well-known to the human linguist, but that knowledge of the language is not embedded in the algorithm.

 

The first problem of the learning of morphology is the problem of segmentation – figuring out where to break a word into its component pieces (traditionally called morphs). If the text that we give to the computer is English, then we expect to find that the words read, reads, and reading (distributed throughout the corpus, of course – not located next to each other) are divided in such a way that read is a single morph, while s and ing are suffixes to that stem. If the text we give it is French, then it will not draw those conclusions, but will discover a different set of suffixes and prefixes, including the suffixes er, a, é, and so on. (By the way, I consider the problem to be essentially the same regardless of whether we are looking at a text in standard orthography or in something like a phonological transcription. I also assume that the text is segmented into words.)

Is this an easy problem? Linguists in recent decades have not devoted much attention to this problem, mainly because (I suppose) it seems so easy – anybody who speaks the language can do it, or at least we assume that it is that easy. In earlier days, beginning students in linguistics were taught techniques for accomplishing this (Nida 1949, e.g.), but very few people do that anymore.

 

Again, is this an easy problem? The fact that we can easily find the suffixes in a language that we know already is obviously irrelevant. I know English well, and it is not difficult for me to know how Subject-Auxiliary Inversion should apply to any sentence that I may be given. But my ability to do that tells us nothing about how easy or difficult it would be to write a completely formal description of that ability. The only thing to do is to sit down and to try to write an algorithm that will accomplish the task at hand, which is the segmentation of words into component morphs, with no prior knowledge of the language. (I wish to emphasize this point, because I often find that when I explain this to people, they assume that I cannot really mean what I am saying – the idea is to design the algorithm to do the work of the linguist or language-learner.)

 

And that is what I have been working on, and that is the problem that I wish to discuss with you today. The program that I will discuss with you is reasonably successful within a certain range of languages, but much work remains to be done, as you will see.

 

2. The question

 

Let us look at a particular word – let us say, reading. What clues do we have regarding its morphological analysis? We know, of course, that its stem is read and its suffix is ing. But what evidence is there of this in the data? It is true that there are a large number of words ending in ing in a corpus of English – but there are even more words that end in ng (which include all the words ending in ing, -ong, etc.), and –ng is not a suffix. Raw frequency is not what we care about.

 

There are two ways of thinking of an analysis such as read + ing and how to find the correct segmentation into morphs. The first focuses on where the morpheme break is (it is between d and i); the second focuses on the pieces that are created (read is a morph and ing is a morph). It may not be obvious right from the start how different these two approaches are. Let us discuss each in turn.

 

3 Zellig Harris and successor frequencies

 

Zellig Harris was the first linguist to consider this problem seriously; he published two papers on the subject (Harris 1955, 1967); the second of these was explored in detail in Hafer and Weiss (1974). Harris was especially concerned with developing an explicit algorithm that would allow us to accept a phonemic text as input and derive a set of morphemes for the language. He had the following idea.

 

Suppose we have a list of the words in our corpus, and suppose that for each one, we scan it from left to right.  After moving along N times, we are looking at the first N letters of our word. For example, if the word is government, and N=5, then we are looking at gover. We might ask, how many different words are there that begin with gover? But Harris said something slightly different. He said, If we look at all the words that begin gover, how many different letters are there that will immediately follow gover? If we look at the first 100,000 words of the Brown corpus,[2] we can find out the answer. There is exactly one letter that follows gover: it is n. And how many can follow govern? Four: i, m, o, and s, plus a fifth, since we need to count word-boundary (#): governing, government, governor, governs and govern. And how many letters can follow after governm? Just one: e, as in government (as well as governments, governmental, etc.).

 

 

g   o   v   e   r 1  n 5  m 1  e   n   t : successor frequencies as superscripts (1)

 

 

So Harris gave a name to this counting procedure, which was “successor frequency”, and he proposed that the “goodness” of a break between letters (or phonemes) could be measured by the successor frequency there, compared to the successor frequency of the letters on either side. In the case of government, it is clear that the correct parse (govern + ment) can be correctly identified in some version of this procedure.

 

When I say, “some version of this procedure”, I mean that Harris’ general idea can be turned into an algorithm in a variety of ways, and this was one of the contributions made by Hafer and Weiss: they tried out various versions and tested them to see how well they work in practice.  One version of Harris’ algorithm is this: we propose a morphological break when and only when the successor frequency of letter N is larger than the successor frequency of letter N-1 and letter N+1. This, we have seen, works for government.

 

b 9  a 14  l 9  l

 

But Harris’method fails to give the right answer in in many other cases. It is often too “liberal” in creating too many morpheme-boundaries, but it is also too conservative, in failing to find boundaries that exist.  I will mention just a few examples. It very often fails near the beginning of the word, because syllable structure interferes. After b, there are 9 successors; after ba there are 14; and after bal there are 9. But unfortunately ba is not a prefix in the word ball, as this predicts; the reason there are more successors after ba is that ba ends in a vowel, and more phonemes (or letters) can follow a vowel than a consonant, in general.

 

c  9  o 14  m 9  p   e   t   i  t   i   o   n

This effect is even more clear in the case of a word like competition, where co- is not a prefix. After c, there are 12 successors; after co, there are 23 successors, and after com, there are only 8, and therefore there is a peak of successor frequency after the letters co – but that does not mean that co is a prefix in this word.

 

The Harris approach of successor frequency counting also fails when a given stem is present in a corpus with two or more suffixes that happen to begin with the same letter (e.g.,

  • Reception/receptive, which will be assigned the stem recepti-;
  • journalism/journalist will be assigned the stem journalis-;
  • earlier/earliest the stem earlie-;
  • craftsman/craftsmen the stem craftsm-;
  • headlines/headlights the stem headli-;
  • somewhat/somewhere the stem somewh-;

 

and so on.)

 

I will not pursue this point any further.  There is a good deal of insight in Harris’ approach, and we shall be able to use it later on. But it is not a solution to the problem, even with minor adjustments.  The point to bear in mind is that Harris’ approach focuses on the break between phones (letters). We will turn next to the alternative, which is to look at the morphs themselves.

 

4 Naïve Description Length

Let us return to the question we are considering. We are looking at the word reading, and trying to determine if it is composed of two morphs, and if so, what are they: is it

  • rea-ding, or
  • read-ing, or
  • readi-ng,

etc?

 

One simple and natural way to think about this is to think about a morphology as a way of creating a shortened wordlist for the language. If we construct a wordlist for all the words in any large corpus, we will have many similar (but not identical) related forms of words, such as read, reads, and reading. But if we construct a morphological analysis, we can enter just one time the stem read, and then specify a general pattern that is shared by a large number of stems. The pattern may be that the stem may appear followed by

(i) no suffix (call that “NULL”);

(ii) the suffix s; and

(iii) the suffix ing.

If we do this, then we save on the length of the entire wordlist.

 

The idea of a pattern of suffixes that appear on several stems is so important that we will give a name to that: we will call it a signature, and we will say that a given stem appears with a (unique) signature in any given corpus. The signature is the alphabetized list of suffixes that appear on a given stem in a corpus.


 

(1)   before morphology:

jump jumps jumping

read reads reading

walk walks walking

Total count of letters: 48

 

 

(2) after morphology:

read walk jump:

plus the pattern:

__ NULL (or) s (or) ing

Total count of letters: approximately 16


We can get a little more explicit and actually count the number of letters (or phonemes) in each case. In case (1), we have a total of 48 letters, while on the right, (2), we have 12 letters in the first row (specifying the stems) plus some number for the suffix pattern: the number is 4 if we count only the actual letters (and take “NULL” to be no letters); we also have to inquire as to what the “cost” in letters is of the words “or” in parentheses! We will eventually answer that question, but for present purposes suppose we just count the 4 letters in the suffixes – this will illustrate the basic idea that if we keep track of our data in terms of morphs rather than words, the total length of our list will be considerably shorter. And the more stems we add (such as “proceed, halt, fasten, maintain…”) the more savings of letters there will be if we choose to write the list in morphs rather than words.

 

An incorrect morphological analysis will usually lead to more letters in the list. If, for example, we incorrectly parse jumping and reading as jumpi + ng and readi + ng (but correctly parse read and read+s) , then we will have a partial and incomplete savings. In other words, if we have a set of word forms that really are related and the algorithm fails to cut them in the same way – so that there are two different stems set up in the analysis – then we will have more letters in the underlying list of stems than is really necessary. And if the goal is to have the shortest and most compact list, then this error will be a bad move.

 

So this second approach to the problem of finding a morphology focuses on how useful a morph is in terms of compressing a word list. I will refer to this as the Naïve Description Length approach.

 

 

5 Comparing Successor Frequency (Z. Harris) and Naïve Description Length

 

We have already noted that these two approaches differ in their focus: Zellig Harris’ approach focuses on where the breaks are, and the Naïve Description Length focuses on the character of the pieces that we discover.  But there are two other important differences that we should be aware of:

 

  1. Z. Harris’ approach is algorithmic, in the sense that it is trivial to implement it as a computer algorithm. Once we have alphabetized our word-list, we immediately discover where the word-breaks between morphs are.

 

b.  The Naïve Description Length approach gives us absolutely no idea how to find a good analysis for a given set of words. The Naïve Description Length approach is very good at evaluating alternative morphological analyses, but it is no good at all at discovering alternative analyses. The Harris approach is no good at comparing alternative analyses, but it is good at finding one analysis.

 

 

(By the way, this is an interesting contrast, one which we have encountered before in linguistics more than once. We have seen it in the difference of linguistic theory of Zellig Harris and his student, Noam Chomsky; Harris’s theory throughout was of the first sort; Chomsky’s theory, from The Logical Structure of Linguistic Theory (1955) up until Lectures on Government and Binding (1981) (when he shifted to “principles and parameters”, a very different foundational theory) was of the second sort, as he famously argued in Syntactic Structures (1957, p. 55). Similarly, the contrast is found in comparing generative phonology and optimality theory.)

 

There is something that both approaches fail to speak to, and that is the overall coherence of the system produced. Clearly, the Naïve Description Length approach comes closer to making general statements about the language, but it does not take the problem quite far enough.  Harris’s approach does nothing of the sort.

 

Here is an example. In the first 500,000 words of the Brown Corpus of written English, we find seven stems which occur with the signature

 

(3)  NULL  ed  ing  ion  ive  s 

           

Remember – this is important in everything that follows – that when we include “NULL” in a signature, that means that the stem can appear as a free-standing word in the corpus, without any suffix at all.  The stems are:

 

(4) disrupt project             connect            protect            

prevent             suggest             predict

(e.g.,

disrupt,

disrupted,

disrupting,

disruption,

disruptive,

disrupts,                and same for the others stems).

 

This information is extremely helpful in establishing that ion and ive are distinct suffixes in English. What we need is a method that will allow that information to have a bearing on how we analyze the words constructive and construction. Again, we need a method that says, if we have analyzed one set of words in one way – such as by means of the signature in (5), then we should analyze

(5)  NULL   ed   ing   ion  ive  s

constructive and construction with the sub-signature ive - ion, even if that means having two extra letters in the signature (ive - ion versus ve - on) and one fewer letter in the stem (construct- versus constructi). To put that another way, a smart morphological analyzer will look at the words constructive and construction and say to itself, Yes, I could analyze this as constructi + on, but I’m sure that the suffix is really -ion, not –on, from seven other words, and so I will analyze construction in the parallel way. In short, making sure that the overall system is as coherent and self-consistent as possible, not just that local savings are made.

Harris’ approach doesn’t do that. On the other hand, the Naïve Description Length will analyze construction/constructive wrongly if there is only one word in this pattern, but if there are 3 or more words of this sort, then it will prefer the correct solution, with the vowel i in the suffixes rather than on the stems. Good! But that’s only the beginning of the story.

Unfortunately, there is a related problem in English morphology which is worse than the constructive/construction problem. In fact, when you think about it, it is a problem that every language will have, if we have a large enough sample corpus from the language. Consider the following situation in English. It so happens that t is the most common letter (or phoneme) to end a stem in English (e.g., halt, seat, interest, attempt, resort, test, blast, alert, accent, request). How does the Naïve Description Length approach deal with these stems, if we have a corpus in which all of the inflected forms of these stems appears? That is, if we have the forms in (6) for a large number of stems that end in t, then what does the Naïve Description Length approach tell us to do?

(6) halts, halted, halting, halt

(7) t, ted, ting, ts

The answer is: the Naïve Description Length approach tells us that if we have 11 or more words like in (6) – which you and I would call “stems that end in t” – then we should set up a new signature of the form in (7) --  because even if this means creating four new suffixes (which “costs” us ten letters to build), we will save one letter in each stem (because the stem will be hal, not halt; or sea- instead of seat-; etc.)  In fact, this is true not only for stems ending with t, but with any other letter, if we have eleven or more stems. If there are more than 11 stems, Naïve Description Length tells us to set up a new signature.

So there are some problems looming if we depend on the Naïve Description Length method, in addition to the other problems for the Zellig Harris approach that we have looked at.

 

6 Summary so far

I have looked at this in some detail because the problem of segmenting words into pieces seems very easy until you look at it carefully – and especially until you try to write an explicit computer program to accomplish the task.  I remember my very first encounter with generative syntax; it involved the analysis of the English verbal auxiliary that I alluded to above. Obviously I knew how to form negated and inverted sentences in English (I’ve known that almost all my life), but I had never thought about how difficult a task it was to propose a formal analysis that actually works, even for the “obvious” cases. And that is the essence of a linguistic problem: finding the best explicit analysis for a set of data which is obvious to the native speaker.

What we need in order to continue is to combine the strong points of the Harris approach and the Naïve Description Length approach. We need the algorithmic character of the Harrisian approach, and some of the fine judgment of the Naïve Description Length approach. The next step we take will be one of the most important: this is to establish the discovery of signatures as the most important object of our search. We will begin by employing the algorithmic form of Harris’ idea in order to discover the basic signatures of the corpus.

 

7 Signatures: aim for maximal signature

We have encountered earlier the notion of signature: the set of affixes that a given stem occurs with in a corpus. We assign every stem to exactly one signature, and so we can speak more broadly of a signature, in a particular corpus, of being not only a set of affixes, but also of the set of stems that appear with precisely these affixes.

The signature is the most useful and important tool in understanding how broad and significant a set of affixes is. A signature which consists of several suffixes (three or more) of length greater than one letter (phoneme) and a good number of stems (fifteen or more) is a very strong candidate for being an important morphological pattern in a language. Conversely, a signature that occurs with only one stem is suspicious, all other things being equal, even if it contains a large number of suffixes. For example, the “signature” in (8) in a particular corpus, has only one stem – mat (match, mate, material, materials, matrimony, matrons, maturing), and the presence of just one stem tells us (even if we know nothing about the language) that we have no reason to believe that this is a real pattern.

(8) ch  e  erial  erials  rimony  rons  uring  [ Spurious! ]

as in: match, mate, material, materials, matrimony, matrons, maturing

Remark There is a special case that works differently: if we have a large signature with a significant set of stems associated with it, then it is perfectly reasonable for only a single stem to appear with a proper subset of the affixes in the larger signature. Example

Likewise, stems which appear with only a single suffix may not be true stems at all: we could gather together all the words in a language that end in m, but declaring that m to be a suffix would not make it one! (On the other hand, if the candidate suffix consists of several phonemes, there is a good chance it is indeed a suffix; we return to this later.)

So we will set ourselves the first task, that of finding as many signatures in a language that contain several stems and several suffixes. How are we to begin?

There are a number of ways of finding good candidate suffixes; we can, for example, consider all word-final sequences of letters (phonemes) up to some reasonable maximum (6, typically). This is rather slow and it takes a very large amount of computer memory, so let us take advantage of Zellig Harris’s proposal to employ successor frequency. In particular, let us compute the successor frequency for each word, starting after the 4th letter (since earlier in the word, there is too much “noise”  from phonological effects), and look for places where a successor frequency is greater than 1, and where the preceding and the following letters have a successor frequency of 1. These various conditions guarantee that the algorithm will fail to apply to a large proportion of words; but if our goal is to find some very good suffix candidates and to build candidate signatures from them, this should work well for us.[3]

I have brought with me a demonstration (demo) version of Linguistica 2001, which allows us to perform this operation as we speak.

Let’s take the first 100,000 words of the Brown Corpus of written English – that is roughly the size of a novel, and it contains approximately 12,200 distinct words. We obtain 346 signatures, and 175 suffixes. Here is the top set of the signatures:

(9)

Here is the result of the same operation on the Breen word-list of Japanese:

(10)

The left part of the screen shows this; after reading in the corpus, we click on “Successor Freq 1”, as shown below:

(11)

And by clicking on a signature, we obtain a complete list of the stems that appear associated with it. The top signature in this word-list, for example, is this:

Ippen shônin: Tabi no shisaku-sha

(12)

Returning to the analysis of English, we see excellent results from combining Harris’ much too liberal algorithm for finding candidate morphs along with candidate signatures, and focusing on the signature in particular. As we scan down the list of signatures from the top, we do not find any errors until we get to number 32, which is the (spurious) signature m.t, for words like journalism/journalist. Here we have been pulled into the trap of this wrong analysis by Harris’ algorithm, and it is not the only case of this sort (earlier and earliest appear lower on the list of signatures).

In general, stems that appear with only one suffix – that is, stems associated with a signature containing only one suffix – are not trustworthy, for reasons we have already touched on.

(13)

u   l   t   i  m  a 1 t 2 u 1 m                                   ultimatum

u   l   t   i  m  a 1 t 2  e 2  #                                  ultimate

u   l   t   i  m  a 1 t 2  e 2  l    y                             ultimately

The process so far has been designed in order to be conservative, to make as few mistakes as possible, rather than to apply to all words. For example, in this corpus, the words ultimate, ultimately, and ultimatum are found. After ultimat-, two letters are found: e and u; after ultimate-, two letters are found (l and #); and after ultimatu-, one letter is found. So if we look for peaks of successor frequency, surrounded by successor frequency of 1, the algorithm will split ultimat-um, but fail to split ultimately. In a sense, it is the existence of ultimatum that causes the failure of ultimately to split by Harris’ algorithm. In any event, since our goal at this point is to find the signatures of English (and not to find every place where these signatures actually occur), this is not a problem; but it would be a problem if our algorithm stopped here.

Clearly, at this point, we need to go back over the whole corpus, and ask further questions, based on our growing knowledge of the word-structure of the language. Minimally, we want to ask all of our words: how many of you can be analyzed in terms of the signatures we have just discovered? We want to analyze ultimate and ultimately in terms of the signature NULL.ly because we know that that signature occurs very often in the language (there were 76 occurrences in the first pass, illustrated above – that is a very large number of occurrences); we no longer care whether Harris’ successor frequency algorithm identifies ly as a suffix: we are already almost complete certain that it is a suffix.

 

8 Looping and heuristics

Let us take a step back, and discuss the overall strategy that we employ. First of all, we need, and we have employed, a bootstrap heuristic—a simple method that gives us an imperfect but reasonably good sketch of the morphological structure of the language. This came from a modification of Harris’s method (the modification making it much more conservative in what it picks out) plus the use of signatures.

Following this bootstrap procedure, there are a number of steps to take next, as we noted a moment ago, when we discussed the words ultimate and ultimately. We establish thresholds for signatures, based on the number of stems and suffixes that are present in the signatures, and then gather together all of the suffixes in these signatures. We reconsider all the words, and ask which ones can be split using one or more of the suffixes we have just determined. That, in turn, gives us a new set of candidate stems. These stems become certain, if they participate in known signatures; they become merely candidate stems otherwise. And so on.

Implicit in what we have just described is a global strategy that might be described informally this way: Find the smallest number of signatures that will cover the largest number of words in the corpus.

But what does this really mean? And what is its relationship to Naïve Description Length? Its connection to Naïve Description Length is direct: the Naïve Description Length approach tells us to find the smallest number of letters in the list of morphs that make up the corpus, and it is the role of the signatures to do that. Remember that each time we apply a signature to a group of related words that share a stem, we save a large number of letters because we do not have to repeat the stem for each one of its suffixed forms. To repeat, then, the Naïve Description Length is the formalization of the strategy to “cover the largest number of words in the corpus.” But what about the idea of using the smallest number of signatures? Well, some of that follows from the Naïve Description Length: we save on letters if we have fewer signatures, all other things being equal. But, as we have seen when we look at the analysis closely, Naïve Description Length is too willing to allow new signatures to be permitted.

 

9 The theory behind all of this, and the connection to “Maximize the probability of the data”

We will make a large transition now, and make a connection between this work on the learning of morphology, and the work on phonology that I discussed in the first lecture. In that lecture, I made the case that regardless of what particular theory of phonology one would like to believe in, there is a larger principle in science (or in cognition in general), which is this:

Whatever the data or evidence is that we have, we must find the hypothesis that maximizes the probability of that data.

Up to this point, I have discussed this principle as if the phrase “the probability of the data” meant the same thing as “the probability of the data as it is computed by some particular model M”. But those two phrases are not at all the same thing:

(14)      (i) The probability of the data D;

(ii) The probability of the data D, given some particular model M.

 

We have not discussed the notion of conditional probability, but I trust that what I have said will seem sufficiently natural. In order to connect these two different concepts in (i) and (ii), we may note the relationship in (iii):

(14)      (iii) the probability of data D =
(the probability of data D, given model M) x
(the probability of model M)

And what is the probability of a given model? What could that question even mean? In influential work, Jorma Rissanen (1989) has argued that we can make sense of this, and he has proposed that the (positive) log probability of an empirical model is the length (in bits) of the shortest description of that model.  He called this framework the Minimum Description Length framework. So can reformulate (iii) as in (iv), which becomes (v)

(14)     (iv) the log probability of data D =
log probability of data D (given model M)  +
log probability of model M

 (v) The grammarian’s restatement of (iv):
 the log probability of data D =
log probability of data D (given grammar M)  +
length of grammar M

Rissanen calls this perspective Minimum Description Length. Since we have taken logs in (iv) and (v) (and, following what I said in the first lecture, we will take the positive log, which is the absolute value of the log of a probability), we may now translate Rissanen’s Minimum Description Length principle as follows:

Minimize the log probability of the data as calculated by your grammar, and minimize the length of your grammar – and minimize both in such a way that the total sum of the two is a minimum.

There are quite a number of consequences of this formulation for linguistic theory. We see, for example, that it contains as a special case the original concept of generative grammar (Chomsky 1955): that the correct grammar for a given language is the grammar whose description length is the shortest. We also see that the phonological goal of maximizing the probability of the sequences of phonemes, as discussed in the first lecture, is a special case. Indeed, pushing that a bit further, we see that this principle tells us to continue enriching the grammar – that is, making it longer – until the additions to the grammar no longer are worthwhile, where we measure the worth by seeing how much the log probability of the corpus decreases by virtue of the additional grammatical description.

Notice that the notion of prediction of data, and the notion of language as a cognitive process play no role here at all; they are unnecessary.

Let us return to the specific problem of learning morphology.

 

10 Minimum Description Length of the words of a language

We may now replace the Naïve Description Length criterion of a morphology with the more complex notion of a Minimum Description Length. This shift, however, requires a good deal of mathematical work which I will not discuss here; most of the details are found in Goldsmith 2001.

But let us look at some of the structure, even if we do not look at the details of the mathematics. We want to look at the log probability of the data (that is, of the corpus) and the log probability of the morphological grammar which is, as we have said, its length.

What is the morphological log probability of the data? We have not spoken of the morphology as a probabilistic model, but I pointed out in the first lecture that a probabilistic model is a structural model to which we have associated a distribution, which is a set of numbers that add up to 1.0. We assign a probability to every word on the basis of the unique signature which contains it. In particular, the log probability of a word is the sum of (i) the log probability of its signature; (ii) the log probability of the stem, given the signature; and (iii) the log probability of the suffix, given the signature; each of these corresponds to frequencies which are simple to observe in the training corpus.

So, the job of the morphology is to assign as low a log probability of the data as possible. The morphology itself contains some complexity, which is measured by its length. And here is an important key point: the morphology contains the stems and the affixes, and it must “pay the cost” of maintaining lists of the phonemes or letters in each morph.  And what is the cost that it pays? It is the phonological  log probability of each morph. That was what we computed in the first lecture; and it must necessarily be computed once in each grammar; but then the presence of higher level structure makes it unnecessary to make that computation more than once.

 

11 Conclusion

Let us draw this discussion to a close. There are two points of significance that I hope to have shared with you in this presentation. First of all, this is a linguistics of doing: of producing computer programs that actually do work – such as language identification or the discovery of morphology. We test our hypotheses by seeing if we can in fact turn them into working software. I think this is a crucial and essential step. (When I say that it is a linguistics of doing, I mean to contrast this with a linguistics of knowing. Of course I have nothing against a linguistics of knowing; but I think we have gone too far in that direction. One of the hallmarks of modern science is its capacity to enable technology. That is not its only goal, and probably is not its primary goal, but it is a significant element of the modern scientific world, one which we linguists should not ignore; we do so at our own risk.)

Second of all, we arrive at what is both a new and an old conception of linguistics. Of  course it is very much consistent with the computational methodology that I have just spoken of. It is a conception of a linguistics which is justified purely by virtue of its good analysis of complex linguistic data. It does not need to make promises that someday it will find correspondences between its models and what goes on in the brain. It is not a cognitive theory of linguistics; it is an empirical theory of language.

 

.

Notes: Lecture 2



[1] I am grateful to a large number of people for discussions of these issues, especially Carl de Marcken, Partha Niyogi, Svetlana Soglasnova, and Derrick Higgins.

[2] This is a sample of over a million words in computer-readable format, collected at Brown University nearly 40 years ago.

[3] You may notice that this will solve the problem of setting putting stem-final consonants in the suffixes.