Topics in Computational Morphology and Phonology
John Goldsmith
Summer 2003 LSA Institute: Michigan State University

My goal in this series of classes is to discuss and explore a way to think of linguistic theory as being closely related to (some parts of) machine learning and information theory. By exploring those somewhat mathematical topics, but always keeping our eye on linguistic applications, we come to better understand what it is that drives linguists and linguistics. If we are successful in formulating questions of linguistic theory in such terms, then it should be possible to develop algorithms that embody the linguistic criteria for analysis that linguists use implicitly, and, in effect, to create grammar-discovery programs.We illustrate this, focusing on the automatic discovery of morphological structure, but also discussing the role of word discovery from an unbroken corpus, and the problem of phonology discovery.

Web links: There's a lot of information that is relevant to this course on my website -- please take a look.

Note on software: all of the programs of mine listed below run under Windows; all are in MS C++ (except the last, which is in Visual Basic).


Class topics:

1. An overview of what we will accomplish, or, Why linguists should care about machine learning. Powerpoint slides.

1. First, situating what we'll be doing within linguistics more generally. (a) On the mediationalist-distributionalist dimension: this is distributionalist. (b) Schools of explanation: historical; psychological; social; algorithmic: this is algorithmic. Computational linguistics is the embodiment of algorithmic explanation. 2. Strengths of computational approaches to linguistics: (a) They allow for testing of ideas against data. (b) They allow for better exploration of data: you get your hands dirty (and your mind clean). (c) They suggest models to linguists. Contrasts between the automatic learning and the hand-crafted rule-based computational linguists. 3. Machine learning: (a) Compare Syntactic Structures p. 51ff. What role the automatic learning of grammars in generative grammar? (b) paradigms of machine learning (ML) today: Minimum Description Length, neural networks, support vector machines, etc. (c) Machine learning provides a way (perhaps the only way) to test the claim that an information-rich UG (Universal Grammar) is necessary to account for the acquisition of human language. (Such a view carries with it the claim that UG is learned on an evolutionary scale by an uncharacterized species-level learning mechanism: not a very plausible idea, but not impossible: something like that has indeed happened for, e.g., vision, but then again languages vary more than vision systems do, and mammals had longer than homo sapiens.) 4. Probabilistic approaches: (a) a better name would be QTE: the quantitative theory of evidence. (b) probability theory is not fuzzy.(c) More importantly, it is not in any fashion inimicable to struturally-based theories (a wrong-headed notion that many linguists have picked up). (Probabilistic models sometimes appear to be skeptical about structure, but only because probabilistic modelers want to wring every last bit of quantitative result out of a simpler model before investing themselves in a more complex model, one which may require a great deal of mathematical work to get up and running.) (c) information theory's usefulness, and its central tool, the notion of information content, positive log probability, or average positive log probability. (d) A general goal of quantitative modeling, not just for linguistics, but for intelligence in the universe: Minimize the complexity of the perceived universe. This maxim leads to the framework of Minimum Description Length (MDL). MDL has many interesting connections to early generative grammar.

2. A brief overview of the next five classes, and how to download the software that we will be talking about.

2. Computational learning of morphology: first pass. How to design a computer program that will learn the morphology of a language when presented with a large chunk of text from it. Powerpoint slides.

The primary reading is Unsupervised learning of the morphology of a natural language.Computational Linguistics 27:2 (2001) pp. 153-198. The software described there is available at this website.
Recommended reading: A probabilistic model for learning concatenative morphogy. Matthew G. Snover and Michael R. Brent, Proceedings of NIPS 2002.

Please download Linguistica, a program that runs under Windows. You will need a text file in a language that uses an alphabet: there are many links to resources in different languages that you can find here. Learn how to do an analysis of the suffixal and the prefixal system of the language; we will discuss this in class.

Topics: we wish to formulate the problem of language learning as a problem in optimization. We won't have the technical tools to do this until next week, but we can get a sense of what this means by formulating the problem of morphology-acquisition as the problem of identifying, and eliminating, redundancy in the lexicon: a certain kind of redundancy -- the redundancy of substrings of letters [phonemes]. Even more important, we integrate into the model the notion that the lexicon is not fundamental a list (or set) of words, but rather a set of patterns, patterns which describe how pieces (i.e., morphemes) are put together in the language. For example, we can identify the pattern of a suffixal signature: a set of suffixes all of which appear with a specific set of stems. The signature serves both as a heuristic for uncovering morphological structure, and (with the mathematical tools we develop in Class 4) a means to automate the justification of the morphological analysis of a language.

 

3. Probability for linguists The basics: a little math, a little probability. Powerpoint slides
The reading is Probability for linguists (htm). Please read this ahead of time, and we'll go through it in detail. Recommended: Probabilistic models of grammar: phonology as information minimization, which covers some similar topics in less detail, but with more focus on phonology. Here is link to a webpage with more information about obtaining the software discussed in the paper.

Topics: 1. Sample space, or universe of probabilities. A distribution is a non-negative function over the sample space whose values sum to 1.0 (for our simple, discrete interests). 2. Probability of events which are combinations of independent choices: multiply the independent choices. How we know this makes a distribution. 3. We use this information to compare different models' treatment of the same observation set, and chose the model (or parameters) which assigns the highest probability to the observations. Example: an invisible corpus from an unknown language: all we know are the frequencies of the various letters. Can we determine what language it is written in? 4. Logarithms: much better way to think of things. In fact, we prefer inverse (or positive) log probabilities: they are additive, and intuitive, too. The log probability of the word is the sum of the log probabilities of the letters (or phonemes): we call that the unigram model. 5. The average log probability of the word is a reasonable measure of its phonological well-formedness within its language. 6. Conditional probability: conditioning a phoneme on its predecessor. This gives us a more refined description of the sound pattern of the language. 7. If we take logs of the conditional probability, that probability turns out to be the sum of the unigram log probability (an extensive quantity for each phoneme) plus the mutual information between successive phonemes, which is a quantitative measure of how much those two phonemes like to be next to each other in that language. (Henceforth, "log prob" means positive or inverse log probability.)

Study on Russian hypocoristics by Svetlana Soglasnova employing this framework.

4. The word segmentation problem, and the Minimum Description Length approach. A mathematical approach that makes sense to linguists. Powerpoint slides
The word segmentation problem is this: how can we take a long string of phonemes (or letters) which compose a long utterance in a language but with no indication of word boundaries, and from it infer a list of words in the language? This challenge involves an interesting characterization of what a child does in learning a language, though it abstracts away from meaning and personal interaction. How hard is it? What kind of conceptual tools could be useful in tackling it? We will discuss ideas first developed by Michael Brent and Carl de Marcken in the mid-1990s on unsupervised learning of vocabulary.

Reading: Composition and perturbation, Carl de Marcken, ACL 1996. You may wish to read the more detailed description found in Unsupervised Language Acquisition, at the same site.

Software: Here is some software that I wrote that should implement the ideas de Marcken discusses in his dissertation; I had to make a lot of assumptions based on common sense, and many of them may be less than optimal. It's pretty rough in some respects, but is useful. I'd be happy to share the C++ code with anyone who is interested.

Topics: Introduction. The fundamental goal of intelligence in the universe is to maximize the probability of the observed data. That is possible only if we assume some particular model that generates the probability for the observations, and then the probability of the observed data is equal to the probability of the data, under that model, times the probability of the model: Pr(Observations) = Pr(Observations | model ) * Pr (model). We take positive log probabilities, and exhort ourselves to find the minimum value for the quantity: positive log prob ( Observations ) + positive log prob ( model that generated those probabilities ). This is the essence of Minimum Description Length analysis (MDL). The positive log probability ("plog prob") of the model is an explicit measure of the complexity of the model, while the positive log probability of the data is a measure of how poorly (or well) the model fits the observations. The first term corresponds rather well to Chomsky's early notion of the evaluation metric of a grammar.

1. If we assume a unigram model of phonemes (letters), we know how to compute the probability of a corpus. If we take the corpus to be the concatenation of chunks larger than the phonemes, the log prob of the corpus is the sum of the log prob's of the individual chunks. 2. We will calculate explicitly how much better the log probability of a corpus of English gets if we assume that "th" is a chunk with its own statistics. 3. The more chunks we add to our inventory of chunks ( = vocabulary), the smaller the log probability for the corpus gets (remember, smaller is better). (Increase in probability of the corpus, given the model, is the same as decrease in log probability of the corpus, given the model) 4. When do we stop? Will the whole corpus become one large lexical entry? No, it won't, because the probability of the model decreases as the vocabulary increases. We figure out a way to measure the a priori probability of a vocabulary: in essence, it is the product of two factors, the probability of having a vocabulary of that particular size (a probability that decreases as the size gets bigger) times the probability of choosing that particular vocabulary, given the chosen size. 5. Based on this idea, and the assumption that language has chunks larger than phonemes, we arrive at an algorithm that learns linguistic chunks. Are they words? Are they morphemes?

 

5. Computational learning of morphology: second pass. How to learn harder linguistic structure from data.
We will move on to two harder problems. First problem: how can we recognize two distinct morphemes that happen to be homophones (or homographs), like the two suffixes -s in English (one marks plural nouns, the other 3rd person singular present tense verbs)? To do this, we need to draw inferences about the syntactic distribution of the words, even in the absence of any prior knowledge of the syntax of the language. Second problem: how do we recognize morphemes in a language with a morphology richer than those in European languages? A discussion of string edit distance. Application of these techniques to Swahili, Tagalog.


Recommended reading:Using eigenvectors of the bigram graph to infer morpheme identity. Proceedings of the Morphology/Phonology Learning Workshop of ACL-02 Powerpoint of that presentation.

6. A computational approach to rhythmic accent. And now for something completely different.

We will look at a model of rhythmic accent which consists of a simple network, with a small number of continuous (non-discrete) parameters -- 4 parameters, in the basic model. A language (or at least, its accentual system) is located as a point in this 4-dimensional space, and learning consists of moving around in that space until the right location (the setting of these parameters) is found. The system is based on continuous variables inside, and calculates discrete observables, which are the stress patterns.

Based on work done with Gary Larson.

Please read this paper A Dynamic Computational Theory of Accent Systems. That's a MS Word document. If you can't read it, there is an html version, but some of the diagrams are distorted, so read this only as a last resort. Reference: In Perspectives in Phonology, edited by Jennifer Cole and Charles Kisseberth, pp. 1-28. Stanford: Center for the Study of Language and Information, 1994.

Some Powerpoint slides can be found here.

Software running under Windows can be found here. (This is a Visual Basic program going back a few years, so I'm not sure what other files it expects to find; don't be too surprised if this won't run on every computer running a current version of Windows; it probably requires a file called VB6RUN.DLL, which is easily found on the Internet.)

 

 

 

References

General work on grammar induction

Chater, Nick. Neural networks: the new statistical models of mind.

Clark, Alexander. Unsupervised inducation of stochastic context-free grammars using distributional clustering. Clark's homepage is http://www.issco.unige.ch/staff/clark/

Finch, Steven and Nick Chater. 1991. A hybrid approach to the automatic learning of linguistic categories. Artificial Intelligence and Simulated Behaviour Quarterly, 78:16-24.

Grünwald, Peter. A Minimum Description Length approach to grammar inference. In 'Symbolic, Connectionist and Statistical Approaches to Learning for Natural Language Processing' (editors S. Wermter, E. Riloff, G. Scheler), pp 203-216; Lecture Notes in Artificial Intelligence no. 1040. Springer Verlag, Berlin, Germany, 1996. Grünwald's homepage is http://www.cwi.nl/~pdg/publicationpage.html.

Stolz, Walter. 1965. A probabilistic procedure for grouping words into phrases. Language and Speech 8. 219-235. Included here mostly for its historical interest.

Phonology

Albright, Adam, and Bruce Hayes. Rules vs. analogy in English past tenses: a computational/experimental study. See Bruce Hayes' webpage below.

Gildea, Daniel and Daniel Jurafsky. 1996. Learning bias and phonological-rule induction. Computational Linguistics 22:4.

Hayes, Bruce. Go to his publications page for several papers on the acquisition of phonology, several with Adam Albright.

Pierrehumbert, Janet. 2001. Stochastic phonology. GLOT 5 No.6, 1-13. Find this, and other relevant papers, on Pierrehumbert's publications page: http://www.ling.nwu.edu/~jbp/publications.html

Morphology

Jochen Trommer maintains a very good webpage on automatic morphological learning.

Harris, Zelling. 1955. From phoneme to morpheme. Language 31: 190-222.

Harris, Zelling. 1967. Morpheme boundaries within words: report on a computer test..Transformations and Discourse Analysis Papers 73.

Other

Benedetto, Dario, et al. Language trees and zipping. Physical Review Letters 88:4, 28 January 2002.

 


Assignments: Choose 2 -- answering all of them would be too much for a brief three weeks.

1. Linguistica exploration.

1. Download Linguistica onto a PC running Windows, and read the website describing its use. Choose an English text to analyze first. There are links on the Linguistica page to some texts in English and French, and you can go to my page on Internet resources for links to more languages.

2. Decide how many words of running text you wish to analyze: try 20,000, for example. The Tom Sawyer text is quite a bit larger than that. Set this by clicking "Words requested" in the left window (the Tree). Then read the corpus in, by going to the Menu: Reading/Read corpus. You will get a window that allows you to find the corpus you want to analyze. The next time you run the program, you do not need to choose this text again; you can select "Reread", or just type Control-D. How many distinct words are there? (You find the answer to that in the Upper Tree on the left side.)

3. In the upper Tree (on the left), click on Words to see an alphabetized list of the words in the corpus. Click on the column header "corpus count" to sort the words by their frequency in this corpus.

4. Click on "Forward trie" in the Upper Tree, and observe how the words are organized into a tree with common initial-strings (what computational linguists call "prefixes").

5. Click on "Find suffix system", then the first option: "Run all" (or type Control-S). Look at the status bar on the bottom; it will show you what it is doing, and when it is done. If the corpus is less than a few thousand words, it will be done before you look, probably. Click on "suffixes" in the Upper Tree, and observe the suffixes that have been discovered. Note that below "suffixes", and indented, are "signatures" and "stems (suffixed)". Click each of them. What are the 3 most common signatures in this corpus? While you are displaying signatures, click on one, and note that you find the stems in the bottom right window. If you don't see the bottom right window, below the list, you must pull up the bar separating the windows, or else set the screen to a higher resolution on your computer. (Note, too, that you can change font size with the Menu item "View/Fonts").

6. Click on the suffix signature ranked the highest, and see what stems are associated with it. Do they all below to the same lexical category? What category or categories do they belong to? Explore this for the top 10 signatures.

7. Read the top 100 signatures. How many of them are correct / incorrect / difficult-to-say? What kinds of errors are made?

8. Look at the top 20 signatures. How many of them should be viewed as subsignatures of a more common signature? Draw a diagram of the signatures, where one signature (S1) is above another signature (S2), with a line drawn between them, if S2 should be understood as a subsignature of S1. (X is a subsignature of Y if all of the affixes in X are affixes in Y). Could this be done automatically? If so, how?

9. Click on "words" (or type Shift W), and look through the list. Find some words that have not been morphologically analyzed that should have been, and some that have been morphologically analyzed that should not have been.

10. Click on "suffixes" (or type Shift F). Click on a suffix. Note that the signatures that contain that suffix are given in the "Text" box (lower right). What suffix did you choose? What signatures does it appear in? Click one of the signatures in the Text box; note that the signatures now appear in the upper Collection window, and if you click on a signature, its stems appear in the Text box.

11. Click on "Find automatic allomorphy" (or go to the menu item "Allomorphy"), then click on signatures. Click on the top of the first column (where it says "signatures") to sort the signatures alphabetically. What signatures have been created by "find automatic allomorphy"? What do they represent? Remember that you can click on a signature to find the stems that it is associated with.

12. Find the prefixes of words. Explain the difference between "prefixes of stems" and "prefixes of words". What differences do you find between the prefixes that Linguistica has found in these two cases?

13. Click on compounds. Can you evaluate how good a job Linguistica has done here? Try to determine what percentage of true analyses, false positives and false negatives the program has accomplished.

14. Save the results to a subdirectory on your (floppy or hard) drive.

2. Phonological complexity exploration. Explore one of the languages you have an appropriate file for (English, French, Japanese, or a language for which you create the dictionary (dx1 ) file (see next question). Most of the following questions are intended to be thought questions -- they are questions that you can think about for hours, or for years, or for any length inbetween. Please note that you can calculate the complexity for words that are not present in the word-list by modifying the calculations of word that is there. In the case of the unigram model without mutual information, you just need to change the values of the log probabilities of the segments that you have changed; in the case of the unigram model with mutual information (equivalent to the bigram model), you must change the mutual information on each side of any segment you modify, and you can get the mutual information from the left-hand window.

2a. How does the ranking of the segments by frequency correspond to markedness considerations in the language, as described by phonologists?

2b. Consider the words at the high complexity end of the word-list. How does the set of words you find here vary depending on whether you sort using a unigram model or a bigram model? Why do you find that difference? Is it phonologically important or interesting?

2c. Consider several words that have more than one pronunciation; you may choose borrowings for this purpose. How does the average bigram complexity vary across the different pronunciations? Does the least complex form correspond to the form which has been best integrated into the sound pattern of the language, in your opinion?

2d. What values for segment log probability or for mutual information between two segments surprises you the most? Why?

2e. Choose a morphophonemic alternation in the language -- the sort of phenomenon that would be handled by a phonological rule in most theories, such as vowel reduction in English. Does the output of the rule have lower complexity than the input in every case? Explore this question in detail. Why is this significant?

3. Phonological complexity: Develop a phonology complexity file for a new language (one other than English, French, or Japanese). You will probably need some basic programming ability -- knowledge of Perl is ideal. The file you create must have the extension ".dx1". It will consist of a set of lines, each of which has the form:

1. A word (consisting of a consecutive string of alphanumeric characters), followed by a space; then,
2. A number, which is the number of occurrences of the word in the text. If you do not have counts, then put "1" here for every word, followed by a space; then,
3. A sequence of phonemes, separated from each other by a space. A phoneme is just an alphanumeric string: it may be a single character (e.g., B, or b) or it may be a string (e.g., EH1).

That's all.

I'd be delighted if you created such a file for a new language. Note that for a lot of languages, standard orthography is almost good enough for phonemic representation. For example, if you took a word list from Spanish, and kept the various digraphs together (not separated by space), such as ch and ll, and separated the other letters with spaces, you'd almost be done.

 

4. Calculation of the probabilistic benefits of taking th as a basic "chunk" of English text. This is the hardest question, but probably the most interesting. It requires nothing by high school math, but perhaps some mathematical experience, especially in handling notation.

1. Notation: function L maps from positive integers to letters of our alphabet. L(1) = a; L(2) = b; L(26) = z; etc.

2. We will describe a string S as parsed when it is entirely broken up into contiguous pieces of 1 or more characters. (If you like, you may say that a parse is a function from the first N positive integers to substrings of string S - if you find that helpful.) Every character in string S occurs in exactly one piece of the parse. A string can thus have many parses associated with it; a string associated with a particular parse is a parsed string. Such a parsed string can be indexed (which is just a way of saying that, we can refer to the nth piece of it -- the first, the second, and so on). The length of the parse is the number of pieces in the parse. We can define the count of a string(-type) t in a parsed string as the number of times t occurs as a complete piece of the parsed string. We write [t] for the count of t in parse P, and hopefully context will always make it it clear what the parsed string is that we're talking about. So [a] is the number of a's in the parse, and [th] is the number of th's (pieces, that is) in the parse, etc.

3. A parsed string defines a vocabulary in a natural way: the vocabulary is the set of all of the pieces of the parse. Each has a count, and this defines a distribution via the pieces' frequency. Assume that even if it is not marked, a string ends with a special end-of-string symbol # (we use # for this symbol, but the choice is arbitrary), and do not forget to count its occurrence (both type and token).Write an explicit formula for that distribution. The parse of a string in which each character is a piece is called the complete parse. Its length is the same as the number of character-tokens in the string.

4. Question: when is the frequency of the pieces identical to the frequency of the letters comprising the string?

5. Consider the string that consists of a document D. Using Word or some other program, determine how many characters are in it. Determine how many times the sequence "th" occurs in it.

6. Define a parse T for document D in such a way that each substring "th" is a piece, and all other pieces are one character in length. What is the length of this parse? What is the count [th] in this corpus? What is [t], and [h]?

7. Using the capital P (pi) notation, write the probability of corpus D with the complete parse C, using the definition you gave in question 3. Don't do the arithmetic; you can sum over [l-i] (that's l sub i), the count of the i-th letter, where i ranges over 1 to 26. Use the variable N to indicate the total number of characters in D, which you determined in 5 above. Hints: you will be multiplying N different numbers, each of which has a numerator and a denominator. The denominators are all the same, so summarize that in one number with an exponent. Write the product of the numerators in a simple way, using exponents as you need to.

8. Using the same notation, write the probability of parse T (defined in paragraph 6); don't do the arithmetic, leave the answer in symbols. Hints: This is very similar to question 7, but several things have changed. First, the total number of pieces in the string has changed -- it's no longer N; and there is a new "symbol" in the alphabet ("th"); and the counts of the t's and h's has decreased.

9. Find the ratio of the the probability of the complete parse C to that of the parse T. Most of the factors cancel, and you can get a reasonably simple formula.

10. Take base-2 logs of the ratios, and come up with a final answer; plug in the real counts, and get an answer in bits.

Using a log function: remember that we care about base-2 logs. If you have a calculator (like Excel) that gives base 10 logs, divide that answer by log-base-10 of 2 to get the base-2 log. Likewise, if you want to convert a natural log (base e) to log-base-2, divide it by the natural log of 2.

The log ratio of these numbers is the improvement, in bits, of the increase in log probability of the data under the two models. This is the improvement we get in understanding the language by taking th to be a chunk.

If your answer isn't somewhere between 500 and 2000, you've probably made a mistake.

Hint: your answer should be composed of the answer to four questions: how does the treatment of h, of t, and of th change in the two analyses? How does the change in the total number of pieces in the two parses affect the log probabilities of all the other pieces?