Statistical Natural Language Processing
John Goldsmith
Spring 2003
Text: Foundations of Statistical Natural Language Processing, Christopher D. Manning and Hinrich Schuetze
This course will cover the basic ideas of contemporary statistical natural language processing in the first half, and then focus on two particular areas: automatic learning of morphology, and statistically-based machine translation.
Course requirements: to be specified.
Weeks 1 - 2. Basics of probability theory and information theory: Probability for linguists (Goldsmith) and Chapter 2 of Manning and Schutze. Link to discussion of how the log probability of a string changes when you add 10 e's to it: preparation for Assignment 1 below.
Week 3. Collocations: M and S, Chapter 5.
Week 4. Biword and trigram distribution; sparsity and back-off: M and S, Chapter 6
Week 5. Alignment 1: String Edit Distance. Powerpoint.
Week 6. Word
segmentation: de Marcken, Brent
Reading: Composition and perturbation, Carl de
Marcken, ACL 1996
Week 7. Automatic learning of morphology 1: European languages.
Reading: Unsupervised
learning of the morphology of a natural language.Computational
Linguistics 27:2 (2001) pp. 153-198
Week 8. Automatic learning of morphology 2: rich morphological systems; automatic learning of syntactic behavior using Week neighbor-word vectors, and spectral analysis of matrices
Weeks 9.-10. Chapter 13 Statistical Alignment and Machine Translation
Assignments
1. Notation: function L maps from positive integers to letters of our
alphabet. L(1) = a; L(2) = b; L(26) = z; etc.
2. We will describe a
string S as parsed if it is entirely broken up into contiguous pieces of
1 or more characters. A parse is a function from the first N positive integers
to substrings of string S, and every character in string S occurs in exactly one
piece of the parse. A string can thus have many parses associated with it; a
string associated with a particular parse is a parsed string. Such a
parsed string can be indexed (that is, we can refer to the nth piece of
it), and we can define the count of a string(-type) S in a parsed string
as the number of times S occurs as a complete piece of the parsed string. We
write Count(P, S) for the count of S in parse P when we want to be very
explicit, or simply [S] when context makes it clear. So [a] is the number of
a's in the parse, and [th] is the number of th's (pieces, that is)
in the parse, etc.
3. A parse defines a vocabulary in a natural way: the
vocabulary is the set of all of the pieces of the parse. Each has a count, and
this defines a distribution via the pieces' frequency. Write an explicit formula
for that distribution.
4. Question: when is the frequency of the
pieces identical to the frequency of the letters comprising the
string?
5. We can define a length function as well: Length
(P,S, 1) is the number of pieces of length 1, Length (P, S,2) is the number
of pieces of length 2, and in general Length (P, S, n) is the number of
pieces of length n in parse P. Write an equation that says: the length of
the entire string S, described in terms of the number of characters, is, for a
given parse P of S, equal to the sum (over all lengths) of the length times the
number of pieces of that length in the string.
6. The parse of a string
whose length is the character-length of the string is called the complete
parse: every character is a piece.
7. Consider the string that
consists of a document D.
Using Word or some other program, determine how many characters
are in it. Determine how many times the sequence "th" occurs in it.
8. Define a parse T for document D in such a way that each "th" is a
piece, and all other pieces are one character in length. How long is this parse?
What is the Count (T, "th" )? What is Count (T,"t"), Count (T,"h")? Notation: we
can refer to these (respectively) as [th], [t], and [h]. When we write
complicated equations, that's a big help --
9. Using the capital P (pi) notation, write
the probability of corpus D with the complete parse C, using the definition
you gave in question 3. Don't do the arithmetic; you can sum over [l-i] (that's
l sub i), which is short for "Count(C, L(i))", where i ranges over 1 to 26. Use
the variable N to indicate the total number of characters in D, which you
determined in 7 above. Hints: you will be multiplying N different numbers, each
of which has a numerator and a denominator. The denominators are all the same,
so summarize that in one number with an exponent. Write the product of the
numerators in a simple way, using exponents as you need to.
10. Using the
same notation, write the probability of parse T (defined two paragraphs above);
don't do the arithmetic, leave the answer in symbols. Hints: This is very
similar to question 9, but several things have changed. First, the total number
of pieces in the string has changed -- it's no longer N; and there is a new
"symbol" in the alphabet ("th"); and the counts of the t's and h's has
decreased.
11. Find the ratio of the the probability of the complete
parse C to that of the parse T. Most of the factors cancel, and you can get a
reasonably simple formula. Depending on how you simplify, you will get about 6
different factors raised to various powers.
12. To get a real numerical answer, it's best to take logs. So take logarithms of your formula in 11, and calculate the answer numerically.
Assignment 2 Due April 21 (Monday)
1. Download the letter counts from texts in English, French, German, Swahili, and a mystery passage from this link (or this one for a tab-deliminted text file).
2. Using a spreadsheet, calculate the frequencies of each of the letters, the log frequencies, and the entropy of each text. Do it like this: Make a series of columns with the following data: Counts, frequencies, positive log frequencies, frequency times positive log frequency. Create 5 copies of this in 5 separate files, one for each language; after you've done the first, the next ones are very easy -- you just cut and paste the new data into the first column.
3. Using a spreadsheet, compute the crossentropies of all the pairs among the known languages. This should be a lot easier than it might sound. Your strategy should be to take each of the five files you created in (2), and add a series of blocks of columns, one block for each of the other languages. Each block of columns will be like the block of columns you made in (2), except that the positive log frequencies will come from the "base" language whose log frequencies are at the left.
4. Which languages are most similar? Which languages are most different? Use cross-entropy to answer that question.
5. What language is the mystery language written in?
Assignment 3 Due May 2 (Friday)
Click on this link: assignment 3.
Click here for solution to exercise 1.
Assignment 4 Due May 12 (Monday)
Build a tableau in which you compute the string edit distance between the words LINGUISTICS and LANGUAGE, given that:
1. vowel for vowel substitutions cost 0.5;
2. consonant for consonant substitutions cost 1.0;
3. other substitutions cost 3;
4. deletions and insertions each cost 2;
Look at the xeroxed handout and the powerpoint presentation on string edit distance.