Survey of Industrial Linguistics
CSPP 56555
Spring 2004
John
Goldsmith
Departments of Computer Science and Linguistics
University of Chicago
This website is certain to be modified up until the start of the spring quarter 2004. If you're thinking about taking this course, this syllabus should give you a detailed picture of what we will cover this year.
Goals of this course
This course focuses on how natural language is being used in software today, and how it will be used in the near future. The goal is not to make you a research at the leading edge, but rather:
A few ways in which natural language is used and will be used in the near future in computational applications:
The uses of natural language (which is a synonym for human language) in computation can be divided up into these areas (and this list is not exhaustive):
In order to understand the technologies that lie behind these applications, we will need to study:
Required textbook: Daniel Jurafsky and James H. Martin:Speech and Language Processing, (2000) Prentice-Hall. You should also visit their website.
There is another excellent textbook on much the same subject, by Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, which also has a website. The book is a bit harder, and covers fewer topics; on the whole, it's a very good book.
Organization of the course
In view of how long each class is - almost three hours, as you well know - I will do my best to divide each class into two topics, so that when we resume after the mid-class break we can feel like we're getting into something new and different.The reading assignments given below should be interpreted as indicating the reading you should do before coming to that class.
| Week |
Before the break...
|
...after the break
|
Assignment based on this week's material
|
| 1 |
A history of NLP Reading assignment: Chapter 1.
|
Introduction to the heart of linguistics: phonetics; phonology; morphology; syntax. Reading assignment: 3.1, 4.1 and 4.2 |
|
| 2 |
An introduction to Perl See the Perl box below for assignment and resources. |
Regular expressions. Examples. Dividing a corpus into sentences; finding abbreviations. Reading assignment: 2.1. |
|
| 3 |
N-gram models of language. Unsmoothed models; smoothing, and backoff. Probability, distributions, and conditional probability. Reading assignment:
|
More about probability. Entropy as average log probability. The effect of calculating the probability of a set of data (a corpus) according to different probabilistic models: cross-entropy. KL-divergence as the method of choice for comparing two distributions. Author identification (word-based language model); language identification (letter-based language-model). | |
| 4 |
Morphology Reading: Section 3.1, and skim the rest of the chapter. |
String edit distance (minimum edit distance, Levenshtein distance) Applications in NLP and bioinformatics. Dynamic programming algorithm: in the simplest case, finding the best alignment of two words M and N by keeping track of 3 numbers for each of m x n cases, where the two words are of length m and n. Reading assignment: Section 5.6. The example is jamful of errors. Find them all and correct them in your book. |
String edit distance problem. |
| 5 |
Word classes and Part of speech tagging reading: Chapter 8.1 to 8.5 |
Transformation-based tagging Reading: Section 8.6 |
|
| 6 |
Grammar: Phrase-structure and context-free grammars. Reading: Chapter 9.1 - 9.8, 9.11 |
Parsing: Earley parsing. Reading: Chapter 10, especially 10.4. |
|
| 7 |
Word-sense disambiguation Reading: Chapter 17 |
Information retrieval | |
| 8 |
Machine translation (MT) Reading: Chapter 21 |
Rapid development of statistical MT systems | |
| 9 |
Speech recognition: basic architecture Reading: Chapter 7. |
Hidden Markov models (HMMs)
|
|
| 10 |
Speech synthesis: generating intonation that sounds natural. Reading: Dealing with prosody in a Text to Speech system. John Goldsmith, International Journal of Speech Technology 3: 51-63. |
Wrapping up the course: a summary and synthesis. |
Additional suggested readings
| Week | ||
| 1 |
Lillian Lee: "I'm sorry Dave, I'm afraid I can't do that": Linguistics, statistics, and natural language processing circa 2001. Steven Abney. Statistical methods and linguistics. The Balancing Act, Judith Klavans and Philip Resnik, eds, MIT Press Alan M. Turing, 1950. Computing machinery and intelligence. In Mind LIX(236), pp. 433-460. Wilensky: An AI Approach to NLP Early AI NLP Efforts |
|
| 2 | See the section of the syllabus below on Perl. | |
| 3 | Very nice powerpoint presentation by Josh Goodman (Microsoft Research). | |
| 4 |
Trost, Harald. Computational Morphology. John Goldsmith 2001 Unsupervised learning of natural language morphology Morphological learning (Jochen Trommer) |
String edit distance. Ristand and Yianilos, Learning String Edit distance. 1997. |
| 5 | ||
| 6 | ||
| 7 |
Yarowsky, D. `` Unsupervised Word Sense Disambiguation Rivaling Supervised Methods.'' In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA, pp. 189-196, 1995. |
|
| 8 |
Bonnie Dorr, Pamela Jordan, and J. Benoit, 1999, A Survey of Current Paradigms in Machine Translation Advances in Computers, edited by Marvin V. Zelkowitz, Vol 49, Academic Press. Kevin Knight, 1999. A Statistical MT Tutorial Workbook. |
|
| 9 | ||
| 10 | A Short Introduction to Text-to-Speech Synthesis (Thierry Dutoit) |
Other general resources:
Michael Barlow's Corpus Linguistics website.
Some links I have put together to relevant sites.
Perl
PERL programming basics; PERL is an ideal language to work in for many simple
language processing tasks, so we use it to demonstrate a few basic approaches
involving counting and sorting, and to introduce linguistically relevant programs.
We will learn the basics of Perl by working our way through the following program, which we will build iteratively through several stages.
WordFreq1.pl
WordFreq2.pl
WordFreq3.
pl WordFreq4.
pl
WordFreq5.pl
WordFreq6.pl
WordFreq7.pl
WordFreq8.pl
Regular expressions, and how they can be used. Here are the Perl programs I discuss (some) in class.
Links for you on Perl:
1. http://www.activestate.com/Products/ActivePerl/Download.html
is the best link to use to download Perl to your computer.
2. Check out http://www.perl.com/
, from O'Reilly publishers, as a first link to information on Perl.
3. The Comprehensive Perl Archive Network (CPAN)
is a good second link to information on Perl.
4. Nice introduction
to Perl from the University of Missouri.
5. Another very nice
introduction, this one from the University of Kansas!
6. Very good step-by-step explanation
to getting Perl set up under Windows by Selena Sol and Nikhil Kaul.
Regular expressions. Regular expressions are also known as Reg-exps:
Just search the Internet for "regular expressions"-- there's a lot of extremely
useful material, such as
1. Steve Ramsay's Guide to regular expressions.
2 Dario Gomes's Learning to use regular
expressions.
3. An introduction
by Larry Mak -- C#, Java, and Perl.
There are a large number of freeware (not to mention shareware) text editors
that incorporate capabilities to use regular expressions in searches and replacements.
I haven't explored any of them in depth (as I write these words), but Crimson
Editor looks good (http://www.crimsoneditor.com/) -- or just enter "freeware
text editor" in your favorite search engine to find a program that runs on the
platform you prefer.
Assignment:
1. Modify one of the Perl programs you have so that it produces the frequency
of each word (that is, the number of occurrences of the word divided by the
total number of words.
2. For each word, calculate the Zipf product: the frequency of the word times
its rank in the frequency list.
3. Find the average value of the Zipf product for the 1,000 most frequent words
in your corpus. Do this for three different corpora in English. Compare the
values. Is it reasonable to speak of a "Zipf constant"?
4. Optional: Redo exercise 3 above with at least one other corpus from a language
other than English. Compare the Zipf products, and try to explain the difference
or similarity that you find.