Survey of Industrial Linguistics

CSPP 56555

Spring 2004

John Goldsmith
Departments of Computer Science and Linguistics
University of Chicago

This website is certain to be modified up until the start of the spring quarter 2004. If you're thinking about taking this course, this syllabus should give you a detailed picture of what we will cover this year.

Goals of this course

This course focuses on how natural language is being used in software today, and how it will be used in the near future. The goal is not to make you a research at the leading edge, but rather:

A few ways in which natural language is used and will be used in the near future in computational applications:

 

The uses of natural language (which is a synonym for human language) in computation can be divided up into these areas (and this list is not exhaustive):

  1. Use of speech:
    automatic speech recognition, in which a user speaks and the computer can at least identify what the words are that the user said; and
    speech synthesis, in which the computer speaks to the user in an understandable and not too implausible manner of speaking
  2. Use of words and word-structure. Spell-checking, figuring out the correct pronunciation of a new word (proper name, neologism, borrowing from a foreign language, etc.)
  3. Grammar and syntax: understanding the meaning of sentences
  4. Use of knowledge of language in the organization of documents; document retrieval
  5. Machine translation from one human language to another

In order to understand the technologies that lie behind these applications, we will need to study:

Required textbook: Daniel Jurafsky and James H. Martin:Speech and Language Processing, (2000) Prentice-Hall. You should also visit their website.

There is another excellent textbook on much the same subject, by Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, which also has a website. The book is a bit harder, and covers fewer topics; on the whole, it's a very good book.

Organization of the course

In view of how long each class is - almost three hours, as you well know - I will do my best to divide each class into two topics, so that when we resume after the mid-class break we can feel like we're getting into something new and different.The reading assignments given below should be interpreted as indicating the reading you should do before coming to that class.

Week
Before the break...
...after the break
Assignment based on this week's material
1

A history of NLP

Reading assignment: Chapter 1.

 

Introduction to the heart of linguistics:

phonetics; phonology; morphology; syntax.

Reading assignment: 3.1, 4.1 and 4.2

 

2

An introduction to Perl

See the Perl box below for assignment and resources.

Regular expressions. Examples. Dividing a corpus into sentences; finding abbreviations.

Reading assignment: 2.1.

 

3

N-gram models of language.

Unsmoothed models; smoothing, and backoff. Probability, distributions, and conditional probability.

Reading assignment:
Chapter 6.
Probability for linguists.

More about probability. Entropy as average log probability. The effect of calculating the probability of a set of data (a corpus) according to different probabilistic models: cross-entropy. KL-divergence as the method of choice for comparing two distributions. Author identification (word-based language model); language identification (letter-based language-model).  
4

Morphology

Reading: Section 3.1, and skim the rest of the chapter.

String edit distance (minimum edit distance, Levenshtein distance) Applications in NLP and bioinformatics. Dynamic programming algorithm: in the simplest case, finding the best alignment of two words M and N by keeping track of 3 numbers for each of m x n cases, where the two words are of length m and n.

Reading assignment: Section 5.6. The example is jamful of errors. Find them all and correct them in your book.

String edit distance problem.
5

Word classes and Part of speech tagging

reading: Chapter 8.1 to 8.5

Transformation-based tagging

Reading: Section 8.6

 
6

Grammar: Phrase-structure and context-free grammars.

Reading: Chapter 9.1 - 9.8, 9.11

Parsing: Earley parsing.

Reading: Chapter 10, especially 10.4.

7

Word-sense disambiguation

Reading: Chapter 17

Information retrieval
8

Machine translation (MT)

Reading: Chapter 21

Rapid development of statistical MT systems
9

Speech recognition: basic architecture

Reading: Chapter 7.

Hidden Markov models (HMMs)

10

Speech synthesis: generating intonation that sounds natural.

Reading: Dealing with prosody in a Text to Speech system. John Goldsmith, International Journal of Speech Technology 3: 51-63.

Wrapping up the course: a summary and synthesis.  

 

Additional suggested readings

Week  
1

Lillian Lee: "I'm sorry Dave, I'm afraid I can't do that": Linguistics, statistics, and natural language processing circa 2001.

Steven Abney. Statistical methods and linguistics. The Balancing Act, Judith Klavans and Philip Resnik, eds, MIT Press

Alan M. Turing, 1950. Computing machinery and intelligence. In Mind LIX(236), pp. 433-460.

Eliza

Wilensky: An AI Approach to NLP Early AI NLP Efforts

2 See the section of the syllabus below on Perl.
3 Very nice powerpoint presentation by Josh Goodman (Microsoft Research).
4

Trost, Harald. Computational Morphology.

John Goldsmith 2001 Unsupervised learning of natural language morphology

Morphological learning (Jochen Trommer)

String edit distance. Ristand and Yianilos, Learning String Edit distance. 1997.
5
6
7

Yarowsky, D. `` Unsupervised Word Sense Disambiguation Rivaling Supervised Methods.'' In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics. Cambridge, MA, pp. 189-196, 1995.

8

Bonnie Dorr, Pamela Jordan, and J. Benoit, 1999, A Survey of Current Paradigms in Machine Translation Advances in Computers, edited by Marvin V. Zelkowitz, Vol 49, Academic Press.

Kevin Knight, 1999. A Statistical MT Tutorial Workbook.

9
10 A Short Introduction to Text-to-Speech Synthesis (Thierry Dutoit)

 

Other general resources:

Kenji Kita's web page

Michael Barlow's Corpus Linguistics website.

Some links I have put together to relevant sites.


Perl

PERL programming basics; PERL is an ideal language to work in for many simple language processing tasks, so we use it to demonstrate a few basic approaches involving counting and sorting, and to introduce linguistically relevant programs.

We will learn the basics of Perl by working our way through the following program, which we will build iteratively through several stages.

WordFreq1.pl        WordFreq2.pl        WordFreq3. pl       WordFreq4. pl
WordFreq5.pl        WordFreq6.pl        WordFreq7.pl        WordFreq8.pl   

Regular expressions, and how they can be used. Here are the Perl programs I discuss (some) in class.

Links for you on Perl:

    1.  http://www.activestate.com/Products/ActivePerl/Download.html is the best link to use to download Perl to your computer.
    2. Check out http://www.perl.com/ , from O'Reilly publishers, as a first link to information on Perl.
    3. The Comprehensive Perl Archive Network (CPAN) is a good second link to information on Perl.

    4. Nice introduction to Perl from the University of Missouri.
    5. Another very nice introduction, this one from the University of Kansas!
    6. Very good step-by-step explanation to getting Perl set up under Windows by Selena Sol and Nikhil Kaul.

Regular expressions. Regular expressions are also known as Reg-exps: Just search the Internet for "regular expressions"-- there's a lot of extremely useful  material, such as
                1. Steve Ramsay's Guide to regular expressions.
                2 Dario Gomes's Learning to use regular expressions.
                3. An introduction by Larry Mak -- C#, Java, and Perl.
There are a large number of freeware (not to mention shareware) text editors that incorporate capabilities to use regular expressions in searches and replacements. I haven't explored any of them in depth (as I write these words), but Crimson Editor looks good (http://www.crimsoneditor.com/) -- or just enter "freeware text editor" in your favorite search engine to find a program that runs on the platform you prefer.

Assignment:
1. Modify one of the Perl programs you have so that it produces the frequency of each word (that is, the number of occurrences of the word divided by the total number of words.
2. For each word, calculate the Zipf product: the frequency of the word times its rank in the frequency list.
3. Find the average value of the Zipf product for the 1,000 most frequent words in your corpus. Do this for three different corpora in English. Compare the values. Is it reasonable to speak of a "Zipf constant"?
4. Optional: Redo exercise 3 above with at least one other corpus from a language other than English. Compare the Zipf products, and try to explain the difference or similarity that you find.