An Overview of Industrial Linguistics
CS 324 583 01
Ling
260 391 01
John Goldsmith and Sean Fulop
ja-goldsmith@uchicago.edu
(http://humanities.uchicago.edu/faculty/goldsmith)
sfulop@midway.uchicago.edu
(http://people.cs.uchicago.edu/~sfulop/index.htm)
Tuesdays 5:30 to 8:30 pm Spring quarter 2001
Teaching assistant: Derrick Higgins dchiggin@midway.uchicago.edu
General requirements: The grade will be based on homework assignments, listed below, and a term project, which will be a computational project selected from the list below (though we will be willing to consider a different project if you want to make the case). [List does not presently appear here.] You will notice that there is no homework assigned after Week 6; this is so that you can concentrate on your term project.
Programming skills: Perl is the language of choice for many of the projects involved in this course. If you don't already know Perl, but know C, we think it might very well be worth your time to spend a long evening and learn enough Perl to write code for these projects.
Readings and links
Two kinds of reading: Some of the assigned readings below are readings for background; others are straight readings. The difference is that reading for background is material that you should read to get the big picture and so that you can go back there later if you find you need to understand a concept in detail. You are expected to know the material in the reading-for-background sections. Straight reading is material that you are expected to study carefully and learn. Assignments that are not marked "reading for background" are intended as straight reading assignments.
Principal textbook: Daniel Jurafsky and James H. (2000) Martin: Speech and Language Processing. Prentice Hall.
Home page for website: http://www.cs.colorado.edu/~martin/SLP/slp-web-resources.html
Other readings: Other assigned readings will be distributed gratis to registered students.
Assignments and suggested readings from:
Charniak, Eugene (1993).
Statistical Language Learning. Cambridge MA: MIT Press.
Jelinek,
Frederick (1997). Statistical Methods for Speech Recognition. Cambridge:
MIT Press.
Keller, Eric. (1994) Fundamentals of Speech Synthesis and
Speech Recognition. John Wiley & Sons
Manning, Christopher D., and
Schütze, Hinrich (2000). Foundations of Natural Language Processing.
Cambridge: MIT Press.
Osherson, D. N., Stob, M., and Weinstein, S. (1986)
Systems that Learn. Cambridge: MIT Press.
Sproat, Richard (1992).
Morphology and Computation. Cambridge: MIT Press.
An online (free) resource: Survey of the State of the Art in Human Language Technology (1996) http://cslu.cse.ogi.edu/HLTsurvey/
Week
1
Overview of the course Fulop and Goldsmith.
Introduction: who we are and what our backgrounds are. Organization: readings, assignments, meetings, office hours and appointments. Programming expectations; a few important words about Perl. A quick spin through the whole syllabus.
Reading for next week: Keller, Chapter 1. Reading for background: Jurafsky and Martin: Chapters 1 and 2. Some of this you will need to understand in order to understand material in Chapter 3. Read for next week: Jurafsky and Martin, Chapter 4: Read 91-110; 120-130. Chapter 5: 141-184. Note that some of Chapter 5 requires some knowledge of probability, which some of you may not currently have (we will cover it in Week 6). Do the best you can. The Viterbi algorithm is extremely important, and is used widely in both computational linguistics and in other computational areas.
Read "Comparative Evaluation of Letter-to-Sound Conversion Techniques for English Text-to-Speech Synthesis," R.I.Damper et al. Damper et al
Assignment for next week: Download the Brown corpus. Write a program that provides a frequency-sorted list of words in the corpus. Reach a decision as to how to treat punctuation and the distinction between capitalized and non-capitalized words. Submit the code and the output plus any thoughts you have on significant decisions you needed to make in writing the program. If you are programming in Perl, then you can use their hashes ("associative arrays"). If you're writing in C++, you will need to learn to use a "map" class (hash). Hopefully you won't have to write it yourself, but that option is always there.
On the Brown corpus: http://www.hit.uib.no/icame/brown/bcm.html#bc3 There are many places to download it; one is http://humanities.uchicago.edu/faculty/goldsmith/data/Browncorpus.txt
Week 2 Phonetics and
phonology Fulop and
Goldsmith
Assignment: Letter to sound relationships in English. Download the Nettalk.data.gz labeled corpus at : ftp://svr-ftp.eng.cam.ac.uk/pub/pub/pub/comp.speech/dictionaries. Using this as your data source, write a program that will determine the phonemic realization of each letter in English, also associating with each phoneme a proportion. E.g., the letter L is realized 91% (or .91) as the phoneme L, and 9% (.09) as NULL (e.g., in calm). Write a program to do the inverse, that is, showing for each phoneme what letters can represent it, along with frequencies.
Some on-line resources:
On punctuation: Say, Bilge and Akman, Varol (1997) Current Approaches to
Punctuation in Computational Linguistics. Computers and the Humanities
30(6):457-469 http://cogprints.soton.ac.uk/documents/disk0/00/00/01/98/index.html
Letter to sound (LTS): Issues in
Building General Letter to Sound Rules (Black et al)
Reading for next week: Jurafsky and Martin: From Chapter 3, Read pp. 57-71. Read for background: 71-82. Real reading pp. 82-88. Read about the Viterbi algorithm, which is very important, and which we'll encounter three times during the quarter. Juraksky and Martin cover in on pp. 177ff and 244ff; read those passages.
Suggested: You might also want to look at: Sproat Morphology and Computation.
Week 3 English morphology Goldsmith
We will begin with a discussion of the Viterbi algorithm, in connection with Minimum String Edit and probabilistic letter to sound conversion. Powerpoint slides. Any-browser-readable format.
Reading for next week: Jurafsky and Martin: Chapter 9,10 Read Chapter 8 to p. 298, all of 9, start 10.
Assignment for next week: Write a program to determine compounds in a corpus of English. Run it on a large corpus (e.g., Brown corpus), and determine (by sampling, if necessary) how well it works. Submit the code, the output, and its score; explain your scoring method. Hint: One natural way to try to find compounds is to look for words which can be spelled as the concatenation of two independently existing words in the corpus. Hint 2: That strategy will include false compounds like "mean" and "meat". Be sure to deal with that problem.
Week 4 Introduction to natural language syntax Fulop
Syntax is the arrangement of words in sentences; most current theories of natural language syntax specify the organization of a sentence as a hierarchy of subconstituents in a syntactic structure.
Reading for next week: Jurafsky and Martin: the part of Chapter 8 that you haven't yet read, and Chapter 10.
Assignment for next week: Do J&M Exercises 9.1, 9.2, 9.3
Week 5 Current approaches to syntax Fulop
This week we consider theoretically motivated ways of computing syntactic structures and recognizing the sentences that have them.
Reading for next week:
Read: Introduction to probability for linguists (pdf format -- the sigmas aren't visible). Word format Html formatAssignment for next week: Do J&M Exercise 10.2
Week 6 Basics of probability and information
theory Goldsmith
Good additional resources:
Charniak,
Eugene Statistical Language Learning.
Manning and Schütze
Assignment: do the exercises in Introduction to probability for linguists (the reading for this week).
Reading for next week: Reading for background: Goldsmith, Unsupervised learning of the morphology of a natural language (to appear in Computational Linguistics) Read: Systems that Learn, Chapter 1.
Week 7 Learnability and some aspects of machine learning Fulop and Goldsmith
Reading for next week: Jurafsky and Martin, Chapters 6 and reread 8; good additional resource is Manning and Schütze, Chapter 6, which we recommend.
Week 8 Ngram language models and the sparseness of data problem Goldsmith
Powerpoint slidesReading for next week: Jurafsky and Martin, Chapter 5 and Chapter 7 (partial review).
Week 9 Speech recognition; Hidden Markov models. Fulop
Reading for next week: Jurafsky and Martin, pp.130-133; Keller Ch. 6 “Formant synthesis”
Also, Goldsmith, John. 1999. Dealing with prosody in a Text to Speech system. International Journal of Speech Technology 3: 51-63.
Good additional resources:
Charniak; Manning and Schütze; Jelinek.
Week 10 Speech synthesis and intonation Fulop and Goldsmith
Some on-line resources:
A Short Introduction to
Text-to-Speech Synthesis (Thierry Dutoit)
Term Projects
1. Read Jurafsky & Martin Chapter 11; implement the modified Earley algorithm for unification parsing on p. 431, and test it on a toy example.
2. Do Jurafsky & Martin Exercise 7.3, and implement the resulting version of the Viterbi algorithm. Show that it works by providing some toy inputs.
3. Develop a letter-to-phoneme conversion system, and a method for testing how well it works. This could be done for English, or for another language.
4. Develop a finite-state morphology along the lines described in Jurafsky and Martin.