Resources and
background notions
Introduction to computational linguistics
Week 1 Class
3
John Goldsmith
Resources: To do work on natural language, you need data. Our data is usually a corpus (plural: corpora). For our purposes, this will generally be text, which consists of strings of symbols (8-bit or 16-bit representations, ASCII or unicode), generally with white space between words (but not in CJK-like languages: Chinese/Japanese/Korean). Text in languages is getting relatively easy to find on the internet -- but it's not totally trivial, either. Still, there are a lot of good websites that should not be ignored. Bibles in lots of languages.
Resources.
Brown corpus at humfs1.uchicago.edu/~jagoldsm. Look in Data directory.
Ambiguity of natural language: especially in syntax. "Hitchhikers can be escaping convicts" (Oklahoma roadsign).
Hitchhikers can be [convicts who are escaping]
Hitchhikers can be escaping [from] convicts: present progressive form of Hitchhikers can escape [from] convicts.

Syntax requires a formalism at least as complex as (context-free) phrase-structure grammar: A verb in a language like English depends (grammatically, semantically) on the head (roughly, the most important word) of the preceding Noun Phrase, not the preceding word; and that preceding head-of-a-Noun-Phrase can be indefinitely far back:
The hinge is squeaking.
The hinge of my door is squeaking.
The hinge of the front door of the house I rented in Nantucket squeaked a lot.
Sentence -> Noun Phrase + Verb Phrase
Noun Phrase -> NP* + Relative Clause ("*" marks the head)
Noun Phrase -> Article + Noun* + Prepositional Phrase
or Microsoft's grammar's rules:
DECL-> NP + Verb + AVP + Char
NP -> DetP + Noun + PP
DetP -> Adj ("the")
PP -> PP
PP -> Prep ("of") + DetP + Noun + PP etc etc.

Correctly parsing natural language sentences is a major challenge.
In the meantime, a great deal can be done with words, and with (frankly)
simplistic models of language, such as "bag of words" models, Markov
models (which ignore constitutent structure), and shallow parsing (finding Noun
Phrases, or more accurately, the non-recursive core of a sentence).
Bag of words model: keep track of which words appear in a document
(generally with their counts as well: multiset ("bag") rather
than set).
Markov models: based on the notion that for many purposes, the best single predictor of a word is the previous word. (I don't give a __ ; but cf. I saw a ___). Of course, knowledge of the preceding word is a lot information (as measured in bits).[Minor digression: Can we give a rough measure of how many bits? Expected inverse log probability of words in English. ]'
Shallow parsing: [ We ]NP [ expect ]Verb [ the present crisis ]NP [ will continue ]VP [ into ]PP [ the next campaign season ]NP.
Encoding issues: upper 128 bits. Unicode.
Tokenization: knowing how to break up a text into the appropriate pieces. Four major challenges:
a. Breaking a
large text into coherent subpieces: roughly, tracking topics in a large
corpus.
b. Breaking a large text into sentences: when is a period a sentence delimiter?
c. Words and lexemes: We looked the word up. "look...up" is
a single "lexical item": so just cutting at white space isn't the
whole story.
d. Other "details" to worry about, in finding words: When do we break
at apostrophes?
John's book: best as 3 words: John, 's, book
It's true: best as 3 words: It, 's [variant of is], true.
I'm goin' home: 4 words: I 'm goin' home -- goin' is a variant on going
Breaking at
hyphens
Speech corpora: some in .wav file format (we won't be looking at that) and some in text format. Arpabet. http://www.telecom.tuc.gr/~ntsourak/tutorial_arpabet.htm
Bioinformatics: many of the techniques we are exploring work, producing interesting results, when applied to novel symbol sequences: genomic sequences (DNA, RNA, protein sequences of amino acids), how about dolphin speech?
Programming issues: Dealing with strings, and having good string collection classes. Get familiar with built-in hash tables (which Perl is good for), or else roll your own: build a trie-structure in C++ (or find someone who has already built one).
Created
by Derrick Coetzee in Illustrator
Part of speech (POS): Be sure to read the discussion in the text, section 4.3.2. It brings out clearly (I think) what might not be obvious: that there is a considerable degree of arbitrariness in the selection of PoS labels we assign to words. Why do we need to set up a special category for "Verb, auxiliary be, present participle", or for the word "to", and so forth?
Morphology: there's a brief discussion in text 131-134. This will play a bigger role for us.
Precision and
recall: All work in this area is heavily dependent on maintaining good quantitative
measures of the quality of the results.
Precision is the proportion of the results that are right; recall is
the proportion of the true results that were found.
That is, if we are searching for diamonds D, and we pull out a set of stones
S:
Precision is the proportion of the Ss that are true diamonds: #S's that are diamonds / # of S's altogether
Recall is the proportion of the Diamonds that our method found: #S that are diamonds/ # of D's altogether (in the field where we were looking)
(How do you know how many diamonds there really were? You have to know, or else you can't really compute the recall.)