On information theory, entropy, and phonology

in the 20th century

John Goldsmith

University of Chicago

1. Introduction

In the phonological tradition that has dominated the United States since the 1960s, and many places elsewhere for nearly as long, the notions of information theory and entropy have played an uncertain role over the years.[1] In the 1950s, both Roman Jakobson and Charles Hockett spoke glowingly about the usefulness and the importance of these notions for phonological theory, while Noam Chomsky (1957a) , in his review of Hockett’s Manual of Phonology 1955, wrote disparagingly of it, and in later interviews, Chomsky has offered a view of the intellectual climate in a Cambridge (Mass.) in the 1950s where a kind of scientism existed, to the point where the mathematical elegance of information theory might have led (or did lead, in Chomsky's view) to the belief that the theory must be grand just because it looked so science-like, regardless of whether it had anything of any significance to say about language. For example, in an interview published in a history of cognitive psychology, Chomsky made the following observation:

Finite-State Markov sources and similar models were very highly regarded at the time. There was a lot of euphoria about such approaches to language. In part, it came from the prestige and achievements of information theory, which involved similar notions; in part, the statistical approaches to linguistics; and, in part, it had a kind of technological air to it. There was a lot of euphoria at that time in the area of linguistics in general, about the potentially great achievements that lay ahead along these lines. It was thought that they were already partly real.[2]

More recently, in his 1985 overview of the history of phonological theory in the 20th century -- a history that is with good reason widely cited -- Stephen Anderson has pursued the notion that interest in information theory was largely misguided, based in part, Anderson suggests, on a unity of purpose that was only apparent and not real behind the information theorist's desire to see everything translated into 1's and 0's, on the one hand, and the Jakobsonian drive to analyze all features as binary  (the reader will recall that this was a bone of contention that Jakobson had with Troubetzkoy). 

In this paper, I would like to suggest a different view on these issues. It is this: the notions of information theory (probability, entropy, and the notions derived therefrom) are the natural quantitative measures of many of the concepts used by phonologists, and by linguists more generally. Phonologists such as Jakobson and Hockett saw this, though at the present point in time it is difficult to determine with just what clarity. I think that Anderson is wrong in his characterization of information theory, and hence his effort to split information theory apart from phonology is mistaken. However, Anderson is correct in pinpointing one of the important phonological issue at play here as the issue of (what is these days called) underspecification, but other equally important issues are involved, not the least being a fully explicit way of comparing the relative validity of two competing models of the same language or corpus.  I think, or suspect, that we phonologists have not paid enough attention to two problems that really are ours to deal with – the first being the problem of the acquisition of the phonology of a language,[3] and the other being the question of continuous speech recognition.  As Andras Kornai has argued (1996), we phonologists ignore at our peril the successes achieved by the speech community in continuous speech recognition using hidden Markov models. But even ignoring questions of technology and sticking to more traditional notions of language acquisition, information theory potentially has much to say.

I am mindful of an objection that might be raised at several points along the way that I shall be taking; to wit: we can make an argument for information theory today as I do, but is it anything but an anachronism to consider these arguments as relevant to what Hockett or Jakobson might have thought 40 years ago -- let along what others said 60 years ago? This is a question more for philosophers than linguists or historians, but I will return to this point at the end, when I will remind us that in science, as in the stock market, it is expected that scientists will make judgments about the eventual fruitfulness of an approach long before the approach has even begun to pay research dividends.

2. Some basic notions of information theory

The central notion in information theory is that of entropy. Briefly, entropy is a measure of the unpredictable character of a set of objects. The more variation and difference there is, the higher the entropy, while the less variation there is, the less entropy there is. The simplest and most general way to think about entropy is to see how it measures a set that has been broken up into one or more subsets, where we consider each subset to be essentially uniform for some particular purpose. To take a phonological example, suppose we are interested in coronal consonants in a given language, and we look at the set of coronals in a particular set of words. If the coronals are all the same (all aspirated [t]s, perhaps), then the entropy of the set is 0.0, which is the lowest possible entropy. If, on the other hand, there are 8 different coronal consonants, and each of them has the same frequency (i.e., 12.5%), then the entropy is 3 (because, using the base-2  logarithm, we see that log2 (8) = 3). If we take that set with 8 different coronals, and divide it into subgroups on the basis of the environments of these coronals, so that the distribution in each new set is not even, but rather favors some of the allophones more than the others, then the entropy of each subset goes down, and that decrease in entropy is the mathematical sign that this division is inching towards a correct statement of the environments in which the various phones occur. If we can reduce the entropy in each subset to 0 by considerations only of the environment, then we have found a set of phonological environments in which only one of the phones is found, and hence we have proven that the distribution is allophonic.

There is no particular magic to this operation, but the point is that what the phonologist already knows how to do -- to seek out the essence of the conditioning environments which account for which phone is used in any given context -- is measurable in terms of the entropies of the subsets.

On the other hand, if we cannot reduce the entropy of all of the sets to 0, then some  opposition remains, and the relationship is not obviously one of allophony-by-virtue-of-complementary-distribution (it is, then, a matter either of contrast or of free variation; distribution alone at this level will not tell us whether the difference is contrast or free variation). The notion of entropy can be applied in a succession of complex cases, and can be used to measure in a sense the complexity of an analysis, and can serve therefore in a complete formalization (or automation) of a procedure. Using the terms of information theory, we can say that the distribution of a set of allophones is better understood if we succeed in decreasing the conditional entropy, using some independently specifiable criterion like phonetic environment.[4]

Another issue that plays a role throughout this discussion is the nature of probability, so fundamental to entropy and information theory. In this day and age, it is generally agreed that there are three different, though related, conceptions of probability theory. It can be viewed, in the first place, as a purely mathematical theory as formalized by Kolmogorov in this century, and independent of any applications to the real world. A second view is the frequentist interpretation, according to which probabilities are all in one fashion or another statements about the frequency of sets of events; during much of the twentieth century, this view has been dominant, though it was not the view of the founding fathers of probability theory, nor is it the dominant view currently (see e.g., Gigerenzer (1989), Daston (1988)). The third interpretation, labeled variously as subjectivist or Bayesian, interprets statements of probability as principles of rational beliefs; on this account, the theory of probability could be considered an extension -- a massive extension, to be sure -- of logic, in the sense that it codifies not what any actual person thinks or infers, but rather is a systematic account of the relationship between strength of belief and evidence.  It is this view that I implicitly adopt; it has been  the dominant view in the second half of this century.[5]

3. Troubetzkoy.

We should note, first of all, that Troubetzkoy's work all predated information theory as it is normally understood, that is, predated Shannon's work on entropy (Shannon and Weaver 1949). In his Grundzüge, however, Troubetzkoy devoted an entire chapter --a chapter of barely 10 pages, it is true -- to statistical studies of phonological patterns. His primary point is this: it is frequently possible to make a prediction regarding the relative frequency of pairs of sounds that enter into an opposition. His example is roughly the following, taken from Chechen. If in a given language, geminates occur only inter-syllabically, and the non-geminate version appears in those intersyllabic positions plus word-initially and word-finally, then if we know the average number of syllables per word, we can make a prediction of the relative frequency of the single and geminate versions of a consonant. If our prediction does not match the reality, then we can infer there is something that remains to be accounted for.

 "Le chiffre absolu de la fréquence réelle d'un phonème n'a qu'une importance accessoire. Seul le rapport entre ce chiffre et le chiffre de fréquence attendu théoriquement possède une valeur véritable. (284)…Le calcul des probabilités théoriques n’est pas toujours aussi simple que dans les exemples ci-dessus. Mais on ne doit pas se laisser rebuter par les difficultés d'un tel calcul, car c'est seulement par comparaison avec les chiffres de fréquence possible obtenus au moyen de ces calculs que les chiffres de fréquence effective acquièrent une valeur, en montrant si un phonème, dans la langue en question, est beaucoup ou peu utilisé. (285)."[6]

This is a powerful notion that remains to be fully explored.  What Troubetzkoy (and others since) have seen is that a study of frequency can often be tantamount to a search for lurking generalizations. If we toss a die 1000 times, and it comes up a 6 on 240 occasions (rather than on 166 occasions), we have reason to suspect that something is responsible for the inordinate number of 6's -- and we would not have noticed that if we hadn't taken the time and effort to do the counting.

The linguistic equivalent might be something like this. Just as we know the expected frequencies of the 6 sides of a die (1/6 for each side), we know the expected frequencies for each of the phonemes of English, if we count in a large text; here are the top 14, not from a running text, but from a dictionary of English (hence frequencies are what Jakobson will call "frequencies in the code", as opposed to "frequencies in the use" (1971:578); Troubetzkoy also reflects on the difference between these kinds of statistics (1967:277).



(1)

«

0.0788

n

0.0694

t

0.0632

s

0.0606

l

0.0545

r

0.0542

k

0.0473

d

0.0415

I

0.0406

z

0.0340

m

0.0329

e

0.0281

p

0.0266

{ (syll r)

0.0259


Given the frequency of a pair of phonemes  – to fix ideas, let us say a lax I followed by N --  we can predict the frequency of the pair I-N if English were put together with no regard for what phonemes follow what phonemes: we would simply multiply the frequency of I by the frequency of N.  The difference between the actual observed frequency and the predicted frequency (or really, the difference between the logarithms of these values) is a direct measure of how much these two phonemes attract or repel each other in the entire vocabulary of the language. The term mutual information is used to refer to this measure, which is . One could think of this as the measure of the excess frequency over what was expected, given knowledge of frequency of the component parts.  We can multiply this value (for each pair) by the frequency of that pair, giving us a measurement, the weighted mutual information, of the importance of a string in a corpus. Pairs that rank high on such a list will be pairs of sounds that play a statistically significant role in the language. If we compute all such pairs in the language and look at a list with the highest weighted mutual information at the top, we find an interesting result. The top 8 from the dictionary are (where I have treated word boundary # as a phoneme):

 

(2)   z #               « n               IN                « l                y #[7]             r[8]              s t               

N #              d #.

It is not hard to see that 4 of the 8 explicitly contain a word-boundary, and at least two more (I N and r) occur primarily at word-end.  If we perform the same operation on sequences of three phonemes, and compute the measure of excess frequency over expected frequency given prior knowledge of frequency of pairs of phonemes, the top 13 are:

(3)   t s #             « l #             « n #            sh « n           b « l             # I n             t « s             « d #            # d I                   I N #            m « n           I k #            

Again, word-periphery plays an overwhelming role in these statistics. It should be clear what is popping up: it is the high frequency morphemes and their component parts (in (2), we see the most common, the “-s” marking plural in nouns and 3rd sg in verbs; the suffixes –ing, -al, -y, -or, and the verbal past suffix –d; in (3) we see in addition the suffixes tion, -ed, -ic, part of –able, and the prefix in-).

This teaches us two things, at the very least, I think: first of all, it illustrates well the way in which the patterns of letters (or phonemes) are heavily governed by the patterns of morphemes in the language.  The situation would have been much different if we had searched the lexicon of English for patterns of sequences of distinctive feature values: we would then have found real phonological patterns, not morphological patterns, and indeed Troubetzkoy rightly argues that a phonological account should focus on the distribution of featural oppositions rather than segmental distributions.  Second of all, this simple experiment suggests a plausible route for early language acquisition, in the following sense: psycholinguists who study the way in which young children acquire words early on in the language acquisition process often look for ways in which the children might have figured out where the word-boundaries are (easy if the language has regular penultimate stress, perhaps); they implicitly assume that the child learns the words by first finding the boundaries (see many of the papers in Morgan and Demuth 1998, for example). But the probabilistic approach suggests that they could more profitably start from the bottom up, looking for those chunks whose frequency is disproportionately high given the frequencies of their parts.[9]

4. Jakobson -- and Anderson on Jakobson

It seems to me that for phonologists working throughout the middle of this century, the two absolute and primordial facts of phonology were the following: first, the discovery (or invention) of the phoneme during the first half of the century, conceived of as a means for reinstating order in a chaotic world where ever more narrow description of speech created descriptions of languages with ever more phones,[10] and second, the potential re-emergence of chaos with the technological developments of acoustic phonetics after World War II. The development of sophisticated electronic equipment permitting close and careful scrutiny of the phonetic signal did not make it easy to map traditional -- or even not so traditional -- conceptualizations of phone and phoneme onto the acoustic signal; the deeper understanding of the acoustics of the speech signal afforded by the technological advances of the post-war years raised far more questions than it answered.

Putting these together: the objective world of linguistic sounds was not becoming simpler and more tame with increase in our knowledge, but if anything it was getting more complex. At the same time, it was necessary to be able to get a linguistically sophisticated handle on that complexity, and that handle was provided by the theory of phonemics.

To some extent, we still live in the wake of these primordial facts, but (especially in the United States, I think) scholars in phonology have made considerable progress in forgetting them. (I do not mean to put myself outside the range of that generalization; my comments apply to me when I am acting as a phonologist as well as they do to others.) We have done this by a very simple expedient: we have focused in phonology on problems where the data are presented in simple form, in more or less orthographic form. We do not even use our extended notational devices of the sort introduced by autosegmental and metrical phonology to present a more complex representation of the data than we might have done forty years ago: instead, we keep those rich notational devices for the analysis, even though there is nothing within anybody's theory that justifies using the notations for "deeper" phonological representations more than for surface representations (if anything, one might expect the opposite on purely theoretical grounds).

At risk of repetition, let us underscore that the discovery of the phoneme was the great organizing principle of 20th century phonology, and we modern phonologists continue to take it for granted, as an unproblematic system. Much of the time we ignore the principles that separate allophones; on some occasions we delve into the matter of allophonic relations -- such as Kahn's treatment (1976) of the flap in American English, or the treatment of the relationship of the features of voice, vowel length, vowel quality, and flapping in the rider/writer contrast. We delve into the matter when it bears on representational issues, but we do not feel obliged to deal with troublesome issues, such as the duration assigned to the vowel or the rhyme and how that relates to the phonological feature of voicing, from a phonologist's point of view.

The important point to bear in mind is that whether or not we accept the details of the structuralist view of phonemics (and we do not), there is still a very important core of that program that is relevant: the phonologist must use tools like the discovery of complementary distribution and free variation to reduce the inventory of independent segments in the language. We generativists say that we do not believe in a qualitative difference between rules of phonology that structuralists would have called morphophonemics and rules of phonology that they would have called allophonics, but there is a deep irony here, because generativists for the most part simply do not work very hard on those rules of allophony -- unlike the structuralists, who worked very hard on them. The irony is that we say they are all the same, but we do not study allophony; the structuralists distinguished between morphophonology and subphonemic phonology, but studied both in detail. But because they studied both in detail, they were aware how difficult it was to come up with a scientifically valid methodology to deal with rules of allophony, and they were aware of the methodological continuity between the search for rules of allophony and the rules of morphophonology: a process was a candidate for being part of the morphophonology only when it failed the tests for allophony. 

Let’s not get too abstract about all this. As we know,[11] rules of allomorphy are more often than not very similar to (or even identical to) rules that specify the patterns of unmarkedness in a language: if there is a rule in English of Trisyllabic Laxing (and do bear in mind that that I write 'if'!), it goes hand in hand with the observation that the frequency of lax vowels is much higher than the frequency of tense vowels in the environment in question (which is: before two syllables, the first of which is unstressed). How does the language learner discover the correct environment for Trisyllabic Laxing, a rule of morphophonology? Most of the “discovery procedure” is identical to the process of learning a rule of allophony: the cases must be sifted through to determine what the context is in which a lax vowel predominates. We determine that the rule is not allophonic because we find that there is no subcontext in which only lax vowels appear (that is, there is no way to specify a phonological environment such that words like Oberon, nightingale, and obesity are excluded, and no context in which only tense vowels appear; if there were such environments and only such environments, the entropy of those subenvironments would be zero, and the rule would be a rule of allophony. I leave aside the possibility of free variation from this brief discussion, though it is entirely relevant).

Contemporary phonological theory is no better equipped to model the discovery of which aspects of a speech signal are phonemically redundant than phonological theory was 50 years ago. But somehow language learners do figure out that various aspects of the signal are indeed redundant, and thus are able to come to understand their language as built up of a number of smaller building-blocks.

I have tried to sketch the outlines of two closely related points: first, notions of information theory are most likely (I would like to say, “almost certainly”) behind the way in which we sort sounds into groups corresponding to underlying segments;[12] and second, we phonologists, when we get around to developing a complete and comprehensive theory of phonology and its acquisition, will need this theory as well, regardless of whether we include in our theory something like a level of phonemes.

The language-acquisition device, the LAD which we as linguists mean to model, must perform a difficult task: it must determine what contexts play a role in determining the variations in the sounds that we as phonologists study. We as phonologists have not significantly advanced beyond the rules of thumb of the sort that we teach students in introductory phonology, and that (for example) Kenneth Pike summarized in his Phonemics. Researchers working in the area of machine learning have tackled problems of this sort, and one approach that they have developed (and the one which is most similar to what linguists do) is the use of what are called classification (or CART) trees (Breiman et al. 1984).  Put at its most simple, classification trees are assigned the task of taking in a large amount of data (for example, a set of phones and the context in which they appear), and from that discovering the optimal way of specifying logical or statistical dependencies among the variables encoded in the data. For example, if we fed in the data regarding the contexts in which a large set of allophones of /t/ of American English occurred (including flap, aspirated t, and glottalized t), we might use a CART tree to determine that word-initial ts are always aspirated, that word-final t before word-initial vowels are flapped, and so forth. How does a CART tree do this? Put simply, it is a computer algorithm, and it analyzes a large (often a very large) number of alternative ways of splitting the data up into two groups, and it looks for the splitting that maximizes the “purity” of each of the subparts, where “purity” can be defined in several ways, of which the most common is the negative entropy:[13] we wish to find an arrangement of the data so that in each subgroup there is more and more a predominance of one of the allophones; we (as phonologists) know that we have completed our task when we have only one allophone left in a subcategory (such as when the CART algorithm selects the category “word-initial ts”  and discovers only aspirated ts, and no flaps or glottalized ts). Such a subset, with only one allophone, necessarily has an entropy of 0.0, which is the best purity that can be reached. And one of the reasons that CART is particularly useful for this problem is that finding the right way to analyze a phonological problem often takes a succession of only partly successful steps; yet a quantitative measure, like negative entropy ( -1 * entropy), can be used to show that a tentative division of the sounds into two subgroups is a partial step towards discovering the correct phonological conditioning behind a set of allophones.

For example, if stops are aspirated after (and only after) homorganic nasals in a language, then a division of the data into two groups, based on whether the preceding segment is [+sonorant] or [-sonorant] will divide increase the purity of the system, because the set of aspirated stops after [-sonorants] will be null, while the proportion of aspirated stops after [+sonorants] will be higher than was the proportion of aspirated stops in the whole corpus. Hence a system seeking the right way to divide the phones into two sets will be reasonably certain to have taken a step in the right direction when it divides the context up in that way.

Whether or not this brief description of CART is clear enough for the reader to fully understand the details, the point that remains is this: the task that the LAD faces is in no significant way modified by the rejection of the phoneme, nor by any other decision in recent phonological theory, and the use of CART-style analysis to figure out what the relationship is between phones.

Jakobson seems to have understood this point. And he understood that information theory proposed a way (the way, in fact, we can say) to quantify and explicate this notion, that one distribution of basic observations is predictable from another distribution of observations.[14] Here is Jakobson:

A phonemic analysis, when consistently proposing the elimination of redundancies, necessarily provides an optimal and unambiguous solution. The superstitious belief of some theoreticians unconversant with linguistics that "there remain no good reasons for the distinction between distinctive and redundant among the features"[15] is patently contradicted by innumerable linguistic data. If, for example, in Russian the difference between advanced vowels and their retracted counterparts is always accompanied by the difference between preceding consonants, which are palatalized before the advanced vowels and devoid of palatalization before the retracted vowels, and if on the other hand the difference between palatalized and non-palatalized consonants is not confined to a vocalic neighborhood, the linguist is obliged to conclude that in Russian the difference between the presence and lack of consonant palatalization is a distinctive feature, while the difference between the advanced and retracted vowels appears as merely redundant. Distinctiveness and redundancy, far from being arbitrary assumptions of the investigator, are objectively present and delimited in language.

"The prejudice treating the redundant features as irrelevant and distinctive features as the only relevant ones is vanishing from linguistics, and it is again communication theory, particularly its treatment of transitional probabilities, which helps linguistics to overcome their biased attitude toward redundant and distinctive features as irrelevant and relevant respectively [1971:572]."

Anderson 1985 sketches (134ff) a picture of how Jakobson's views were influenced by information theory that seems like an inaccurate caricature in the light of Jakobson's own remarks. Here is Anderson:

 [F]rom his earliest writings about phonological structure, a phonemic representation …was seen as expressing exactly what distinguishes one linguistic form from another: a logically 'pure' distillation of the contrastive relation between forms, purged of all redundant and accidental properties. This picture came to be reinforced by considerations outside the field of linguistics proper…Jakobson seized on the connection between information theory and his view of phonology…and expressed the view in a number of papers (e.g., Jakobson, Cherry and Halle 1952; Jakobson 1961) that the generalized mathematical theory of communication would provide a rigorous scientific basis for the interpretation and analysis of phonological systems. It is hard not to see a certain amount of fascination with the impressive mathematical apparatus of this theory in Jakobson's espousal of it. (135).

Jakobson elsewhere rejects the position that Anderson appears to impute to him, as in the following passage:

L'interrelation des traits distinctifs, configurations (surtout démarcatifs), expressifs et redondants (49) requiert un examen comparatif précis. Un tel examen doit en particulier éviter toute confusion entre ces ensembles de traits essentiellement hétérogènes et tout effacement des limites effectives entre leurs fonctions divergentes. Le préjugé qui consiste à confiner la recherche phonologique aux seuls traits distinctifs et à les désigner totalement arbitrairement comme les seuls qui soient utiles et pertinents déforme tout autant la réalité. Leur charactère discret, qui les distingues spécifiquement de la gamme graduée des traits expressifs, ne donne pas au linguiste le droite d'écarter ces derniers. (Jakobson 1973, p. 152).

5. Hockett and Chomsky

5.1 Hockett

Charles Hockett, in his Manual of Phonology 1955, also expressed enthusiasm for the usefulness of information theory in understanding some properties of human language. He gives a sketch of a mechanism for modeling syntax inspired by information theory which we will consider in a moment. Following that, we will look at Chomsky’s 1957a criticism of this in a review of Hockett published in IJAL, and try to judge each from an end of the 20th century perspective.

In his Manual of Phonology, Hockett proposed that the linguistic faculty -- which he called the "Grammatic Headquarters", or GHQ -- be modeled as a finite state device, and he noted that such a device, with n states, could be modeled with an n x n array of transition probabilities; in addition, each transition would be associated with a set of words along with a probability for each; the set of these probabilities must sum to 1.0. If the device is in a particular state S0, then for any given word W, there can be only one transition from state S0 associated with word W: put another way, knowledge of the current state, plus knowledge of the word about to be generated/parsed, yields deterministic (or rather, certain) knowledge of the next state.

From our perspective today, this means that Hockett does not consider the possibility of a Hidden Markov Model (HMM), which would allow for precisely that possibility. HMMs have become an extremely important engineering tool in the past 20 years, most notably in the area of speech recognition (see e.g. Charniak 1993 or Jelinek 1998). Why should this be? I will return to this question shortly in connection with Chomsky’s criticism of this model.

It is important to remember that the number of states (n, in the example above) is intended to be considerably greater than then number of words in the lexicon being modeled. In general, a given word could be emitted (or "uttered" or "parsed") when the system is in many different states, and the transitions will in those cases all be to different states (in the general case, again). This is the picture suggested by information theory, and with the (very important) difference (a loosening) that today the interest is in HMMs rather than deterministic Markov models, this is the picture that is used today in much current speech technology. In general, many HMMs for natural language define their states on the basis of the categories of the two most recent words of the utterance, though the "categories" that are used are typically much richer (that is, smaller and more syntactically homogenous) than the familiar linguistic categories such as noun and verb. 

5.2 Chomsky

Chomsky reviewed Hockett's Manual in IJAL in 1957, and he begins with a repudiation of Hockett's model, citing both his Syntactic Structures(1957b), and "Three Models" (1956).  “Rather straightforward considerations indicate that this view is untenable,” (224) Chomsky suggests, arguing on the basis of what has come to be known as the n-gram sparseness problem. Recast in contemporary terminology, the problem is this: suppose we treat each word as completely different from every other word, and suppose we make our prediction about what is the likelihood of the next word based on the last one or two words. The number of words in a natural language is so large that in practical terms we will continue to encounter as many as 25% of the time sequences that have never occurred in the data before.

Interestingly, Chomsky makes an error at this point, suggesting that a probabilistic model of this sort would predict 0.0 probability for sequences it had not encountered in the training data. Both in theory and in practice Chomsky’s suggestion is mistaken; such a model assigns will be anything but zero. A probabilistic model must “reserve” some of its probability for sequences not directly observed in the past; in some contexts this is called a “back-off” probability or strategy. The question is how the device will distribute the probability among the words that it has not encountered in a specific context. The simplest model (and the sort that in retrospect we might well attribute to Hockett) would be to reserve some probability (say, 0.25) for words not encountered in this state in the past, and distribute the 0.25 probability among the rest of the vocabulary in a way strictly proportional to the frequency of the individual words in the corpus). This is not a very good solution, but it is feasible.

Chomsky asks rhetorically how Hockett’s model could deal with the difference between the probability of the phrases look at the cross-eyed elephant, look at the cross-eyed mole, and look at the cross-eyed of. If we chose the simple model that I mentioned in the previous paragraph, all three would have non-zero probability, but the third one (with cross-eyed of) would have the highest probability, since of’s frequency is higher than that of elephant or mole. Clearly, what is needed – and Hockett points this out, to be sure – is a notion of grammatical category, so that more probability could be distributed to those words whose distribution is most like the words found in the context the cross-eyed…. In English, we call that category the category of nouns, of course, and so Hockett’s probabilistic model would distribute most of the 0.25 probability among words in the category noun – and it would distribute the probability in a fashion proportional to the relative frequency of the nouns. Since the frequency of of as a noun is extremely low (but not zero)[16], the predicted probability of the cross-eyed elephant would be much greater than the probability of the cross-eyed of.

A probabilistic model will typically group words into categories, and use frequencies from all the words in the category in its back-off strategy. Models differ with regard to how these categories are determined, and they vary a great deal with regard to how big the categories are: are they as big a category as noun, or are they as small a category as names of days of the week? Chomsky inexplicably says that invoking such categories “is no minor revision; it amounts to a total rejection of the statistical theory of  grammaticalness sketched above.” (224) The reason Chomsky gives for this claim is that “evidently very different sets of transitional probabilities will be associated with different contexts of a single form class.” In fact, probabilistic grammars select their categories in order to minimize the discrepancies created by treating all the words in a single category the same way in all contexts. Sometimes this is very much the right thing to do: if only “Monday” and “Tuesday” are found in some syntactic context, it is likely that the other days of the week should be allowed to appear there at the same probability, and a discrepancy between the probability generated by the model and that found in the training corpus is fine: it is the training corpus which fails to include “Wednesday”  through “Sunday”, for some accidental reasons. (This is no different from what the ordinary working linguist does when s/he works with a category such as “noun”  or “common noun”: s/he generalizes past the observations, of course.)

What is of interest to us today is why there should have been such a lack of communication between Hockett and Chomsky -- between the proponent of what was then mainstream linguistics, attempting to integrate information theory, and the proponent of the revolutionary generative grammar. Several reasons may be the cause, but one of the most important is likely to be a radical difference in what Hockett and Chomsky took their goals to be. As Stephen Anderson discusses in detail in his contribution to this volume, Chomsky and Halle were arguing for a conception of linguistics whose goal was to model the knowledge in a speaker's head, while for most of their colleagues in the early 1950s, such a goal was "mentalism," and as such, inappropriate for linguistics. What was appropriate was an analysis of texts and corpora. Information theory was (and still is) particularly appropriate for that latter task, though the precise argument does not seem to have been spelled out by linguists at the time. As we have noted, the entropy of a particular corpus is computed on the basis of a particular model of the language that is assumed to have generated the corpus in question. By comparing different language models, one can compare the success of various models with regard to how well they succeed in modeling a corpus with low entropy, for assigning a low entropy to a corpus is exactly the same thing as assigning a high probability to the corpus. What information theory could offer structuralist linguistics at that moment was precisely the corpus-based equivalent of Chomsky's competence-based evaluation metric. That may sound odd, but it is indeed correct. Information theory offered an operationally explicit means of comparing and evaluating two competing analyses of the same data, which is precisely what Chomsky's original generative enterprise sought to establish (and as he made explicit in Syntactic Structures 1957b). It is only now, in the 1990s, that the relationship between these two conceptions is coming to light; in recent years, various computationally sophisticated ways various computationally sophisticated ways have emerged for both computing the complexity of an analysis, and determining the degree to which an analysis correctly models a corpus.

6. Conclusion

As I indicated at the beginning, I am sensitive to the rejoinder that much of what I have said in defense of the relevance of information theory to phonology could be said to be irrelevant or anachronistic, in the following sense: discussions of the success of hidden Markov models in the context of Hockett's writings in 1955 do not mix. The success of HMMs came about by virtue (among other things) of dropping the deterministic character of the finite state model Hockett proposed to employ. Maybe, one might reply, linguists should be interested in HMMs today, but that is no reason to score one up to Hockett in 1955.

While I am sensitive to that counter-argument, I think it is off the mark, for this peculiar reason: anyone can judge the worth and relevance of a theoretical approach once it has (so to speak) produced the goods. Once HMMs made it possible to develop functioning continuous speech recognition, it does not take much insight and intelligence to allocate credit. The hard part is deciding ahead of time. You can call that vision, or whatever you want, but it is the ability to answer questions forty years in advance that in some sense wins the prize.[17]

In conclusion: I think that linguists have a great deal to learn from the technology that information theory can give us, and I have sketched the outlines of some of the ways that this can happen. I think that earlier generations of phonologists were more receptive to that message, in some measure because they recognized the importance, the centrality, and the difficulty of the question of how phonemic analysis is to be performed on the basis of phonetic data. I have focused on issues of data analysis, which I take to be essentially issues of learnability, rather than on issues of representation or derivations, etc. I think we phonologists should retain an awareness of the difficulty of this problem, and that if we develop a healthy interest in projects like continuous speech recognition, we cannot overlook them.


 

References

 

Anderson, Stephen R. 1985. Phonology in the Twentieth Century. Chicago: University of Chicago Press.

Baars, Bernard J. 1986.The Cognitive Revolution in Psychology. New York: Guilford Press.

Bar-Hillel, Yehoshua. 1958. Three methodological remarks on "Fundamentals of Language." Word 13: 323-335.

Boole, George. 1854. An investigation into the Laws of Thought, on Which are founded the Mathematical Theories of Logic and Probabilities.  London:  Walton and Maberley.

Breiman, Leo et al. (1984). Classification and regression trees. Belmont, Calif., Wadsworth International Group.

Charniak, Eugene. 1993. Statistical Language Learning. Cambridge: MIT Press.

Chomsky, Noam. 1956. Three Models for the Description of Language. Trans IRE, PGIT, vol. IT-2, no. 3, pp. 113-24.

Chomsky, Noam. 1957a. Review of A Manual of Phonology by Charles Hockett. IJAL XXIII. pp. 223-234.

Chomsky, Noam. 1957b. Syntactic Structures. 'S-Gravenhage: Mouton.

Daston, L Classical probability in the enlightenment, Princeton University Press 1988,

Durand, Jacques and Bernard Laks. 1995. Introductory essay to Current Trends in Phonology: Models and Methods, vol. 2, pp. 395-418.. Paris: CNRS, ESRI, Paris X.

Gigerenzer, G., Swijtink, Z., Porter, T., Daston, L. J., Beatty, J., & Krueger, L. (1989). The empire of chance. How probability changed science and everyday life. Cambridge, UK: Cambridge University Press.

Goldsmith, John. 1990. Autosegmental and Metrical Phonology. Oxford: Basil Blackwell.

Goldsmith, John. 1998. Unsupervised Learning of the Morphology of a Natural Language.  University of Chicago ms, also available at humanities.uchicago.edu/faculty/goldsmith.

Hockett, Charles. 1955. A Manual of Phonology. Indiana University Publications in Anthropology and Linguistics, Memoir 11 (IJAL).

Jakobson, Roman. 1971 [1961]. Linguistics and Communication Theory. Presented  in the Symposium on Structure of Language and Its Mathematical Aspects, New York, 15 April 1960, published in Proceedings of Symposia in Applied Mathematics XII (1961). Reprinted in Roman Jakobson: Selected Writings II The Hague: Mouton.

Jakobson, Roman. 1973. Essais de linguistique générale: Rapports internes et externes du language. Paris: Les éditions de minuit.

Jelinek, Frederick. 1998. Statistical Methods for Speech Recognition. Cambridge: MIT Press.

Kahn, Daniel  1976 [1975]. Ph.D. dissertation, Massachusetts Institute of Technology.

Kornai, Andras. 1996. Analytic models in phonology. In J. Durand and B. Laks (eds.) Current Trends in Phonology: Models and Methods, vol. 2, pp. 395-418.. Paris: CNRS, ESRI, Paris X.

Morgan, J. and K. Demuth (eds.). 1998. From Signal to Syntax: Bootstrapping from Speech to Grammar in Early Acquisition. Hillsdale, N.J.: Lawrence Erlbaum Associates.

Pike, Kenneth. 1943. Phonemics: A technique for reducing languages to writing. Glendale, CA: Summer Institute of Linguistics.

Shannon, C.E. and W. Weaver. (1949) The Mathematical Theory of Communication. Urbana, Illinois: University of Illinois Press.

Troubetzkoy, N.S. 1967 [1939]. Principes de Phonologie. Translated by J. Cantineau. Paris: Klincksieck.
Notes



[1] An early draft of a paper was presented at Royaumont CTIP II Round Table on Phonology in the 20th Century, June 26, 1998.

[2] Interview with Noam Chomsky, p. 342, in Baars 1986.

[3] And here I am not focusing on the acquisition of morphophonemic alternations, but the broader question of the acquisition of the distribution of all the phones.

[4] The reader may wonder why the entropy of the subsets could not be reduced to zero by organizing the examples into different subsets based simply on which coronal sound is present; would that not mean that the entropy could always be reduced to zero? This question misunderstands the problem; what we really are interested in doing is dividing up the set of contexts or environments into different sets, and seeing what the entropy in each is with respect to the choice of coronal, and the sound in question is not part of the context (of its own context, so to speak).

[5] The adoption of the Bayesian point of view, and the abandonment of the frequentist point of view, means that there is no reason at all to associate probabilistic views of language with any assumption that language is in any sense random. Probabilistic statements based on evidence are rather claims about the degree to which generalizations are empirically supported by a given corpus. One can find clear statements of this point of view in well-known writings before the 20th century, such as in Boole's classic Laws of Thought (1854).

[6] “The absolute figure of real frequency of a phoneme is of only secondary importance. Only the relationship between this figure and the theoretically expected frequency is of real significance…Calculating theoretical probabilities is not always as simple as in the preceding examples. But one should not be daunted by the difficulties of such a calculation, because it is only by comparison with possible frequency obtained by such calculations that actual frequency takes on a value, by showing if a phoneme in the language in question is used a great deal or a little.”

[7] where "y" is the unstressed high front glide in English, spelled with a "y" normally.

[8] O is the vowel in "caught", open o

[9] I have developed a system of unsupervised learning of morphology based on this proposal; see Goldsmith 1998 for discussion. In addition, I ran a similar count on running English text with all white spaces removed, and found the following top 17 sequences of 3 letters (bear in mind that this is not the highest frequency trigrams (= sequences of 3 letters); it is the trigrams whose frequency most exceeds the expectations we would have based on bigram frequency, weighted by their frequency in the running text, i.e., in the language):

and     for     boy     ing     hat     you     was     ook  his  ght  not          tom     new     loo     ver     now     wit

Obviously, these are either words, or parts of high frequency words or morphemes. The text was Tom Sawyer, which is why Tom is there; hat is part of what, just as ook is part of look and took (as loo is part of look). The most striking fact is that all of these sequences are word-internal, even though the exercise was run on a text in which all of the white spaces between words had been eliminated.

[10] See Durand and Laks 1995 for recent discussion.

[11] This was a large part of the message of lexical phonology in the 1980s (Goldsmith 1990, chapter 5).

[12] The only alternative to admitting that language-learners use information theoretic notions in the task of learning a language is to imagine that learning language amounts to establishing the language-particular settings along a small number of dimensions, a notion that I find difficult to take seriously, in view of the enormous range of variation across languages.

[13] Or the reciprocal of 1 + entropy, or any other familiar function of x which grows smoothly while x decreases.

[14] I have put that so broadly that it may seem presumptuous to suggest that information theory has priority in that task, which sounds like it is the domain of many other fields of study, from mathematics to criminology. What information claims for its own is the quantitative study of sets and signals when separated from other substance or content.

[15] There is here in the original a reference to a review by Bar-Hillel in Word (Bar-Hillel 1958, p. 328).

[16] But its frequency as a noun is not 0.0; this very sentence contains an example of of being used as a noun.

[17] I do not wish to oversimplify this point; it is a complex one, and after all, anyone can answer any question they please, regarding all sorts of questions whose answers will not be known for 40 years, and if they answer enough such questions, they are bound to be right (or to have been right) some of the time.