[Note: the conversion to html did not preserve the association lines in the diagrams from the Word document format. I'll work on inserting them in this html format shortly. JAG.]
To appear in International Journal of Speech Technologies (Kluwer).
Dealing with Prosody in a Text to Speech system
Department of Linguistics
University of Chicago
The task of assigning appropriate intonation to synthetic speech is one which requires knowledge of linguistic structure as well as computational possibilities. This paper surveys the basic challenges facing the designer of a text to speech system, and reviews some of the perspectives on these problems that have been developed in the linguistic literature.
Building a text to speech system – often called a "TTS system" – is composed of three major steps. In the first, text is converted to phonemes, the symbols representing in a rough way the categories of English speech sounds (or the speech of whatever language one is interested in); a second stage involves questions of prosody, i.e., the intonation and pausing; and the third stage is the backend, the component responsible for the production of the sounds from the specifications provided by the first two components. In this brief note, I will cover some of the issues that have arisen in the development of the second of these components for a TTS system at Microsoft during the course of 1996. This system, which I will call "SpeakEasy", was linked to two other Microsoft projects: NLPWin, a large-scale unrestricted-text syntactic parser, and Whistler, a concatenative speech synthesizer. My purpose here is to provide a brief overview of the issues and assumptions that a linguist would bring to bear on the problems faced in the development of a TTS system. Most of the examples given here are linked to .wav files accessible through the Internet.
Pitch is an unavoidable property of any speech at all – any speech that is not whispered. As is well-known, voiced speech is driven by noise produced by the vibration of the vocal folds which are opening and closing at a rate that varies between about 50 Hz, at the low male end, to about 400 Hz, at the high female end; this is called the fundamental frequency, or F0. The fundamental frequency rarely remains constant for long, and the human ear is very sensitive to these variations. If we build an artificial voice which is based on a fixed fundamental frequency, or on a fundamental frequency which is constant in pitch for even as long as a second or two, the speech sounds stiff and artificial – in English, and in most other languages as well.
Native listeners rejects not just stretches of speech on a constant pitch. If we take utterances as small as a phoneme or so, and artificially concatenate sounds that we take from widely different parts of a recording of a single speaker, the resulting utterance will not sound much better. It will not have a constant, fixed pitch, but it will not have a pitch pattern that corresponds to a natural English intonation, and it will not sound natural. Our task is to provide a specification of pitch for a synthesized utterance that will sound natural, and to the extent possible will sound appropriate for its context.
The term prosody is used to refer to both pitch and to the placement of pauses in speech, key elements for making synthetic speech sound natural and acceptable to native listeners. It is normal for a human speaker to pause at various places in his or her speech – to think, to find a word, to emphasize. While these are functions that a TTS device does not absolutely need to duplicate (though perhaps it should sometimes emphasize certain words and phrases), still the fact remains that human listeners expect pauses when they listen to speech, and a functional TTS system must give its listeners those expected pauses. Without them, the task of listening to extended synthetic speech becomes a burdensome task, and the listener’s attention will rebel.
So a prosodic system for a TTS system must provide sufficient information to make the pitch sound realistic, and appropriate pauses for the speech to be easy to attend to.
In a number of respects, the syllable is the most natural unit for analysis of prosody. The word is too large a unit: in even a two syllable word (candy, say), it matters very much whether the pitch of the first syllable is higher than that of the second, or not. In the normal, neutral pronunciation, the first syllable’s pitch is considerably higher than the second, whereas in a word like balloon, the beginning pitch of the second syllable can be as high as that of the first syllable. Thus we must make the level of analysis finer than the word.
The word is too large a unit, but in general the phoneme is too small a unit, in the sense that it is rarely the case that one would want to specify independently the pitch of successive phonemes. In most cases, successive phonemes are assigned pitch by virtue of what we might call a pitch trajectory – a line, perhaps curved, perhaps straight, connecting one pitch target to another pitch target some distance away (about three or four syllables, which is to say perhaps ten phonemes, away). For most purposes, it is necessary to specify the pitch target for the vowel nucleus of certain syllables (those that are accented), and then compute the pitch for the phonemes in-between by some more local, often somewhat stochastic, process. We will return to this question below.
But while it is true that we can use the syllable as an appropriate unit of analysis for purposes of pitch assignment, it is not true that we need provide one pitch target per syllable. I have already alluded to the fact that the pitch targets of some syllables is determined in a more passive, indirect way, but more importantly, there are many syllables to which two or even three tones must be assigned. This is typically the case when the syllable in question is at or very close to the end of its intonational phrase, but since intonational phrases are relatively short in English, as in other languages, this occurs frequently.
As we will see shortly, the most efficient way to think about the relationship between tones and syllables in a given phrase is as a relation between equals: there are tones in each phrase which comprise a tonal melody, and those tones are associated with particular syllables; pitches of intermediate syllables are filled in by default.
Pauses, on the other hand, are periods of silence. That is not all that they are: they are also heralded by the way in which the immediately preceding syllable is pronounced, with a very noticeable pre-pausal lengthening. The pause itself can be measured in milliseconds; pauses on the order of 40 msec are appropriate inside a sentence.
We have used the term pitch several times, and we have referred to tones as well. The importance of distinguishing between these two notions will become clearer as we proceed, but it can only be helpful to mention some of the ways in which these terms are used by linguists. There is a spectrum of concepts that are closely linked, stretching from a set of observations of the openings of the vocal folds, at one extreme, going through fundamental frequency and then pitch, and ending with tone. None of these four concepts coincide precisely with any of the others. Observations of the opening and closing of the vocal folds can be idealized as a frequency – the frequency is the reciprocal of the length of the folds’ behavior’s period, the length of time from the beginning of one opening to the next onset of opening (or we could choose some other landmark in the cycle of the vocal folds’ behavior). In reality, it is frequently difficult to decide the precise length of such a period, and this can be for several reasons, the most obvious being that if the vocal folds’ period is speeding up or slowing down, then the precise period that we measure will depend on just where in the vocal folds’ cycle we choose to measure the period. For the purposes of work on intonation, these niceties are of virtually no interest whatsoever, so we ignore them, and we take fundamental frequency to be a measurable quantity.
Pitch is closely aligned with fundamental frequency, but the term is used to describe a quantity which abstracts away certain characteristics of fundamental frequency which are not relevant to prosody. Some of the properties of vowels and consonants have an effect on fundamental frequency; voiceless consonants can raise it, and voiced consonants lower it. When we speak of the voice’s pitch, we will be abstracting away these effects.
Tone is yet more abstract than pitch – abstract enough so that there can be real disagreement as to just what it is. Tone is a linguistic concept; it is concerned with relative pitch, and with categorical (that is, discrete) distinctions rather than continuous differences. (Pitch, of course, is in the domain of continuous concepts.) If a man were to repeat exactly what a woman had just said, with the same intonation, chances are that the pitch would be much lower, but the tones analytically assigned to the sentence would be identical.
Developing a text to speech system amounts to developing a model of a person who is reading out loud, which means that part of the TTS system must be a model of the knowledge that we would call knowing how to read. As with any linguistic knowledge of this sort, the closer we look at it, the more implicit knowledge we realize is there – and this knowledge must be made explicit in our model. Among the most important functions early in the process are sentence-breaking and text normalization.
Sentence breaking means what it says: figuring out how to break a text into sentences – since sentences are transparently among the most important units to be considered in a developing a TTS system. It would be nice if there were an unambiguous sentence terminator in English, but it’s simply not to be found. Periods, of course, are used for many functions other than marking the end of a sentence; their most notorious other use is to mark abbreviations: Dr. King, Mrs. Smith. Nothing prevents them from playing a double role -- as in this very sentence, e.g. An abbreviation at the end of a sentence is followed by only one period, not too, as one might logically expect. Furthermore, because of the possibility of quotations, even exclamation points and question marks, which by and large play fewer roles, are not unambiguous marks of sentence breaks: Where are you going?, she said. Out!, he replied. (Some stylists refuse the comma following the ? and the !, which only makes the point more sharply that sentence breaking is not trivial.)
Once we have established sentence breaks in our text, we can consider the task of converting it to a string of phonemes. This job can be divided into two steps: first, ensuring that all text has been converted to what we may call orthographic normal form (which permits only alphabetic characters and punctuation, but no additional symbols: no numerals, no @s, no 3/4s), and second, conversion of orthographic normal form to phonemes.
(1) a. Mr. Haapenen’s e-mail address is email@example.com.
Conversion of (1) to orthographic normal form:
(2) Mister Haapenen’s e-mail address is jay enn aitch at bee hyphen mass dot com.
Conversion to phonemes:
(3) M IH S T RR H EH Y P AH N AH N Z IY M EY L AH D R EH S IH Z …
The problem of text normalization – of conversion to orthographic normal form – is an open-ended one, with new problems arising constantly as the language evolves and innovates. I will briefly summarize some of the issues that arise in this context. E-mail and the web have created all sorts of challenges for this task, not the least of which is deciding what to do with emoticons (like J or :) ). Do they even have a normalized form? It’s certainly not clear to this writer that they do. A good discussion of text normalization can be found in Liberman and Church 1992.
The problem of turning normalized orthographic form into phonemes has two parts to it, for most systems: the first part involves the core vocabulary of English, some or all of whose pronunciations can be listed in a lexicon. Arguments can be made in either direction regarding the value of storing the pronunciation of a word whose pronunciation is perfectly predictable, such as "hat". In the system whose development I will discuss below, a large and complex syntactic analysis is performed which accesses much that is arbitrary about each lexical item, and picking up the phonemic information along the way is no matter at all; there’s no value gained in computing the phonemes H AE T from ‘hat’.
If there is a core vocabulary of English, one which would be a component of a language analyzer for the TTS system, it is nonetheless true that the words in our orthographic normalized form will extend far beyond that core vocabulary. The list of proper names – given names, family names, and geographical names – and foreignisms that can appear in a perfectly common piece of writing is very long. TTS must include, we have said, the abilities that an adult reader has, and this includes, to some degree, the ability to pronounce new words and names that one encounters. How do we do this? The task is not without its pitfalls. If we focus on the names of persons (as opposed to geographical locations, trademarked names, etc.), we can in fact obtain a helpful list of surnames and given names from the U.S. Census Bureau at http://www.census.gov/genealogy/names/. But the list does not come with pronunciations, and of course some families may choose to pronounce a name in a creative way that one probably could not have guessed (Koch may be pronounced "Cook," for example.) In general, the most effective strategy is likely to be to establish an automatic procedure to divide names up into the languages of their origins, and then to establish rules for pronunciations of names for each subsystem. Trying to set up one set of rules for all the names would be a bit like – indeed, would be very much like – trying to set up a single set of spelling-pronunciation rules that will work for all the European languages, and then worrying about what to do when rules that work well for Italian (say) work badly for English. If we compare the different stress pattern that we are likely to put on the names Connally (initial stress) and Connelli (penultimate stress) despite the near identity of the phonemes, we see that we are going well beyond the phonemes in establishing the stress pattern. This single example does not prove that two autonomous letter-to-sound correspondence systems are at work here (we might pursue the notion that noting the roles played by final /i/ and final /y/ will be sufficient for accounting for the stress patterns), but they illustrate the general point.
The task of converting even normalized orthography to phonemes in English is made more difficult by the fact that there are several hundred words in English whose pronunciation depends on their part of speech. The best-known cases are those of closely-related Latinate verb-noun pairs such as súbjèct/sùbjéct, óbjèct/òbjéct and noun/adjective pairs such as cóntènt/cóntènt. (Rarer but still troubling are other, less systematic pairs, such as polish/Polish.) There are various ways to resolve the problem of determining which pronunciation should be used in a given case, but the best way is to have a full syntactic analysis: in we’ll subject the subject to mild discomfort, we would like to know that the first subject is a verb, the second a noun, and a full syntactic parse will tell us this, in addition to telling us where the subject noun is and many other things, some of them quite relevant to the intonation. The subject of grapheme-to-phoneme conversion continues to be studied in depth (see, e.g., Divay and Vitale 1997 for a recent discussion) for difficult languages like English and French.
Microsoft has been developing for several years a broad coverage syntactic parser, NLPWin, and our prosody system was designed from the start to lean heavily on its functionality. Prosody accepts a sentence typed in by the user, and sends the sentence by DLL to NLPWin, receiving in return a syntactic parse of the sentence, using traditional bracketed tree notation, and including the appropriate phonemic representation, including stress, for each word. A range of additional syntactic information is provided as well. Consider a simplified example produced by NLPWin, a parse of the sentence I drink coffee, tea, and milk:
(4) Simplified output from NLPWin:
[ C:NP Subject Nom Pers1 Sing Humn Anim ProsFW
[ C:PRON O:I P:AY1 Nom Pers1 Sing Anim ProsFW ] C:PRON
[ C:VERB O:drink P:D R IH1 NG K ]C:VERB
[ C:NP B: Object Coordconj NPcoord Conjcomma ProsFW
[ C:NOUN O:coffee P:K AO1 $ F IY0 Mass ] C:NOUN
[ C:CONJP ProsFW
[ C:CONJ O:, Coordconj ProsFW ] C:CONJ
] C:CONJP Coordconj ProsFW
[ C:NOUN O:tea P:T IY1 Mass Food ] C:NOUN
[ C:CHAR O:, ] C:CHAR
] C:NP Food
[ C:CONJ O:and P:AH1 N D Coordconj ProsFW ] C:CONJ
[ C:NP Food
[ C:NOUN O:milk P:M IH1 L K Mass Food ] C:NOUN
] C:NP Food
[ C:CHAR O:. ] C:CHAR
Mnemonic tags indictate much of the information made available here, including indications of DECLarative sentence, PROSodic Function Word, CONJunction-comma (i.e., a comma functioning as a conjunction). "O:" tags Orthography, while "P:" tags a phonological representation of the word.
With this information at our disposal, many problems are eliminated or avoided that might otherwise seem to be considerable challenges. For example, as the sentence in (1) illustrates, SpeakEasy (fn. 1) needs to do no work to determine which commas play the role of conjunctions, as in the phrase coffee, tea, or milk (often referred to as lists, associated with a particular list intonation). This is important for intonation, because list intonation typically involves a rising intonation on each conjunct but the last one, and English does not have a separate punctuation mark for marking this particular function. A parser can do this work of identification, however.
A second important area for generating prosody that is facilitated by having a full-bodied syntactic parse involves the area of pauses (see, for example, Wang and Hirschberg 1992). In normal English text, the appearance of commas in most cases needs to be mapped onto the boundary of an intonational phrase. But there are cases where no punctuation appears and a boundary would be appropriate, and there are cases where no boundary is appropriate despite the presence of a comma. For example, a prepositional phrase (PP) which appears at the beginning of its clause is typically followed by a pause, which may or may not be graphically represented as a comma. A real example, from The Wizard of Oz: Soon after they had begun their journey again they came to a place where the trees grew so thick that the travelers could not pass. A pause after "again" would be natural, even though no comma is present in the text. More generally, a sentence-initial subordinate clause of more than six or seven words is well-served by being followed by an intonational boundary, regardless of whether this has been marked orthographically by a comma or not. (Of course, knowledge of where that subordinate clause ends involves sophisticated linguistic analysis.) Contrariwise, a comma that separates pre-adjectival adverbs is one that is best ignored by the TTS system, as in He was very, very wrong to do that. List intonation would be entirely inappropriate for that sentence, as would any attempt to use the comma to divide the sentence up into phrases (though in this case a full syntactic parse is not necessary to do the right job here).
Much of the work on English intonation done in the American research community over the past fifteen years has been heavily influenced by contemporary phonological theory. The case is worth reviewing with some care, if only because it is a clear case where the speech and theoretical linguistics communities have unambiguously evolved in a mutually beneficial way.
Through the decades of the 1950s and the 1960s, research on prosody within the mainstream linguistic community in the United States was not a high priority, and within the early generative paradigm, work was largely (though not entirely) restricted to Chomsky and Halle’s (1968) analysis of stress in English. While that analysis made certain forays into stress above the level of the word, it was generally and correctly perceived as focusing on predicting word-internal stress patterns (though it dealt with many other segmental issues within words), which was where most of the attention had been directed in American linguistics. Issues of intonation in English and European languages, as well as issues of tone and pitch in non-Indo-European languages, were by and large treated only marginally in most, though not all, of the mainstream American tradition. Issues of tone and intonation continued to be the focus of considerably more interest and attention for European scholars during this period. One reason for the relative disinterest in intonation among American phonologists was the dominance of the notion of the phoneme, for since intonation could not be used in English to distinguish lexical items, it did not rise to the level of the phoneme, so to speak. This was also the reason that word-internal stress levels was a subject of interest for American linguists: there were many pairs of words whose only difference lay in the stress pattern that they possessed. Linguists with a broader, more practical or functional perspective, such as Kenneth Pike and Dwight Bolinger, paid more attention to intonation, and of course linguists who worked on languages in which intonation does distinguish lexical items, as it does in Norwegian, Swedish, and Serbo-Croatian, naturally maintained a serious interest in intonational matters.
In the second half of the 1970s, the interests and concerns of the American phonological community shifted considerably, largely under the influence of two frameworks for analysis which drew their original strengths from the study of prosodic phenomena, autosegmental analysis (Goldsmith 1976) and metrical analysis (Liberman 1975). Autosegmental analysis involves breaking down phonological systems into parallel interacting systems, and the system comprising tones, on the one hand, and syllables, on the other, is one of the most striking and interesting of interacting systems in language, though by no means the only one. Within the first two years after the definition of the framework, analyses of a wide range of languages and systems had been developed, including tone in several African languages, tone in Japanese dialects, vowel harmony, nasal harmony, and, in the first paper circulated in the framework (Goldsmith 1980 ), English intonation. Entitled "English as a tone language," this last paper argued that English intonation should be treated as being composed of parallel tiers of tones and of phonemes, each independent of the other. Certain tones were "accented," indicted with an asterisk sitting on top of the tone; we will return to their function in a moment. Other tones were not accented. These tones came together in packages, such as the pattern (L* H)n H* L that is characteristic of list questions, such as "Do you want coffee, tea, or milk?"; an intonational lexicon could be conceived of in this way. The accented tones and syllables on this account corresponded to the term used earlier by Dwight Bolinger, pitch accent, and it is Bolinger’s term (Bolinger 1958) , used again by Pierrehumbert (see below) that has become wide-spread. Thus the 1974 Goldsmith model had only two types of tones, (pitch-)accented and unaccented.
Let us consider the example Do you want coffee, tea, or milk? Accents are assigned to the words coffee, tea, and milk (and in particular to their primary-stressed syllable), and the tonal melody L* H is assigned to all but the last such phrase; H* L is assigned to the last; phrase boundaries are marked by "$":
* * *
Do you want coffee $ tea $ or milk
* * *
L H $ L H $ H L
The corresponding asterisked (i.e., accented) elements in each phrase are autosegmentally "associated":
* * *
Do you want coffee $ tea $ or milk
* * *
L H $ L H $ H L
Finally, the unaccented tones are spread to the available following syllables, creating contour tones (i.e., rising and falling patterns) in several cases:
* * *
Do you want coffee $ tea $ or milk
* * *
L H $ L H $ H L
Mark Liberman’s dissertation (1975) developed a complementary system, metrical phonology, but in that dissertation and in Liberman and Sag (1974), an addition was made to the strictly tonal analysis of English proposed in Goldsmith (1980 ). Liberman argued that there were not two (accented and unaccented) but three functionally distinct roles in which a High/Low contrast arises in English intonation. That is, in addition to the contrast between accented High and accented Low tones, and the contrast between unaccented High and unaccented Low tones, there was a third locus where a tone must be identified, and this position was at the edge (sometimes left, more often right) of a major intonational phrase, and Liberman called the tone playing this role a "boundary tone," indicated by a % adjacent to the tone. The boundary tone is most striking in the case of a yes/no question, whose final syllable will normally be on a higher pitch, as in (8):
(8) Wasn’t John in the hospital?
L L H H$ link
This work in intonation retained the traditional linguistic point of view which puts considerable attention on the notion of contrast, which means, in particular, contrast in a given position in an utterance. It might well be true that an accented Low tone is lower in pitch than an unaccented Low tone, but that is no reason not to call them both Low tones. What is critical is to determine what options there are in the language for qualitatively different intonations at a particular place in an utterance: say, on an accented syllable. If the answer is that there are only two possibilities (one relatively high, and one relatively low), then that position motivates only a binary tonal distinction.
This statement of the problem glosses over a problem which has not disappeared, however. In the previous paragraph, I used the phrase "qualitatively different intonations," a phrase that hopes to get around the problem of whether intonations are really lexical items (parallel to "dog", "cat", or "I"), for if they are not, then formal means used to distinguish them are not suitably central to linguistic analysis – this, of course, is the modern translation of the notion that phonemic contrasts are justified only if they are actually used by a language to distinguish members of a minimal pair. From a purely pragmatic point of view, if we wish to develop an intonational system for synthetic speech, there is no question that we need at least this fine-grained an approach to intonational description, regardless of what linguistic theory may or may not permit!
Janet Pierrehumbert’s dissertation (submitted in 1980, also at MIT, as Liberman (1975) and Goldsmith (1976) were) was the first detailed study of English intonation using the framework that I have described. As noted earlier, Pierrehumbert applied the term pitch accent, a notion developed by Dwight Bolinger in an influential series of publications (see Bolinger 1958, 1965), to the tone associated with the accented syllable; she proposed that in certain cases, it was not a single tone, but a tightly bound pair of tones that (H* L, for example) that constituted the pitch accent, even though it was only a single one of those tones that ultimately would be associated to the accented syllable in question. (In Pierrehumbert 1980, a bitonal pitch accent is permitted, and the tone which is not associated with the accented syllable is marked with a superscripted minus sign; in addition, the two tones of the bitonal formula are linked by a plus sign +). For the tone that stretches over the last plateau of an intonational phrase she used the term "phrasal accent" (p. 26 et passim), though phrasal tone is a better term, for this tone’s association is primarily to a stretch of syllables which will not be accented, and it is that term that I shall use.
In sum, then, intonational structure should be understood as the joint product of a sequence of words (composed themselves of syllables) and a sequence of tones. The tones are not features or attributes of words or syllables (except indirectly); tones rather are units that are form a linked list, much as words do, and while there is frequently a one-to-one relationship between tones and syllables, the best one can say regarding the general case is that there is a many-to-many relationship between tones and syllables. In the case of a one-syllable word pronounced in with the neutral intonation in English, we have a relationship like that shown in (9), where the monosyllabic word "horse" is associated with the tones High (H) and Low (L), themselves linearly ordered, in the sense that the H precedes the L temporally. Both tones are associated with the same syllable, and so that syllable is realized with a sharp fall in pitch from High to Low.
We may say as a first approximation that all words other than function words are assigned a pitch accent (indicated with an asterisk in autosegmental representation). Pitch accents are the critical points at which a change in the direction of pitch change occurs in spoken English, and in virtually all cases, pitch accent occurs on the syllable that the dictionary gives as the primary-accented syllable of the word. Most function words (also called grammatical words) are normally not assigned a pitch-accent, but as the italics on "not" in this sentence illustrates, some grammatical words are more likely to be assigned pitch-accent than others (not is more likely than the, the is more likely than a). The most distinguishing phonological characteristic of compound nouns is the fact that the second word in the compound does not bear a pitch accent at all: thus, the stress pattern of dogbones is primary stress followed by secondary stress, just like Revlon. If we encounter a compound noun which is not spelled in the familiar fashion – that is, if it is not spelled as a single word, with no space between its parts – then it is important to ascertain that it is a compound noun, so that we may put primary stress on the first part of the compound. In the sentence We sold dog bones to the pound, the word bones is the second half of a compound, and must therefore not receive pitch accent. Unfortunately, the ability to determine whether what is spelled as two separate words is really a single compound is beyond the ability of most TTS systems, and requires a significant amount of syntactic knowledge. Such a system must be capable of distinguishing between a sentence like We sold the dog bones to the pound, where dog bones is a compound, and a sentence like We sold the dog bones after we sold our cat the remaining catnip, where we sold the dog bones is a paraphrase of we sold bones to the dog.
Another extremely important consideration in assigning pitch accent is that words which refer to items recently mentioned in the preceding discourse are generally unstressed; this generalization on destressing is part of an area of research that has been studied in some detail recently. In the sentence John teaches linguistics, so his son decided to take a course in linguistics too (link), the second occurrence of linguistics would sound quite unnatural if it were to be accented for this reason. (In most cases where a noun could logically be repeated, English finds a way to avoid its repetition, either through use of a pronoun or through deletion. E.g., John works with Professor Fishman, and Mary works with him too, or Mary bought a BMW, so John bought one too. In these cases, a second accent is avoided. Only in cases of contrast does accent emerge on the repeated words, as in John respects Mary, and she respects him.). But it should be noted in addition that too is assigned a pitch-accent here; if it were not (if it were treated prosodically like other prosodic function words), we would have a very unnatural-sounding sentence (link).
Not all pitch accents are equally important; in fact, by far the most important is the final pitch accent in an intonational phrase. A sharp change in pitch is typically found immediately after this final pitch accent, and the effect of this pitch change is so marked that a special name is given to the final pitch accent in a phrase: it is called the nuclear pitch accent. (We will return shortly to the "phrasal tone" which is responsible for the pitch following the nuclear pitch accent.) When correctly placed, the fall from the final high pitch accent to the low phrasal tone gives better than anything else the subjective feeling that a comprehending agent lay behind the production of that sentence. But without a good grammatical analysis, it is easy to make mistakes (many – perhaps most – of these errors can strike the user’s ear as quite inappropriate).
Consider an example. Suppose we enter Dickens’ familiar sentence, It was the best of times, it was the worst of times (link). If we simply assign accent to all content words, and leave the matter at that, we will produce a sentence with a pitch accent on the final word, times, and this will sound very odd: if we had to write down that pronunciation, we might write it thus: it was the best of times, it was the worst of times. (link). If we heard that, we would wonder, why the emphasis placed on the word "times"? To be sure, as I noted above, nouns that are repeated within a sentence are generally de-accented, and this example is an illustration of the cost to be paid for violating that maxim.
Another example: consider the sentence, I went to school today (link). If the final content word in the sentence – today – is assigned a pitch accent, it will be pronounced as if special emphasis, or contrast, were placed on that today: I went to school today (link) . In general, adverbs of time and space that appear at the end of a sentence are not accented, unless they are intended to be understood as being in contrast to some other phrase (I didn’t go to school yesterday, but I went to school today, for example). Nouns or noun phrases that are objects of the verb do not typically show this kind of accento-phobic behavior: if I say, I went to the university (link) (or I went to the university today), a pitch accent must appear on the word university, in the sense that if university were not accented, went would bear the nuclear accent rather than university, and the sentence would sound odd, because a hearer would search for a reason that university had lost its pitch accent (had the speaker just mentioned the word university in the preceding sentence, or is the word went being placed in sharp contrast to some other word – but if so, what might it be? -- that would be the implicit reasoning of the hearer of such a sentence).
We noted above that the term phrasal tone (equivalent to phrasal accent) is normally used to refer to the tone that associates with the sequence of syllables following the nuclear accent. The consequence of this usage is that several distinct properties converge on the nuclear accent: it is the rightmost pitch-accent of its intonational phrase, it is (or may be) slightly higher in pitch (or prominence in some other sense) than the other pitch accents in its phrase, it is followed immediately by a phrasal tone (which phrasal tone is always opposite in polarity from the nuclear tone: phrasal tone is Low after a High nuclear tone, and High after a Low nuclear tone), and the syntactic phrase it is part of typically has been endowed special discourse prominence in the sentence. While it may be difficult to apply all of these criteria in a real-time TTS system, it nonetheless is true that without a clear drop to a Low phrasal tone (in declaratives, or a rise to a High phrasal tone in normal yes/no questions), the intonation simply does not sound natural and life-like. It is better to have a few inappropriate falls to phrasal Low than to have none at all.
Another area which must be handled well, if we are to having a daring – and thus human-sounding – intonation is short, low-pitched phrases. A clear example of this class is a vocative phrase, that is, a phrase used by the speaker to name and address the listener, such as John, in Good morning, John, how are you today? or Mr. Smith, in When you first moved to Seattle, Mr. Smith, did you own a car? A High pitch accent on a vocative such as "John" or "Mr. Smith" in these sentences sounds unnatural. More generally, vocatives which are preceded by material (that is, vocatives which are not initial) are best treated as clitics -- that is, as words not assigned a pitch accent, and which form part of the preceding phrase. (E.g., if you don't mind, John, I'd like to get my batteries recharged soon. (link.) Treating such a vocative as a separate phrase leads to a somewhat unnatural intonation (link) if it has the neutral H*L pitch accent, and it is unnatural to group it with the preceding material if it has been assigned a pitch-accent -- indeed, the hearer perceives a different syntactic structure than the one that was intended. (link). However, such a vocative may stand as a free-standing phrase if it is assigned a L*H tonal melody (link). As man-machine interaction becomes more frequent, such vocatives will become increasingly frequent in the computer’s production as it "addresses" its user, and the intonation must be appropriate.
It is often said (and traditional supports the view) that in neutral intonation in English, pitch accents are realized as a High tone, a peak in pitch compared to the immediately surrounding syllables. Low pitch accents generally appear in two important contexts: in yes/no questions, throughout the string of pitch accents, and in list intonations, where all but the last item in the list is pronounced with a rising pitch, the accented syllable being the one bearing the lowest pitch of all. This two types are illustrated in (10):
(10) a. Yes/no question: Did Reagan know about all of the arms sales?
L* L* H
* * * *
Massachusetts, Vermont, New Hampshire, and Pennsylvania were among the
L* H L* H L* H H* L
original states in the Union.
In (10b), note that the primary-stressed syllables in Massachusetts, Vermont, and New Hampshire are relatively low in pitch, while the primary-stressed syllable in Pennsylvania is relatively high in pitch.
But this accepted wisdom is by no means the whole story, and it is not at all uncommon for speakers to assign Low tones to pitch-accented words in fluent speech, especially when reading a text. Now, that is an observation that cuts two ways: on the one hand, a TTS that endeavors to make its intonational patterns less routine and more varied can indeed use Low-tone pitch accents to that end. On the other hand, this should be done thoughtfully, and the fact that some speakers use this intonation especially when reading is a fact that designers of data-driven intonational systems should beware of. Listeners to Dick Astell may observe his habit of assigning a Low pitch accent to a large proportion of his non-nuclear (that is, non-phrase-final) pitch accents.
For example, the sentence "President Kennedy was traveling in a motorcade on his way into downtown Dallas," (link) may be produced with High pitch accents on each of the accented syllables. It sounds reasonably natural in that way, but if we assign Low pitch accents to "traveling" and "downtown"(link), it sounds more interesting and life-like.
The conversion of the symbolic tonal pattern to a pitch representation, in which each syllable is assigned one or more pitch at a particular frequency (in Hz.), is the next step in the generation of prosody.
It is at this point that most of the automatic prosody-generation systems that strike naïve listeners as mechanical, inflexible, and annoying. It is critical to incorporate here a set of rules that permit the pitch associated with successive High tones not to be equal, but rather to form a sequence of downward drifting pitches. There has been much discussion of this phenomenon (called declination and downdrift in different contexts), some of it advancing the view that there is no such thing (refs.). I will sketch the point of view that I have incorporated into the present TTS system, one which produces a natural pitch range.
In this model, there are two distinct processes, downstep and declination. Declination refers to the gradual decline in pitch shared by all syllables regardless of their tone in a rather large domain – typically the grammatical sentence (the one that ends with a period, so to speak), but sometimes even longer (example). Downstep, on the other hand, refers to the more dramatic difference in pitch shown by successive High tones in a much smaller domain, and here – crucially – the pitch of the downstepped High tones is "reset" to a downstep-less pitch position at the beginning of each such phrase. In (11), in one common and natural intonation, successive Highs within a phrase are sharply lower than the immediately preceding High , except that the first High of each non-initial phrase (that on decision and great) is reset to a High that is only a bit lower than the first High of the preceding phrase (that is, comparing P4 to P1, or P9 to P4). (In the pattern in (11), in addition, the final or nuclear High tone of each phrase has its pitch boosted slightly.) This general pattern results from the superposition of two effects: one is the long-term declination that takes the entire sentence as its domain, while the other is the sharper effect of downstep on successive Highs within each phrase, an effect which starts over again with each intonational phrase.
Assignment of pitch to accented syllables (which we have just discussed) must precede the assignment of pitch to non-accented syllables, for the pitch of the latter syllables is done on the basis of the pitch of the accented syllables in their neighborhood. Different speakers use different strategies for filling in the pitches of unaccented syllables in-between accented syllables, and a natural sounding transition can be achieved with many smooth curves that descend smoothly to a minimum before rising again smooth at nearly the same rate to reach the pitch of the following accented syllable.
Natural human speech contains a good deal of pausing, generated (so to speak) as the speaker thinks, looks for words, and in some cases pauses for emphasis. Synthetic speech can easily fail to have these hallmarks of human origin, and by failing to have sentence-internal pauses, they may ultimately be difficult to attend to, and unappealing to the human user.
It is therefore imperative for a good TTS system to include a pause-generation system whose effects increase the naturalness of the speech and aid the listener to listen to the generated speech over an extended period of time. This is all the more important as we come to think of a TTS system being able to read or otherwise generate an extended stretch of material, and not just utter a few isolated phrases (as a telephone answering system needs to do, for example). In the development of the SpeakEasy program, pauses were inserted before major clausal syntactic phrases which occurred after at least six words from the previous pause, and after subjects of finite clauses that were at least six words in length; this gave rise to a comfortable and natural pattern of pauses in a wide range of cases.
One of the most difficult decisions to make in developing a TTS system is the degree of risk-taking one wants to assume. The cost associated with a risky system is the cost attached to assigning an inappropriate intonation to any given sentence, and a conservative policy will avoid even low probability situations where a system produces an inappropriate prosody for a given sentence. Not surprisingly, low risk (or what I would prefer to call "low daring") systems tend to sound the least interesting and the least natural. In the development of SpeakEasy, the availability of a reliable syntactic analysis from NLPWin decreased the probability of inappropriate parsing, and hence inappropriate intonation, but for the present, there is no way to drive the proportion of inappropriate intonations down to zero.
I mentioned one example of this sort earlier, the case of vocative expressions.
Naïve subjects listening to computer-generated intonation are frequently sensitive to the repetitive character of the intonation after listening to a dozen or more sentences in a row. In order to prevent this from creating resistance by the listener, it is helpful to build some stochastic control parameters into the generation process. One such parameter was mentioned above: the inclusion of a Low pitch accent, placed randomly inside a sequence of three or more pitch accents in a declarative sentence. This simple effect adds remarkably to the vividness and naturalness of the sentence in ways that listeners perceive but cannot identify. Using a stochastic parameter to modify the realization of pitch from tone is helpful as well in this regard.
The treatment of the intonation of questions remains one of the greatest challenges currently. It is often said that in American English, the intonation of yes/no questions and wh-content questions is different (cf. 12 a and b)
L* L* H
b. When are you flying to Chicago? (link)
H* H* H* L
In a yes/no question in American English (but unlike British English), accented words, as we have seen, are assigned a Low pitch accent, and the sentence ends with a sharp rise. In wh-content questions, it is often said that pitch-accents are High, the pitch realization of these accents is quite high, and the sentence ends with a fall to Low.
However, there are a number of intonations used in American English for questions, and they can rarely be used interchangeably: context determines which one is appropriate. For example, if I have just put away the groceries, my wife might ask, (13) Where did you put the coffee? with a pitch the descends linearly from a high first syllable to a low point on the first syllable of coffee, followed by a significant jump up in pitch on the final syllable. This intonation is much more polite, in this context, than would be the intonation assigned by the traditional description, which would have a high pitch on all of the syllables up to, but not including, the last one. And yet this more polite intonation can be used in only a relatively limited set of contexts.
H L* H
At present, synthetic speech is used primarily in text to speech applications and in cases where a relatively restricted set of syntactic patterns is going be needed. But in the future, computers will interact with their uses in more and more spontaneous ways; computers will ask their users for instructions regarding a wider and wider range of questions. It will be necessary to develop carefully the intonations and uses of a far broader range of question types.
Natural-sounding prosody in a TTS system will only become more important in the years to come, but the experience described in this paper suggests that having access to a robust natural language parser, to provide not only correct part of speech and phonemic representation but also correct higher-level syntactic analysis, will be seen more and more as a necessary prerequisite for prosodic systems. As continuous speech recognition becomes a reality over the next few years, more users will expect to be able to interact with their computer systems in the modality of speech, and lively, natural prosody will be seen as a necessary step for persuading the user that the computer functions as a linguistic entity with which can enter into a dialog in his or her language.
Bloomfield, Leonard. 1933. Language. University of Chicago Press.
Bolinger, D.L. 1958. "A theory of pitch accent in English". Word 14:109-149. Reprinted in Bolinger, D.L. 1965.
Bolinger, D.L. 1965. Forms of English: Accent, Morpheme, Order. Edited by I. Abe & T. Kanekiyo. Cambridge: Harvard University Press.
Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Monterey: Wadsworth and Brooks/Cole.
Chomsky, Noam and Morris Halle, 1968. The Sound Pattern of English. New York: Harper and Row.
Daelemans, W., and A. van den Bosch. 1996. Language-Independent Data-Oriented Grapheme-to-Phoneme Conversion. In Van Santen, J., R. Sproat, J. Olive, and J. Hirschberg (eds.) Progress in Speech Synthesis. New York: Springer Verlag, pp. 77-90.
Divay, Michael and Anthony J. Vitale. 1997. Algorithms for Grapheme-Phoneme Translation for English and French: Applications for Database Searches and Speech Synthesis. Computational Linguistics 23(4):495-523.
Goldsmith, John. 1976. Autosegmental Phonology. Ph.D. dissertation, MIT. Reprinted by Garland Press, New York, 1979.
Goldsmith, John. 1980. English as a Tone Language. In D. Goyvaerts (ed.), Phonology in the 1980s. Gent: Story-Scientia. Circulated as a unpublished paper, 1974, MIT.
Hirschberg, Julia. 1993. Pitch accent in context: predicting intonational prominence from text. Artificial Intelligence 63(1-2): 305-340.
Huang, Xuedong, Alex Acero, Jim Adcock, Hsiao-Wuen Hon, John Goldsmith, Jingsong Liu, and Mike Plumpe. 1996. Whistler: A Trainable Text-to-Speech System. In Proceedings of the Fourth International Conference on Spoken Language Processing.
Klatt, Dennis. 1987. Review of text-to-speech conversion for English. JASA 82(3): 737-793.
Ladd, D. Robert. 1992. An introduction to intonational phonology. In Gerald J. and Docherty and D. Robert Ladd (eds), Papers in Laboratory Phonology II: Gesture,Segment,Prosody. Cambridge University Press. Pp 321-334.
Liberman, Mark. 1975. The Intonational System of English. Ph.D. dissertation, MIT.
Liberman, Mark and Kenneth Church. 1992. Text Analysis and Word Pronunciation in Text-to-Speech Synthesis. In S. Furui, and M.M. Sondhi (eds.) Advances in Speech Technology .New York: Marcel Dekker. Pp. 791- 831.
Liberman, M. & Sag, I. 1974, "Prosodic form and discourse function". LaGaly, M.W., Fox, R.A. & Bruck, A. 1974, Papers from the 10th Regional Meeting, Chicago Linguistic Society, Chicago. Pp. 416-427
McCawley, James. 1994. Some graphotactic constraints. In W. C. Watt (ed.), Writing Systems and Cognition (Dordrecht: Kluwer), 115-27. Earlier version in University of Chicago Working Papers in Linguistics 5: 96-103.
Ostendorf, Mari and N. Veilleux. A Hierarchical Stochastic Model for Automatic Prediction of Prosodic Boundary Location. Computational Linguistics 20(1): 27-54.
Pierrehumbert, Janet. 1980. The Phonology and Phonetics of English Intonation. Ph.D. dissertation, MIT.
Pierrehumbert, Janet. 1981. Synthesizing Intonation. J. Acoust. Soc. Am. 70: 985-995.
Pike, Kenneth. 1945. The Intonation of American English. Ann Arbor: University of Michigan.
Wang, Michelle Q. and Julia Hirschberg. 1992. Automatic classification of intonational phrase boundaries. Computer Speech and Language 6: 175-196.
1010 East 59th Street
Chicago IL 60637