Exercise 3: Making sure we have a probability distribution for all words from an alphabet

 

Let’s suppose that we have set up a probability distribution for a set L of symbols that we call our alphabet. (I will use “26” as a shorthand for the number of symbols in L.) So far, in class, we have used the empirical frequency of each symbol as the probability we assign to it, and we can continue to do that now.

 

We refer to the set of all strings of symbols in L as L+. I mean for us to consider only non-null strings: the strings must not be of length 0. (If we allowed ourselves to consider strings of length 0, then we would use the notation L*; this is also called the Kleene star).

 

The set of all strings of length n is written Ln. If we assign to a string S the probability , as we’ve been doing, then this creates a distribution over all strings of length |S|. Let’s call this the empirical unigram model for Ln (it deserves a name, at least).

 

 

Q1. Show that this is true for all strings of length 1 (=L1) and all strings of length 2 (=L2).

Q2. So – as we discussed in class – we cannot use as a probability distribution over L+ or L*. In a few words – why is this, exactly?

 

 

 

We want to create a probability distribution over L+; we want to divide the total probability mass up, and spread it over all strings from L+. The easiest way to do this is find a way to a distribution  – let us call it { di }, with the properties that di > 0 and – and then d1 is the probability mass spread over all  of L1 (which has 26 members) , d2 is the probability to be spread over all of L2 (which has 262 members), and so on. We’ll call this a length distribution, and there are many different distributions which we could choose for our length distribution.

 

Then we’d be done, because we’d have a distribution for L+: for any string S in L+, we’d assign it a probability which is the product of (1) the probability assigned by its empirical unigram model for S’s particular length l, times (2) the value of dl from the length distribution.

 

What should the length distribution look like? In order for an infinite set (of non-negative numbers) to sum to 1.0, the values must go to zero as we go out to infinity, but we pretty much knew that would be so anyway. We discussed the fact in class that

 

(1)                    0.9 + 0.09 + 0.009  + … + 0.9 * (1/10)n

 

sums to 1.0, just by seeing that the sums look like: 0.9; then 0.99; then 0.999; and so on. In fact, we could take any number r, and find that rn, where n goes from 1 to infinity, sums to a finite number. Let’s remember how that works. Let’s call that sum  Z”.

            If  Z = r1 + r2 + r3 + …, then

                rZ =       r2 + r3 + …

 

Let’s subtract the second equation from the first, and so (1-r) Z = r; hence we can conclude

 

(2) Z = r/(1-r).

 

If r = 0.1, then Z = 0.1/0.9 = 1/9. So if we multiply each term by 9, as we did in (1), the infinite sum will be 9 times 1/9, or 1, which is just what we expected. If r is strictly less than 1, then this sum is well-defined and finite (you can see this wouldn’t be true if r =1, because what would the denominator be in (2)?). Once we pick our value for r, then we get a distribution by multiplying each value in r1, r2, r3, … by 1/Z, -- you see that this will give us a distribution, right? – and this is it:

 

(3) Length distribution formula: r/Z,   r2/Z,   r3/Z, …

 

We will make our length distribution take this form (which is called a geometric distribution), and so our only concern is to pick r in a reasonable way. The smaller the value of r, the more our length distribution will prefer short strings and disprefer long strings.

 

If we rewrite (3) without Z, we get:

 

(4) Length distribution formula: 1-r,   r(1-r),   r2(1-r), …

 

for length 1, 2, 3, etc. Plug Z into (3) to see that.

 

We would like to understand our length distribution formula in terms of the probability of a final word-boundary #. Imagine you select a length for a string with the following procedure: you have a coin which you can flip, and it will fall “heads” with probability r, and “tails” with probability (1-r). You will flip it until it falls “tails”. This might be on the first turn, or the second, or… the nth turn.

 

Q3: Can you show that if you apply the following procedure, the probability that you assign to each string will be a distribution over the entire set of strings L+ ?

            i. Select a letter l from L, with probability pr(l).

            ii. Flip the coin described above. If it falls tails, stop; otherwise, go to step 1.

 

What is the probability of generating the string “cab”? More generally, what is the probability for any string in L+?

 

Optional: can you figure out the average length of a word, if we build from this distribution, for a given probability r for the coin? (Hint: to put this another way, what’s the relationship between the probability of a word-boundary and the average length of a word?)

 

Q4: Here are the lengths of the words in the Brown Corpus: there are 39,419 occurrences of words that are 1 character in length, 173,343 that are 2 characters long, and so on. You might want to cut and paste this to a spreadsheet. Can you pick a good value for r for English? How would you choose a good value? There isn’t exactly one answer to this question. It’s more a matter of finding a reasonable value, given the data. Extra, optional thought-question: would you prefer to find a better distribution than the geometric distribution? Any ideas on what might work better? Why?

 

Counts of word-length, starting at 1: (I’ve left one number per line so you can easily cut and paste to a spreadsheet):

 

39419

173343

208178

148184

107822

88355

79816

61166

45067

31536

19133

11501

6444

3477

1494

721

350

250

136

66

56

28

14

7

11

3

7

1

1

1

2

4

1