Exercise 3: Making sure we have a probability
distribution for all words from an alphabet
Let’s suppose that we have set up a probability distribution for a set L of symbols that we call our alphabet. (I will use “26” as a shorthand for the number of symbols in L.) So far, in class, we have used the empirical frequency of each symbol as the probability we assign to it, and we can continue to do that now.
We refer to the set of all strings of symbols in L as L+. I mean for us to consider only non-null strings: the strings must not be of length 0. (If we allowed ourselves to consider strings of length 0, then we would use the notation L*; this is also called the Kleene star).
The set of all strings of length n is written Ln.
If we assign to a string S the probability
, as we’ve been doing, then this creates a distribution over
all strings of length |S|. Let’s call this the empirical unigram model
for Ln (it deserves a name, at least).
Q1. Show that this is true for all strings of length 1 (=L1) and all strings of length 2 (=L2).
Q2. So – as we discussed in class – we cannot use
as a probability distribution over L+ or L*. In a few words –
why is this, exactly?
We want to create a probability distribution over L+; we
want to divide the total probability mass up, and spread it over all strings
from L+. The easiest way to do this is find a way to a distribution – let us call it { di }, with the
properties that di > 0 and
– and then d1 is the probability mass spread over all
of L1 (which has 26
members) , d2 is the probability to be spread over all of L2
(which has 262 members), and so on. We’ll call this a length
distribution, and there are many different distributions which we could
choose for our length distribution.
Then we’d be done, because we’d have a distribution for L+:
for any string S in L+, we’d assign it a probability which is the product of
(1) the probability assigned by its empirical unigram model for S’s particular
length l, times (2) the value of dl from the length
distribution.
What should the length distribution look like? In order for an infinite set (of non-negative numbers) to sum to 1.0, the values must go to zero as we go out to infinity, but we pretty much knew that would be so anyway. We discussed the fact in class that
(1) 0.9 + 0.09 + 0.009 + … + 0.9 * (1/10)n …
sums to 1.0, just by seeing that the sums look like: 0.9; then 0.99; then 0.999; and so on. In fact, we could take any number r, and find that rn, where n goes from 1 to infinity, sums to a finite number. Let’s remember how that works. Let’s call that sum “Z”.
If Z = r1 + r2 + r3 + …, then
rZ = r2 + r3 + …
Let’s subtract the second equation from the first, and so (1-r) Z = r; hence we can conclude
(2) Z = r/(1-r).
If r = 0.1, then Z = 0.1/0.9 = 1/9. So if we multiply each term by 9, as we did in (1), the infinite sum will be 9 times 1/9, or 1, which is just what we expected. If r is strictly less than 1, then this sum is well-defined and finite (you can see this wouldn’t be true if r =1, because what would the denominator be in (2)?). Once we pick our value for r, then we get a distribution by multiplying each value in r1, r2, r3, … by 1/Z, -- you see that this will give us a distribution, right? – and this is it:
(3) Length distribution formula: r/Z, r2/Z, r3/Z, …
We will make our length distribution take this form (which is called a geometric distribution), and so our only concern is to pick r in a reasonable way. The smaller the value of r, the more our length distribution will prefer short strings and disprefer long strings.
If we rewrite (3) without Z, we get:
(4) Length distribution formula: 1-r, r(1-r), r2(1-r), …
for length 1, 2, 3, etc. Plug Z into (3) to see that.
We would like to understand our length distribution formula in terms of the probability of a final word-boundary #. Imagine you select a length for a string with the following procedure: you have a coin which you can flip, and it will fall “heads” with probability r, and “tails” with probability (1-r). You will flip it until it falls “tails”. This might be on the first turn, or the second, or… the nth turn.
Q3: Can you show that if you apply the following procedure, the probability that you assign to each string will be a distribution over the entire set of strings L+ ?
i. Select a letter l from L, with probability pr(l).
ii. Flip the coin described above. If it falls tails, stop; otherwise, go to step 1.
What is the probability of generating the string “cab”? More generally, what is the probability for any string in L+?
Optional: can you figure out the average length of a word, if we build from this distribution, for a given probability r for the coin? (Hint: to put this another way, what’s the relationship between the probability of a word-boundary and the average length of a word?)
Q4: Here are the lengths of the words in the Brown Corpus: there are 39,419 occurrences of words that are 1 character in length, 173,343 that are 2 characters long, and so on. You might want to cut and paste this to a spreadsheet. Can you pick a good value for r for English? How would you choose a good value? There isn’t exactly one answer to this question. It’s more a matter of finding a reasonable value, given the data. Extra, optional thought-question: would you prefer to find a better distribution than the geometric distribution? Any ideas on what might work better? Why?
Counts of word-length, starting at 1: (I’ve left one number per line so you can easily cut and paste to a spreadsheet):
39419
173343
208178
148184
107822
88355
79816
61166
45067
31536
19133
11501
6444
3477
1494
721
350
250
136
66
56
28
14
7
11
3
7
1
1
1
2
4
1