=



Taking logs:

or 
In the file, there were 1,115 t’s, 640 h’s, and 458 th’s, out of a total of 14,062 characters altogether (with spaces), according to MS Word.
So t2 is 657; h2 = 182.
By the table below, we save slightly more than 1272 bits in
the second model, the one that includes th;
the log probability decreases by that amount. The treatment of the t’s and the h’s
gets somewhat worse, but the treatment of the th’s
is much better, and the other letters are treated better, because their
frequency goes up.
The last term, the one involving all the other letters, has
an interesting side to it. It will generally be of the form: multiply the
number of unaffected letters times the log of a number a little bit more or
less than 1.0. But there is a good approximation for the natural log (1+x),
when the absolute value of x is small; it is approximately x. (Since we’re
using base 2 logs, we must multiply our base 2 log by 1/ln(2) to use this
formula*). And x, here, is [th]
divided by N2. So a good approximation for the last term is: the
number of unaffected letters * (the number of th’s
/ N2 )
divided by the natural log of 2. This gives us 597.6, which is not far off from
587.
*The formula makes sense if you think about the shape of the curve y = ln(x) when it passes through the x-axis: its first derivative is 1 there (think of its slope there), so it is approximated by the line y = x – 1 close to that point.
|
State
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
[t] |
[h] |
|
N1 |
Other
letters |
pr(t) |
pr(h) |
|
pr(t)*pr(h) |
|
|
|
|
1115 |
640 |
|
14062 |
12307 |
0.079292 |
0.045513 |
|
0.003609 |
|
|
|
State
2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
[t] |
[h] |
[th] |
N2 |
Other
letters |
pr(t) |
pr(h) |
|
pr(th) |
|
|
|
|
657 |
182 |
458 |
13604 |
12307 |
0.048295 |
0.013378 |
|
0.033667 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Diff
between states |
Ratio
of probs between states |
|
Ratio of
N's |
||||
|
|
|
|
|
458 |
|
1.641833 |
3.401951 |
|
0.107192 |
1.033667 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Log of
ratio |
|
|
log ratio
times -1 |
||
|
|
|
|
|
|
|
0.715308 |
1.766363 |
|
-3.22173 |
-0.04777 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Weighted
diff (weighted by t2 and h2 and other letters)) |
|
||||
|
|
|
|
|
|
|
469.9571 |
321.478 |
|
-1475.55 |
-587.916 |
|
|
|
|
|
|
|
|
|
|
|
|
|
Sum of
these 4 numbers: |
|
|
|
|
|
|
|
|
|
|
|
|
-1272.03 |