=

 

 

 

 

Taking logs:

 

or

 

 

In the file, there were 1,115 t’s, 640 h’s, and 458 th’s, out of a total of 14,062 characters altogether (with spaces), according to MS Word.

 

So t2 is 657; h2 = 182.

 

By the table below, we save slightly more than 1272 bits in the second model, the one that includes th; the log probability decreases by that amount. The treatment of the t’s and the h’s gets somewhat worse, but the treatment of the th’s is much better, and the other letters are treated better, because their frequency goes up.

 

The last term, the one involving all the other letters, has an interesting side to it. It will generally be of the form: multiply the number of unaffected letters times the log of a number a little bit more or less than 1.0. But there is a good approximation for the natural log (1+x), when the absolute value of x is small; it is approximately x. (Since we’re using base 2 logs, we must multiply our base 2 log by 1/ln(2) to use this formula*). And x, here, is [th] divided by N2. So a good approximation for the last term is: the number of unaffected letters * (the number of th’s / N2  ) divided by the natural log of 2. This gives us 597.6, which is not far off from 587.

 

*The formula makes sense if you think about the shape of the curve y = ln(x) when it passes through the x-axis: its first derivative is 1 there (think of its slope there), so it is approximated by the line y = x – 1 close to that point.

 

State 1

 

 

 

 

 

 

 

 

 

 

 

 

[t]

[h]

 

N1

Other letters

pr(t)

pr(h)

 

pr(t)*pr(h)

 

 

 

1115

640

 

14062

12307

0.079292

0.045513

 

0.003609

 

 

State 2

 

 

 

 

 

 

 

 

 

 

 

 

[t]

[h]

[th]

N2

Other letters

pr(t)

pr(h)

 

pr(th)

 

 

 

657

182

458

13604

12307

0.048295

0.013378

 

0.033667

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Diff between states

Ratio of probs between states

 

Ratio of N's

 

 

 

 

458

 

1.641833

3.401951

 

0.107192

1.033667

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Log of ratio

 

 

log ratio times -1

 

 

 

 

 

 

0.715308

1.766363

 

-3.22173

-0.04777

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Weighted diff (weighted by t2 and h2 and other letters))

 

 

 

 

 

 

 

469.9571

321.478

 

-1475.55

-587.916

 

 

 

 

 

 

 

 

 

 

 

 

Sum of these 4 numbers:

 

 

 

 

 

 

 

 

 

 

 

-1272.03