Hafer and Weiss 1974: Word segmentation by letter successor varieties

Information Storage and Retrieval 10 371-385

 

Exploit’s Zellig Harris’ notion of successor frequency (SF) and predecessor frequency (PF) to find morpheme boundaries, and provides detailed testing of the results.

 

Typology of how SF and PF can be used:

1. Cutoff (i.e., threshold): make a cut when SF exceeds a threshold.

2. Peak and plateau: make a cut at point k when SF(k) is >=  SF(k-1) and also SF(k) >= SF(k+1).

3. Complete word: make a cut after a “prefix  that is identical to an existing word. E.g., cut “electricity  after “elect”  because “elect” is a free-standing word.

4. Entropy: calculate the 1-letter entropy of alternatives after each position k (instead of the counts used by SF and PF).

 

Fifteen experiments

1. SF reaches cutoff: “completely unsatisfactory”

2. Both SF and PF reach “cutoff  (threshold). They don’t tell us what the threshold used was! Other evidence suggests it was 5 and 17 respectively. Precision: 0.894, recall 0.511

3. Threshold exceeded by the sum of SF and PF. Precision 0.848, recall 0.565. They don’t give the threshold, again!

4. Make breaks only after a “ completed word” . Precision 0.904, recall 0.318.

5. The mirror image of 4: Useless.

6. Make breaks after a completed word, OR PF reaches threshold. Precision 0.778 recall 0.711.

7. SF at “peak and plateau” Precision: 0.486 recall 0.734. This works very badly at the beginning of words.

8. Both SF and PF are at “peak and plateau”: Precision 0.787, recall 0.569.

9. Sum of SF and PF are at “peak and plateau  Recall: 0.828 precision: 0.441. This makes 3 times as many cuts as method 8, and 80% of those new ones are wrong. This is because the sum has more peaks.

10.                 Make breaks after a complete word, also where PF is at “peak or plateau”: works for FIND-ING, COMPUT-ER. Precision 0.484, Recall 0.937.

11.                 Hybrid of method 2 and 6: Make a cut when either of the following conditions is met:

a. Left to right: completed word PF >= 5; OR

b. SF >= 2 and PF >= 17

Precision 0.91 recall 0.610

     

      Entropy-based techniques:

12.                 Left to right: completed word, PF-entropy >- 3. Precision 0.72, recall 0.728.

13.                 Sum of entropies greater than threshold = 4, and also make break after complete word (or before complete word). Precision 0.609 recall 0.59.

14.                 Entropy version of 11: Make a cut when:

a. Left to right completed word and predecessor entropy >= 0.8, OR

b. Right to left completed word and successor entropy >= 1.0.               Precision 0.874, recall 0.526.

15.                 Relaxation of 14: basically just a fudge, not interesting, I think. Cut as in 14, OR: if SF = 1 at point k, and EITHER SuccEntropy or PreEntropy >= 0.8 at k+1, cut at k+1.

 

 

Best: 11 and 15.