Hafer and Weiss 1974: Word
segmentation by letter successor varieties
Information Storage
and Retrieval 10
371-385
Exploit’s Zellig Harris’ notion of successor frequency (SF) and
predecessor frequency (PF) to find morpheme boundaries, and provides detailed
testing of the results.
Typology of how SF and
PF can be used:
1.
Cutoff (i.e.,
threshold): make a cut when SF exceeds a threshold.
2.
Peak and plateau: make
a cut at point k when SF(k) is >= SF(k-1) and also SF(k) >=
SF(k+1).
3.
Complete word: make a
cut after a “prefix”
that is identical to an existing word. E.g., cut “electricity” after
“elect” because “elect” is a
free-standing word.
4.
Entropy: calculate the
1-letter entropy of alternatives after each position k (instead of the counts
used by SF and PF).
Fifteen
experiments
1.
SF reaches cutoff:
“completely unsatisfactory”
2.
Both SF and PF reach
“cutoff”
(threshold). They don’t tell us what the threshold used was! Other
evidence suggests it was 5 and 17 respectively. Precision: 0.894, recall
0.511
3.
Threshold exceeded by
the sum of SF and PF. Precision 0.848, recall 0.565. They don’t give the
threshold, again!
4.
Make breaks only after
a “ completed word” . Precision 0.904, recall 0.318.
5.
The mirror image of 4:
Useless.
6.
Make breaks after a
completed word, OR PF reaches threshold. Precision 0.778 recall
0.711.
7.
SF at “peak and
plateau” Precision: 0.486 recall 0.734. This works very badly at the beginning
of words.
8.
Both SF and PF are at
“peak and plateau”: Precision 0.787, recall
0.569.
9.
Sum of SF and PF are
at “peak and plateau”
Recall: 0.828 precision: 0.441. This makes 3 times as many cuts as
method 8, and 80% of those new ones are wrong. This is because the sum has more
peaks.
10.
Make breaks after a
complete word, also where PF is at “peak or plateau”: works for FIND-ING,
COMPUT-ER. Precision 0.484, Recall 0.937.
11.
Hybrid of method 2 and
6: Make a cut when either of the following conditions is
met:
a.
Left to right:
completed word PF >= 5; OR
b.
SF >= 2 and PF
>= 17
Precision 0.91 recall
0.610
Entropy-based
techniques:
12.
Left to right:
completed word, PF-entropy >- 3. Precision 0.72,
recall 0.728.
13.
Sum of entropies
greater than threshold = 4, and also make break after complete word (or before
complete word). Precision 0.609 recall 0.59.
14.
Entropy version of 11:
Make a cut when:
a.
Left to right
completed word and predecessor entropy >= 0.8, OR
b.
Right to left
completed word and successor entropy >= 1.0.
Precision 0.874, recall
0.526.
15.
Relaxation of 14:
basically just a fudge, not interesting, I think. Cut
as in 14, OR: if SF = 1 at point k, and EITHER SuccEntropy or PreEntropy >=
0.8 at k+1, cut at k+1.
Best: 11 and
15.