Goldsmith:
Signature-based discovery of
morphological structure
0. What is the problem? To
find the morphological ( = word-internal ) structure
of a language, given a corpus of words
from the language in the range of 5K words to 1,000,000 words.
English: laugh, laughs,
laughing, laughed, boy, boys, girl, girls, government, governs.
Swahili: nilipenda,
nitapenda, ninapenda, unapenda, anapenda, ninakula, ninasoma, ninasema,
ninakisema.
1. Let’s break the problem into
governing principle (MDL and explicit
model of morphology) and heuristics
(path for searching for solutions in a search space).
2. A heuristic first: Zellig
Harris’s successor frequency and predecessor frequency, defined on a
large set of words.
The SF is defined for any
string in a corpus C that begins at least one word (i.e., is a prefix in the computational, not the
linguistic, sense of at least one word).
SF(string S) is defined as the number of distinct letters
lk such
that Slk is a
prefix of some word in corpus C.
Note that all words end with
a designated symbol (often written “#”, but sometimes with a space; and all
punctuation are treated as instances of “#”).
3. Example:
SF( gover ) = 1
SF ( govern ) = 6 because of govern, government, governed, governor,
governs, governing in some corpus C.
SF (
governm ) = 1.
Harris had the insight that
SF measurement was a clue to morpheme-boundaries.
4. Counting SF is transparent
if we store our data in a trie: a
standard data structure consisting of a directed acyclic graph where:
each node represents a string;
each node’s string is a prefix of the
string of any of its daughters’ nodes; and
no nodes have exactly one daughter (that
is, each node is either terminal or it has more than one daughter: each node is
either terminal or branching).
Look at this in Linguistica
5. You can do this backward just as well as forward.
6. Crude evaluation of this
approach:
a. It works well at the end of words (when it works) but very
poorly within the first 4 letters.
b. It works poorly or not at all when the stem occurs very few
times (e.g., 1 time).
c. It works poorly (wrongly) when all the suffixes of a stem
begin with the same letter(s) (mailman, mailmen; constructive, construction;
etc.)
7. Quantitative evaluation of
this approach: Hafer and Weiss.
8. Still, it can be used as
the beginning of a good bootstrapping algorithm for a language like English.
9. The bootstrap that we use
takes the peak approach (not peak and
plateau), and then adds a much more stringent condition, based on signatures.
10. At a simple level, what
the discovery of morphological structure is is the discovery of sub-word pieces
that (a) recur; and (b) recur as parts of patterns (groups) or recurring
pieces.
we
find all of the cartesian products in our corpus:
11. Difference between first
column and the second: the elements in the first occur with exactly the pieces in
the second; whereas the pieces in the second may reccur in other patterns, such
as:
[1] 

We will call these patterns signatures (a bit ambiguously: sometimes
we refer to the set of affixes as the signature, sometimes to the set of
affixes along with the corresponding stems).
12. The signatures allow for
a compact representation of words: when words can be broken into appropriate
pieces, they allow for the pieces to be stated just once. [1] above requires 18 letters, plus a certain amount of
structure, while a list of the words would look like this: barked, barking,
shouted, shouting, yelled, yelling, which would require 41 letters.
In general, if there are T
stems, then you save T-1 “copies” of all of the suffixes, and if there are F
distinct suffixes, you save F-1 “copies” of all of the stems. That can be a lot.
13. We know how to build
lists using pointers, and how to evaluate the amount of information contained
in them: the amount of information is equal to the sum of the (inverse) log
frequencies of the items to which the pointers point, plus some way of showing
that the list is finished, such as heading the list with a count of how many
items are to follow (that takes about log N bits to write).
14. How do we express the structure as in (1)? These signatures
are just a list of 2 lists. The first is a list of pointers to stems, and the
second is a list of pointers to suffixes.
So the size of a signature is:
(1) log
( # stems ) + log ( # suffixes ) +
(2) sum
over all stems of -log freq (stemi)
+
(3) sum
over all suffixes of –log freq (suffixi)
15. And the morphology
consists of a collection of signatures, a collection of suffixes, and a
collection of stems. Each of them has a length equal to:
(1) log
( number of items in the collection )
(2) sum
over all members of the size of the item.
16. In the case of
signatures, the size of the signature
is given in (14). In the case of stems and suffixes, the size is more concrete:
Stems:
acquire
acquisi
acquitt
acre
etc.
How much information does it take to express each of these?
17. If we were de Marcken, we
might say that the information in acquire
is equal to the sum of the (inverse) log frequencies (“information content”) of
each of the letters.
18. But the information
content is really less than that. Why?
19. If we place phonological
information in our list of stems and suffixes, we are (in effect) saying that
phonological attraction and repulsion are forces that are at work inside of morphemes and are blocked from
having an effect across morpheme
boundaries and across word boundaries.
This is basically correct (though it’s not the whole story).
20. We will have an MDL model
once we can assign a probability to a word in the corpus; we already have a
computation of the length of the
morphology in bits. The log of the corpus’ (inverse) probability will be
measured in bits as well.
Prob (word = stem + suffix,
in signature s) =
prob (s) * prob (stem| s) * prob (suffix|
s)
How does this predicted
probability correspond to observed frequencies?
21. Now we have an MDL model.
If MDL as a theory is correct, in some sense, then if we
modify the model so as to decrease the MDL (sum of two terms…), then the
resulting changes will better accord with the correct analysis of the data.
Thus we need to find ways of
modifying the grammar, and we test to see of they
decrease the DL (description length).
22. We take our initial cuts,
then, from the Zellig Harris successor frequency peaks, and use them to create signatures: to each stem, we attach all
of the suffixes that it appears with, and then we impose a series of conditions
on these signatures:
** a
suffix must appear 3 times;
** a
signature must either have 25 stems
associated with it; OR
** it
must not be composed primarily of 1-letter suffixes: they are very unreliable.Why?
It is easy to find lots of 1-letter suffixes with lots of short stems.
Implementation: NULL counts as “1”; for all other suffixes, add up 1 less than
their length. E.g.: NULL.ed -> 2; NULL.s -> 1; b.s.->0.
Threshold: 2.
**
We completely throw out any
SF-induced word-breaks that do not correspond to these signatures.
23. We can use our knowledge
of signatures to discover word-analysis for sets of words whose stems were so
short that they were in the danger zone:
e.g., do, doing. How will this save?
1.We
will go from having two stems do and doing to one (do), saving 3*4 bits (roughly, if each letter ~ 4 bits),
2.plus
we remove the pointers to do and doing from the NULL signature, and
3.we add
the pointer to the do stem to the NULL.ing signature.
What kind of changes will those
be, in terms of bits?
24. The other major problem
that Harris’ algorithm encountered: sets of suffixes that begin with the same
letter(s): the constructive/construction
or mailman/mailmen problem.
First 100 words of Brown
Corpus leads us to:
on.ve
ng.on
l.tion
an.en
Here MDL should be able to
help us, and in an especially interesting way, because MDL says that the right
analysis is right because of how all the pieces of the morphology fit together
in a global way. Constructi-on is
wrong because ion occurs a lot more
than on does (etc.).
So: we compute the DL of two
analyses, and chose the one for which the overall DL is smaller.
How do we choose which
signatures to reconsider? We look at the last k letters of the stems of a signature, and if the entropy of that
set is less than 1.5, we reconsider. In fact, we chose all lengths up to 4
letters whose entropy is less than 1.5, and pick the length which gives us the
best DL.
The DL will improve if we use
suffixes that occur more often, and even more, if we reuse a
stem that happened to exist in a different signature. In the latter case, we
will need to do some restructuring of the whole morphology to compute the
change in DL.
Let’s keep track of what can
change when we make the cut differently for the words in a given signature:
1. Biggest savings: the new
stem(s) might already exist (but associated with a different signature). Then
the “cost” of the entire stem vanishes.
2. The new signature S2
might already exist (but associated with a different set of stems). Then the “cost”
of both the old and the new signatures goes down.
(How much? let’s work it out.) Question for reflection: how many objects in the overall model should
contain copies of the pointer to the signature?
3. The new suffixes might
already exist. Then their frequencies might go down.

Sample output from
Logfile of Linguistica:
on.ve
Shift # letters: 1: Entropy
sufficiently small: 0
Shift # letters: 2: Entropy
sufficiently small: 1
Shift # letters: 3: Entropy
too large: 2.64644 (Threshold 1.5.)
Suffix: 'on' Number of stems: 12
DL of affix: 7.947
Length of pointers (DL) in these words to on: 79.469
> Information for this suffix is owned by this sig in
this proportion: 0.833 ; i.e. 7.834 bits
Suffix: 've'
Number of stems: 10
DL of affix: 8.210
Length of pointers (DL) in these words to ve: 82.099
> Information for this suffix is owned by this sig in
this proportion: 1.000 ; i.e. 9.401 bits
Length of pointers to this sig: 70.693
Current signature's DL: 265.653
=== Change in size of shifted
piece =
--------- Shifting different pieces: si----------
sion
did not exist before; DL for this new affix is 8.069
Total description cost
for the new suffix, including phonological information: 26.871
Pointers from each word
to 'sion': 40.347
sive
did not exist before; DL for this new affix is 8.069
Total description cost
for the new suffix, including phonological information: 26.871
Pointers from each word
to 'sive': 40.347
Stems:
* aggres
* deci
* expan
* explo
* exten
Pointers to this sig: 35.347
Total for this sig: 169.782
--------- Shifting different pieces: ti----------
tion
existed; old count was 12; New DL for this affix: 7.072
Pointers from each word
to 'tion': 35.362
tive
did not exist before; DL for this new affix is 8.069
Total description cost
for the new suffix, including phonological information: 26.871
Pointers from each word
to 'tive': 40.347
Stems:
* co-opera
* conserva
* posi
* recep
* representa
Pointers to this sig: 35.347
Total for this sig: 185.345
If we add 2 letters, total TD is 355.127
=== Change in size of shifted
piece
--------- Shifting different pieces: i----------
ion
existed; old count was 64; New DL for this affix: 5.322
Pointers from each word
to 'ion': 53.224
ive
existed; old count was 2; New DL for this affix: 7.947
Pointers from each word
to 'ive': 79.469
Stems:
* aggress
* co-operat
* conservat
* decis
* expans
* explos
* extens
* posit
* recept
* representat
Pointers to this sig: 60.693
Total for this sig: 259.880
If we add 1 letters, total TD is 259.880
Conclusion: change on ve
>> ion ive
25. Now we continue:
We have already used our
knowledge of signatures to see if
there are any more sets of words that can be decomposed using known signatures.
Next, we look to see if there
are known stems that can lead us to
hypothesize more suffixes. If a known stem T can be continued to a word W with
a suffix F, then hypothesize T is a suffix. Does it occur with other known
stems to form words? Does this inclusion decrease the other-all description
length? It will certainly decrease the use of some of the signatures, and this
will increase the description length of all of the words that used such
signatures. (Why?) But providing an analysis of a word (that was previously not
analyzed at all) into a known stem
plus a suffix always brings quite a bit of savings: the number of bits in the
stem (minus the length of the pointer to that stem).
26. Example of a problem: MDL
gives wrong answer when testing
NULL.nt (which should really be e.ent): This seems to be a mixture of two
problems: taking “NULL” too seriously, and a local minimum that is not a global
minimum (that old problem!)
stems:
absorbe accorda accueille accélère attira avança baigne
bénéficia changea cime circule comme commence compose concerne confesse
confirma considère coule couvre créa
crée céda date devienne diffère dise dispose donna donne déboucha dérive détienne effectua emploie
entra entre estime exerça exigea existe
explique fabrique forma fournisse garde gouverna habite impose indiqua interprète jette joua lança libère manque
monta montra montre mérite
mêle nia
opposa oppose organise passa passe pense porta porte possède prenne presse procède profita projette
prolifère propose prospère précisa préfère
publie puisse pénètre qui raconte refusa relève remonta remonte renferme retombe réalisa réglemente s'accorde
s'applique s'appropria s'effectue s'empara s'éloigna s'élève semble situe soie souffre survie tente
tourna transforme travailla travaille
trouva trouve trouvère utilisa utilise varie vienne éclata émergea étudia étudie
NULL.nt
Shift # letters: 1: Entropy sufficiently small: 0.663197
Shift # letters: 2: Entropy too large: 2.91303 (Threshold 1.5.)
Suffix: 'NULL' Number of stems: 1,300
DL of affix: 1.515
Length of pointers (DL) in these words to NULL: 43.942
> Information for this
suffix is owned by this sig in this proportion: 0.022 ;
i.e. 0.419 bits
Suffix: 'nt' Number of stems: 46
DL of affix: 6.336
Length of pointers (DL) in these words to nt:
183.743
> Information for this suffix is owned by this sig in
this proportion: 0.630 ; i.e. 5.927 bits
Length of pointers to this sig: 174.737
Current signature's DL: 416.620
=== Change in
size of shifted piece
--------- Shifting different pieces: e----------
e existed; old count was 121; New DL for this affix: 4.631
Pointers from each word
to 'e': 111.137
ent existed; old count was 34; New DL for this affix: 5.882
Pointers from each word to
'ent': 141.174
Stems:
* accélèr
* baign
* cim
* considèr
* détienn
* devienn
* diffèr
* interprèt
* jett
* libèr
* pénètr
* possèd
* préfèr
* prenn
* procèd
* projett
* prolifèr
* prospèr
* relèv
* s'élèv
* s'accord
* s'effectu
* trouvèr
* vienn
Pointers to this sig: 127.163
Total for this sig: 501.124
--------- Shifting different pieces: a----------
a existed; old count was 19; New DL for this affix:
6.275
Pointers from each word
to 'a': 31.373
ant existed; old count was 35; New DL for this affix:
5.860
Pointers from each word
to 'ant': 29.298
Stems:
* avanç
* exerç
* lanç
* s'éloign
* s'appropri
Pointers to this sig: 37.807
Total for this sig: 646.607
If we add 1 letters,
total TD is 1,147.732
NULL.nt: **
Conclusion: Keep original signature. **
|
|
NULL |
nt |
|
|
e |
ent |
|
a |
ant |
|
count |
1300 |
46 |
|
|
121 |
34 |
|
19 |
35 |
|
plog |
1.5 |
6.3 |
|
|
4.6 |
5.8 |
|
6.2 |
5.8 |
|
total ptrs to suffix |
44 |
183 |
|
|
111 |
141 |
|
31 |
29 |
|
ptrs to this sig |
175 |
|
|
|
127 |
|
|
37 |
|
|
DL |
416 |
|
|
|
501 |
|
|
646 |
|
|
|
|
|
|
|
1147 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27. Building a gold standard (with Nikki Adams).
28. Onion-skin structure: Do
the stems have an internal structure?
Yes, generally they do. Apply
the same algorithm to the set of stems we have just discovered, to find “roots”.