Goldsmith:

Signature-based discovery of morphological structure

 

0. What is the problem? To find the morphological ( = word-internal ) structure of  a language, given a corpus of words from the language in the range of 5K words to 1,000,000 words.

 

English: laugh, laughs, laughing, laughed, boy, boys, girl, girls, government, governs.

Swahili: nilipenda, nitapenda, ninapenda, unapenda, anapenda, ninakula, ninasoma, ninasema, ninakisema.

 

1. Let’s break the problem into governing principle (MDL and explicit model of morphology) and heuristics (path for searching for solutions in a search space).

 

2. A heuristic first: Zellig Harris’s successor frequency and predecessor frequency, defined on a large set of words.

The SF is defined for any string in a corpus C that begins at least one word (i.e., is a prefix in the computational, not the linguistic, sense of at least one word).

SF(string S) is defined as the number of distinct letters lk such that Slk is a prefix of some word in corpus C.

Note that all words end with a designated symbol (often written “#”, but sometimes with a space; and all punctuation are treated as instances of “#”).

 

3. Example:

SF( gover  ) = 1

SF ( govern  ) = 6 because of  govern, government, governed, governor, governs, governing in some corpus C.

SF ( governm ) = 1.

 

Harris had the insight that SF measurement was a clue to morpheme-boundaries.

 

4. Counting SF is transparent if we store our data in a trie: a standard data structure consisting of a directed acyclic graph where:

      each node represents a string;

      each node’s string is a prefix of the string of any of its daughters’ nodes; and

      no nodes have exactly one daughter (that is, each node is either terminal or it has more than one daughter: each node is either terminal or branching).

 

Look at this in Linguistica

 

5. You can do this backward just as well as forward.

 

6. Crude evaluation of this approach:

      a. It works well at the end of words (when it works) but very poorly within the first 4 letters.

      b. It works poorly or not at all when the stem occurs very few times (e.g., 1 time).

      c. It works poorly (wrongly) when all the suffixes of a stem begin with the same letter(s) (mailman, mailmen; constructive, construction; etc.)

     

7. Quantitative evaluation of this approach: Hafer and Weiss.

 

8. Still, it can be used as the beginning of a good bootstrapping algorithm for a language like English.

 

9. The bootstrap that we use takes the peak approach (not peak and plateau), and then adds a much more stringent condition, based on signatures.

 

 

 

10. At a simple level, what the discovery of morphological structure is is the discovery of sub-word pieces that (a) recur; and (b) recur as parts of patterns (groups) or recurring pieces.

 

 we find all of the cartesian products in our corpus:

 

11. Difference between first column and the second: the elements in the first occur with exactly the pieces in the second; whereas the pieces in the second may reccur in other patterns, such as:

[1]

 

We will call these patterns signatures (a bit ambiguously: sometimes we refer to the set of affixes as the signature, sometimes to the set of affixes along with the corresponding stems).

 

12. The signatures allow for a compact representation of words: when words can be broken into appropriate pieces, they allow for the pieces to be stated just once. [1] above requires 18 letters, plus a certain amount of structure, while a list of the words would look like this: barked, barking, shouted, shouting, yelled, yelling, which would require 41  letters.

 

In general, if there are T stems, then you save T-1 “copies” of all of the suffixes, and if there are F distinct suffixes, you save F-1 “copies” of all of the stems. That can be a lot.

 

13. We know how to build lists using pointers, and how to evaluate the amount of information contained in them: the amount of information is equal to the sum of the (inverse) log frequencies of the items to which the pointers point, plus some way of showing that the list is finished, such as heading the list with a count of how many items are to follow (that takes about log N bits to write).

 

14. How do we express the structure as in (1)? These signatures are just a list of 2 lists. The first is a list of pointers to stems, and the second is a list of pointers to suffixes.

 

So the size of a signature is:

(1) log ( # stems ) + log ( # suffixes ) +

(2) sum over all stems of  -log freq (stemi) +

(3) sum over all suffixes of –log freq (suffixi)

 

15. And the morphology consists of a collection of signatures, a collection of suffixes, and a collection of stems. Each of them has a length equal to:

(1) log ( number of items in the collection )

(2) sum over all members of the size of the item.

 

16. In the case of signatures, the size of the signature is given in (14). In the case of stems and suffixes, the size is more concrete:

 

Stems:

acquire

acquisi

acquitt

acre

etc.

 

How much information does it take to express each of these?

 

17. If we were de Marcken, we might say that the information in acquire is equal to the sum of the (inverse) log frequencies (“information content”) of each of the letters.

18. But the information content is really less than that. Why?

 

19. If we place phonological information in our list of stems and suffixes, we are (in effect) saying that phonological attraction and repulsion are forces that are at work inside of morphemes and are blocked from having an effect across morpheme boundaries and across word boundaries. This is basically correct (though it’s not the whole story).

 

20. We will have an MDL model once we can assign a probability to a word in the corpus; we already have a computation of the length of the morphology in bits. The log of the corpus’ (inverse) probability will be measured in bits as well.

 

Prob (word = stem + suffix, in signature s) =

prob (s) * prob (stem| s) * prob (suffix| s)

 

How does this predicted probability correspond to observed frequencies?

 

21. Now we have an MDL model. If MDL as a theory is correct, in some sense, then if we modify the model so as to decrease the MDL (sum of two terms…), then the resulting changes will better accord with the correct analysis of the data.

 

Thus we need to find ways of modifying the grammar, and we test to see of they decrease the DL (description length).

 

22. We take our initial cuts, then, from the Zellig Harris successor frequency peaks, and use them to create signatures: to each stem, we attach all of the suffixes that it appears with, and then we impose a series of conditions on these signatures:

** a suffix must appear 3 times;

** a signature  must either have 25 stems associated with it; OR

** it must not be composed primarily of 1-letter suffixes: they are very unreliable.Why? It is easy to find lots of 1-letter suffixes with lots of short stems. Implementation: NULL counts as “1”; for all other suffixes, add up 1 less than their length. E.g.: NULL.ed -> 2; NULL.s -> 1; b.s.->0. Threshold: 2.

**

 

We completely throw out any SF-induced word-breaks that do not correspond to these signatures.

 

23. We can use our knowledge of signatures to discover word-analysis for sets of words whose stems were so short that they were in the danger zone: e.g., do, doing. How will this save?

 

1.We will go from having two stems do and doing to one (do), saving 3*4 bits (roughly, if each letter ~ 4 bits),

2.plus we remove the pointers to do and doing from the NULL signature, and

3.we add the pointer to the do stem to the NULL.ing signature.

 

What kind of changes will those be, in terms of bits?

 

24. The other major problem that Harris’ algorithm encountered: sets  of suffixes that begin with the same letter(s): the constructive/construction or mailman/mailmen problem.

First 100 words of Brown Corpus leads us to:

on.ve

ng.on

l.tion

an.en

 

 

Here MDL should be able to help us, and in an especially interesting way, because MDL says that the right analysis is right because of how all the pieces of the morphology fit together in a global way. Constructi-on is wrong because ion occurs a lot more than on does (etc.).

 

So: we compute the DL of two analyses, and chose the one for which the overall DL is smaller.

 

How do we choose which signatures to reconsider? We look at the last k letters of the stems of a signature, and if the entropy of that set is less than 1.5, we reconsider. In fact, we chose all lengths up to 4 letters whose entropy is less than 1.5, and pick the length which gives us the best DL.

 

The DL will improve if we use suffixes that occur more often, and even more, if we reuse a stem that happened to exist in a different signature. In the latter case, we will need to do some restructuring of the whole morphology to compute the change in DL.

 

Let’s keep track of what can change when we make the cut differently for the words in a given signature:

 

1. Biggest savings: the new stem(s) might already exist (but associated with a different signature). Then the “cost” of the entire stem vanishes.

2. The new signature S2 might already exist (but associated with a different set of stems). Then the “cost” of both the old and the new signatures goes down. (How much? let’s work it out.) Question for reflection: how many objects in the overall model should contain copies of the pointer to the signature?

3. The new suffixes might already exist. Then their frequencies might go down.

 

 

 

 

 

Sample output from Logfile of Linguistica:

on.ve

      Shift # letters: 1:    Entropy sufficiently small: 0

      Shift # letters: 2:    Entropy sufficiently small: 1

      Shift # letters: 3:    Entropy too large: 2.64644 (Threshold 1.5.)

 

 

      Suffix: 'on' Number of stems: 12

      DL of affix: 7.947

      Length of pointers (DL) in these words to on: 79.469

            > Information for this suffix is owned by this sig in this proportion: 0.833 ; i.e. 7.834 bits

 

      Suffix: 've' Number of stems: 10

      DL of affix: 8.210

      Length of pointers (DL) in these words to ve: 82.099

            > Information for this suffix is owned by this sig in this proportion: 1.000 ; i.e. 9.401 bits

 

      Length of pointers to this sig: 70.693

 

      Current signature's DL: 265.653

 

 

 === Change in size of shifted piece =

 

 --------- Shifting different pieces: si----------

 

 

      sion did not exist before; DL for this new affix is 8.069

       Total description cost for the new suffix, including phonological information: 26.871

       Pointers from each word to 'sion': 40.347

 

      sive did not exist before; DL for this new affix is 8.069

       Total description cost for the new suffix, including phonological information: 26.871

       Pointers from each word to 'sive': 40.347

 

 

            Stems:

            * aggres

            * deci

            * expan

            * explo

            * exten

 

      Pointers to this sig: 35.347

 Total for this sig: 169.782

 --------- Shifting different pieces: ti----------

 

      tion existed; old count was 12; New DL for this affix: 7.072

        Pointers from each word to 'tion': 35.362

 

      tive did not exist before; DL for this new affix is 8.069

       Total description cost for the new suffix, including phonological information: 26.871

       Pointers from each word to 'tive': 40.347

 

 

            Stems:

            * co-opera

            * conserva

            * posi

            * recep

            * representa

 

      Pointers to this sig: 35.347

 Total for this sig: 185.345

 

       If we add 2 letters, total TD is 355.127

 

 === Change in size of shifted piece

 

 --------- Shifting different pieces: i----------

 

      ion existed; old count was 64; New DL for this affix: 5.322

        Pointers from each word to 'ion': 53.224

      ive existed; old count was 2; New DL for this affix: 7.947

        Pointers from each word to 'ive': 79.469

 

 

            Stems:

            * aggress

            * co-operat

            * conservat

            * decis

            * expans

            * explos

            * extens

            * posit

            * recept

            * representat

 

      Pointers to this sig: 60.693

 Total for this sig: 259.880

 

       If we add 1 letters, total TD is 259.880

 

 

Conclusion: change on ve >> ion ive

 

 

 

 

 

25. Now we continue:

 

We have already used our knowledge of signatures to see if there are any more sets of words that can be decomposed using known signatures.

 

Next, we look to see if there are known stems that can lead us to hypothesize more suffixes. If a known stem T can be continued to a word W with a suffix F, then hypothesize T is a suffix. Does it occur with other known stems to form words? Does this inclusion decrease the other-all description length? It will certainly decrease the use of some of the signatures, and this will increase the description length of all of the words that used such signatures. (Why?) But providing an analysis of a word (that was previously not analyzed at all) into a known stem plus a suffix always brings quite a bit of savings: the number of bits in the stem (minus the length of the pointer to that stem).

 

26. Example of a problem: MDL gives wrong answer when testing NULL.nt (which should really be e.ent): This seems to be a mixture of two problems: taking “NULL” too seriously, and a local minimum that is not a global minimum (that old problem!)

 

stems:

absorbe accorda accueille accélère attira avança baigne bénéficia changea cime circule comme commence compose concerne confesse confirma considère coule couvre  créa crée céda date devienne diffère dise dispose donna donne  déboucha dérive détienne effectua emploie entra entre estime exerça exigea  existe explique fabrique forma fournisse garde gouverna habite impose indiqua  interprète jette joua lança libère manque monta montra montre mérite

 mêle nia opposa oppose organise passa passe pense porta porte  possède prenne presse procède profita projette prolifère propose prospère précisa  préfère publie puisse pénètre qui raconte refusa relève remonta remonte  renferme retombe réalisa réglemente s'accorde s'applique s'appropria s'effectue s'empara s'éloigna  s'élève semble situe soie souffre survie tente tourna transforme travailla  travaille trouva trouve trouvère utilisa utilise varie vienne éclata émergea  étudia étudie

 

NULL.nt

Shift # letters: 1:    Entropy sufficiently small: 0.663197

Shift # letters: 2:    Entropy too large: 2.91303 (Threshold 1.5.)

 

 

Suffix: 'NULL' Number of stems: 1,300

      DL of affix: 1.515

      Length of pointers (DL) in these words to NULL: 43.942

> Information for this suffix is owned by this sig in this proportion: 0.022 ; i.e. 0.419 bits

 

Suffix: 'nt' Number of stems: 46

      DL of affix: 6.336

      Length of pointers (DL) in these words to nt: 183.743

            > Information for this suffix is owned by this sig in this proportion: 0.630 ; i.e. 5.927 bits

 

      Length of pointers to this sig: 174.737

 

      Current signature's DL: 416.620

 

 

 === Change in size of shifted piece

 

 --------- Shifting different pieces: e----------

 

e existed; old count was 121; New DL for this affix: 4.631

        Pointers from each word to 'e': 111.137

ent existed; old count was 34; New DL for this affix: 5.882

Pointers from each word to 'ent': 141.174

 

 

            Stems:

            * accélèr

            * baign

            * cim

            * considèr

            * détienn

            * devienn

            * diffèr

            * interprèt

            * jett

            * libèr

            * pénètr

            * possèd

            * préfèr

            * prenn

            * procèd

            * projett

            * prolifèr

            * prospèr

            * relèv

            * s'élèv

            * s'accord

            * s'effectu

            * trouvèr

            * vienn

 

      Pointers to this sig: 127.163

 Total for this sig: 501.124

 --------- Shifting different pieces: a----------

 

a existed; old count was 19; New DL for this affix: 6.275

        Pointers from each word to 'a': 31.373

ant existed; old count was 35; New DL for this affix: 5.860

        Pointers from each word to 'ant': 29.298

 

 

            Stems:

            * avanç

            * exerç

            * lanç

            * s'éloign

            * s'appropri

 

      Pointers to this sig: 37.807

 Total for this sig: 646.607

 

       If we add 1 letters, total TD is 1,147.732

 

NULL.nt: ** Conclusion: Keep original signature. **

 

NULL

nt

 

 

e

ent

 

a

ant

count

1300

46

 

 

121

34

 

19

35

plog

1.5

6.3

 

 

4.6

5.8

 

6.2

5.8

total ptrs to suffix

44

183

 

 

111

141

 

31

29

ptrs to this sig

175

 

 

 

127

 

 

37

 

DL

416

 

 

 

501

 

 

646

 

 

 

 

 

 

1147

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

27.  Building a gold standard (with Nikki Adams).

 

28. Onion-skin structure: Do the stems have an internal structure?

Yes, generally they do. Apply the same algorithm to the set of stems we have just discovered, to find “roots”.