Linguistica
Download
software now (latest: v3.2, December 2003 - older versions temporarily unavailable)
John
Goldsmith (email, homepage)
Departments of Linguistics and Computer
Science
University of Chicago
What is Linguistica? (return to top)
Linguistica is a program which can be used to explore the unsupervised
learning of natural language, with primary focus on morphology, which
is to say, word-structure. It runs under Windows, and is written in C++. Its
demands on memory depend on the size of the corpus analyzed. We are currently
developing a Linux and a Macintosh version.
Unsupervised learning refers to the computational task of making inferences
(or acquiring knowledge) about the structure that lies behind some set of
data without any direct access to that structure. In the case of unsupervised
learning of morphology, and the possibilities of morpheme-combinations, for
a set of words, based on no knowledge whatsoever of the language from
which the words are drawn.
Segmentation is the first task of this process: figuring out where
the morpheme breaks are in the words, and what are the stems, what are the
suffixes, and so forth. Most of Linguistica's functionality, at this point,
goes into making these decisions. It has some limited capabilities for learning
allomorphy, which is to say, the ways in which stems or affixes are
modified in particular contexts (for example, most stems that end in -e in
English will drop the -e before various vowel-initial suffixes, such as -ed,
-ing, and -ity).
This document presents a brief description of how to use the program, and
links to other documents which explain the ideas incorporated in Linguistica.
Understanding Linguistica (return to top)
This section attempts to present a bit of background about Linguistica, to
help better understand the function of the program.
Using Linguistica (return to
top)
First, a couple of things to note.
How
to Begin
The first operation is to read in a corpus, or a part of a corpus. The default
setting for Linguistica's corpus input is 5,000 words: this is the number
of words from a corpus that the program will read. If you wish to change this
setting, select "Words requested" in the Lower Tree on the left.
A pop-up window will appear in which you can specify a different number of
words to be read from the corpus. This number refers to the total number of
word (tokens) read, not word types.
To read a corpus, click on the third menu item, "Reading", and then click on "Read corpus". A window will appear in which you identify the location of the text file (not a word processor document) which you wish to read. If you have already run Linguistica and have previously read in a corpus, it will remember the location of the file, and you can simply click "Reread corpus" to reread the same old corpus. Shortcut key: Type Control-D to reread the same corpus.
When the reading is complete, "Words"
will appear in the Upper Tree area at the top left of the screen (which
you saw earlier), under "Lexicon". The Lexicon contains many collections of
information, including Words, Stems, Suffixes, Prefixes, and Signatures. When
these collections are empty, they do not appear in the Statistics area. You
may click on Words, and words of the corpus will appear as a list in the Collections
area. The width of the columns in the Collections area may be too small or
too large for your purposes. You can change the widths of the columns by grabbing
an edge at the top of the columns with your mouse and moving it to the left
or the right. You can also sort by any of the column values by clicking on
the title of that column. This may be particularly useful for clicking on
the "Corpus Count" column to bring the most frequent words to the top of the
column. You can return to an alphabetical display of the words by clicking
on the top of the first column, "Words". If you wish to see the words organized
into a "trie", you can click on the "Forward trie" line, also under "Lexicon"
in the Upper Tree area.
Finding a suffixal system: signature-based analysis

In the Lexicon, then, we have 29 suffixes, and 105 signatures built up out of them, along with 819 stems. Of the 3,049 words, 1,103 were analyzed. You should click consecutively on each of these groups, and see that they are displayed in the Collection window on the right as you do so. When the collections get large, it may take a while to display a collection (as much as 10 seconds or more if there are much more than 5,000 members).
Saving to file
Prefixes
Allomorphy
Linguistica looks for reanalyses in which certain material that had previously been analyzed as a suffix will be reintegrated into the stem, and other suffixes will be informed that they are capable of deleting that material when it appears before them. For example, the words love, loves, loved, and loving, which had been analyzed as lov + signature e.ed.es.ing, will be reanalyzed with the stem love and the suffixes NULL, ed, s, and ing. The suffixes ed and ing will be informed that they are capable of deleting a preceding e, and this is indicated by placing an e in angle brackets before the prefix, thus: <e>ing and <e>ed. Thus the new signature for love is NULL.<e>ed.<e>ing.s, and this signature correctly deals with stems that end with -e and those that do not.
You may note that Linguistica treats y-final nouns and verbs this way: thus academy/academies is treated as based on the stem academy and the suffixes NULL and <y>ies.
Rich morphologies
Consider a typical initial template, arising in a 15,000 word corpus of Swahili:


By default, signatures are ranked by their robustness, which is roughly the number of letters saved by the analysis, compared with the total number of letters in the original words which are analyzed in the signature. That is, the robustness of a signature is (roughly) the number of letters in the original words minus the number of letters in the signature. The signatures can be resorted by clicking on the header at the top of various columns of the display. "Remarks" gives an indication of which function was responsible for the identification of the signature.
When the reading is complete, "Words" will appear in the Statistics area
at the top left of the screen (which you saw earlier), under "Lexicon". The
Lexicon contains many collections of information, including Words, Stems, Suffixes,
Prefixes, and Signatures. When these collections are empty, they do not appear
in the Statistics area. You may click on Words, and words of the corpus will
appear as a list in the Collections area. The width of the columns in
the Collections area may be too small or too large for your purposes. You can
change the widths of the columns by grabbing an edge at the top of the columns
with your mouse and moving it to the left or the right. You can also sort by
any of the column values by clicking on the title of that column. This may be
particularly useful for clicking on the "Corpus Count" column to bring the most
frequent words to the top of the column. You can return to an alphabetical display
of the words by clicking on the top of the first column, "Words". If you wish
to see the words organized into a "trie", you can click on the "Forward trie"
line, also under "Lexicon" in the Statistics area.
Now you can click, successively, on the rest of the items within "For first-time
users" in the Tree: (1) Successor Freq 1, (2) Known stems and suffixes, (3)
Loose fit, (4) Check signatures, and (5) Find prefixes (of suffixal stems).
You will find the resulting data of these actions under "Lexicon" in the Statistics
area.
If you click on one of these items, details will appear in the Collections area.
These steps have now provided you with a morphological analysis of the suffixal
system of the language. We will discuss later what these steps consist of.