So that large text collections can be efficiently searched, it is necessary to prepare the data by a process called automatic indexing. Many retrieval systems are advertized based on how many megabytes per hour they can index - an important measure when many gigabytes must be processed. Automatic indexing begins with text analysis, tokenization, stopword removal - all considered in this unit - and so these processes must be fast for indexing to be fast. Later parts of the indexing process are discussed in other parts of our text, namely: constructing a dictionary and document vector file, building an inverted file, and computing weights for terms.
To begin with, the stream of characters provided must be classified by type and grouped into tokens, much as is done in a compiler. Thus, the discussion of tokenization should be something familiar to students, and need not be re-studied much. Finding and eliminating stopwords can be viewed as part of this same process, allowing a solution based on lexical analysis technology too. Thus, Chapter 7 warrants little study. One should know, however, that other approaches, such as using hashing techniques, also can yield reasonable solutions.
Stemming methods, on the other hand, are relatively unusual, rarely used for other than information retrieval applications. While searchers familiar with Boolean systems are often accustomed to truncation, and computer scientists often work with regular expressions, linguists prefer to apply morphological analysis methods to find roots. Stemming is not quite like any of these, but yields comparable results, and can be carried out using algorithmic processing. Chapter 8 surveys this interesting field, giving not only descriptions of algorithms but also an overview of research in this area.