IN - Automatic Indexing

Automatic Indexing begins with texts, and leads to Document Vectors and a Dictionary.

Lexical analysis may vary for each text, leading to different sets of tokens.

Words can be ignored when on the Stopword List.

Words can be conflated by Stemming or simple Plural Removal.

Terminology

automatic indexing
examine texts to yield index term list
lexical analysis
convert input stream of chars to tokens
query processing
analyze query and use it to find documents
stoplist, or negative dictionary
list of words to be ignored in indexing
token
char group with collective significance

Steps in Automatic Indexing

LEXICAL ANALYSIS

Issues

Finite State Machine Approach

Stopword Removal

Issues

Lexical Analyzer: Finite State Machine