Automatic Indexing begins with texts, and leads
to Document Vectors and a Dictionary.
Lexical analysis may vary for each text,
leading to different sets of tokens.
Words can be ignored when on the Stopword List.
Words can be conflated by Stemming or
simple Plural Removal.
Terminology
- automatic indexing
- examine texts to yield index term list
- lexical analysis
- convert input stream of chars to tokens
- query processing
- analyze query and use it to find documents
- stoplist, or negative dictionary
- list of words to be ignored in indexing
- token
- char group with collective significance
Steps in Automatic Indexing
- Identify documents
- Identify fields
- Write, use index specification as driver
- Parse names, dates, compounds, words, abbreviations, acronyms, numbers, special characters, tags
- Transform to canonical forms, types
LEXICAL ANALYSIS
Issues
- Symmetry of document, query
- Specialization to collection: language, format, typography
- Implementation
- Lex, Yacc: SMART experience (too complex, too big, too hard to change)
- Finite state machine
Finite State Machine Approach
Stopword Removal
Issues
- What to remove
- quantity, selection
- Function words?
- High frequency words
- Alternative approaches
- Hashing function: MPHF
- Trie
- Lexical analyzer
Lexical Analyzer: Finite State Machine