Digital Libraries - Automatic Indexing
Automatic Indexing begins with texts, and leads to:
- index term lists or
Document Vectors (e.g., for a document, a list of all words in it
along with how many times they appear: ([list,5],[vector,3])
and
- a Dictionary (e.g., a list of all unique words and their IDs).
Words can be conflated by Stemming or
simple Plural Removal:
- stemming: computation to comput
- plural removal: salaries to salary
Terminology
- lexical analysis
- convert input stream of chars to tokens
- query processing
- analyze query and use it to find documents
- stop word list, or stoplist, or negative dictionary
- list of words to be ignored in indexing (e.g., a, an, and, of, the)
- token
- char group with collective significance (e.g., word, number, name)
Steps in Automatic Indexing
- Identify documents (e.g., an article in an encyclopedia)
- Identify fields (e.g., title, author, abstract
- Parse (and transform to standard forms):
names, dates, compounds, words, abbreviations,
acronyms, numbers, special characters, tags
For more information see: