CS5604, Unit IN

Edward A. Fox
Department of Computer Science, Virginia Tech, Blacksburg VA 24061-0106

Abstract:

While many sales announcements of new retrieval systems claim that they support conceptual retrieval, all retrieval systems fundamentally depend on tokens, such as words. We therefore should consider some of the basic aspects of document or lexical analysis, namely to quickly identify words and throw away stopwords.

In this Unit we explore lexical analysis, stopword removal and stemming, and discuss the underlying issues in terms of tokenization, construction of finite state machines, and suffix lookup. Data structures, algorithms, implementation guidelines, and experimental results are given.

This Unit has two chapters and two laboratory exercises. The lecture coverage provides an overview.



fox@cs.vt.edu
Thu Oct 27 02:57:58 EDT 1994