Introduction

Beginning in the late 1950's, researchers became aware of the value of statistical data in the processes of automatic indexing and retrieval. The vector space and probabilistic models, along with variants, embellishments and alternative approaches, have led to theoretical, implementation, experimental and a small number of commercial developments. In particular, the DowQuest service developed by Thinking Machines Inc. (TMI), and Personal Librarian from Personal Library Software Corp. (a derivative of the SIRE system) were available in the 1980's, and WIN from West Publishing Company debuted in October 1992. The popular WAIS system for network information access became popular in 1991, as an outgrowth of the earlier efforts by TMI. For library catalog searching, the MARIAN system at Virginia Tech will become available by 1995.

Part of the allure of these approaches is their simplicity, from a user perspective. Users submit natural language statements which the computer then processes - returning in black-box manner a ranked list of likely results. Interface designers can grapple in a variety of ways with the problem of helping users build a mental model of the process and understand why the documents are retrieved and ranked in the given order. As long as mostly relevant items are at the top of the list, however, normal users will be quite pleased.

Inside the retrieval system, a variety of operations, sometimes including stemming and nearest-neighbor term expansion, can be applied, along with computation of similarity values that rely on document statistical characteristics. The likely benefit of longer queries is increased recall, and the likely benefit of good similarity and weighting schemes is increased precision.

How can we identify the best similarity measures? Building upon various theoretical foundations, and testing on a variety of collections through numerous experimental studies, general guidelines have emerged. These have led to usable data structures, efficient algorithms for building and using those structures, and both prototype and commercial implementations (e.g., WIN from West costing perhaps $8M). These include methods to store frequency and normalizing values, schemes to accumulate partial results and prune the search process for the n best documents, ways to locate and store the nearest neighbors to terms, and approaches to let users or retrieved documents help with screening of sometimes large sets of query expansion candidate terms.

Users also can identify which documents they like, and train the system through relevance feedback methods. These lead to weights on query terms, and on terms being added. Various approaches have been tested regarding how the old query fits in with the new (terms). Expensive hardware (e.g., Connection Machines for the DowQuest service) or clever algorithms (for term selection) can handle the large new queries that can result from feedback. We balance the benefits of added terms, with the problems of extra computation, and the scarcity of data regarding the distribution of (new) terms in the relevant vs. other documents.

Donna Harman is carrying forward her studies of ranking and feedback in the largest experimental effort of its kind, the TIPSTER and TREC projects, with almost forty research groups exploring a variety of approaches of this kind to a very large test collection. TIPSTER has entered Phase II of its funding, and TREC-3 will be held later in 1994.


fox@cs.vt.edu
Thu Oct 27 01:30:52 EDT 1994