Edward A. Fox
Department of Computer Science,
Virginia Tech, Blacksburg VA 24061-0106
Modern advanced retrieval systems have two main features that users find particularly useful, which are not found in Boolean or string-matching type systems. First, they attempt to rank results so that those most likely to be relevant are seen soonest. Second, they learn about the users' interest through a type of feedback mechanism.
In this Unit we explore ranking and feedback, and discuss the underlying issues in terms of weighting methods, similarity measures, query expansion and screening approaches. Data structures, algorithms, implementation guidelines, and experimental results are given.
This Unit has two chapters, one article (with about 3-4 pages of relevant content) and two exercises.
Beginning in the late 1950's, researchers became aware of the value of statistical data in the processes of automatic indexing and retrieval. The vector space and probabilistic models, along with variants, embellishments and alternative approaches, have led to theoretical, implementation, experimental and a small number of commercial developments. In particular, the DowQuest service developed by Thinking Machines Inc. (TMI), and Personal Librarian from Personal Library Software Corp. (a derivative of the SIRE system) were available in the 1980's, and WIN from West Publishing Company debuted in October 1992. The popular WAIS system for network information access became popular in 1991, as an outgrowth of the earlier efforts by TMI. For library catalog searching, the MARIAN system at Virginia Tech became available in 1995. The INQUERY has been under development since 1990, and is being applied in a number of experimental and demonstration projects.
Part of the allure of these approaches is their simplicity, from a user perspective. Users submit natural language statements which the computer then processes --- returning in black-box manner a ranked list of likely results. Interface designers can grapple in a variety of ways with the problem of helping users build a mental model of the process and understand why the documents are retrieved and ranked in the given order. As long as mostly relevant items are at the top of the list, however, normal users will be quite pleased.
Inside the retrieval system, a variety of operations, sometimes including stemming and nearest-neighbor term expansion, can be applied, along with computation of similarity values that rely on document statistical characteristics. The likely benefit of longer queries is increased recall, and the likely benefit of good similarity and weighting schemes is increased precision.
How can we identify the best similarity measures? Building upon various theoretical foundations, and testing on a variety of collections through numerous experimental studies, general guidelines have emerged. These have led to usable data structures, efficient algorithms for building and using those structures, and both prototype and commercial implementations (e.g., WIN from West costing perhaps $8M). These include methods to store frequency and normalizing values, schemes to accumulate partial results and prune the search process for the n best documents, ways to locate and store the nearest neighbors to terms, and approaches to let users or retrieved documents help with screening of sometimes large sets of query expansion candidate terms.
Users also can identify which documents they like, and train the system through relevance feedback methods. These lead to weights on query terms, and on terms being added. Various approaches have been tested regarding how the old query fits in with the new (terms). Expensive hardware (e.g., Connection Machines for the DowQuest service) or clever algorithms (for term selection) can handle the large new queries that can result from feedback. We balance the benefits of added terms, with the problems of extra computation, and the scarcity of data regarding the distribution of (new) terms in the relevant vs. other documents.
Donna Harman is carrying forward her studies of ranking and feedback in the largest experimental effort of its kind, the TIPSTER and TREC projects, with almost forty research groups exploring a variety of approaches of this kind to a very large test collection. TIPSTER has entered Phase II of its funding, and TREC-5 will be held later in 1996.
From the Course Objectives a key point is the objective to: critique, contrast, compare, and evaluate the efficiency, effectiveness, and utility of commercially available and research prototype systems for IS&R. We can operationalize this in terms of considering the MARIAN, and INQUERY systems. We can also look at the fundamental issues in this regard, involving data and files structures and their coupling with algorithms: inverted vs. document files, dictionaries with nearest neighbors, accumulators for document similarities vs. pruning methods, etc.
Another objective calls for students to: read and understand research contributions ... . The SALT75b reference was a research article when written, and gives a flavor of the interaction of theory and experimentation in this field. The text chapters are more like review articles, summarizing the history, and trying to organize the results given in many papers.
In summary, this Unit has the following objectives, for students to be able to:
There are two main types of effort required. First, the readings (see Section 5) should be carefully studied, keeping unit objectives in mind (see Section 3). Second, students should repeat the CD-ROM exercise using the MARIAN system and also use the INQUERY system. Note that it is wisest to do the readings before the computer exercise, though you may want to experiment a little with the system earlier, to motivate and provide a concrete foundation for the chapters.
There are two parts to the exercise: searching for publications about CD-ROM using the MARIAN system, and using the INQUERY system.
The MARIAN system has been under development in the Computing Center as an alternative to VTLS for searching. There are interfaces for NeXT computers, for systems that emulate a VT100 with curses, for gopher+ clients, and for WWW clients.
MARIAN is now in beta test, and is being extended and improved.. It usually works, but try again later if there is no response or if the results seem not correct (in which case we would appreciate a detailed account of the anomalies observed).
Please find and report upon "publications about CD-ROM" using MARIAN. You can copy and paste the result set directly from your searches. In particular, please send the instructor:
Please study about the INQUERY system so you can use it effectively, including applying relevance feedback.
INQUERY WWW information includes a good introduction saying how it was used in the TIPSTER project. Please also read about its features and study the syntax used for advanced queries so you can try some of those as well as simple "natural language".
Please carry out the following exercise, for each of 2-3 databases that work with INQUERY, accessible from the WWW pages mentioned above (e.g., United States Holocaust Memorial Museum or sample databases or Thomas):
You should be able to answer each of the following questions.
In this collection of readings there are many typographical and other minor errors. Most are detailed below --- if any others are found, please tell the instructor so a complete list can be developed. In addition, many comments, clarifications, points of disagreement, etc. are provided.
In general, this is a good chapter, with introduction, history, experimental results, guidelines for implementers, enhancements, and a summary of related topics. The organization makes sense, but does cause some findings to be mentioned several times. This chapter should be read before Chapter 11, but there are cross references, and some repetition between the two.
Regarding Chapter 14, there are many specific comments to be made:
In Chapter 11, there is good discussion on feedback methods. Little is said about the underlying theories, or on issues relating to statistics, probability, or AI. However, many details are given regarding algorithms and implementations.
This article gives some coverage on the vector space model, and some results on ranking, especially using discrimination value (DV) weighting. Of particular relevance are the first page (p. 613) and Figure 1, the weighting discussion on p. 615 in Section 2, Table III on p. 618, and the last two pages (pp. 619-620). Figure 7 is very important. Note that in Figure 8, the key is misleading, since the better curve is for phrase assignment.