CS5604, Unit RR

Edward A. Fox
Department of Computer Science, Virginia Tech, Blacksburg VA 24061-0106

Abstract:

Modern advanced retrieval systems have two main features that users find particularly useful, which are not found in Boolean or string-matching type systems. First, they attempt to rank results so that those most likely to be relevant are seen soonest. Second, they learn about the users' interest through a type of feedback mechanism.

In this Unit we explore ranking and feedback, and discuss the underlying issues in terms of weighting methods, similarity measures, query expansion and screening approaches. Data structures, algorithms, implementation guidelines, and experimental results are given.

This Unit has two chapters, one article (with about 3-4 pages of relevant content) and two exercises.

Unit Highlights

Lecture/Motivating Question:
Is there a single best combination of model, weighting, feedback, query expansion, and query screening to be used? If not, can we pick a very good one for each type of collection we encounter?

Special Class Presentations:
The lectures provide an overview on the chapters and reading and include discussion of the probabilistic model.

Homework Exercise
The MARIAN system makes use of specialized ranking methods (similar to the optimizations discussed in the text) to search essentially the same data as VTLS. It will be a useful comparison to redo the earlier assignment involving VTLS, regarding publications about CD-ROM. Also, it will be of value to work with the INQUERY system to gain experience with the underlying search and feedback software.

Computer Exercise
The MARIAN and INQUERY systems can be run on WWW.

Introduction

Beginning in the late 1950's, researchers became aware of the value of statistical data in the processes of automatic indexing and retrieval. The vector space and probabilistic models, along with variants, embellishments and alternative approaches, have led to theoretical, implementation, experimental and a small number of commercial developments. In particular, the DowQuest service developed by Thinking Machines Inc. (TMI), and Personal Librarian from Personal Library Software Corp. (a derivative of the SIRE system) were available in the 1980's, and WIN from West Publishing Company debuted in October 1992. The popular WAIS system for network information access became popular in 1991, as an outgrowth of the earlier efforts by TMI. For library catalog searching, the MARIAN system at Virginia Tech will become available by 1995. The INQUERY has been under development since 1990, and is being applied in a number of experimental and demonstration projects.

Part of the allure of these approaches is their simplicity, from a user perspective. Users submit natural language statements which the computer then processes --- returning in black-box manner a ranked list of likely results. Interface designers can grapple in a variety of ways with the problem of helping users build a mental model of the process and understand why the documents are retrieved and ranked in the given order. As long as mostly relevant items are at the top of the list, however, normal users will be quite pleased.

Inside the retrieval system, a variety of operations, sometimes including stemming and nearest-neighbor term expansion, can be applied, along with computation of similarity values that rely on document statistical characteristics. The likely benefit of longer queries is increased recall, and the likely benefit of good similarity and weighting schemes is increased precision.

How can we identify the best similarity measures? Building upon various theoretical foundations, and testing on a variety of collections through numerous experimental studies, general guidelines have emerged. These have led to usable data structures, efficient algorithms for building and using those structures, and both prototype and commercial implementations (e.g., WIN from West costing perhaps $8M). These include methods to store frequency and normalizing values, schemes to accumulate partial results and prune the search process for the n best documents, ways to locate and store the nearest neighbors to terms, and approaches to let users or retrieved documents help with screening of sometimes large sets of query expansion candidate terms.

Users also can identify which documents they like, and train the system through relevance feedback methods. These lead to weights on query terms, and on terms being added. Various approaches have been tested regarding how the old query fits in with the new (terms). Expensive hardware (e.g., Connection Machines for the DowQuest service) or clever algorithms (for term selection) can handle the large new queries that can result from feedback. We balance the benefits of added terms, with the problems of extra computation, and the scarcity of data regarding the distribution of (new) terms in the relevant vs. other documents.

Donna Harman is carrying forward her studies of ranking and feedback in the largest experimental effort of its kind, the TIPSTER and TREC projects, with almost forty research groups exploring a variety of approaches of this kind to a very large test collection. TIPSTER has entered Phase II of its funding, and TREC-4 will be held later in 1995.

Objectives

From the Course Objectives a key point is the objective to: critique, contrast, compare, and evaluate the efficiency, effectiveness, and utility of commercially available and research prototype systems for IS&R. We can operationalize this in terms of considering the MARIAN, and INQUERY systems. We can also look at the fundamental issues in this regard, involving data and files structures and their coupling with algorithms: inverted vs. document files, dictionaries with nearest neighbors, accumulators for document similarities vs. pruning methods, etc.

Another objective calls for students to: read and understand research contributions ... . The SALT75b reference was a research article when written, and gives a flavor of the interaction of theory and experimentation in this field. The text chapters are more like review articles, summarizing the history, and trying to organize the results given in many papers.

In summary, this Unit has the following objectives, for students to be able to:

  1. effectively use the INQUERY system for carrying out searches, using natural language queries and relevance feedback;

  2. know how to make relevance judgments for IR research, and do that for the MARIAN system, characterized by its very short document descriptions;

  3. explain the key ideas behind the various ranking and feedback algorithms that aim to make these processes efficient; and

  4. compare and contrast the various ranking and feedback methods regarding retrieval effectiveness, user effort required, storage requirements, and efficiency.

Suggested Procedure

There are two main types of effort required. First, the readings (see Section 5) should be carefully studied, keeping unit objectives in mind (see Section 3). Second, students should repeat the CD-ROM exercise using the MARIAN system and also use the INQUERY system. Note that it is wisest to do the readings before the computer exercise, though you may want to experiment a little with the system earlier, to motivate and provide a concrete foundation for the chapters.

Exercises

There are two parts to the exercise: searching for publications about CD-ROM using the MARIAN system, and using the INQUERY system.

MARIAN Searching

The MARIAN system has been under development in the Computing Center as an alternative to VTLS for searching. There are interfaces for NeXT computers, for systems that emulate a VT100 with curses, for gopher+ clients, and for WWW clients.

MARIAN is now in beta test, and is being extended and improved.. It usually works, but try again later if there is no response or if the results seem not correct (in which case we would appreciate a detailed account of the anomalies observed).

Please find and report upon "publications about CD-ROM" using MARIAN. You can copy and paste the result set directly from your searches. In particular, please send the instructor:

INQUERY

Please study about the INQUERY system so you can use it effectively, including applying relevance feedback.

INQUERY WWW information includes a good introduction saying how it was used in the TIPSTER project. Please also read about its features and study the syntax used for advanced queries so you can try some of those as well as simple "natural language".

Please carry out the following exercise, for each of 2-3 databases that work with INQUERY, accessible from the WWW pages mentioned above (e.g., Stat-USA or sample databases or Thomas):

Study Questions

You should be able to answer each of the following questions.

  1. How might one implement a retrieval system with natural language queries, ranking and relevance feedback?
  2. What are the sources for new terms in query expansion, and what are the ways to screen or narrow down to find the best ones to add?
  3. What combinations of weighting factors and similarity functions seem best?
  4. What is the interaction between retrieval models and their ability to do ranking? query expansion? relevance feedback?
  5. What type of weighting is based on discrimination value?
  6. Which formula for relevance feedback, developed in connection with the probabilistic model, seems to perform best?
  7. What is the effect of document length on choice of ranking systems?
  8. What special data structures and algorithms are needed to support an efficient implementation of ranking and relevance feedback?
  9. What data structures are most useful for recording information relating to query expansion?

Comments on Readings

In this collection of readings there are many typographical and other minor errors. Most are detailed below --- if any others are found, please tell the instructor so a complete list can be developed. In addition, many comments, clarifications, points of disagreement, etc. are provided.

Textbook Chapter 14

In general, this is a good chapter, with introduction, history, experimental results, guidelines for implementers, enhancements, and a summary of related topics. The organization makes sense, but does cause some findings to be mentioned several times. This chapter should be read before Chapter 11, but there are cross references, and some repetition between the two.

Regarding Chapter 14, there are many specific comments to be made:

Textbook Chapter 11

In Chapter 11, there is good discussion on feedback methods. Little is said about the underlying theories, or on issues relating to statistics, probability, or AI. However, many details are given regarding algorithms and implementations.

SALT75b

This article gives some coverage on the vector space model, and some results on ranking, especially using discrimination value (DV) weighting. Of particular relevance are the first page (p. 613) and Figure 1, the weighting discussion on p. 615 in Section 2, Table III on p. 618, and the last two pages (pp. 619-620). Figure 7 is very important. Note that in Figure 8, the key is misleading, since the better curve is for phrase assignment.

Summary of Key Concepts

  1. Many users find it easier to submit natural language queries, i.e., queries that are simply (long) lists of good keywords or phrases, instead of build complex Boolean queries.
  2. Similarity measures that consider collection statistics can be used to prepare rankings of retrieved documents, that attempt to present relevant documents before others.
  3. Expanding user queries, with terms from relevant retrieved documents or other sources (e.g., morphological or thesaurus processing), can often improve effectiveness, especially with good term selection or screening, weighting, and similarity computation.
  4. Reweighting based on relevance feedback data can improve the effectiveness of document ranking --- essentially training the system regarding the terms that relate to an information need.
  5. Data structures and implementations for ranking and relevance feedback are derived from an overall model (e.g., vector, probabilistic) and tuned to classes of, and individual instances of, document collections.

References

1
E. A. Fox. Extending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and Multiple Concept Types. PhD thesis, Cornell University Dept. of Computer Science, August 1983. Available from University Microfilms Int.

2
G. Salton, C. Buckley, and E. A. Fox. Automatic query formulations in information retrieval. Journal of the American Society for Information Science, 34(4):262--280, July 1983.

3
G. Salton, E. A. Fox, and E. Voorhees. Advanced feedback methods in information retrieval. Journal of the American Society for Information Science, 36(3):200--210, May 1985.

4
G. Salton, E. Voorhees, and E. Fox. A comparison of two methods for Boolean query relevance feedback. Information Processing & Management, 20(5-6):637--651, 1984.



Edward A. Fox
Mon Oct 2 18:13:11 EDT 1995