CS5604, Unit RR
Edward A. Fox
Department of Computer Science,
Virginia Tech, Blacksburg VA 24061-0106
Abstract:
Modern advanced retrieval systems have two main features that users
find particularly useful, which are
not found in Boolean or
string-matching type systems.
First, they attempt to rank results so that those most likely to be relevant
are seen soonest. Second, they learn about the users' interest through a
type of feedback mechanism.
In this Unit we explore ranking and feedback, and discuss the underlying
issues in terms of weighting methods, similarity measures, query
expansion and screening approaches. Data structures, algorithms,
implementation guidelines, and experimental results are given.
This Unit has two chapters, one article (with about 3-4 pages of relevant
content) and two exercises.
- Lecture/Motivating Question:
- Is there a single best combination
of model, weighting, feedback, query expansion, and query screening to
be used? If not, can we pick a very good one for each type of collection
we encounter?
- Special Class Presentations:
- The lectures
provide an overview on the
chapters and reading and include discussion of the
probabilistic model.
- Homework Exercise
- The MARIAN system makes use of specialized
ranking methods (similar to the optimizations discussed in the text)
to search essentially the same data as VTLS. It will be a useful
comparison to redo the earlier assignment involving VTLS, regarding
publications about CD-ROM. Also, it will be of value to work with
the INQUERY system to gain experience with the underlying
search and feedback software.
- Computer Exercise
- The MARIAN and INQUERY systems can be
run on WWW.
Beginning in the late 1950's, researchers became aware of the value of
statistical data in the processes of automatic indexing and retrieval.
The
vector space and probabilistic models, along with variants,
embellishments and alternative approaches, have led to theoretical,
implementation, experimental and a small number of commercial
developments. In particular, the DowQuest service developed by
Thinking Machines Inc. (TMI), and Personal Librarian from Personal
Library Software Corp. (a derivative of the SIRE system) were
available in the 1980's, and WIN from West Publishing Company
debuted in October 1992. The popular WAIS system for network
information access became popular in 1991, as an outgrowth of the
earlier efforts by TMI. For library catalog searching, the MARIAN
system at Virginia Tech will become available by 1995. The INQUERY
has been under development since 1990, and is being applied in a
number of experimental and demonstration projects.
Part of the allure of these approaches is their simplicity, from a user
perspective. Users submit natural language statements which the
computer then processes --- returning in black-box manner a ranked list
of likely results. Interface designers can grapple in a variety of ways
with
the problem of helping users build a mental model of the process and
understand why the documents are retrieved and ranked in the given
order. As long as mostly relevant items are at the top of the list,
however, normal users will be quite pleased.
Inside the retrieval
system, a variety of operations, sometimes including stemming
and nearest-neighbor term expansion, can be applied, along with
computation of similarity values that rely on document statistical
characteristics. The likely benefit of longer queries is increased
recall,
and the likely benefit of good similarity and weighting schemes is
increased precision.
How can we identify the best similarity measures? Building upon various
theoretical foundations, and testing on a variety of collections through
numerous experimental studies, general guidelines have emerged. These
have led to usable data structures, efficient algorithms for building
and
using those structures, and both prototype and commercial
implementations (e.g., WIN from West costing perhaps $8M).
These include methods to store frequency and normalizing values,
schemes to accumulate partial results and prune the search process for
the n best documents, ways to locate and store the nearest
neighbors to terms,
and approaches to let users or retrieved documents help with screening
of sometimes large sets of query expansion candidate terms.
Users also can identify which documents they like, and train the
system through relevance feedback methods. These lead to weights
on query terms, and on terms being added. Various approaches have
been tested regarding how the old query fits in with the new (terms).
Expensive hardware (e.g., Connection Machines for the DowQuest
service) or clever algorithms (for term selection) can handle the large
new queries that can result from feedback. We balance the benefits of
added terms, with the problems of extra computation, and the scarcity of
data regarding the distribution of (new) terms in the relevant vs. other
documents.
Donna Harman is carrying forward her studies of ranking and feedback
in the largest experimental effort of its kind, the TIPSTER and TREC
projects, with almost forty research groups exploring a variety of
approaches of this kind to a very large test collection.
TIPSTER has entered Phase II of its funding, and
TREC-4 will be held later in 1995.
From the Course Objectives a key point is the objective to:
critique, contrast, compare, and evaluate the efficiency, effectiveness,
and utility of commercially available and research prototype systems for
IS&R. We can operationalize this in terms of considering the MARIAN,
and INQUERY systems. We can also look at the fundamental issues in this
regard, involving data and files structures and their coupling with
algorithms: inverted vs. document files, dictionaries with nearest
neighbors, accumulators for document similarities vs. pruning methods,
etc.
Another objective calls for students to: read and understand research
contributions ... . The SALT75b reference was a research article when
written, and gives a flavor of the interaction of theory and
experimentation in this field. The text chapters are more like review
articles, summarizing the history, and trying to organize the results given
in many papers.
In summary, this Unit has the following objectives, for students to be
able to:
- effectively use the INQUERY system for
carrying out searches, using natural language queries and relevance
feedback;
- know how to make relevance judgments for IR research, and
do that for the MARIAN system, characterized by its
very short document descriptions;
- explain the key ideas behind the various ranking and feedback
algorithms that aim to make these processes efficient; and
- compare and contrast the various ranking and feedback methods
regarding retrieval effectiveness, user effort required, storage
requirements, and efficiency.
There are two main types of effort required. First, the readings (see
Section 5) should be carefully studied, keeping unit objectives in mind
(see Section 3). Second, students should repeat the CD-ROM exercise
using the MARIAN system and also use the INQUERY system.
Note that it is wisest to do the readings before the computer exercise,
though you may want to experiment a little with the system earlier, to
motivate and provide a concrete foundation for the chapters.
There are two parts to the exercise: searching for publications about
CD-ROM using
the MARIAN system, and using the INQUERY
system.
The MARIAN system has been under development in the Computing
Center as an alternative to VTLS for searching. There are interfaces for
NeXT computers, for systems that emulate a VT100 with curses, for
gopher+ clients, and for WWW clients.
MARIAN is now in beta test, and is being extended and improved.. It
usually works, but try again later if there is no response
or if the results seem not correct (in which case we would
appreciate a detailed account of the anomalies observed).
Please find and report upon "publications about CD-ROM" using
MARIAN.
You can copy and paste the result set directly from your searches.
In particular,
please send the instructor:
- the best query you used to search for works about CD-ROM,
- the results set (top 20) from that search,
- your explanation of how you think the system is ranking works
(e.g., by number of matching words with the query, using different
weights for different query terms),
- a detailed account of any problems you found with MARIAN.
Please study about the
INQUERY
system
so you can use it effectively, including applying relevance feedback.
INQUERY WWW information includes a good
introduction
saying how it was used in the TIPSTER project.
Please also read about its
features
and study the
syntax used for advanced
queries
so you can try some of those as well as simple "natural language".
Please carry out the following exercise, for each of 2-3 databases
that work with INQUERY, accessible from the WWW pages mentioned above
(e.g.,
Stat-USA
or
sample
databases
or
Thomas):
- Write down a 20-50 word version of a statement of something of
interest to you (or a member of your group), and send that to the
instructor.
- Use INQUERY to do an initial "natural language" search based on
the interest statement. Send the results to the instructor, along
with at least 5 relevance judgments.
- Use INQUERY to do a feedback search, and send the instructor
the results, along with a short statement on whether they seem
to be better than were those of the first query.
- Use INQUERY to do a search using at least one of the advanced
syntax capabilities, and send the instructor
the results, along with a short statement on whether they seem
to be better than were those of the first query.
You should be able to answer each of the following questions.
- How might one implement a retrieval system with natural
language queries, ranking and relevance feedback?
- What are the sources for new terms in query expansion, and what
are the ways to screen or narrow down to find the best ones to add?
- What combinations of weighting factors and similarity functions
seem best?
- What is the interaction between retrieval models and their ability
to do ranking? query expansion? relevance feedback?
- What type of weighting is based on discrimination value?
- Which formula for relevance feedback, developed in connection
with the probabilistic model, seems to perform best?
- What is the effect of document length on choice of ranking
systems?
- What special data structures and algorithms are needed to support
an efficient implementation of ranking and relevance feedback?
- What data structures are most useful for recording information
relating to query expansion?
In this collection of readings there are many typographical and other
minor errors. Most are detailed below --- if any others are found, please
tell the instructor so a complete list can be developed. In addition, many
comments, clarifications, points of disagreement, etc. are provided.
In general, this is a good chapter, with introduction, history,
experimental results, guidelines for implementers, enhancements, and a
summary of related topics.
The organization makes sense, but does cause some findings to be
mentioned several times.
This chapter should be read before Chapter 11, but there are cross
references, and some repetition between the two.
Regarding Chapter 14, there are many specific comments to be made:
- p. 363, very bottom, note that the robustness of natural language
queries that is mentioned depends purely on having long enough queries,
so that incorrect terms can be ignored.
- p. 364, very top, note that it is not certain how precise one can be
with natural language queries when searching large full-text collections.
- p. 365, figure, bottom section, in each group in the table, the
third line is misleading. The vector shown is the and of the query
and the record, and the number is the dot product.
- p. 368, figure, note that there are two names for each of the
equations at the bottom. Thus you can use w sup 1 or F sub 1
to refer to the first equation.
- p. 369, 4th paragraph, discusses two types of experiments. The
retrospective ones give an upper bound on performance using this
approach, showing what would be optimal in those circumstances. The
predictive experiments suggest how the method would work in
actual practice.
- p. 370, first equation, add in a left parenthesis. Regarding the
second equation, see also p. 375, where it is repeated. When less than
or equal to 0.5
we have the equation shown in the middle of p. 375, for w sub iq
which was developed by Fox
[1].
- p. 370-1, omit Sections 14.3.3-4.
- p. 372, end of section 14.4.1, note that tf*idf or something very
much like it is being discussed here.
- p. 373, top, note that latent semantic indexing (LSI) aims to
find the key concepts, or orthogonal semantic aspects, reducing the
dimensionality of the document-term matrix. Recently, there have been
some new algorithms to speed up the indexing.
- p. 374, item 2, note that this encourages use of tf*idf. However,
the tf component may be ill-advised when there are
very short documents (like those used with MARIAN).
- p. 377, item 1, the first line is really just a conjecture --- we do
not yet know how well we can do with large collections of long texts,
using only ranking, and avoiding adjacency and field restrictions. This is
also the case with the similar conjecture that underlies the discussion
under item 2.
- p. 378, par. after the figure, recall p. 34 and Figure 3.5.
- p. 380, 2nd from last par., recall Chapter 4. Note that a sort is
not necessary if one only wants the top k items. In that case, one can
do the job in linear time using an array of size k that stores the
document number and weight for the k items that have been seen so
far that have the highest similarity values. This process of accumulators
and finding the top k items was used in the REVTOLC experiment at
VPI&SU.
- p. 387, you can omit Section 14.8.
In Chapter 11, there is good discussion on feedback methods. Little is
said about the underlying theories, or on issues relating to statistics,
probability, or AI. However, many details are given regarding
algorithms and implementations.
- p. 249, 3rd line from bottom, omit the dash in three-vector.
- p. 251, 2nd par., note that they coined the term searchonym.
- p. 251, 3rd par., note that there are various other works on
automatic Boolean query construction including
[1,2,4,3].
- p. 253, 1st new par., note that Connection Machines have been
used for the WAIS system, which is discussed elsewhere in this unit.
- p. 253, end of first par. in 11.4.1, note that the discussion of (Fox
1987) is a bit misleading in that only survey results were discussed in that
article, with no effectiveness figures, and the interface undoubtedly
confounded observations.
- p. 254, first new par., note again that there was other work on
Boolean query expansion. Also, note that front end systems do not
usually affect the host --- this argument is weak.
This article gives some coverage on the vector space model, and some
results on ranking, especially using discrimination value (DV)
weighting.
Of particular relevance are the first page (p. 613) and Figure 1, the
weighting discussion on p. 615 in Section 2, Table III on p. 618, and the
last two pages (pp. 619-620). Figure 7 is very important.
Note that in Figure 8, the key is misleading, since the better curve is for
phrase assignment.
- Many users find it easier to submit natural language
queries,
i.e., queries that are simply (long) lists of good keywords or phrases,
instead of build complex Boolean queries.
- Similarity measures that consider collection statistics can be
used
to prepare rankings of retrieved documents, that attempt to present
relevant documents before others.
- Expanding user queries, with terms from relevant retrieved
documents or other sources (e.g., morphological or thesaurus
processing), can often improve effectiveness, especially with good term
selection or screening, weighting, and similarity computation.
- Reweighting based on relevance feedback data can improve the
effectiveness of document ranking --- essentially training the
system
regarding the terms that relate to an information need.
- Data structures and implementations for ranking and relevance
feedback are derived from an overall model (e.g., vector, probabilistic)
and tuned to classes of, and individual instances of, document
collections.
References
- 1
-
E. A. Fox.
Extending the Boolean and Vector Space Models of Information
Retrieval with P-Norm Queries and Multiple Concept Types.
PhD thesis, Cornell University Dept. of Computer Science, August
1983.
Available from University Microfilms Int.
- 2
-
G. Salton, C. Buckley, and E. A. Fox.
Automatic query formulations in information retrieval.
Journal of the American Society for Information Science,
34(4):262--280, July 1983.
- 3
-
G. Salton, E. A. Fox, and E. Voorhees.
Advanced feedback methods in information retrieval.
Journal of the American Society for Information Science,
36(3):200--210, May 1985.
- 4
-
G. Salton, E. Voorhees, and E. Fox.
A comparison of two methods for Boolean query relevance feedback.
Information Processing & Management, 20(5-6):637--651, 1984.
Edward A. Fox
Mon Oct 2 18:13:11 EDT 1995