CS5604, Unit IF
Edward A. Fox
Department of Computer Science
Virginia Tech, Blacksburg VA 24061-0106
Abstract:
Almost all commercially available information retrieval
systems allow users to submit Boolean queries. These,
along with many of the research systems, use inverted
files to aid in query processing. To better understand
these systems, we consider inverted files and methods to
create them, as discussed in Chapter 3. The actual
processing of Boolean queries involves operations on
sets (that are identified from the inverted file), which
are discussed in Chapter 12. Finally, in Chapter 15 we
consider extensions to Boolean query processing that has
been shown in research studies to yield significantly
higher precision at all levels of recall.
This Unit deals with three chapters and one computer
exercise. In addition, there is one optional reading, for
those with particular interest in extended Boolean
retrieval methods. Class time will be devoted primarily
to lectures, questions and individual work.
- Lecture/Motivating Question:
- What research
related to Boolean information retrieval systems might
make them more efficient and/or effective?
- Computer Exercise:
- The code related to
discussions in Chapter 12 is available online,
and will be studied and applied to some simple
problems.
Boolean query processing is at the heart of most
commercially available retrieval systems, and with
extensions, can be important for future systems.
On the computer we
experiment with programs that implement the most
important low-level operations.
The underlying structures are inverted files and postings
files. These can be more effectively built if disk sorting
is avoided, which is feasible with large primary
memories. Two such approaches are discussed in
Chapter 3.
The postings themselves yield sets of document
identifiers, on which operations like union and
intersection can take place. If bit vectors or hash tables
are used to represent these sets, then the required
operations can be easily carried out, as can be seen in
Chapter 12.
Focusing on effectiveness, on the other hand, we see that
Boolean queries should be interpreted in a soft
scheme. Tradeoffs in time complexity of those
computations, versus amount of improvement likely in a
retrieval, must be considered.
From the Course Objectives a key point is the
objective to: critique, contrast, compare, and evaluate
the efficiency, effectiveness, and utility of commercially
available and research prototype systems for IS&R.
We can look at
the fundamental issues in this regard, involving data and
files structures and their coupling with algorithms:
building inverted files, performing set operations that
correspond to the Boolean operators, and evaluating
extended Boolean queries.
In particular, for these three areas, we support the objective
of being able to:
select ... algorithms and data structures for IS&R systems.
Another objective calls for students to: effectively use ...
search ... access systems --- for common tasks. This
depends on
experimentation with the
laboratory routines and applying an understanding of
the internal representations and processing schemes.
In particular, this Unit has the following objectives, for
students to be able to:
- describe the process of online searching,
including offline preparation, developing a query
strategy, interactively searching, and obtaining results;
- design and write efficient programs to build an
inverted file;
- design and write efficient programs to carry out
the set operations needed for Boolean query processing;
- design and write programs to compute MMM,
Paice, and P-norm similarities between documents and
extended Boolean queries;
- give plausible arguments regarding why the
extended Boolean query schemes may make retrieval
more effective; and
- apply knowledge about Boolean query processing
to the design of future, improved, information retrieval
systems, or the re-design (for improved efficiency and
effectiveness) of current ones.
There are two main types of effort required. First, the
readings (see Section 5) should be carefully studied,
keeping unit objectives in mind (see Section 3). Second,
students should work with the programs to better
understand the Boolean
operations.
In this exercise you should work with the code on
fox.cs.vt.edu that is in the directory called
/u1/gopher-data/CS5604/Code/bool.
It is accessible with gopher, too.
This subdirectory is entitled bool and
corresponds to the code in chapter 12.
Further, you can access it through FTP or WWW
(click
here
with Mosaic).
You should study the source (.c) and include (.h) files.
You may wish to focus on the files that are not in your
text, such as the driver programs.
Note that all of these are in the C language. Work
closely with 1-3 partners, making sure that at least one
person in the group knows C.
Please look at the README file and the makefile to see
what testing is recommended. You should do the regression
testing, and discuss the results.
You may need to take out some of the comments in the
makefile for this to work correctly.
The bvdriver and hdriver programs should work
on any DECstation if you transfer them by gopher, or should
run directly on fox.cs.vt.edu.
Please give to the instructor by
electronic mail or on paper:
a copy of all results from running the driver programs,
a brief discussion of their meaning, and
any comments you have about how these programs
might be used in a retrieval system. This should not take more
than the equivalent of a page. Hint: consider how you might get
away without building and accessing an inverted file if you could
use the set processing routines.
You should be able to answer each of the following
questions.
- Why is inverted file the name given to that
type of data structure?
- Given the binary document/query matrix below,
what would the document file look like, and what would
the inverted file look like? You may use simple lists,
properly labeled, one per line. Refer to text Figure 3.9
or other examples. The matrix has a row for each
document, and a column for each term.
Doc. Terms
No. 1 2 3 4 5 6 7
1 1 0 0 0 1 1 1
2 1 0 0 0 0 0 0
3 1 0 0 0 0 1 0
4 0 1 1 0 0 1 0
- The FAST-INV algorithm essentially carries out
what common matrix operation? What is the time and
space complexity?
- What would be the space requirements for
bit vector and hash table representations of a collection
with 1 million documents and 200,000 terms?
- What would be the special advantages or
disadvantages of MMM vs. Paice vs. P-norm, if the
queries involved were all very, very long?
The supplemental, optional reading
[1]
gives additional background
on the overall topic of Chapter 15, that is, extended Boolean
approaches. It focuses on the P-norm model, and gives experimental
evidence for its utility. Less relevant to the course focus, but
also of interest, is the discussion of the mathematical properties
of the P-norm model, which some feel do not adequately deal with
the requirements of purist Boolean searchers.
In the rest of this collection of readings there are many
typographical and other minor errors. Most are
detailed below --- if any others are found, please tell the
instructor so a complete list can be developed. In
addition, many comments, clarifications, points of
disagreement, etc. are provided.
Note: these refer to the first printing of the text, and should
have been corrected in your edition.
Regarding Chapter 3, there are many specific comments
to be made:
- p. 29, line 5, change 10 percent to less
than 10 percent
- p. 29, line 2 of first bulleted entry, change in
the text to in the text or attached to it
- p. 29, 2nd bullet, consider the quotation To
be or not to be, that is the and what is left after
stopword removal.
- p. 29, last two bullets, consider the abbreviation
AT&T, how the embedded punctuation is handled,
and what would be left from indexing after stopword
removal (where single letters are usually removed too).
- p. 29, top paragraph, is rather unclear. Usually,
think of the process of taking a word and finding the
information for it --- usually a list of record identifiers
indicating where it occurs. Then, the list can
be read and used.
- p. 30, early part of 3.2, consider the inverted file
as a way to start with a string or number, and find a set
of occurrences. This can be done in one step, by using a
B-tree with variable length leaves, or in two steps by
using other search schemes, and a pointer to the
description of the set of occurrences. Note that a
scheme that maintains a sorted order has the added
benefit of also supported alphabetical browsing and
truncation operations in the list of terms.
- p. 32, 9th line from bottom, change 3.4.1 to
3.4.1 or 3.4.2.
- p. 33, bottom, exercise for the researcher ---
look at a real collection and plot a histogram of the number
of postings.
- p. 33, exercise for the researcher --- why and
when would it make sense to use a binary search over an
inverted file? Why should the lists be doubly linked?
What is the cost of updating a list?
- p. 35, Figure 3.6, in-depth reading --- for the
data in Figure 3.4, show these representations.
- p. 35, 2nd line, exercise for the researcher ---
why use a right-threaded tree?
- p. 36, Table 3.1, in-depth reading --- what is the
main benefit of new vs. old?
- p. 37, 3rd line of section 3.4.2, change
principles to facts.
- p. 40, middle of page, exercise for the researcher
--- How does the buffering help? What is the effect of
buffer size?
- p. 41, Figure 3.9, in-depth reading --- What is
the value of M for this example? Can you explain
how the 4 for term 11 in Load 2 gets to
that particular place, from its place in the entry in
Load 2 for document 4.
- p. 42, Table 3.2, fix the column headings for the
two right columns.
In Chapter 12, there are many points where I disagree
with the author, or where commentary is appropriate.
- p. 264, line 3, change last word operations
to expressions. Make this same change on p. 265,
line 20.
- p. 267, middle of page, change other
chapters to Chapter 3.
- p. 268, middle of page, correct the derivation, to
get proper result set doc1 , doc3 , doc5
- p. 268, bottom of page, the discussion on sets
does not seem very relevant or necessary.
- p. 269, line 2, it would be nice to have a
reference for the term element data type.
- p. 273, 4th line, change initialized to
identified. In the last sentence on the page, it is unclear
exactly what is referred to.
- p. 273, Figure 12.3, last group of lines, change
the Unite and Subtract section to correctly do
what is stated on p. 268.
- p. 274, line 12, should large and small
be reversed?
- p. 274, end of 3rd par., see also discussion in
Chapter 3.
- p. 275, line 3, change small to small to
medium since compression of bit vectors (studied by
Bookstein and his colleagues) can extend the utility of this approach.
- p. 276, middle par., change set of to set
drawn from all possible values of. In 2nd from last
sentence, add the example e.g., for Pascal after the
word compilers.
- p. 277, for the set structure, fix the
backslash for bits. A little after, the discussion on
garbage collection is not really well thought out.
- p. 282, first sentence at start of section 12.6, is
not very clear and unduly downplays the range of
applicability.
- p. 283, top par. is confusing and verbose. After
the typedefs, the discussion of char *, does not
match the actual declaration. Later, in the next par., the
7th line from page bottom, more explanation is needed
regarding hashing functions for several tables.
- p. 285, middle of page, add the before
application-supplied. Note that f(v) is not
explained here. However, the concept of hashing-based
applications having to allocate space based on the range
of values rather than number of entries, is important.
- p. 291 line 8, the use of N here corresponds to
W in the chart on the next page, Table 12.1. The first
sentence of the last par. on the page is overstated.
There are various typographical errors in Chapter 15,
and places where comments are of value.
- p. 393, line 6, change Chapter 10 to
Chapter 13
- p. 394, middle, move 1977 to after
McGill.
- p. 395, last line, change Fuzzy to
fuzzy.
- p. 398, text line 5, use P-values and P=2
- p. 399, line 3, change 10 to 14.
- p. 400, top structure declaration, fix the
indenting of comments.
- p. 401, Figure 15.3, note that the top part of the
figure shows the array indexed 0...7.
- p. 403, equation at bottom, should have
parenthesized expressions for all of what is on either side
of the divide symbol.
- p. 406, add space inside the string UNDEF-1
- Inverted files are widely used to support Boolean
query processing, which also involves set manipulation,
commonly done on sorted lists. Fast construction of
those files exploits rapid in-memory computations,
especially: sorting of small lists, and sorted list
insertion. Then the remaining steps generally have
linear complexity.
- Set manipulation can also be done using bit
vectors, or for large ranges in values, hash tables. Most
processing has linear complexity, and that may be sub-
linear for bit vectors if operations on entire bytes or
words has hardware support.
- Computing similarity between Boolean queries
and documents with the MMM, Paice, and P-norm
schemes has led to improvements in precision, at fixed
recall levels, compared to conventional Boolean query
processing. Softening the query interpretation, with
ranking made possible as a result, can help better model
users' view of these queries.
References
- 1
-
G. Salton, E.A. Fox, and H. Wu.
Extended Boolean information retrieval.
Communications of the Association for Computing Machinery,
26(11):1022--1036, November 1983.
Edward A. Fox
Fri Sep 15 19:07:50 EDT 1995