CS5604, Unit IF

Edward A. Fox
Department of Computer Science
Virginia Tech, Blacksburg VA 24061-0106

Abstract:

Almost all commercially available information retrieval systems allow users to submit Boolean queries. These, along with many of the research systems, use inverted files to aid in query processing. To better understand these systems, we consider inverted files and methods to create them, as discussed in Chapter 3. The actual processing of Boolean queries involves operations on sets (that are identified from the inverted file), which are discussed in Chapter 12. Finally, in Chapter 15 we consider extensions to Boolean query processing that has been shown in research studies to yield significantly higher precision at all levels of recall.

This Unit deals with three chapters and one computer exercise. In addition, there is one optional reading, for those with particular interest in extended Boolean retrieval methods. Class time will be devoted primarily to lectures, questions and individual work.

Unit Highlights

Lecture/Motivating Question:
What research related to Boolean information retrieval systems might make them more efficient and/or effective?

Computer Exercise:
The code related to discussions in Chapter 12 is available online, and will be studied and applied to some simple problems.

Introduction

Boolean query processing is at the heart of most commercially available retrieval systems, and with extensions, can be important for future systems. On the computer we experiment with programs that implement the most important low-level operations.

The underlying structures are inverted files and postings files. These can be more effectively built if disk sorting is avoided, which is feasible with large primary memories. Two such approaches are discussed in Chapter 3.

The postings themselves yield sets of document identifiers, on which operations like union and intersection can take place. If bit vectors or hash tables are used to represent these sets, then the required operations can be easily carried out, as can be seen in Chapter 12.

Focusing on effectiveness, on the other hand, we see that Boolean queries should be interpreted in a soft scheme. Tradeoffs in time complexity of those computations, versus amount of improvement likely in a retrieval, must be considered.

Objectives

From the Course Objectives a key point is the objective to: critique, contrast, compare, and evaluate the efficiency, effectiveness, and utility of commercially available and research prototype systems for IS&R. We can look at the fundamental issues in this regard, involving data and files structures and their coupling with algorithms: building inverted files, performing set operations that correspond to the Boolean operators, and evaluating extended Boolean queries. In particular, for these three areas, we support the objective of being able to: select ... algorithms and data structures for IS&R systems.

Another objective calls for students to: effectively use ... search ... access systems --- for common tasks. This depends on experimentation with the laboratory routines and applying an understanding of the internal representations and processing schemes.

In particular, this Unit has the following objectives, for students to be able to:

  1. describe the process of online searching, including offline preparation, developing a query strategy, interactively searching, and obtaining results;

  2. design and write efficient programs to build an inverted file;

  3. design and write efficient programs to carry out the set operations needed for Boolean query processing;

  4. design and write programs to compute MMM, Paice, and P-norm similarities between documents and extended Boolean queries;

  5. give plausible arguments regarding why the extended Boolean query schemes may make retrieval more effective; and

  6. apply knowledge about Boolean query processing to the design of future, improved, information retrieval systems, or the re-design (for improved efficiency and effectiveness) of current ones.

Suggested Procedure

There are two main types of effort required. First, the readings (see Section 5) should be carefully studied, keeping unit objectives in mind (see Section 3). Second, students should work with the programs to better understand the Boolean operations.

Computer Exercise

In this exercise you should work with the code on fox.cs.vt.edu that is in the directory called /u1/gopher-data/CS5604/Code/bool. It is accessible with gopher, too. This subdirectory is entitled bool and corresponds to the code in chapter 12. Further, you can access it through FTP or WWW (click here with Mosaic).

You should study the source (.c) and include (.h) files. You may wish to focus on the files that are not in your text, such as the driver programs.

Note that all of these are in the C language. Work closely with 1-3 partners, making sure that at least one person in the group knows C.

Please look at the README file and the makefile to see what testing is recommended. You should do the regression testing, and discuss the results. You may need to take out some of the comments in the makefile for this to work correctly. The bvdriver and hdriver programs should work on any DECstation if you transfer them by gopher, or should run directly on fox.cs.vt.edu. Please give to the instructor by electronic mail or on paper: a copy of all results from running the driver programs, a brief discussion of their meaning, and any comments you have about how these programs might be used in a retrieval system. This should not take more than the equivalent of a page. Hint: consider how you might get away without building and accessing an inverted file if you could use the set processing routines.

Study Questions

You should be able to answer each of the following questions.

  1. Why is inverted file the name given to that type of data structure?
  2. Given the binary document/query matrix below, what would the document file look like, and what would the inverted file look like? You may use simple lists, properly labeled, one per line. Refer to text Figure 3.9 or other examples. The matrix has a row for each document, and a column for each term.

    Doc. Terms
    No. 1 2 3 4 5 6 7
    1 1 0 0 0 1 1 1
    2 1 0 0 0 0 0 0
    3 1 0 0 0 0 1 0
    4 0 1 1 0 0 1 0

  3. The FAST-INV algorithm essentially carries out what common matrix operation? What is the time and space complexity?
  4. What would be the space requirements for bit vector and hash table representations of a collection with 1 million documents and 200,000 terms?
  5. What would be the special advantages or disadvantages of MMM vs. Paice vs. P-norm, if the queries involved were all very, very long?

Comments on Readings

The supplemental, optional reading [1] gives additional background on the overall topic of Chapter 15, that is, extended Boolean approaches. It focuses on the P-norm model, and gives experimental evidence for its utility. Less relevant to the course focus, but also of interest, is the discussion of the mathematical properties of the P-norm model, which some feel do not adequately deal with the requirements of purist Boolean searchers.

In the rest of this collection of readings there are many typographical and other minor errors. Most are detailed below --- if any others are found, please tell the instructor so a complete list can be developed. In addition, many comments, clarifications, points of disagreement, etc. are provided. Note: these refer to the first printing of the text, and should have been corrected in your edition.

Textbook Chapter 3

Regarding Chapter 3, there are many specific comments to be made:

Textbook Chapter 12

In Chapter 12, there are many points where I disagree with the author, or where commentary is appropriate.

Textbook Chapter 15

There are various typographical errors in Chapter 15, and places where comments are of value.

Summary of Key Concepts

  1. Inverted files are widely used to support Boolean query processing, which also involves set manipulation, commonly done on sorted lists. Fast construction of those files exploits rapid in-memory computations, especially: sorting of small lists, and sorted list insertion. Then the remaining steps generally have linear complexity.

  2. Set manipulation can also be done using bit vectors, or for large ranges in values, hash tables. Most processing has linear complexity, and that may be sub- linear for bit vectors if operations on entire bytes or words has hardware support.

  3. Computing similarity between Boolean queries and documents with the MMM, Paice, and P-norm schemes has led to improvements in precision, at fixed recall levels, compared to conventional Boolean query processing. Softening the query interpretation, with ranking made possible as a result, can help better model users' view of these queries.

References

1
G. Salton, E.A. Fox, and H. Wu. Extended Boolean information retrieval. Communications of the Association for Computing Machinery, 26(11):1022--1036, November 1983.



Edward A. Fox
Fri Sep 15 19:07:50 EDT 1995