Questions for Form A

  1. Given query A and B and C, and a document with weights for terms A, B, C of 0.8, 0.6, 0.4, what is the similarity resulting from each of the following extended Boolean schemes: a) MMM. b) Paice. c) Pnorm.

  2. The Harman and Candela method for building an inverted file supports future growth by using linked lists that have ............(what information?) for each ............(term? document?), which can later be walked through to build a dictionary and a(n) ............(what type?) file. [Say what goes in the blanks above, considering the hints given in parentheses after each.]

  3. Using the FAST-INV algorithm, where the approach taken is for the document vector file to be copied to reflect the separation into loads, what is the number of passes required, and the purpose of each pass, through the document vectors?

  4. If a computer supports or'ing of pairs of 32-bit words, then finding the union of 2 bit vectors, for a collection of 1000 documents, would take roughly how many or operations?

  5. Assume a collection of 1000 documents and a hashing function that evenly distributes document ids across 13 buckets. Assume that on average, a term can be found in 25 documents. [Think through this one carefully - it is easy to get confused here. If you make any assumptions, or define any variables to use in formulas, be sure to explain them carefully.] a) To index all of the documents, and build an inverted file, creating a set for each term, how many Create and how many Insert operations are required? b) For a given term, what is the average number of documents assigned to a bucket? c) Approximately how many Member operations are required in the hashing approach to find the intersection for two terms?


fox@cs.vt.edu
Tue Aug 30 04:41:34 EDT 1994