- Given query A and B and C, and a document with
weights for terms A, B, C of 0.8, 0.6, 0.4, what is the
similarity resulting from each of the following extended
Boolean schemes:
a) MMM. b) Paice. c) Pnorm.
- The Harman and Candela method for building an inverted
file supports future growth by using linked lists that have
............(what information?) for each
............(term? document?), which
can later be walked through to build a dictionary and a(n)
............(what type?) file. [Say what goes in the blanks above,
considering the hints given in parentheses after each.]
- Using the FAST-INV algorithm, where the approach
taken is for the document vector file to be copied to
reflect the separation into loads, what is the number of passes
required,
and the purpose
of each pass, through the document vectors?
- If a computer supports or'ing of pairs of 32-bit words,
then finding the union of 2 bit vectors,
for a collection of 1000 documents, would take roughly
how many or operations?
- Assume a collection of 1000 documents
and a hashing function that evenly distributes
document ids across 13 buckets. Assume that on average,
a term can be found in 25 documents.
[Think through this one carefully - it is easy to get
confused here. If you make any assumptions, or define
any variables to use in formulas, be sure to explain them
carefully.]
a) To index all of the documents, and build an inverted
file, creating a set for each term, how many Create and how
many Insert operations are required?
b) For a given term, what is the average
number of documents assigned to a bucket?
c) Approximately
how many Member operations are required in the
hashing approach to find the intersection for two
terms?