Questions for Form A

In WAIS, we often call text databases or information collections by the name sources. Searching with WAIS can involve which of the following:
a)
using a query to search for the right databases to work with.
b)
using a query to search inside one or more databases.
c)
examining a ranked list of results.
d)
using a like this document scheme that is similar to (or a type of) relevance feedback.
e)
all of the above.
f)
exactly 2 out of choices a through d (which?).
g)
exactly 3 out of choices a through d (which?).
h)
none of the above.

When we refer to WAIS or MARIAN (or the new West Publishing system, WIN) making use of a natural language query, what do we mean? Give an example of a natural language query for finding books about Albert Einstein and his theory of general relativity. Do these systems really understand the query as might be the goal of a natural language understanding system?

Weighting schemes that involve the term frequency statistics that come from counting the number of times a term occurs in a document are most helpful:
a)
when documents are very short (e.g., titles only).
b)
when documents are of moderate length (e.g., a 200 word abstract).
c)
when documents are very long (e.g., full-length books).

Please explain/justify your answer.

a) First, assume you are doing a retrospective study (i.e., using full relevance data to ascertain the optimal performance) based on the probabilistic model. Assume you know the number of documents in the collection (N), the number of relevant documents per query (R), the number of documents (n) having the term of interest (t), and the number of relevant documents having term t (r). Based on studies of Sparck Jones, what formula will give you good results if you use it for weighting term t - please write down the formula.b) Now consider using this in the predictive case of relevance feedback. Explain briefly how you might get estimates for the four values mentioned above so you can still use this formula. (Hint: Let w be the number of documents retrieved in the relevance feedback step, x the number of relevant documents found in that step, yt the number of documents having term t found in that step, and zt the number of relevant documents having term t found in that step.)

Assume that you are designing a retrieval system that will support matching of words or word stems (depending on user preference), that will automatically substitute a proper or canonical term when either of its two closest synonyms is used in a query, that will use TF*IDF weighting, and that will apply the Ide dec-hi relevance feedback scheme with cosine similarity. Please describe and explain the files needed to support all this. You can do that aided by any of the following means: sketch (with arrows for pointers, possibly with sample data values) the files and their contents and interrelationships, or list using a series of tables along with their attributes, or specify a relational schema.


fox@cs.vt.edu
Tue Aug 30 04:42:03 EDT 1994