- In WAIS, we often call text databases or information collections by the
name sources. Searching with WAIS can involve which of the
following:
- a)
- using a query to search for the right databases to work with.
- b)
- using a query to search inside one or more databases.
- c)
- examining a ranked list of results.
- d)
- using a like this document scheme that is similar to (or a
type of) relevance feedback.
- e)
- all of the above.
- f)
- exactly 2 out of choices a through d (which?).
- g)
- exactly 3 out of choices a through d (which?).
- h)
- none of the above.
- When we refer to WAIS or MARIAN (or the new West Publishing
system, WIN) making use of a natural language query, what do we
mean? Give an example of a natural language query for finding
books about Albert Einstein and his theory of general relativity. Do
these systems really understand the query as might be the goal of a
natural language understanding system?
- Weighting schemes that involve the term frequency statistics that come
from counting the number of times a term occurs in a document are most
helpful:
- a)
- when documents are very short (e.g., titles only).
- b)
- when documents are of moderate length (e.g., a 200 word
abstract).
- c)
- when documents are very long (e.g., full-length books).
Please explain/justify your answer.
- a) First, assume you are doing a retrospective study (i.e., using full
relevance data to
ascertain the optimal performance) based on the probabilistic model.
Assume you know the number of documents in the collection (N), the
number of relevant documents per query (R), the number of documents
(n) having the term of interest
(t), and the number of relevant
documents having term t
(r). Based on studies of Sparck Jones,
what formula will give you good results if you use it for weighting term
t - please write down the formula.b) Now consider using this in the
predictive case of relevance feedback. Explain briefly how you might
get estimates for the four values mentioned above so you can still use this
formula. (Hint: Let w be the number of documents retrieved in the
relevance feedback step, x the number of relevant documents
found in that step, yt the number of documents having term
t found
in that step, and zt the number of relevant documents having term
t
found in that step.)
- Assume that you are designing a retrieval system that will support
matching of words or word stems (depending on user preference), that
will automatically substitute a proper or canonical term when either
of its two closest synonyms is used in a query, that will use TF*IDF
weighting, and that will apply the Ide dec-hi relevance feedback scheme
with cosine similarity. Please describe and explain the files needed to
support all this. You can do that aided by any of the following means:
sketch (with arrows for pointers, possibly with sample data values) the
files and their contents and interrelationships, or list using a series of
tables along with their attributes, or specify a relational schema.