Questions for Form B

What are the main advantages and disadvantages of having a very short stopword list? Of having a list with several hundred words?

If one is working with a collection of titles of books, such as occurs in the MARIAN system, what type of conflation would be best:
a)
stemming using a Porter-like algorithm.
b)
morphological analysis to root forms.
c)
no conflation - leave words as-is.
d)
removal of plural-forming endings.
e)
none of the above.

Please briefly justify your choice.

Compare the speed and quality of results for stemming (based on a Porter-type algorithm) versus morphological analysis (i.e., linguistically correct reduction to root forms). In what situations would each be most suitable?

What is the difference discussed by William Frakes between weak stemming and strong stemming and when would each be likely to be more suitable than the other?

In a very large collection of legal documents, where the full-text of the documents is indexed, it is often the problem that too many documents are retrieved in response to a query calling for exact matching of words. If that collection is re-indexed with stemming done at indexing time, what is likely to happen with queries calling for matching of stems, regarding the number of documents retrieved (i.e., will the number increase or decrease, and by how much)? What can be done to correct this situation?


fox@cs.vt.edu
Tue Aug 30 04:42:23 EDT 1994