Questions for Form A

In a DFA that implements stoplist word removal, arcs often are labeled by:
a)
words in the stoplist.
b)
words not in the stoplist.
c)
letters that are present in one or more stoplist words.
d)
all of the above.
e)
exactly 2 out of choices a through c (which?).
f)
none of the above.

What kinds of errors are likely to occur if a Porter-type stemmer is used to conflate terms:
a)
two different words will conflate to the same stem.
b)
two variants of a word will conflate to different stems.
c)
no errors can occur.
d)
both (a) and (b) above.
e)
none of the above.

Briefly describe changes that would be necessary to the lexical analyzer of Christopher Fox, if one were to recognize and conflate equivalent dates, written in the following forms: 1 January 1993; 1/31/93; 01/31/93; 31/1/93; 31/01/93; and January 1, 1993. Be specific, saying what parts of the analyzer would change, and how.

What are the advantages and disadvantages of stemming at indexing time? Of stemming at search time? Be sure to compare these two approaches against each other and against the option of using no stemming. Think about processing time and storage requirements, in particular.

Assume that PAT is used to store a collection of text with at least one million original words. Given the data available in the resulting PAT tree, would it be possible to implement successor variety conflation? Briefly sketch how this would work.


fox@cs.vt.edu
Tue Aug 30 04:42:23 EDT 1994