Questions for Form C

A DFA that implements stoplist word removal for a collection of n words is most likely to have:
a)
O(n) states, O(n) arcs, more states than arcs.
b)
O(n) states, O(n) arcs, less states than arcs.
c)
O(n) states, O(n**2) arcs.
d)
O(n**2) states, O(n) arcs.
e)
none of the above.

Explain why Christopher Fox feels that removing stoplist words is best done as part of lexical analysis as opposed to in a separate phase that makes use of hashing methods.

If a lexical analyzer is to be built that will recognize and conflate times written in a variety of forms, assuming 12 or 24 hour clocks, referring to various time zones, etc., would it be best to use a tool like lex or to write the analyzer by hand as a FSM? Briefly explain why.

What type of conflation method is most immune to spelling errors in the given words? Why?

All of the results on stemming retrieval effectiveness described by William Frakes are based on experiments with small or medium size collections. What do you expect would be the results if a very large collection was used? Briefly explain your answer.


fox@cs.vt.edu
Tue Aug 30 04:42:23 EDT 1994