-
A DFA that implements stoplist word removal for a collection of n
words is most likely to have:
- a)
- O(n) states, O(n) arcs, more states than arcs.
- b)
- O(n) states, O(n) arcs, less states than arcs.
- c)
- O(n) states, O(n**2) arcs.
- d)
- O(n**2) states, O(n) arcs.
- e)
- none of the above.
-
Explain why Christopher Fox feels that removing stoplist words is best
done as part of lexical analysis as opposed to in a separate phase that
makes use of hashing methods.
-
If a lexical analyzer is to be built that will recognize and conflate times
written in a variety of forms, assuming 12 or 24 hour clocks, referring
to various time zones, etc., would it be best to use a tool like lex or
to write the analyzer by hand as a FSM? Briefly explain why.
-
What type of conflation method is most immune to spelling errors in the
given words? Why?
-
All of the results on stemming retrieval effectiveness described by William
Frakes are based on experiments with small or medium size collections.
What do you expect would be the results if a very large collection was
used? Briefly explain your answer.