- In a DFA that implements stoplist word removal, arcs often are labeled
by:
- a)
- words in the stoplist.
- b)
- words not in the stoplist.
- c)
- letters that are present in one or more stoplist words.
- d)
- all of the above.
- e)
- exactly 2 out of choices a through c (which?).
- f)
- none of the above.
- What kinds of errors are likely to occur if a Porter-type stemmer is used
to conflate terms:
- a)
- two different words will conflate to the same stem.
- b)
- two variants of a word will conflate to different stems.
- c)
- no errors can occur.
- d)
- both (a) and (b) above.
- e)
- none of the above.
- Briefly describe changes that would be necessary to the lexical analyzer
of Christopher Fox, if one were to recognize and conflate equivalent
dates, written in the following forms: 1 January 1993; 1/31/93; 01/31/93;
31/1/93; 31/01/93; and January 1, 1993.
Be specific, saying what parts of the analyzer would change, and how.
- What are the advantages and disadvantages of stemming at indexing time?
Of stemming at search time? Be sure to compare these two approaches
against each other and against the option of using no stemming.
Think about processing time and storage requirements, in particular.
- Assume that PAT is used to store a collection of text with at least one
million original words. Given the data available in the resulting PAT
tree, would it be possible to implement successor variety conflation?
Briefly sketch how this would work.