Computer Exercise

There are two parts to the exercise: using the stopword and the stemming code. All routines, in source and executable form, are on video.cs.vt.edu in /u1/fox/ir/Frakes. Look at the stopper and stemmer directories, studying first the read.me and testfile files.

In each directory, please look at the results of running the executable routine (i.e., stopper, stemmer, respectively) on the testfile. In the case of stemmer please look at testfile.out, the results, and testfile.uniq (which comes from running sort -u testfile.out and sending the output to testfile.uniq, which then has the unique words). Please send to the instructor a list of all words that have been overstemmed or understemmed, in this example. Also, try running stemmer on your own to find at least three other words that would be overstemmed, and three other words that would be understemmed. Send those two sets of three words to the instructor, with a short explanation and justification.

In the stopper directory are a variety of interesting result files. File testfile.out is the result of the regression testing on the testfile while testfile.out.uniq is the set of unique words in that result file. On the other hand, testfile.all.uniq has all the unique words in testfile and can be compared with testfile.out.uniq by looking at the result of running diff on these two files. Please study testfile.uniq.diffs and send a message to the instructor telling why these words are here. Also explain how the words here relate to those in stop.wrd. Please comment on the size and quality of stop.wrd and tell of some words that you suggest should be added and some that you think should be taken out. Finally - and it should be easy to do this given the discussion above - indicate how you can tell that stopword removal is correct - discuss a way to cross-check the results.

Look at the driver routines (stopper.c, stemmer.c), and other routines you find of interest. Note that coverage of the program text inside your textbook is good.


fox@cs.vt.edu
Oct 22 1996