Unit RR, Part A: Features
Feature = Word in Description
- flat, uniform, simple system
- assume term independence
- assume large but finite set
- compute weights based on frequency in description
- compute weights based also on frequency in collection
Example: Student Description
- case 1: feature - word in description
- case 2: subvectors of features
- group features by type or by class
- typical groupings: classroom, course, professor, residence, sport
- process sub-vectors separately
- combine sub-vector results
- sub-vectors probably are not independent
General Issues
- How many? Which to use? (query expansion)
- use a small number
- use ones closely related to the original term (query)
- How to assign weights? (term weighting)
- in collection (e.g., IDF, DV, noise)
- in document (e.g., TF --- or normalize and scale for collection)
- combination (e.g., product)
Alternatives for Feature Selection
- Word
- Stem or root or non-plural form
- Thesaurus category
- Descriptor
- N-gram (e.g., bigram, trigram)
- Factor from SVD in LSI scheme
- Attribute (value pair)