Unit RR, Part E: Evaluation
Measures
- Recall and precision
- Single-valued measures:
- Average of r-p (e.g., at r=.25, .50, .75)
- E-measure (lower is better, parameterized regarding relative
importance of recall vs. precision)
- Expected search length
What Computed On?
- Boolean query: on retrieved set
- Ranking query: on set retrieved for a given similarity threshold
- Ranking query: sequence of r-p pairs, for each point in
(descending order) ranking when find another relevant document
Applied to Relevance Feedback
- Test and control
- Divide document collection, randomly, into halves,
- Search and get feedback data on one half.
- Test new query on other half.
- Compute measures on 2nd half only.
- Residual collection
- Run search.
- Define test collection as original collection less already
retrieved.
- Compute measures on that residual collection.
- Rank freezing
- Run search.
- Define test collection as original collection, but freeze
ranks of: only relevant documents (partial freezing) or all
retrieved documents (full freezing)
- Compute measures with frozen slots included; new retrieved
documents fill any unfrozen slot.