CS5604, Unit CL

Edward A. Fox
Department of Computer Science, Virginia Tech, Blacksburg VA 24061-0106

Abstract:

Terms are not independent of each other. Documents are not independent of each other. We exploit these facts when we cluster: to construct term groupings or thesauri, facilitate browsing, improve the efficiency of retrieving documents from disk, and in many cases to improve retrieval effectiveness.

Clustering algorithms vary widely in terms of their requirements for space and time, their stability, the tightness of the resulting clusters, whether or not cluster hierarchies are produced, and in their utility for information retrieval applications. Indeed, some collections are not really amenable to clustering, especially when similar documents are not relevant to the same query, or when terms that have similar document occurrence characteristics are not really searchonyms.

This Unit covers these issues, making use of one textbook chapter, one lecture, and an exercise.



fox@cs.vt.edu
Oct 22 1996