Ranking/Relevance Feedback Article Summary of: "A Vector Space Model for Automatic Indexing" Salton and Yang 1975 Group II Submission: Lauren Barton Martin Falck Nelson Kile, Jr. Carolyn O'Hare Robert Ryan This article discusses a new indexing model which attempts to increase the recall and precision of document retrievals or other pattern matching environments by developing vectors for the documents and attributing relevance based on their relative position to each other. In an ideal situation it would be desirable to have all documents assigned vectors which are so similar that similarly relevant documents are tightly clustered on a 3-d graph. Once this is done you can be assured of getting high recall and precision by selecting all documents in the same tight cluster. However, this ideal situation requires "a priori knowledge of the retrieval history for the given collection." This means if you have humans heuristically assign relevance scores to documents then future selections will be precise because they have this a priori knowledge as a guide. Unfortunately, Salton et al. state this strategy is not practical. Instead they suggest the best way to increase precision and recall is to achieve a maximum possible separation between the individual document vectors in their model. This is accomplished by first minimizing the average similarity between document pairs which guarantees that each given document can be retrieved when co-located near a user inquery- " without also necessarily retrieving its neighbor." By decreasing the amount of unwanted neighbors retrieved, the model increases its precision by failing to retrieve irrelevant near-by vectors. Therefore, since the relevant documents are clustered tightly together and widely separated from irrelevant documents, both high recall and high precision are achieved. The document space is best considered when the documents are assigned into classes represented by a class centroid. This centroid represents the average weight of the same elements in the surrounding document vectors. After the field is broken into classes with a single cluster centroid the author's suggest a `main' centroid be developed from the average weights of the cluster centroids. The best possible space would consist of many vectors gathered into a small clusters and the clusters would be separated from other clusters by as far as possible. Therefore, the intra-cluster vector distances would be minimized and the inter-cluster distances would be maximized. The authors then proceed to apply automatic indexing methods to evaluate their new model. These procedures "indicate that retrieval performance and document space density appear inversely related, in the sense that effective indexing methods in terms of recall and precision are associated with separated(compressed) document spaces." Finally, the authors conclude that they are unable to provide a causal link between space density and the resulting document frequency indexing model. However, they suggest their vector space model is suitable for use on collections in several different subject areas. They believe their results are superior to other manual or automatic indexing and analysis procedures. =================================== Review of "A Vector Space Model for Automatic Indexing" Article written by: G. Salton, A. Wong, and C.S. Yang Reviewed by Rick Compton, Fred Drake, Mark Missana, Steve Williams In considering a vector space model for automatic document indexing, we can imagine a document space that consists of documents characterized by one or more index terms. These index terms may be weighted based on importance or unweighted (present/not present). An n-dimensional vector can be utilized to represent documents indexed by n terms. Index vectors for two different documents can be compared for similarity in corresponding terms and term weights by computing a similarity coefficient. A single point on the unit sphere can also be used to represent a document. Given this representation, the distance between two document points in the space are, in general, inversely correlated with the similarity between the corresponding vectors. The next question is whether there is a particular document space that performs optimally in retrieval. Ideally, documents relevant to a certain query should be clustered closely together. However, without previous knowledge as to the queries themselves, some clustering patterns might well retrieve undesirable documents. Therefore, it seems logical to minimize the similarities between documents. It is sometimes thought that good spacing leads to good retrieval, and likewise that improved retrieval relies on good separation. However, because the number of vector comparisons can be quite large, a clustered document space is considered instead. Documents are grouped into clusters, and each cluster is represented by a cluster centroid. There is also a main centroid at the center of all the clusters. Experiments were performed where term weighting was expressed using both term frequency, and the product of term frequency and the inverse document frequency. One test was performed with many small clusters and considerable cluster overlap, while another experiment used larger clusters with less overlap. The fraction y/x, where x was the average similarity between documents and the associated centroid, and where y was the average similarity between cluster pairs, was used as a measure of space density. The experiments tended to support the notion that recall and precision are improved with decreased space density. Similar tests that used inverses of the weighting procedures also showed that poor performance indicated increased space density. The need to increase similarity within clusters and decrease similarity from cluster to cluster was now evident. Additional experiments were performed where unique terms were emphasized and terms appearing in many documents were de-emphasized. The positive results reflected the usefulness of term discrimination. Good index terms should have a high discrimination value. Poor index terms should be just the opposite. Surprisingly, further study showed that the best discriminators had neither high nor low document frequencies. It was found that high frequency terms could obtain higher discrimination value by being combined with other terms as a phrase. Low frequency terms could be added to a common semantic term and likewise obtain higher discrimination value. While these experiments did not demonstrate a proof positive relationship between recall-precision quality and space density, respectable results were achieved for several different types of document collections. = = = = = = = = = = = == = = = = = = = = = = = == = "A Vector Space Model for Automatic Indexing" by Salton, Wong, and Yang RR Article Summary by Group I: Kalafut, Muhlenburg, Klein, Fitzgerald This article begins with the hypothesis that document space density may have a strong relation to retrieval performence. In fact, the clustering of similar documents and the separation of dissimilar ones may be the optimum distribution of a document space. Using within-document frequency and normalized IDF weighting, it was shown that retrieval performance did affect space density. Afterwards, the reverse was shown - that document space density affected performance, the best being the well-separated clusters of similar documents. The discrimination value of an index term is how much the assignment of that term makes its document more different than similar documents. About 25 percent of terms in the Medlars collection were the best discriminators. Finally, it was shown that automatic phrasing and automatic phrasing with thesaurus classes performed significantly better than standard term frequency. =================================== From: (Group 5) Shirley Carr Mike Joyce Zakia Khan Vas Madhava Article Summary (SALT75b): A Vector Space Model for Automatic Indexing In a vector space model, a document can be represented by a vector such as Di = (di1, di2, ... dit) where D represents a document and d represents term weights for each term within the document. With this approach, given two document vectors one could find a similarity between them by one of two methods: 1) Getting the inner product of the two vectors or 2) Getting the inverse function of the angle between the two vectors. But instead of making each document a vector, the vector can be normalized so that each document just becomes a point in a document space. Then, the distance between two document points is inversely correlated with the similarity between the two vectors. Ideally we'd like to have a space where all the relevant documents are clustered together. But, unfortunately, this cannot be done without apriori knowledge of the document's relevance. Another option, however, is to have a space where everything is separated so that we can access all the relevant documents without any of the non-relevant ones being in the vicinity. In practice, even this is hard to do. What can be done, however, is to create clusters and have each cluster be represented by a centroid. Using these centroids, one can calculate the space density by checking each document against the centroid. This decreases the necessary computations from O(n**2) to O(n). The best clustering method is when there is large intra-cluster similarity and little inter-cluster similarity. The best term-weighting technique is: (term frequency weight) * (inverse document frequency) This gives the highest weight to terms that occur a lot in one document but little in any other documents. Results from experiments using this has shown that improved recall-precision is associated with decreased density and vice versa, that when space is compressed there is poor retrieval performance. This leads to the term discrimination model. With this model, a discriminating value is one that when assigned to a collection decreases the similarity between the documents. And conversely, if bad index terms are put in, the space density decreases. Thus, getting good discriminators is the next task. What was found was that good discriminators are those whose document frequency is not too high or not too low. The following strategy can be used: 1) Terms with mediator document frequency should be used directly. 2) Terms with very high document frequency should be transformed into lower frequency terms by making them into phrases. (which by their very nature are lower in frequency) 3) Terms with very low frequency should be substituted by other words by a thesaurus class. This way different terms can be combined into one term. Such a technique has helped with the MEDLARS collection.