CLASS SUMMARY 09/25/95 S. Carr, M. Joyce, B. Khan, Z. Khan, V. Madhava * The class started with the handling of administrative work and general questions and answers. * A quick review of some of the previous class material (Unit RR) was done. * From the Unit RR Part E: Evaluation was discussed. The issue of effectiveness vs. efficiency was pointed out. Definitions of recall and precision were provided. Recall is the number of relevant documents retrieved i.e., how much of what is relevant has been retrieved. On the other hand, precision is how much of what is retrieved is relevant. Under the topic of evaluation of relevance feedback the residual collection method was discussed. The problem of working with residual documents was stressed - the highly ranked relevant documents have been removed and so the performance will be lower in the second pass. * Data Structures - the use of term frequency (TF), inverted document frequency (IDF), stemming, and alternate words from a thesaurus, to improve retrieval performance. * Optimizations - the basic search process using accumulators was discussed; problem with using accumulators - pruning, the solution to the problem. * In the CL Unit, Clustering - we discussed the key clustering concepts: cluster, centroid, low level clusters, hierarchical clustering, agglomerative clustering, dendogram. * Cluster searching - depth-first searching: a recall enhancing device; problems with this type of searching - too many terms searching; another alternative backtracking - backing up and going through paths not traversed before. * Partitioning - another way to look at clustering. = = = = = == = = === = = = = = = = = = ==== = = == = = = == = = Tom Kalafut 09/25/95 Class Summary After taking care of some administrative business, we reviewed the derivation of the F4 equation. We were then introduced to MARIAN, a newer VT library IR system that lets the user input more natural languaged queries. We then covered inference networks that have several parallel layers to connect documents to help retrieve relevant ones. We then went into methods of retrieval evaluation keying on the comparison of the r and p measures. The best method is test & control which involves breaking up the document space into pieces and seeing how a set of queries works in each subset. A less representative evaluation method is the residual method which envolves removing selected documents from the search space after each query, which is skewed when there are only a few possible works for a specialized query. A not so good method for evaluation is rank freezing. We then saw how to construct a good data dictionary holding the terms, their stems, ranking measures, and pointers to documents. We then closed Relevance Ranking with some optimization techniques including pruning which is stopping a term ranking once it is significantly high, sorting the ranks, using the lowest ranking term first, and screening the feedback. We started Clustering similar subsets of documents just before ending class. = = = = = == = = === = = = = = = = = = ==== = = == = = = == = = 9/25: 5604n RANKING AND REL CLASS SUMMARY GROUP 2 Lauren Barton Martin Falck Nelson Kile Carolyn O'hare Robert Ryan A review of last classes material on RR was held. New information was covered on Vector Feedback Uses information in searches predict which other records may be relevant. Rocchio - Uses all information on positive and negative searches. Ide - Uses only the positive results. The Probabilistic Model Computes the probability of relevance based on whether or not index terms are present Binary Independence model - assumes that terms are independent Inference Networks Uses Bayesian inference and "if" structures to determine relevance to query Evaluation Precision and Recall - dual measure Averages - average of recall and precision at specific points. E-measures - single measure Expected lengths - Tells how many documents need to be reviewed to find X terms Data Structures Look at the frequencies of terms in a document - requires larger index. Could use an inverse document frequency Could stem, or expand with thesaurus. Optimizations Sort query terms by frequency - the highest IDF is the most important. The user may screen results. Multiple processors Began new unit on clustering Combine several similar documents into centroids. The centroids are then combined into structures that are similar to trees. The same document may also be placed into several centroids.