Group II Submission: Lauren Barton Martin Falck Nelson Kile, Jr. Carolyn O'Hare Robert Ryan Author - Gerald Salton Title - Another look at automatic text-retreival systems The author uses studies to support his theory that automatically indexed systems are preferable to manually indexed systems. This is in contrast to Blair and Maron's assertions in their study. The author also provides theories on weighting models and processes for automatic indexing. The author discusses the following: * The terms precision, recall, precision enhancing devices, and recall enhancing devices. * The IBM STAIRS system with the following features: - Words are normally extracted from document texts. Text words are broadened using truncation and may be supplemented by lists of synonyms supplied by the user. - Synonyms supplement searches for specific terms. - Includes a ranking feature that retrieves documents in decreasing order based on weights. - Average precision (0.75). - Average recall (0.20). * The assertions made in the Blair and Maron article based on an evaluation of STAIRS system usage. - When high recall is essential, users cannot simply broaden the search request because of overload. - When high recall is desired, manual indexing is preferred. * The Medlars evaluation -- Medlars is a large system with manual indexing by experts. Medlars is maintained on medical research at the library of medicine. A study was done to evaluate precision and recall. although widely scattered, the average recall was 0.58 and the average precision was 0.50. Three particular performance points were analyzed: - for high precision searches precision was about 0.80, recall was 0.19. - for high recall searches, recall reached 0.89, precision was 0.20. - for average performance recall was 0.58, and precision was 0.50. A summary of the recall failures and the precision failures showed that a substantial portion of the search failures were due to manual indexing and the controlled language. * The NASA study in the mid 1970's a comparison between automatic and manual indexing was conducted using a nasa database. The following indexing systems were compared. - A natural language text search consisting of a machine search of document titles and abstract, not the entire text, produced the best average recall (0.78) and had a high order of precision (0.63). - A natural language text search system supplemented by a thesaurus of associated concepts prepared from the source documents. - A controlled language indexing of the documents performed by human subject experts. This method produced a precision (0.74) which was better than the automatic abstract search, but also produced a substantially worse recall (0.56). - The controlled indexing supplemented by natural language terms extracted from the documents. * The Aslib-Cranfield study which attempted to evaluate the performance of natural language (single term) indexing based on abstract searching supplemented by precision and recall enhancing mechanisms. The tests indicated that the single term natural language indexing provided better results than the comparable controlled term indexing. * The terms exhaustively, and specificity were defined. The author also suggests a possible weighting factor. * The probabilistic retrieval model assumes that the most valuable documents are those whose probability of relevance to a query is the largest. The relevance can be estimated using the properties of the individual terms. * The term discrimination model assumes that the most useful terms for content identification are those best capable of distinguishing the documents of a collection from each other. The best content identifies will be those occurring neither too rarely nor too frequently. * Proposed process for automatic indexing: - identify individual words. - use a stop list. - use suffix stripping. - compute a term weighting factor. - represent each document by the set of weighted word stems. * Proposed refinements for the above automatic indexing model - generate word stems that are attached to the documents. - use a thesaurus. - use a phrase formation process to generate term phrases that incorporate terms with high document frequencies. - compute a combined term weighte for assigned thesaurus classes and term phrases and represent each document by the corresponding sets of weighted single terms term phrases and thesaurus classes. * Conclusion -- the author believes that some of the statements made in the Blair Maron study are incorrect. The author believes that the evidence supports that properly designed text are preferable to manually indexed systems. ================================================================= Review of Salton, "Another Look at Automatic Text-Retrieval Systems" Reviewed by Rick Compton, Fred L. Drake, Jr., Mark Missana, and Stephen Williams In "Another Look at Automatic Text-Retrieval Systems," Gerard Salton refutes an earlier study by Blair and Maron and cites several studies as evidence that automatic document retrieval systems are competitive with or superior to systems based on manual indexing. Effectiveness of a retrieval system is based on measures of recall and precision. Recall is the amount of relevant material retrieved divided by the relevant material available in the system. Precision, on the other hand, is the relevant material retrieved divided by all materials retrieved. In practice, recall and precision tend to be inversely related. Automated retrieval systems have been enhanced to improve recall and precision. Term truncation, the use a word stem to include variations, and the addition of synonyms or related words form a broader search to increase recall. Term weighting, use of word phrases, and term hierarchies can provide a narrower search and thus improve precision. The STAIRS system, studied by Blair and Maron, is a "full-text retrieval system". Using term truncation and synonym lists to broaden text words and a ranking feature that returns documents based on decreasing document weights, the STAIRS system averaged a recall of 20 percent while maintaining a precision of 75 percent. Blair and Maron's searchers were lawyers. For their application, 20% recall was considered unacceptable. Blair and Marion determined that the system was not user friendly and that broadening the search request would merely result in detrimental loss of precision. Their conclusion was that full-text retrieval systems should not be substituted for manual indexing. Salton asserts that Blair and Maron's conclusions were at least partially based on sentiment as their sample study simply did not provide enough data for them to make this conclusion, as no comparative data from a manually indexed collection was considered. Salton combines information from the study presented by Blair and Maron and two other studies of systems using large document collections to show that automatic retrieval systems are competitive with manual indexing. Lancaster's study of Medlars and Clarendon's evaluation of a NASA system broaden the evaluation space to put Blair and Maron's experience into perspective. In the 1960s, Lancaster conducted experiments with "the Medlars demand search service" using biomedical literature. After manual indexing and querying, the search and retrieval operations were performed manually. Lancaster's results indicate that even without a controlled language and manual indexing, the STAIRS performance was "within the range of the high-performance Medlars searches". Lancaster's analysis also showed that manual indexing demonstrated some problems that could be avoided by automated indexing. In the 1970s, a NASA study to compare automatic and manual indexing was headed by Clarendon. When the automated search was based on titles and abstracts, the results were as good or better than those for a controlled language using manual indexing. Additionally, for comparably sized collections, the NASA test proved superior to the STAIRS system. The effectiveness of an automated indexing system depends on two characteristics. Exhaustivity is the extent that all aspects of the document are categorized and recognized. Specificity, on the other hand is the extent that a single index term represents the contents of a document. More exhaustive indexing, tends to increase recall. Higher term specificity tends to increase precision. The best weighted terms are those that occur frequently within a document and sparsely outside that document. Both the "probabilistic retrieval model" and the "term-discrimination model" offer calculations for evaluating term weights. Four steps can improve the basic indexing process. First, apply a weighting measure to each word stem. Second, use a thesaurus to substitute for words with low frequencies within a document. Third, generate the list of high frequency word phrases. And fourth, represent a document by a set of weights for each of the term types described above. In conclusion, there is no evidence to support the superiority of manual retrieval systems over text-based retrieval systems. Optimistically, one can only expect the text-based system to get better. =================================== Article Summary: Another Look at Automatic Text-Retrieval Systems, Salton Submitted by: Aleasa Chiles-Feggins, Mahmood Bahraini, Doug Walls, John Thomas and Kathleen Sgamma Salton begins "Another Look at Automatic Text-Retrieval Systems" with an overview of some of the basic aspects of automatic text-retrieval systems and the measures of their effectiveness. He provides an overview of recall and precision, and the 80-20 rule. The article discusses that query formulation and document indexing can be altered to achieve the desired level of recall versus precision by using term broadening or term narrowing devices. He provides details on these devices. This introduction sets the stage for his rebuttal of the conclusions reached by Blair and Maron in "An evaluation of Retrieval Effectiveness for a Full-Text Document-Retrieval System". Salton believes that the Blair and Maron study of the STAIRS automatic text-retrieval system produced results that are indeed effective and typical of other similar systems, desite the negative conclusions reached. They basically conclude that effective automatic text-retrieval systems are not achievable goals and are not worh pursuing. The conclusions are particularly suspect because their own study actually produced desirable results; other available studies had provided support for automatic text-retrieval systems; and they drew their conclusions without comparing full-text retrieval versus manual systems, and large versus small document collections. Salton proceeds to give evidence that automatic indexing systems perform as well as if not better than manual systems. He also refutes Blair and Maron's concerns for the costs incurred by automatic indexing that deal with "... 20 times the amount of information that a manually indexed system would ..." by citing studies of systems that index abstracts rather than the full text. The studies discussed by Salton include: - The Medlars system - a controlled indexing language system used for biomedical literature with automatic search and retrieval. The Medlars system study did not support Blair and Maron's concerns for high recall searches that produce "output overload," but Salton attributed this result not to the manual indexing but the homogeneity of the biomedical collection. In fact, Salton cites Lancaster's attribution of precision and recall failure to the manual indexing and controlled language used in Medlars. These features reduce the number of terms available to index and search, giving users less flexibility in structuring their queries to match their recall/precision needs. Salton cites C.W. Cleverdon who suggests that combining terms is better done during searching, rather than during indexing. This appears to be true because manually determining the correct term to use is a subjective process that introduces human error during both input and search. - Comparison of Manual and Automatic Indexing - a comparison of the STAR and IAA systems revealed that natural-language systems perform equal to if not better than controlled-language systems. For users desiring a higher recall, the natural-language text-search system produced recall rates 20% higher than the controlled language system while still producing a 63% rate of precision. The study also pointed out that collection size is not as important to search performance as the query and homogeneity of the collection. - Aslib-Cranfield tests also produced somewhat better results from the natural-language text-search system versus the controlled-language system, even though the automatic- indexing was limited to single-terms. Salton indicates that, therefore, complete automatic indexing will produce better results. Having made his case for automatic indexing, and having soundly refuted Blair and Maron's results, Salton discusses some basic features and models of automatic indexing systems. Specifically, he discusses that effectiveness depends on exhaustivity of indexing and the specificity of index terms. Both are measured by term frequency (tf) and the inverse document frequency (idf). He discusses two models: probabilistic retrieval and term discrimination. The probabilistic retrieval model produces a term relevance weight that gives greater weight to terms found in fewer documents of a collection. The term-discrimination model evaluates the density of the document collection that is chosen from a search of a given term. The higher the document density remaining after the search (achieved by using a low-frequency term), the less the term is useful for distinguishing the documents from each other. In addition, if the search returns results of high density (achieved by a high-frequency term), the term is also not useful for distinguishing documents from each other. Therefore, the term-discrimination model favors terms that occur neither too frequently nor too rarely. Term-narrowing or term-broadening devices, discussed at the beginning of the article, can be used to achieve the desired level of specificity. Salton argues that term weighting should not be used just to rank the results of a search, but also to improve indexing. The Salton article was excellent for explaining fundamental concepts of automatic text-based retrieval systems and providing convincing evidence for continuing work in the development of automatic indexed systems. He provides a basic approach for automatic indexing that sums up the basic concepts of an automatic text-based retrieval system and offers improvement techniques using term weighting. = = = = = = = = = = = == = = = = = = = = = = = == = Heiman IR Document Summary This article examines the effectiveness of automatic text retrieval systems. The effectiveness is generally measured in 2 terms: recall and precision. Recall is what proportion of all relevant documents were retrieved versus; precision is how relevant are the retrieved documents to the user's needs. A paradox often occurs in retrieval systems: the higher the recall the lower the precision and vice versa. The author discusses several mechanisms for getting an acceptable recall and precision rating. He cites several studies that were able to achieve acceptable retrieval results using automatic indexing, and he uses these as a rebuttal against the Blair and Maron retrieval test. Blair and Maron argue that manual indexing achieves better search results. The article introduces methods of Automatic Indexing, including term weighting using inverse document frequency. Two approaches to retrieval theory are discussed: the probablistic retrieval model and the term-discrimination model. Finally, the author ends with guidelines for effective automatic indexing and suggests again that automatic indexing if implemented properly are preferable to manual indexing. =================================== Article Summary for IR Unit Group 1 James Fitzgerald Chris Klein John Muhlenburg Tom Kalafut Another Look at Automatic Text-Retrieval Systems by Gerard Salton This article was written in response to an article by Blair & Maron whose article stated that for all intents and purposes that Automatic Retrieval systems gave no better performance than standard manual searches. The article starts by explaining some terms relative to Text Retrieval systems such as precision and recall, and the relation between between the two (high precision results in low recall while high recall results in low precision). It also attempts to give a basic definition of term truncation and term weighting. The article then gives a synopsis of the Blair & Marion article's results, whose tests were done using the automatic text-retrieval system STAIRS. They list the three main conclusions of the article. [1] That when high recall is essential that broadening the search parameters results in too much information being returned to the user. From this Blair and Marion argued that previous finding of the superiority of automatic text-retrieval systems may not be relevant in large collections of text. [2] That when high recall is necessary, that manual indexing is preferable to full-text searching. [3] That automatic text-retrieval systems (STAIRS in particular) are not user friendly. The article then goes on to refute each of the points made in the Blair & Marion article. As arguments it uses test studies done with large text-retrieval systems, such as the Medlars search system, which involved searching more than 700,000 documents. They quote how that test produced high precision results without the information overload that the Blair & Marion article predicts. They go to point out that in the Medlar system that automatic indexing would solve some, but not all of the problems associated with failures of manual indexed searches. They also point out that inadequate user interfaces is not only something common to automatic indexed systems, but can be a problem in manual indexed systems as well. The article then moves on to compare manual and automatic indexing. The article points out specific problems with manual indexing. They point out that two skilled people creating a thesaurus will only have 60% of the index terms common to both. Two people manually indexing a document using the same thesaurus will only have 30% of the indexing terms in common. The also point out how there is a difference in search intermediaries and in weighing the relevance of documents. For the comparison of the manual versus automatic indexing a test conducted in 1970 using a NASA containing documents from Scientific and Technical Aerospace Reports (STAR) and International Aerospace Abstracts (IAA) was used as a comparison model. This comparison showed that the automatic search system had a recall rate of over 20 percent as compared to the manual system. The article then goes on to deal with the underlying theory in automatic indexing systems. In this section they deal with how to assign an importance definition to a term and how this can be calculated. They then describe a probabilistic retrieval model and a term discrimination model. The article then goes on to give an overview of how an automatic indexing system should work. In this section they touch on stemming, stop lists, thesaurus generation, and the computation of term weights. The article concludes that there is really no scientific basis for the claims made in the Blair & Marion article. And that they see automatic text-retrieval systems as becoming more powerful and more reliable as compared to conventional methods in the years to come. =================================== From: (Group 5) Shirley Carr Mike Joyce Bushra Khan Zakia Khan Vas Madhava IR ARTICLE SUMMARY: SALT86a This article describes automatic text retrieval systems, both from a theoretical and practical viewpoint. The effectiveness of a retrieval system can be evaluated on Recall and Precision. Although we would like to have high values for both, generally we trade-off one for the other. Enhancing devices can be used to raise both values. Examples of recall enhancing devices are use of term-truncation, synonyms, related terms and broader terms. Examples of precision enhancing devices are use of term weighting, term phrases, term co-occurrences and narrower terms. The Blair and Maron retrieval test used with legal documents was able to get a precision of 75% and a recall of 20%. They concluded that when high recall is needed its not enough to just broaden the search request. Doing so will just cause information overload. They also concluded that manual indexing is better than than full text searching and that most full text systems are not user-friendly, even to expert users. Salton, however, faults their technique and asserts that for large document collections automatic systems are better than manual. The Medlars Evaluation, done at the National Library of Medicine, in the late 1960's used manual professional indexing by subject matter experts. The performance varied ~substantially with the average recall and precision being .58 and .50, respectively. Information overload was not a problem even for high recall searches. They concluded that a lot of the search failures occurred because of manual indexing and the controlled language that was used. Manual indexing is problematic because of the high level of subjectivity involved. Indexes created by any two groups of people will be substantially different. The solution to the controlled language issue is to use natural language. The issue of manual vs. automatic indexing and controlled language vs. natural language addressed again in the mid 1970's with tests involving aerospace related documents. The results were mixed: Natural language indexing produced the best recall while controlled language manual indexing produced the best precision. In the famous Aslib-Cranfield study, it was found that "single-term" natural language indexing got better results than controlled term indexing. Overall, the effectiveness of an indexing system depends on two main characteristics: Exhaustivity and Specificity. Exhaustivity refers to the degree to which all aspects of the document content are recognized and represented in the index and specificity refers to the level of detail of a given index term. A more exhaustive system would increase recall while a more specific system would increase precision. Ideally indexing should be done based on linguistic considerations but in actuality its done using statistical and probabilistic methodologies. The following term weighting function could be used to derive the best terms in a collection: Weight = Term frequency * Inverse Document Frequency where Term Frequency = The number of times a term appears in a document Inverse document frequency = 1/the number of times it appears in all documents The best terms are those that occur frequently in one document but rarely in elsewhere. The probabilistic retrieval model assumes that the most valuable documents are those whose probability of relevance to a query is the largest. The term relevance factor it uses gives more justification to the idf factor described above. Under the term discrimination model, the most useful terms are those that distinguish one document from another. Thus the value of each term is measured by noting the decrease in the density of the collection when that term is assigned. The density here refers to the extent to which the documents resemble each other. Thus, high frequency terms become the least desirable. Low frequency terms are also not desirable because they don't change the space density of the collection either. The best terms are those that have medium occurrence levels. Techniques such as term broadening (eg. using a thesaurus) and term narrowing (eg. using phrases) can be used to change the discrimination value of a term. The following steps can be used as a blueprint for automatic indexing: 1) identify the individual words 2) use a stop list to weed out irrelevant terms 3) Use suffix stripping to get to the stem form 4) Compute term weighting for each term 5) Represent each document by a chosen set of weighted word stems This process can be improved by doing the following: 1) Generating weighted word stems that are attached to the documents 2) Using a thesaurus to replace terms with low document frequencies 3) Using phrase formation for high frequency terms 4) Representing each document by a chosen set of weighted word stems, thesaurus classes and term phrases. In addition the queries themselves can be converted to weighted sets of terms as described above. Then fuzzy matches between the query and the document sets can be produced to obtain a ranked output of documents. Overall, the article concludes that machine text processing will get better as time goes by while conventional processing can't get any better.