Review of Salton, Fox, Wu "Extended Boolean Information Retrieval" by Rick Compton, Fred L. Drake, Jr., Mark Missana, and Stephen Williams In "Extended Boolean Information Retrieval", Salton, Fox and Wu introduce an extended boolean document retrieval model that uses a query syntax similar to that of a standard boolean query and allows for the retrieval of items not retrievable with a normal Boolean system. Retrieved documents are also ranked in decreasing order of document-query similarity. The model evaluated here is called the "p-norm" model. The boolean retrieval model, though good for database record retrieval, where all terms are known, is a poor model for document retrieval. Boolean queries are rigid and are unable to rank retrieved documents according to relevance to the initial query. The "p-norm" retrieval model allows the user to generate a more generalized query that allows for both term weighting and boolean operator parameters that vary the "strictness" of that operator. The "p-norm" model calculates the similarity of a document to the given query as the normalized distance between the document values and the query values in n-dimensioned space, where n is the number of query terms. The parameter 'p' selects the "order" of normalization. A p value of 1 gives the average. A p value of 2 gives the Euclidean distance. Furthermore, a p value of infinity gives the MIN or MAX value of the document. The p = infinity model is equivalent to a strict "fuzzy boolean retrieval". A p value of 1 corresponds to a vector norm retrieval. Using the same retrieval model, Salton, Fox and Wu were able to parameterize the effectiveness of the spectrum of retrieval models by simply varying the value of 'p'. Four document collections were used that represented a wide variety of collection types and sizes. Each query was performed using four retrieval methods: - Straight boolean query. (binary weights) - 'p-norm' with 1 <= p <= 9 Documents have binary weights. - 'p-norm' with 1 <= p <= infinity Documents have weights that are proportional to inverse document frequency times term frequency. - Vector processing with cosine match and natural language query. The results of their analyses show that the boolean retrieval method faired poorly with all sample collections. The 'p-norm' model with a 'p' value of 1 or 2 and proportionally weighted terms did the best. In three out of the four document collections, the collection size was inversely related to the precision. An optimal retrieval was shown for comparison. Those values show that there is still much room for improvement. In conclusion, the 'p-norm' retrieval model is a vast improvement over the strict boolean query interpretation. The ability to vary the value of 'p' for each operator, and weight each term and phrase, gives the user much more control over the recall and precision of the retrieved documents. Retrieved items are also given to the user in ranked order, with the most relevant items at the top of the list. This form of document retrieval seems to be superior both in its precision and usability. = = = = = = = = = = = == = = = = = = = = = = = == = 5604n - Salt83d Article Summary Group 5 Shirley Carr Mike Joyce Bushra Khan Zakia Khan Vas Madhava Salton, Fox and Wu, November 1983, "Extended Boolean Information Retrieval," The Communications of the ACM, Vol. 26, Nr. 1, pp. 1022-1036. In the article "Extended Boolean Information Retrieval" Salton and his coauthors propose associating an operator weight with the AND and OR Boolean operators in a query. For example, instead of AND and OR, Boolean operators are expressed as AND(12) (which means "AND with weight 12") and OR(2). After establishing the conceptual foundation, general guidelines for operator weights are suggested based upon the performance of the model as evaluated with four document collections. The article also offers several potential applications of the proposed model. The starting point in the article is a discussion of various information retrieval models, such as conventional Boolean model and weighted-document terms model. The authors so called p-norm model is then introduced. The p-norm model introduces both term weights and query term weights into the query-document similarity calculation as well as coefficients (p) for how strictly the operators are interpreted. Using the p-norm model as a reference, Salton observes that the effect of a standard vector-processing retrieval model is obtained when p is set to 1. When p is set to infinity and the query and document terms weights are limited to 0 or 1, it produces a conventional Boolean retrieval model. And finally, for values of p in the range 1 to infinity, it produces an intermediate retrieval model. p = 1, and weights vector-processing retrieval model 1 < p < infinity, and weights intermediate retrieval model p = infinity, and binary weights Boolean retrieval model Examples of queries using the various models are given below. vector (catalogue, 6.3), (catalog, 5.3) (mechanization, 4.1), (automation, 2.7), (computerization, 1.6) Weighted query and(p1)((or(p2)((catalogue, 6.3), (catalog, 5.3)), 5.9) or(p2)((mechanization, 4.1), (automation, 2.7), (computerization, 1.6), 2.8) Boolean query (catalogue or catalog) and (mechanization or automation or computerization) The article also included a series of evaluations that looked at the performance of the p-norm model in various query configurations. Both recall and precision factors were evaluated using collections covering biomedicine, library science, electrical engineering, and computer related areas. In general, the results indicate that p-norm model provides substantial improvements in retrieval effectiveness over the conventional Boolean or vector systems. While the value of p could be tuned for specific query requirement, best results were with p-values between 2 and 5. At the end of the article Salton proposes using the operator weighting capabilities of the p-norm model in a relevance feedback system. The proposed process starts with a conventional Boolean query submitted by the user. Based upon this selection criteria a set of potentially relevant documents is retrieved. Because the retrieval is based upon a Boolean query, the documents are unranked. A "p-norm" version (i.e., document, term, and operator weights assigned) of the original query is prepared and processed against the set of relevant documents just retrieved. This processing orders (ranks) the documents in terms of their relevance to the query. The ranked set of documents are submitted to the user. Documents identified as relevant by the user are used to generate new operator weights in an improved version of the query. =================================== Article Summary for IF Salton,Fox & Wu - Extended Boolean Information Retrieval Group 2: Lauren Barton Martin Falck Nelson Kile Carolyn O'Hare Robert Ryan This article introduces a new (at the time of writing) Boolean retrieval scheme. First, to build the case for improvements, some of the limitations of conventional Boolean retrieval strategies and the vector processing model are discussed. Later progress, notably the introduction of fuzzy set logic, is mentioned as the background for this extended retrieval scheme. The algorithm is introduced by graphically illustrating distance measurements as coordinates from a point 0. From this, formulas are derived for calculating similarity measures between queries and documents. In this p-Norm model, both weighted document and weighted query terms are used and a p value is assigned to the Boolean connectives. This extended Boolean retrieval model doesrequire some modification of equivalence properties. Using p values varying between 1 and infinity, the results are then compared with conventional and vector processing models. The next section lays out a design using extended retrieval, using varying term weights and p values to tune the system to desired term importance and query formulation. By mixing p values and analying the results, this scheme can be useful in relevance feedback. The inverse document frequency is used as a measure of term importance A formal study utilizing four document colections is used to demonstrate the effectiveness op the p-Norm model. Metrics used were average precision at three recall points and percent difference from conventional Boolean, among others. The results show improvements in retrieval effectiveness when p values are lower. Similar results are found for the weighted p-Norm model. The article concludes by suggesting how this model may be implemented while maintaining compatibility with conventional retrieval environments. = = = = = = = = = = = == = = = = = = = = = = = == = "Extended Boolean Information Retrieval" by Salton, Fox, and Wu IF Article Summary by Group I: Kalafut, Muhlenburg, Klein, Fitzgerald Conventional Boolean information retrieval systems usually use an inverted file of key words or index terms to retrieve documents based on a query with terms joined by and's and or's. This conventional Boolean method does have several drawbacks: output size not being controllable, output not being rankable, query terms not being weightable, and counterintuitive output. The vector-processing retrieval model resolves the last 2 problems but does suffer from the absence of the Boolean model's inherent query structure. The fuzzy-set model allows for both document terms weights and output ranking, but it also suffers from the lack of output document discrimination almost as much as the Boolean model. The article introduces an extended Boolean model with both weighted query and weighted document terms. There are 3 possible document classes for two-term queries: those with both query terms, those with only one, and those with none. The basis of the extended Boolean model is that in a two-term query, point (1,1) represents both terms present in a document, and point (0,0) represents neither query term being present. Based on that, similarity between a document and a query is basically the Euclidean distance between them. The extended Boolean model extends this to the p norm model with weights on both the query and the document terms to determine the similarity between a query and a particular document. Then we can vary p to obtain a retrieval system between the standard vector-processing model (p=1) and the conventional Boolean model (p=infinity). The higher the p value, the stronger the and's and or's become. The extended retrieval system has 3 main advantages: structured queries, term weights, and the variability of query structure interpretation through varying the p value. The p-norm metric brings rise to phrases for specifying broad, high-frequency terms and thesaurus classes for broadening specific, low-frequency terms. With a p value of infinity, the and operator represents a strict phrase assignment - if all of the query terms are not present, then disregard the document. The or operator with a p value of infinity implements a strict thesaurus - any term being present selects the document. With more medial values of p, like 3, the and's and or's are looser meaning that more terms is still more important that less terms, and the lowest value of p (p=1) means all terms are independent - and's and or's have no distinction. The article continues by suggesting the use of the 2 weighting factors - term frequency and inverse document frequency - before presenting some results using these factors. The article then presents some results of the extended retrieval system as compared to the strict Boolean and the strict vector methods, experimented on 4 standard document collections, which suggest that the extended system with p values between 2 and 5 appears to perform the best. After restating the article's conclusion that the extended retrieval model appears to perform the best, the authors suggest an iterative approach to implementing the extended model on a retrieval system with an inverted file structure already in place, which is also an-going topic of research.