Extended Boolean Queries and Retrieval
Problems with Boolean
- A AND B AND C AND D AND E --- if miss one
- get nothing, instead of those with 4, or later those with 3, etc.
- don't have an easy way to reformulate for all the combinations
- A OR B OR C OR D OR E --- if have several
- counts just like if only have one
- don't have an easy way to show that prefer more than one occurrence
- A NOT B --- eliminates even casual use of term B
- No ranking
- so users must fuss with retrieved set size, structural reformulation
- so users must scan entire retrieved set
- No weights on query terms
- so users cannot give more importance to some terms --- retrieval:2 AND system:1
- so users cannot give more importance to some clauses --- retrieval:1 AND (MMM OR Paice):2
- No weights on document terms
- so indexers are forced to make strict binary decisions --- forcing fewer index terms and lower recall
- so no use can be made of importance of a term in a document --- if occurs frequently
- so no use can be made of importance of a term in the collection --- if occurs rarely
Fuzzy Set Theory
- Zadeh since 1965
- Studied here in EE
- Recently adopted in Japan: numerous patents: fuzzy controls, shower heads
- Start with notion of sets for : tall, small, large, bright, kind, ...
- Use range [0,1] instead of choice (0,1)
- Redefine AND as MIN
- Redefine OR as MAX
- Evaluate NOT B as 1 - value(B)
Applying Fuzziness to IR
- If want Boolean laws to apply, must use MIN/MAX definitions.
- Can apply to automatic document indexing with term weight =
- 0, if term not present in document;
- 0.5 + 0.5*TF/MAX-TF, if term is present in document;
- some reduced value, if a related term is present instead.
- Have no simple way to consider query term weights.
- Still have problems:
- A AND B AND C AND D AND E --- only term with lowest value counts
- A OR B OR C OR D OR E --- only term with highest value counts
- Computational and space costs are higher than for Boolean.
MMM Model
- Idea: generalize MIN and MAX by redefining AND and OR as linear combination of them:
- AND: Cand * MIN + (1-Cand) * MAX
- OR: Cor * MAX + (1-Cor) * MIN
- Good values seem to be Cand in [0.5,0.8] and Cor in [0.2,1].
- Problem: still only considers 2 terms (one with lowest weight, and one with highest weight) as opposed to all terms in query.
Paice Model
- Idea: consider all of the terms in the query.
- Idea: use a normalized geometric series, down-weighting the contribution of terms not close to the fuzzy set value (i.e., MIN for AND, MAX for OR).
- Formula has single coefficient, r, which works well as 1 for AND queries or 0.7 for OR queries.
- Sort document terms based on their weight:
- in ascending order for AND queries;
- in descending order for OR queries.
- Evaluate similarity for that document by dividing
- SUM (for all query terms in [1,n]) of r**(i-1) * di
- by the normalization value
- SUM (for all query terms in [1,n]) of r**(i-1)
P-Norm Model
- Idea: consider all of the terms in the query.
- Idea: parameterize the strictness of each AND or OR operator with a p-value.
- Idea: have a general model, p-norm, that has as special cases the standard Boolean model (with fuzzy set interpretation --- when p is infinity) and the vector-space model (with inner-product similarity --- when p is one).
- Thus we get a spectrum of models with decreasing strictness, i.e., strict AND ... soft AND ... vector ... soft OR ... strict OR:
- p-norm AND with p=infinity behaves like strict Boolean AND (i.e., MIN)
- p-norm AND with p at moderate values softens the strictness of the AND
- p-norm AND with p=1 behaves like p-norm OR with p=1 and behaves like vector space model
- p-norm OR with p at moderate values softens the strictness of the OR
- p-norm OR with p=infinity behaves like strict Boolean OR (i.e., MAX)
- Idea: use L-p family of norms to compute similarity by measuring:
- distance from 0 point (i.e., none of query terms present) for OR;
- 1 - distance from 1 point (i.e., all of query terms present) for AND.
- Idea: visualize all this with equi-similarity contours at fixed p-values.
Comparison of Extended Boolean Models
- All seem to work best when AND is interpreted fairly strictly, and OR is interpreted less strictly.
- All are computationally more expensive than Boolean, but at the same time are more effective (i.e., precision at given recall level).
- Computational costs seem to be (in the general case): MMM < Paice < P-norm
- Effectiveness (i.e., precision at given recall level) seems to be: MMM < Paice < P-norm
Implementation Issues
- Need to parse and represent queries (with clause and term weights).
- One way to evaluate "similarity" for a document is to "walk" the query tree in a depth-first traversal --- can be done by recursive evaluation.
- Need to store document weights (unless assume binary weights, or compute at retrieval time based on postings or other statistics).
- Can first do standard Boolean processing and then use an extended Boolean model to prepare a ranking for those retrieved.
- However, to improve recall, should really retrieve all documents that have any of the query terms, and then compute "similarity" for those, to get a full ranking.