Unit RR, Part C: Probabilistic Model

For a thorough introduction to the probabilistic model, covering work through 1979, see Chapter 6 of the monograph by Van Rijsbergen.


Explanation Adapted from Van Rijsbergen's Chapter 6

Our object is to compute, for a given query,
P(relevance/ document)

Let us assume (following Robertson) that:
The relevance of a document to a request is independent of other documents in the collection.

Then, following Maron and Cooper, we can now state the probability ranking principle:
If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data.

Notation

We now use a special notation to describe a binary document vector:
x= (x1,x2, . . ., xn)

where xi = 0 or 1 indicates absence or presence of the ith index term.

Now we consider the two events:

w1 = document is relevant

w2 = document is non-relevant.

Since we cannot estimate P(wi/x) directly then we use Bayes' Theorem

P(wi/x) = P(x/wi) P(wi) / P(x)

where P(wi) is the prior probability of relevance (i=1) or non-relevance (i=2) and the factor P(x/wi) is proportional to what is commonly known as the likelihood of relevance or non-relevance given x

Binary Independence Model

We make the major assumption of term independence:

P(x/wi) = P(x1/wi) P(x2/wi) ... P(xn/wi)

Then, to simplify the equations we define:

pi = Prob (xi = 1/w1)

qi = Prob (xi = 1/w2).

The likelihood functions then are

P(x/w1) = PRODUCT(i=1 to n) (pi**xi) ((1 - pi)**(1-xi))

P(x/w2) = PRODUCT(i=1 to n) (qi**xi) ((1 - qi)**(1-xi))

Thus, for example, P((0,1,1,0,0,1)/w1) = (1 - p1)p2p3(1 - p4)(1 - p5)p6.

Going back to Bayes' Theorem, we substitute, take logs, and get a linear discriminant function where the coefficient for

xi (which is essentially a term weight) becomes:

log [pi ((1 - qi)] / [qi ((1 - pi) ]

Retrospective Evaluation

If we estimate

pi = r / R

qi = (n-r) / (N-R)

we get the F4 formula of Robertson and Sparck Jones

log { r / ( R-r ) } / { ( n-r ) / ( N-n-R+r ) }

with the following variable definitions

Summary