Daily summary 9/19 Chris Ye Today we discussed the unit SS part B in class. Mainly, we concentrated on "Shift-Or" operation. The "Shift-Or" operation represents the position of a entry which occurs in the pattern. Also we started the Unit RR with Basic Vector Space Model. The example in the note is three dimension vector space, however we can easily scale up to n dimension vector space. In our three dimension case, we have system, information, and retrieval at each dimension. A document is located in somewhere of this three dimension space. The location is depended on the relative relationship between each axis. Now we are able to tell how a document is related to others by looking those positions. Also we are able to position a query in vector space as same as the documents To measure the similarity of two documents, we borrowed the concept of vector product. The similarity between two points is measured by the cosine of the angle between their two vectors. At the end of class, we went through the probability model. = = = = = == = = === = = = = = = = = = ==== = = == = = = == = = Class Summary, Sep 19 -- Srinivas R. Gaddam One of the String Matching algorithms, the Shift-Or algorithm, was explained. The difference between this algorithm and others are that here the pattern is preprocessed and not the text. This algorithm has some advantages over the other algorithms and is suitable for hardware implementation. The unit 5 was then continued. There are two models to make use of the different features -- Vector Space Model and Probabilistic Model. Vector Space Model: In this model, each document is represented by a point in a t-dimensional vector space, where each dimension represents a term or concept found in documents. A set of documents can be represented by a Centroid point. In this model, a query can also be represented by a point and as such can be treated as a document. This increases the model's functionality as it facilitates interchange of queries and documents which the boolean model doesn't allow. This way an entire document can be used as query. When a query is given, it is located in the vector space and the set of all the documents "near" this point is the result of the query. Similarity between two points is measured by the cosine of the angle between the query and document vectors. Vector Feedback can be used to improve the results obtained. There are three scenarios: retrospective - everything is known including relevance predictive - results from initial searches, some relevant documents are used formulas :Rocchio - this method uses the info about all the nonrelevant documents, whereas Ide method uses only the nonrelevant documents with the highest value. One of these methods is used to shift the point that represents the query near the documents that are relevant. Normally, three such shifts (or iterations) are sufficient to get the result. Probabilistic Model: The assumption that should be made about the information retrieval theory to apply the probability theory is mentioned and justified. It is "The relevance of a document to a request is independant of the other documents in the collection." This is justified because there are so many documents that, their dependance on one another can be ignored. Probability ranking principle states that if the documents resulting from a query are ranked in the decreasing probability of relevance to the user, then it is the best possible solution the query.