Tom Kalafut 5604n - 09/12 class summary We covered Boolean queries before the break to finish up the Inverted Files module. This included going over basic query construction, p-norm distances, and extended boolean query implementation. After the break, we started the string searching section, more specifically, Part A which involves PAT trees. This included an introduction to PAT trees, algorithms on PAT trees like ranges and prefixes, the basics of PAT tree construction, and a brief encounter with PAT arrays. ======================================================================== Girishchandra K. Saligram Summary for Class on September 12th 1995: The class began with Dr.Fox making a few changes to the accounts set up on video and then informed us of the new mailing address for the course which is 5604i@fox.cs.vt.edu. Someone raised the question of what all these accounts are to be used for and those uses are: fox.cs.vt.edu : For algorithm animation video.cs.vt.edu: For web pages vtaix: For all course related work The lecture then commenced with a discussion of Extended Boolean Information Retrieval. We were given a brief to the article needing to be read for this unit. Then we went into a detailed look at Extended Boolean Queries and Retrieval. We went over the main problems of Boolean queries which are: 1) In the case of AND, OR & NOT we may not get the exact information we were looking for due to the exactness of the search. In AND if one item is missed then we get nothing; in OR if we have several occurrences it is equivalent to having only one; and in NOT even the casual use of a term causes elimination. 2) There is no ranking of the retrieved information, so the user has to contend with a volume of haphazardly arranged data. 3) There are no weights on query terms, causing all terms to be treated equally. Thus importance cannot be given to some desired terms. 4) There are no weights on document terms. Thus the strict binary decisions that have to be made leave no scope for attaching importance to terms. The first thing to get around the Boolean issue is to 'fuzzify' them. On this note we started off on Fuzzy Set Theory. The basic concepts of fuzzy logic were illustrated: 1) Redefine AND as MIN 2) Redefine OR as MAX 3) Redefine NOT B as 1 - value(B) We then saw how to apply this range [0,1] to a document. All this was explained using fundamental mathematical principles and graphs. Next we explored a few more models to improve query and retrieval. MMM Model: Here, OR and AND are considered to be mixed. Whenever one is considered the other is mixed in to a certain extent. Coefficients for OR and AND are defined to specify the level of mixing. Paice Model: This considers all the terms in the query. It uses a geometric sequence assigning weights to each term. P-Norm Model: This again considers all terms in the query, but the strictness of each AND & OR operator is parameterized. THis concept was illustrated with a few graphs of the behavior shown using this model. Upon comparing these models it was found that the P-Norm is the best but it is expensive. If a cheap model is needed it would be better to use the MMM model. Lastly we were introduced to some hints on implementation. The lecture ended with the atart of the fourth unit on String Searching. We started off seeing how PAT trees were built and how they can be used for searching. The concept of a sistring (semi-infinite string) was introduced. ======================================================================== Class Summary, Sep 12 -- Srinivas R. Gaddam The purposes of various class accounts were told. They are (i) fox: for algorithmic animations (ii)video: for accessing material from web. (iii)vtaix: for all other purposes Different problems with extended boolean expressions were addressed. For eg, for an expression using 'AND', even if one keyword is missing then we get nothing. It is desired that we get results that have 4 keywords, then 3 and so on. For 'OR', even if there are several keywords, it counts as one and there is no way of showing preference. 'NOT' eliminates even casual use of the word. If a word occurs very rarely, then it has significant value for retrieval purposes. Since the concept of weights and ranks are absent, this value cannot be taken advantage of. Two schemes that supplement extended boolean logic with weights and ranks are MMM and Paice model. The application of Fuzzy set theory, developed by Zadeh in 1965, to IR systems was then discussed. The various characteristics of Fuzzy theory were discussed. There is a formula to assign term weights to documents, but not to query terms. But this theory has its own drawbacks viz., for a boolean expression with 'AND' its fuzzy equivalent results only in a term with the lowest value. To overcome the above problem, MMM model redefines AND as containing both 'AND' and 'OR' in different proportions. Similar is the case with 'OR'. But the problem is that only the terms with lowest and highest values are considered and rest terms in the query are ignored. Paice model and P-Norm models solve this problem by taking into consideration all the terms. Experiments have shown that MMM, Paice, P-Norm models perform significantly better than the standard boolean model though they take more time as they are computationally more intensive. In particular, MMM < Paice < P-norm in terms of Effectiveness (precision at given recall). Unit 4 was then introduced. There are 3 types of text searching -- Cluster, Hashing, and Sorted Indexing. In this unit, we shall be studying about PAT trees under Sorted Indexing. The concept of sistrings was explained. Question: How can one construct a PAT tree for a small string like "This is a string" ? An example would be very helpful..