Muhlenberg John Class Notes 9/17/95 Administrative notes : Updated web pages No longer required to complete the units in order Do what's required; Quizzes and Assignments Fixed problem retrieving the C code from fox in the IF module Nov. 6th class will meet in the Library Short class 6:30 through 8:30 Access the quizzes through password for video No bonus points awarded yet. Article summaries are being held onto temporarily Can get to VTLS through web & gopher and telnet vtls.vt.edu Lecture notes : Search string unit PAT part completed last week. Naive algorithm is the simplest of the search string algorithms. The complexity for the naive algorithm is M*N for a search string of length M and a match string of length N. O(MN) worst case complexity. Big OH is not a typical case but actually worse case and the probability of getting match is much less. This search algorithm takes the pattern were looking ( the match string ) for and makes a comparison if it matcches then it marks the spot. The search continues though the string of text data. The algorithm moves foreward one character at a time. KMP algorithm is somewhat smarter than the naive program. matching on abrxcadabara match string abca Compare the pattern in the match string to a string of the same length in the search string. Compare abrx to abca. Step forward after x since a doesn't occur again until after the letter x Boyer Moore algorithm has a longer match string. It makes use of repetition in the match strings by comparison on a right to left match. Makes use of match heuristics. most match string patterns don't' make repetitions in them so this algorithm is not of much use generally. Boyer Moore Horrspool Try match heuristics try occurrence heuristics ( look for the next occurance of things that don't match ) Got lost on this one. Simplification of Boyer Moore Get rid of the pattern matching in the previous Boyer Moore algorithm. Its not quite as good as the previous boyer moore program. Shift OR algorithm Finite state algorithm. 0 in state number implies a match or start of a string. an 0 on the left implies a string match shown on page at left. The unique thing about this algorithm is that it doesn't have to store the match string as the data being searched goes by. you can monitor for a match,. Simple because it uses a left shift and or. Only have to store 1 character of test at one moment. Karp Rabin String matching take the search string and encode it to a number. then search the text data. this encoding is called a signature. abaracada would be encoded as strings aba, bar, ara, rac... and a signature would be computed for each string. Summary see web page Ranking and Relevance Ranking search results to clarify what's relevant and important. 1 ----------terms-------------t ! ! ! ! I ! ! ! ! ! ! d! --Documents -----------! a vector space is a 1 if there is a term that matches the document and the term specified by the pair (i,j) 0 othersize. Query expansion uses additional information asked or received that allows the computer more information about the potential answer. More important terms have heavier weights. Alternative feature selection word simple ford expansion stem or non plural N-Gram pair wise deals with things of size N. A bigram os `th' a trigram is `thi' to extract two or three letter strings resembling the match string. Usage is similarly to spell checkers Factor from SVD related to factor analysis. Singular Value Decomposition. Squeezes matrix of documents & terms down and uses factor analysis ( like a sparse matrix ) Attribute value pair publish dates added to a list of documents that provides a decreasing weight or relevance. -break- The vector space model has been around since the 60s and 70s. Mentioned in the SALT75 article. 3 axis system. Documents are characterized by these three dimensions. will model as a T dimensional space where there is one dimension for every word. Clustering and centroids can be used to locate documents nearby, This is good for finding documents ` like this one'. This kind of system finds things through similarity. similarity is designed as the cosine of the angle between the query and document angles. Vector feedback. vector feedback will allow the user to tailor the document near it. the user will feedback the relevant documents and draw away from the non-relevant ones. The problem is knowing which ones are relevant and which ones are not. Use feedback to allow re-ranking by choices. This requires dynamic manipulation of the query. Uses statistics to recompute the algorithm. There will be a lot of non-relevant documents. Roccio developed and algorithm called pos. and Neg. that allowed the user to tailor which documents they are searching for, Ide: had better results by decreasing the high bad documents. feedback policies takes the average of the vectors and moves in that direction. Probabilistic model. makes assumptions in order to calculate the probabilities. probability ranking principle. assumes we get good data which is not always true. then there was this long part on the probability in the web pages. probabilities can be used to calculate the degree of similarity.