Tom Kalafu 10/09/95 Class Summary: We went over Indexing which includes lexical analysis or tokenization, stopword removal, stemming, and plural removal. We reviewed some lexical analysis basics; like grammars, finite state machines, turing machine theory, and the UNIX tools lex and yacc. We then learned how finite state machines, which save space and are quick, can help in stopword removal. We then went over conflation methods, or more specifically, stemming and plural removal methods. We reviewed an example to see that it's not too clear how to handle multiple dictionary entries, short words like 'kings', and spelling variations. We were then introduced to different measures for the purpose of evaluating conflation methods, since conflation is not a perfect science. Near the end of class, we were introduced to the next unit involving word processing, document management, markup and the OHCO model. = = = = = == = = === = = = = = = = = = ==== = = == = = = == = = 9 Oct 95 CLASS SUMMARY, AUTOMATIC INDEXING Group 2 Lauren Barton Martin Falck Nelson Kile Carolyn O'Hare Robert Ryan This lecture was about different possible methods available to automatically remove the key terms and to create an index to the document. To automatically index a document the following steps are required: Identify Documents Identify Fields (title, author) Write Parse the characters into tokens Transform into standard data form Lexical Analysis is used to break the document into tokens. It may be accomplished by; Lex, YACC - Unix tools (large, complex and hard to change). Finite State machine - Moves from state to state to convert characters in to tokens. Fast, small, relatively easy to design. The lexical analyzer must be designed to account for conditions such as changing case and hyphens. The analyzer may also remove stopwords. Stopwords are common terms that if used in the index would create a very large index that would not be very useful. Stop words include words like "the" and "that". Stopwords could also be high frequency words for a specialized document set. An example of this would be for a document set on computers, computers may be used as a stopword since it would apply to all the documents. Once the keywords have been identified and the stopwords have been removed, the indexer will determine if any of the terms may be modified so that they may be combined with other terms. This is done so that the term is able to be found more easily (i.e. combine man and men). Conflation is the combining of similar words into a single term. Automatic stemmers are used to strip suffixes off of a term so that it may be conflated. Some automatic stemmers are: Affix removal Longest Match - Lovins method - removes suffix strings one at a time. Iterative approach. Uses set of rules to correct spelling changes when suffixes applied. Clunky method. Simple removal - Compact, but accurate. Assumes words are constants/ vowel consonants groups followed by a vowel. Looks for patterns. Successor Variety - Uses the number or probable letters that follow a let ter to determine if there is break in the word (suffix). If there is a jump in the number of possible following terms, there is probably a suffix attached. N-Gram - Breaks up the word into n-sized groups of letters. These groups are used to determine similarity to other words. Table Lookup - not practical due to size. Conflation effectiveness is measured by testing e-measure, precision and recall, or space, with and without conflation to see what if any improvement there is. The goal is to optimize retrieval effectiveness of the index. Began a discussion of the chapter on SGML/Document translations. Metadata - add a description of something to data (like a photo). = = = = = == = = === = = = = = = = = = ==== = = == = = = == = = S. Carr, M. Joyce, B. Khan, Z. Khan, V. Madhava CLASS SUMMARY CS5604 (Nova) - Oct 09, 1995 UNIT IN (100%): * We covered two areas of indexing: Automatic Indexing and Conflation. * Automatic Indexing: - goal is to take a work and construct a representation of it, an index term list. - This representation is by index terms (or tokens) that are then used as the description of the document. * Processes used in automatic indexing : Text -> Lexical Analysis -> Document Vectors Stemmer -> Dictionary Stop-Word Removal Plural Removal * Steps in automatic indexing: 1) Identify the documents. This will vary depending on how "document" is defined; ie. it could just as well be defined as sentences or paragraphs. 2) Identify the fields. This depends on the form of the document. Examples would be Title, Author, Abstract, etc. 3) Create an index. This is often experience driven. 4) Parse all the different fields possible. eg. abbreviations and acronyms need to be expanded, the different forms of dates need to be recognized. 5) Transform to canonical, or standard, form. Here the purpose is to convert from many different formats that might exist to one format. For example, the many different ways in which the date might be written needs to get converted to one standard form. A. Lexical Analysis: - To convert a stream of characters (text) into tokens. - Can be done for various portions of the doc, eg. not on the title but on the abstract and body. - Can be done for both the document and the query. - Can be specialized to the document collection to account for peculiarities within the collection. - There are 2 main approaches taken: 1) Use Yacc and Lex - Yacc can be used for context-free languages - But experience with the SMART system showed that this is too complex and difficult to deal with. [ Aside: In the hierarchy defined by Noam Chomsky, there are four types of languages (type 0 is "high"): type 0 - recursively enumerable /\ Need more type 1 - context sensitive | powerful type 2 - context free | machines type 3 - regular | A regular language is a language for representing simple patterns such as repetitions of a character of characters (e.g., "aaabbcaaabbc") A context free language is a language for representing slightly more complicated patterns that include paired elements such as arithmetic expressions with balanced parenthesis (e.g., "(a+b)*(a-b)" or block structures (e.g., "begin - end") in a programming language. A Turning Machine (TM) is an automaton whose temporary storage is a tape. The tape has a read-write head and can travel right or left. The TM changes state as it processes characters on the tape. The processing performed by the TM is defined by a transition function, often called a program. ] 2) Finite State Machine (FSM) Approach - This is especially good for simple languages like regexps. - An FSM is composed of states and transitions. - Full behavior of a lexical analyzer can be specified by an FSM - It can be used for truncation, capitalization, stopword removal, etc. - The main advantage is speed and size. These days FSMs can even be implemented in hardware or firmware. B. Stopword Removal - Goal is to get rid of terms we don't want indexed. - Issue becomes what to remove and how to decide that. - Normally function words are removed. eg. what, which, why, etc. - High frequency words in certain contexts can be removed. eg. the word computer when dealing with a collection of computer documents. - Implementation Techniques: - Hashing functions - Tries - Lexical Analyzer and FSM (described above) Stopwords lists can be implemented in FSMs C. Conflation - Here the goal is to fuse or combine multiple words into 1 to increase recall and reduce size of index. - Stemming is not the same as truncation. Stemming is typically done by a machine to get to a root form of a word while truncation just chops off the ends of words. Truncation is often not linguistically correct and is not language dependent. - Measures for conflation include: a) linguistic correctness - does it make sense in the language. b) Retrieval effectiveness - will it help Recall or Precision c) Space Savings d) Speed - Problems: a) Overstemming - If too much is removed, will get too many words that map to 1 stem b) Understemming - What could have been reduced has not been. Wastes space and time c) Language specific problems: Multiple dictionary entries for similar concept words Conflating short words are a problem There are many spelling variations to worry about - Other Issues: a) When should it be done: when indexed or at search time. If done at indexed time would lose the original terms If done at search time will take longer b) Who should do it, the computer or the searcher. - Conflation Methods: a) Table Lookup - Impractical because it's too time consuming b) Successor Variety - Successively look at a string adding a letter at a time to see how many successor words are possible. The transition between when this number comes down and then up defines the word. - Complicated, Not used much c) N-gram: Bigram, Trigram - Here, successive two and three letter sequences between pairs of words are gotten. The words that have the highest number of matches are considered similar. - Good for noisy data, or when spelling correction is needed. d) Affix Removal - Based on language-specific rules. Two types: i) Longest Match (Lovins Method) - Look for matches (longest, iterative or partial) or recode using fixed rules ii) Simple Removal (Porter Algorithm) - Assume words are of the form [C](VC)m[V] and use suffix rules to conflate - Conclusions: There is at least some space savings in using stemming. Most algorithms give similar performance. Better to stem than not stem Strong stemming is worse than weak stemming In the future, linguistic accuracy may be more critical UNIT SD (10%): * Word Processing - Trend: Paper -> Diskettes -> Online Discs Batch Formatters -> Interactive - Two levels of tools have been in use a) Page level such as word and word perfect. These are evolving into type b): b) Book level such as Interleaf - Authoring now refers to more than just documents but also multimedia - Steps in authoring include: - Developing a hierarchical outline - Including figures, tables, graphs - Creating citations - Following a rhetorical structure such as telling a story, following a timeline, etc. - Providing reader aids such as a table of contents, index, etc. - Using devices to capture or maintain reader attention such highlighting, etc. - Layout which includes placement, size, etc. * Document Management - Includes elements outside of authoring such as publishing, distributing, archiving, etc. - There are many outstanding issues: - Metadata - How to describe items - Lifecycle issues - How to deal with physical data protection - Rights management - How to protect authorship rights especially in an electronic environment - Usability - Since paper has had a lead time of many centuries, how do we make electronic document management equally efficient * Markup and OHCO Model - There are six approaches to markup: Punctuational Presentational - Problem is that it distracts attention from content Procedural - Problem is it's too programming oriented Descriptive - Good way especially for non-programmers Referential Metamarkup - Goes one step above and describes classes of documents, eg. dissertation, dictionary, etc. * OHCO - Ordered Hierarchy of Content Objects - Each content object (eg. chapters in a book) is demarcated by tags)