Edward A. Fox
Department of Computer Science,
Virginia Tech, Blacksburg VA 24061-0106
This Unit has two articles that should be studied. These will be supplemented by lectures on knowledge based information retrieval (KBIR) by the instructor. There is an optional third article that is recommended.
The area of Knowledge-Based Information (KBIR) is an active one, with IR researchers trying to apply various AI schemes to improve IR systems, and with AI researchers trying to extend their systems and apply their methods in the domain of IR. This latter tact involves not only traditional IR, but also fact extraction, where large document collections are analyzed (either in bulk, or in connection with message or document routing tasks) for particular types of information, so that a database or fact base can be loaded with the extracted particulars.
From the IR perspective, KBIR is still unproven. In limited domains, for small collections, it seems to have particular promise, especially when specialized knowledge bases can be constructed to help with particular tasks or problems (e.g., accurately selecting search terms). KBIR also seems to help build systems that are more usable or efficient, where the knowledge is reflected in the system behavior. In part this is the tact taken by Belkin et al. in developing Distributed Expert-Based Information Systems (DEBIS) [1]. However, it is still unclear if and how KBIR can be applied to the important problem of ``understanding'' large numbers of documents in a heterogeneous collection, with truly robust analysis, and with significant improvements in effectiveness when compared to the best statistically-based systems.
From an IR perspective, the question is where and how KBIR can be of benefit. One possibility is in the interface. Natural language processing (NLP) methods can be used so that the user enters natural language statements that are ``understood'', and receives natural language responses (e.g., explanations, questions needed for clarification, statements regarding progress of the session, etc.). The user may be ``understood'' or modeled, and that may continue across sessions so that the system adapts to the user needs and characteristics.
A second possibility is that the system actually ``understands'' the users' interest statements and/or the documents. This can result from NLP, and lead to useful knowledge representations of the interest statements and documents.
Third, the system may guide the user, through dialog and possibly through a machine learning scenario, to develop and refine problem descriptions and ultimately query representations. In some cases this may simply involve working with an existing back-end retrieval system, and forming the best possible Boolean query. In other cases, the back-end system may be more flexible. In either case, query expansion and term disambiguation are important issues.
Fourth, if the above occur, the system may use AI methods to match the representations, yielding some type of ranking or selection that considers the uncertainty and the statistical data that are available. Such matching may involve logical or statistical (e.g., Bayesian) inference, or (conceptual) graph matching operations. KBIR methods can also be applied to help increase parallelism (e.g., with blackboards), or to improve the efficiency of these operations, where rules or heuristics or AI search techniques help prune the space.
More details on these approaches are given in terms of an AI perspective on IR operations.
In the AI field, one key area is natural language processing (NLP) and the related area of computational linguistics. This can be applied to the human-computer interface, to make IR easier for casual users, and to encourage a more flexible dialog, where user modeling, explanation, tutoring and learning can all take place.
NLP can also be applied to query and/or document analysis, leading to a detailed knowledge representation. This is faster when applied only to the query, where at least phrase identification usually occurs. More complex representations, such as conceptual graphs [4] or frame systems (as for chemical reactions) can also be developed.
Of great potential is NLP applied to the documents, so a richer representation can be developed. Once again, identifying phrases is a sensible goal, though it is still not clear that this approach is better than statistical phrase identification methods. In general, parsing for large collections must be very fast, and so may be shallow or partial, perhaps only identifying dependencies. Clearly it must be robust, immune to minor spelling or grammar errors, or to the occurrence of new words.
With richer representations, one often thinks about IR at the level of concepts. This may involve use of a thesaurus, or of AI-constructed equivalents, which map from the world of words to the world of ``concepts.'' Sometimes these are called ``topics'' or ``frames'' or a semantic network or graph representation may be used. The two readings for this unit relate here: the first illustrates the need to have many aliases for a concept, to enhance recall; the second focuses on (sense) disambiguation through the use of NLP (semantics oriented) and use of (lexical and domain) memory.
Given such representations, inference methods can help. This may include query expansion. Graph matching may be needed for complex queries, and may follow an initial screening or filtering that is based on simpler, statistically based selection. Bayesian network or rule based schemes have good potential to help during the matching.
Matching can be improved, as can the interface, through user modeling. There is often some initial setup of stereotypes, knowledge acquisition so the user can be accurately characterized, and differential analysis to compare the user with stereotypical behavior.
Finally, machine learning can be used in a variety of IR situations. The normal feedback scheme is one example. It is essential to have training data or samples to help, so that one can tune parameters. Good methods must converge quickly to (near) optimal solutions. Genetic algorithms and neural nets have been considered but it is not yet clear if and how they can help.
From the Course Objectives in the Syllabus an important goal is to: read and understand research contributions ...; you will gain experience by reading the two CACM articles, by the in-class discussion, and by the series of lectures about a broad range of research investigations.
This Unit has the following objectives, for students to be able to:
There are two main types of effort required. First, the readings (see Section 5) should be carefully studied, keeping unit objectives in mind (see Section 3). The class discussion should be of value in making the readings clearer. Second, if possible, the (optional) lectures should be attended to obtain an overview of key concepts and research in the field.
Work with Inquery is optional but encouraged.
You are encouraged, but not required to visit the U. Mass. Amherst Center for Intelligent Information Retrieval home page and then to look around, read, and try out the Inquery system that is pointed to from there. In particular, try to understand how it works, as opposed to understanding what it does, as was covered in a previous unit.
Please write a short essay, of about 500 words, explaining how you would build a system for filtering and routing your electronic mail. Assume you have a mail handling system that can do the mail related part of this, and all you need to do is develop a module that will take a mail message as input and produce as output a ranked list of names of folders in which this message should be filed. Your module should be knowledge-based, and so may work with rules like Information Lens. It should have some type of lexicon, to help solve the vocabulary problem. It should have some memory and reasoning component to help with message classification. Give as many details as you can about data structures and algorithms you would use.
If you want to see how this might work, you are invited to try out use of sift-mail and sift-tcl, created by Laurence Lundblade. The executable files are in /u1/bin on fox.cs.vt.edu and so if your path (like that specified in the instructor's .login file) includes /u1/bin then you will be able to run them. Documentation is available on fox.cs.vt.edu in /u1/lgl/sift-mail-0.7.3/doc or on the WWW.
You should be able to answer each of the following questions.
The two articles are relatively focused, failing to give a broad overview of KBIR. That, however, should become clear from the lectures. Note too that the 3rd article is highly recommended, and will help with the paper exercise.
This article deals with the problem of finding names or phrases to describe a concept. This is a difficult problem, approached in different ways by different individuals. Fundamentally, it is the cause of many indexing difficulties, where an indexer and a searcher may not think of the same terms, and leads to decreases in recall.
The article presents empirical data from several domains regarding naming. Discussion then goes on to possible solutions. The main one advanced is essentially that of thesaurus construction --- finding a number of words and phrases that can be considered as alternatives when describing a given concept. The article suggests soliciting data from a number of individuals as a reliable way to get coverage, which can be automated and possily improved by a ``failure analysis'' approach such as when ``unlimited aliasing'' is built into the system.
This article is in sharp contrast to the Furnas et al. article in that it focuses on a particular system, for indexing or analyzing patent documents. By picking a limited domain, one can benefit from the characteristics of the special sublanguage involved, such as stereotypical reference forms that allow resolution of the sticky problem of anaphora.
Also as a result of using a sublanguage, a semantic approach to parsing can be adopted, and a relatively small set of relations can be used to indicate how the elements of noun phrases fit together. There is enough detail, however, to illustrate how parsing and natural language processing might be applied to other similar document collections.
The representation for memory is like a semantic network, but specialized into a hierarchy for fast lookup / classification. Problems of disambiguation are at the heart of the discussion. An open question is how well this approach would scale up to larger domains, and to larger numbers of documents.