In terms of the breadth of topics considered, FOA admittedly casts an ambitious net. The fundamental reason I like to teach this subject (as well as others) is that I believe it is possible to give students some "job skills" that are quite immediately rewarding and at the same time use this opportunity to acquaint them with some of the most exciting, intellectually stimulating and open questions addressing modern science and philosophy. The danger, of course, is that the mixture of these very different agendas creates a muddle that accomplishes neither.
One of the premises guiding construction of this text, however, is that all of the new electronic media now available can be especially useful to an instructor attempting to create a unique learning experience for his/her class. Thus FOA has been designed to be easily tailored for classes ranging from early undergraduates (requiring only CS??/Data Structures as a prerequisite) to advanced graduate seminars in IR, machine learning and computational linguistics. I have further attempted to be as explicit as possible about major dependencies between the material presented, both within this text and as it depends on other disciplines. When combined with some simple tools, these can be used by an instructor to identify those sections they find crucial, prerequisites for these, supporting review questions and exercises, incorporating materials I've omitted (?!), and finally schedule it all for the quarter/semester schedule.
A first approximation of FOA's basic structure is given in Figure ? Chapter ? gives a quick, orienting tour through many of the most important questions and methods considered in greated depth later. Section ? then steps through each of the steps required to actually build a "text search engine" much like those now in widespread use across the WWW. The conceptual focus of this section is the fundamental ABOUT(Document,Keyword) relation, connecting statistically derived "keywords" to documents they describe. At this juncture, if nothing else, the student should be in a much better position to evaluate these products.
Of course this basic exposition has had to round a number of corners and the resulting search engine must be considered only a "plain vanilla" base from which more sophisticated techniques can be explored. These techniques draw from a wide range of disciplines: mathematics (esp. probability and machine learning), artificial intelligence (esp. as applies to reasoning about structures of and among documents beyond the basic ABOUT relation), and computational linguistics. Interface issues, connections to database technologies, and some lower-level implementation issues are also considered.
A second extension from the core is to consider the mind-boggling range of new media, network protocols, standards, browsers, etc. that collectively constitute the World Wide Web (WWW) explosion. A remarkable number of issues transfer from prior IR research: creating an inverted index of the Web's pages (like Alta Vista); the role of taxonomic classification systems (like Yahoo!); bibliometric analyses of Web citation patterns; etc. Of course many things also change as new standards (e.g., HTML, Java, OpenDoc, OLE/ActiveX) emerge and new forms of automatic interaction (e.g., "agents") are allowed. This part of the FOA syllabus is per force most dynamic, and will depend especially on the FOA WWW pages for currency.
In stark contrast to this quickly changing topic, we next turn to some of the oldest questions in the philosophy of language: What does it mean for a word to have "meaning"? How do two language users (for example a document's author and its readers) come to understand one another? Our review of work in this area will highlight the role of the mediating artifact (spoken word, written document, photo, movie) and the role of an interpretive context within which each expression must be understood. These philosophical concerns then set the stage for an analysis of the social context within which people read and write, and how the publishing industry is transforming itself as a bridge between authors and readers. Specific topics include the changing economics of publishing and legal questions which arise as intellectual properties move from physical to digital substrates. Two hallowed social institutions, the library and the scientific enterprises, are analyzed in depth to see how these changes are transforming them.
Finally, FOA is turned in on itself as the technologies for seeking knowledge are applied to the task of education. Virtually every level of American education is changing, and the relative roles of classroom lectures, remote "distance learning," textbooks, CD-ROMs, WWW resources must all be reconsidered. This FOA (text/lecture/WWW site) is itself an example of one, evolving experiment.
Obviously this is more material than is likely to be included in any one course. As suggested by Figure ? the idea is that with the foundation provided by the core of Section 2, instructors can pick-and-choose (assisted by FOA curriculum construction tools) from the advanced topics to form a course matching their priorities.
Finding Out About
Information Retrieval and other techniques for seeking knowledge
1. Introduction
1.1. Abstract
1.2. Finding out ...
1.3. ... about
1.4. Defining (a simple version of) the IR problem
1.5. Preview
2. Core technologies: Building a search engine
2.1. Indexing structures
2.2. Weighting indices
2.2.1. Statistics of communication
2.2.2. Language distribution
2.2.3. Factors in index weighting
2.2.4. Weighting methods
2.3. Evaluation of IR systems
2.3.1. What is success?
2.3.2. Measures
2.3.3. Experimental collections
2.3.4. RAVE
3. Advanced techniques
3.1. Violated assumptions (in the basic model)
3.2. Mathematical approaches
3.2.1. Probabilisitic retrieval
3.2.2. Machine learning
3.3. Drawing inferences based on other representations
3.3.1. AI knowledge representation basics
3.3.2. Exploiting other (non-index) information
3.3.3. Publication information
3.3.4. Inter-document links (Citation)
3.3.5. Intra-document links
3.3.6. Keyword structures
3.3.7. Social relations
3.3.8. Maps
3.4. Computational linguistics
3.4.1. A quick history of Natural language processing
3.4.2. Corpus-based linguistics
3.4.3. Statistical language inference
3.4.4. Applications of NLP to FOA
3.5. Interface design for IR systems
3.6. Relation to Database
3.7. Implementation issues
4. IR in the (brave new WWW) Large
4.1. New Media
4.1.1. Email
4.1.2. News
4.1.3. WWW
4.1.4. Beyond textual documents
4.1.5. Active documents
4.2. Agents
4.3. What's new; what will stay the same
5. Taking language seriously: Engineering meaningful representations
5.1. Preliminaries
5.1.1. Shannon/Weaver coding
5.1.2. Requirement of mediting artifact
5.1.3. Oral vs. written communication
5.1.4. Inference beyond decoding
5.2. Meaning
5.2.1. Grice's "meaning"
5.2.2. Language games
5.2.3. Semiotics
5.2.4. Context: mutual knowledge
5.3. The IR language game
5.3.1. About(Topic, Material, SezWho)
5.3.2. Relevance
5.3.3. Relevance feedback
5.4. Text-based intelligence
6. The social context
6.1. Economics of publishing
6.2. Legal issues
6.3. Political issues
6.4. Libraries without walls
6.5. New science in the InfoVerse
6.6. From the authors' perspective
6.7. Research and the rest of the job
6.8. CLOE
7. Searching for an education
7.1. FOA self-reference!
7.2. Teaching:learning :: Writing:Reading dualities
7.3. Pedagogical structures