
Indexing is the key component of the search system. An effective indexing process will yield high-quality records that represent accurately collection of resources, and that better describe the documents. Searching a high-quality index leads to precise identification and correct retrieval of resources.
In effect, indexing for a search system is virtually the same as for a traditional database. In a traditional database, a record is a set of values treated as a unit. For example in a library catalog, a record for a book may consists of title, author, publisher, subject, and so on.
An example of two sample records in a database is shown below
Record: 012
------------
Author: Lincoln D. Stein
Title: How to Set Up and Maintain a World Wide Web Site
Publisher: Addison-Wesley Publishing Company, Inc. (1995)
Subject: Web, Netscape, HTML
ISBN: 0-201-63389-2
Record: 345
------------
Author: David Flanagan
Title: Java in a Nutshell
Publisher: O'Reilly & Associates, Inc., (1996)
Subject: Java
ISBN: 1-56592-183-6
Record: 678
------------
Author: W. Richard Stevens
Title: TCP/IP Illustrated, Volume 3.
Publisher: Addison-Wesley Publishing Company, Inc. (1996)
Subject: TCP/IP, Networks, HTTP, NNTP.
ISBN: 0-201-63495-3
While for a search system, a record for a Web resource or document is a URC (Uniform Resource Characteristic). A URC is a set of attributes that describes a resource or document on the Web. For example, the simplest type of URC are Hotlists that only consist of a document title and document location (URL - Uniform Resource Location), such as
Record: 210
Title: Table of Contents
URL: http://www.netscape.com/search/
Record: 543
Title: Virginia Tech Computer Science Department Home Page
URL: http://www.cs.vt.edu/
Record: 876
Title: CS6204: WWW:Beyond the Bascics
URL: http://ei.cs.vt.edu/~wwwbtb/
Let us recall some basic functions about database. The most important thing about a database is to find information by knowing something about the information you want, without necessarily knowing the detail of how it is stored.
A database is a set of records. Each record in a database can be uniquely identified by a key value. The database is called a primary index if it has these unique keys. We can find the right information by knowing the right key, without necessarily knowing how it is stored.
However searching only on the primary key is somewhat unconvenient, which limits the capability of the database. Sometimes we do not know the primary key, but want to search terms based on other keys. For example, we want search books by an author, or in a subject, or by book title on a library database. Searching information on the Web by heading or document title on a Web indexing database is another example. To this end - searching based on another key, we need to create another database organized by that key, which called inverted index or secondary index. Usually we may have several inverted indexes. Inverted indexing is an important technique that speed the search process and enhance the capabilities of the database. Many critiria or strategies on indexing differ the ways of creating inverted indexes. An inverted index is derived from the primary index --- the original database of records and makes the search process faster. In fact, an inverted index record is the sum of all possible searches that computed and stored in advance according to the strategy. Inverted indexes may be created for some or all of the terms in the database. When all the terms are indexed, it is called full index. Full index is fast and effective, especially when the user knows only a piece of information in the record.
Fulltext indexing is another widespread use of inverted index. Indexes can be built from the words in the document, the full text of the resource. Because the entire resource is parsed, or scanned to derive the information that makes up the record, this indexing technique is called fulltext indexing. This would allow the user search certain words or phrase, or combinations of words among documents.
Documents on the Web have a great varieties both on type and size and reside on widely distributed Web servers. Many data types and formats, such as words, pictures, audio, video, multimedia, newgroups, and network services, exist on the Web. Also on the Web, some resources are small files, and some are huge databases containing thousands of entries. Thus new challenges arise for indexing for the Web. Practically speaking, a wide variety of classification schemes are used.
A common way to derive a record - URC (Uniform Resource Characteristic) data from a resource or document on the Web is to create a fulltext index of that document. However the practical disadvantage of a fulltext indexing is that it requires great amount of storage: Parsing a resource and pulling out all the words creates a huge volumn of data. To reduce a URC from fulltext index down to a reasonable size, some operations may be applied to the fulltext index data. This is exactly what RBSE and Lycos indexers do. For example, to create a record - URC describing the document, a parser can be used to derive the most important keywords, such as the title of the document, heading, subheadings (words in the section titles), the most weighty words (according to the frequency of occurences of that word over all documents), descriptive words around clickable hypertext links, and so on. It depends on what strategies or criteria the search system likes to chose for indexing.
The use of robot indexer is a successful approach to derive URCs. There are a few of early robot indexers, such as Jumpstation, the World Wide Web Worm, RBSE Project Sprider. The following table (from N. J. Yeager & R. E. McGrath's book, 1996, p.238-239) shows an example of different URC formats for Web robots with different strategies:
URC format for Web robots
---------------------------------------------------------------
Robot URC
---------------------------------------------------------------
Jumpstation Title of HTML document, words in
section titles (level headers).
World Wide Web Worm Title of HTML document, full pathname
(URL), descriptive words around
clickable hypertext link that
lead to the document.
RBSE Title of HTML document, 20 most weighted
words* (derived via WAIS fulltex
indexing).
Lycos Title, heading, subheadings, 100 most
weighted words* (derived via fulltext
indexing), first 20 lines, size in
bytes, number of words.
CUI W3 Title, location (URL), document that
linked to and from this document,
keywords, abstracts.
---------------------------------------------------------------
*weight of word = term frequency / document frequency
term frequency = frequency of occurence of that word over
all documents in the document collection
document frequency = frequency of occurence of that word
in a given document
The most popular and successful Web search services today, Lycos and Yahoo, create indexes from robot gatherers.
Copyright © 1996 Aixiang (I Song) Yao, All Rights Reserved
Aixiang (I Song) Yao<ayao@csgrad.cs.vt.edu>
Last modified: Sat Oct 3013:15:51 1996