WWW: Beyond the Basics

15. Searching and Database on the Web

15.3. Indexing

Indexing is the key component of the search system. An effective indexing process will yield high-quality records that represent accurately collection of resources, and that better describe the documents. Searching a high-quality index leads to precise identification and correct retrieval of resources.

In effect, indexing for a search system is virtually the same as for a traditional database. In a traditional database, a record is a set of values treated as a unit. For example in a library catalog, a record for a book may consists of title, author, publisher, subject, and so on.

An example of two sample records in a database is shown below

     Record: 012
     ------------
     Author:     Lincoln D. Stein
     Title:      How to Set Up and Maintain a World Wide Web Site
     Publisher:  Addison-Wesley Publishing Company, Inc. (1995)
     Subject:    Web, Netscape, HTML
     ISBN:       0-201-63389-2

     Record: 345
     ------------
     Author:     David Flanagan
     Title:      Java in a Nutshell
     Publisher:  O'Reilly & Associates, Inc., (1996)
     Subject:    Java 
     ISBN:       1-56592-183-6

     Record: 678
     ------------   
     Author:     W. Richard Stevens
     Title:      TCP/IP Illustrated, Volume 3.
     Publisher:  Addison-Wesley Publishing Company, Inc. (1996)
     Subject:    TCP/IP, Networks, HTTP, NNTP.
     ISBN:       0-201-63495-3
 

While for a search system, a record for a Web resource or document is a URC (Uniform Resource Characteristic). A URC is a set of attributes that describes a resource or document on the Web. For example, the simplest type of URC are Hotlists that only consist of a document title and document location (URL - Uniform Resource Location), such as

    Record: 210
    Title:  Table of Contents
    URL:    http://www.netscape.com/search/

    Record: 543
    Title:  Virginia Tech Computer Science Department Home Page
    URL:    http://www.cs.vt.edu/

    Record: 876
    Title:  CS6204: WWW:Beyond the Bascics
    URL:    http://ei.cs.vt.edu/~wwwbtb/

Let us recall some basic functions about database. The most important thing about a database is to find information by knowing something about the information you want, without necessarily knowing the detail of how it is stored.

A database is a set of records. Each record in a database can be uniquely identified by a key value. The database is called a primary index if it has these unique keys. We can find the right information by knowing the right key, without necessarily knowing how it is stored.

However searching only on the primary key is somewhat unconvenient, which limits the capability of the database. Sometimes we do not know the primary key, but want to search terms based on other keys. For example, we want search books by an author, or in a subject, or by book title on a library database. Searching information on the Web by heading or document title on a Web indexing database is another example. To this end - searching based on another key, we need to create another database organized by that key, which called inverted index or secondary index. Usually we may have several inverted indexes. Inverted indexing is an important technique that speed the search process and enhance the capabilities of the database. Many critiria or strategies on indexing differ the ways of creating inverted indexes. An inverted index is derived from the primary index --- the original database of records and makes the search process faster. In fact, an inverted index record is the sum of all possible searches that computed and stored in advance according to the strategy. Inverted indexes may be created for some or all of the terms in the database. When all the terms are indexed, it is called full index. Full index is fast and effective, especially when the user knows only a piece of information in the record.

Fulltext indexing is another widespread use of inverted index. Indexes can be built from the words in the document, the full text of the resource. Because the entire resource is parsed, or scanned to derive the information that makes up the record, this indexing technique is called fulltext indexing. This would allow the user search certain words or phrase, or combinations of words among documents.

Documents on the Web have a great varieties both on type and size and reside on widely distributed Web servers. Many data types and formats, such as words, pictures, audio, video, multimedia, newgroups, and network services, exist on the Web. Also on the Web, some resources are small files, and some are huge databases containing thousands of entries. Thus new challenges arise for indexing for the Web. Practically speaking, a wide variety of classification schemes are used.

A common way to derive a record - URC (Uniform Resource Characteristic) data from a resource or document on the Web is to create a fulltext index of that document. However the practical disadvantage of a fulltext indexing is that it requires great amount of storage: Parsing a resource and pulling out all the words creates a huge volumn of data. To reduce a URC from fulltext index down to a reasonable size, some operations may be applied to the fulltext index data. This is exactly what RBSE and Lycos indexers do. For example, to create a record - URC describing the document, a parser can be used to derive the most important keywords, such as the title of the document, heading, subheadings (words in the section titles), the most weighty words (according to the frequency of occurences of that word over all documents), descriptive words around clickable hypertext links, and so on. It depends on what strategies or criteria the search system likes to chose for indexing.

The use of robot indexer is a successful approach to derive URCs. There are a few of early robot indexers, such as Jumpstation, the World Wide Web Worm, RBSE Project Sprider. The following table (from N. J. Yeager & R. E. McGrath's book, 1996, p.238-239) shows an example of different URC formats for Web robots with different strategies:

          URC format  for Web robots
  ---------------------------------------------------------------
    Robot                           URC
  ---------------------------------------------------------------
  Jumpstation              Title of HTML document, words in 
                             section titles (level headers).
  World Wide Web Worm      Title of HTML document, full pathname
                             (URL), descriptive words around
                             clickable hypertext link that
                             lead to the document.
  RBSE                     Title of HTML document, 20 most weighted
                             words* (derived via WAIS fulltex
                             indexing).
  Lycos                    Title, heading, subheadings, 100 most
                             weighted words* (derived via fulltext
                             indexing), first 20 lines, size in
                             bytes, number of words.
  CUI W3                   Title, location (URL), document that
                             linked to and from this document,
                             keywords, abstracts.
  ---------------------------------------------------------------
   *weight of word = term frequency / document frequency
    term frequency = frequency of occurence of that word over
          all documents in the document collection
    document frequency = frequency of occurence of that word
          in a given document

The most popular and successful Web search services today, Lycos and Yahoo, create indexes from robot gatherers.

[PREV][NEXT][UP][HOME][VT CS]

Copyright © 1996 Aixiang (I Song) Yao, All Rights Reserved

Aixiang (I Song) Yao<ayao@csgrad.cs.vt.edu>
Last modified: Sat Oct 3013:15:51 1996