WWW: Beyond the Basics

15. Searching and Database on the Web

15.2. Gathering

The resources or documents on the Web are complex, widely distributed, and dynamic because anybody who has access to a Web server can put and/or documents on the Web. In the fact it is impossible to collect every document on the Web because there are just too many. Also there is no guarantee that documents on the Web will be maintained for any specific period of time.

Primitively robot gatherers are technology on the Web. It locates Web severs and collects resources to be indexed. Typically Gatherers fetch documents by traversing hypertext links of a document. Also they should have a policy for limiting the search of a given Web server. There are two main strategies for limiting the search of a Web server: breath first and depth first. The breath first search does a wide and shallow search; the depth first search does a narrow but deep search. Beginning with a single document, a gatherer follows a hypertext link and select the next document to index. A breath first strategy will first traverse all the hypertext links to the original document and gather the documents, then go further links followed the gathered documents. The depth-first strategy will traverse one link and gather the document to index, then follow one of the links to the gathered document, and go to the depth of a path first, then to another path. Usually there is a maximum number limit of links to follow for depth first, or otherwise the gatherer might go down links forever.

The gather fetches and collects documents, and submits them to the indexer. The indexer assigns each document or resource a unique identifier (called primary key ) and its storage location, and also create a record , a set of values or related terms describing the document. That is indexing. Now let us turn to examining the indexing.

[PREV][NEXT][UP][HOME][VT CS]

Copyright © 1996 Aixiang (I Song) Yao, All Rights Reserved

Aixiang (I Song) Yao<ayao@csgrad.cs.vt.edu>
Last modified: Sat Oct 3013:15:51 1996