Federated Searching

what we've learned about the problem
James Powell, jpowell@vt.edu
UH3004, Fall 1997

Problems to solve:

accepting and distributing a query
When a user performs a search against a collection of databases, they expect that search to be appropriately mapped to the requirements of the site, without their intervention, if possible. They also expect the query to be submitted automatically.

language barriers
Increasingly, vast collections of data are becoming available in languages other than English.  In most cases, no translations are available.  But full-text and fielded searching is practical for most non-English document collections.  By using a language dictionary, a query could be translated to the language of the remote site, allowing a person to locate useful information without having any knowledge of the language in which the document is written.  Then the researcher can work with translation software or a human translator to translate specific documents determined to be relavent.

search option variations
Beyond exact match searching there are many mechanisms employed by search sites that can be used to improve the precision of a search.  Among them are word truncation, boolean operators, field level operators, proximity and range searches.  These mechanisms can be implemented in different ways on different systems.  Also, some types are not valid on all types of queries.  For example, full text search sites may not contain any information about fields within the documents it indexes.  A range search might  be useful when trying to locate documents with a specific date

result set variability
Result sets are often presented in radically different formats, with and without information explaining why an item was included or how they were ranked.  Sometimes, results are presented in small blocks, sometimes advertisements and other unrelated materials are included.  Merging result sets is an enormous problem.

differing mechanisms for ranking documents
In order to merge result sets, you must have results sets that rank items similarly, and order the results list similarly, or you need to modify the result set so that it conforms to a particular set of formatting and ranking characteristics.  It is often difficult to determine exactly how a search engine selects one document over another.  In fact, sometimes this is a closely guarded trade secret.

selecting sites to search
Some users may want to search every available site, until they discover that some sites are down and others contain nothing remotely related to their query.  Providing the user with a mechanism for preselecting a subset of searchable indexes saves time and computing resources.  Think of it as the first step in refining a search.

determining the characteristics of a search site
Many search sites are similar in that they present one text entry field and a button to initiate the search.  But behind the scenes of many web search forms are hidden form fields, database selection options and even elements that define the appearance of the results set.  Most of these items have nothing to do with searching, but the search CGI program expects to receive them nonetheless.

how to handle delays and lags in response
Because searchable sites can be anywhere in the world, response time can vary dramatically.  Since merging results is difficult and time consuming, sites that are slow to respond or that fail to respond at all must be abandoned within a reasonable period of time.  But this needs to be reported to the end user, along with information that will allow them to determine whether or not to try the site again or remove it from the list of sites to search.

where should most of the work be performed?
Because of Java and the ability to write once and run anywhere, it is now possible to ask the desktop system to perform some of the tasks, such as results set merging and/or query mapping.  But distributing the workload may not provide many benefits if the client is waiting on the server which is in turn waiting on ten or one hundred servers to respond.