Edward A. Fox
Department of Computer Science,
Virginia Tech, Blacksburg VA 24061-0106
Most computer users undertake some type of searching: in a text editor, with utilities like grep, or with special search systems. We begin with the most powerful string search system that is presently available, PAT (recently renamed OpenText, herein still referred to as PAT for historic reasons and because PAT is still a key part of OpenText) --- studying its use and the underlying algorithms and data structures. Chapter 5 explains these in detail, and a lecture provides clarification. The PAT User's Guide [1] and related exercises allow us to achieve a good working knowledge of the underlying model and its implementation.
Common string searching, on the other hand, simply takes a pattern and tries to find match(es) in a target string. Since this is such an important operation, fast algorithms are critical. Chapter 10 surveys the numerous algorithms, considers their strengths and weaknesses, and gives empirical performance data. Practice with the grep family of UNIX commands reinforces these results. Algorithm simulations help clarify the operation of the KMP and Boyer-Moore algorithms.
This Unit has 2 chapters, 2 exercises, 2 lectures, and a session in the laboratory.
The most common type of search is string searching, covered in Chapter 10. UNIX has the grep family of search tools, and they are among the most frequently used UNIX commands, consuming a large share of the CPU and disk resources of many systems. String searching using fast algorithms is reasonably efficient, and sometimes those same algorithms can be simply changed to work with specialized computer architectures (such as parallel processors, or fast disk subsystems, sometimes with search chips on each disk head).
Plain string matching is valuable in situations when there is no space to build supporting indexes, when searching occurs infrequently, when complex patterns are involved, when data is transient or generally encrypted, or as a post-processing operation on retrieved sets that must be further reduced in size. On the other hand, when there is repeated use to search a collection using strings or pattern expressions, such as for dictionary or literary analysis, then some type of supporting index is needed.
Thus, PAT, a commercial system that uses Patricia trees and arrays, applies the algorithms and data structures discussed in Chapter 5. PAT's elegant model considers a collection as a long text string, and makes efficient use of the specialized supporting index structures. PAT will be used for searching a variety of test collections (i.e., 2), and to carry out a wide variety of operations that are well supported by Patricia trees.
From the Course Objectives a key point is the objective to: critique, contrast, compare, and evaluate the efficiency, effectiveness, and utility of commercially available and research prototype systems for IS&R. We can operationalize this in terms of considering the PAT system and UNIX tools in the grep family. We can also look at the fundamental issues in this regard, involving data and files structures and their coupling with algorithms: building PAT trees and arrays, testing the many operations they support on real collections, and comparing the speed of various string matching schemes.
Another objective calls for students to: read and understand research contributions ... Certainly Chapter 5 is representative of the research literature, and Chapter 10 has some of that flavor as well as some of the attributes of a survey article.
In summary, this Unit has the following objectives, for students to be able to:
There are two main types of effort required. First, the readings (see Section 5) should be carefully studied, keeping unit objectives in mind (see Section 3). Second, students should use the PAT system and the UNIX string searching functions. Note that it is wisest to do the readings and exercises together for each of the two main topics.
There are two parts to the exercise: using the PAT system and using UNIX for string searching. The laboratory session will help with these.
The PAT User's Guide is on reserve in the Library. It is not required, but helps you learn more about searching with PAT.
To work with PAT, connect to vtaix.cc.vt.edu using TELNET or similar facility. You probably will have the best luck using a UNIX system or X terminal such as are available to CS graduate students, or in McB 116 or 118.
Be sure you have the proper environment set up. This is explained
in the page on UNIX
Use in CS5604 under the bullet
about Path and Environment. In particular you should
An online copy of the Query Language Reference Manual is available on vtaix.cc.vt.edu as /usr/local/opentext/examples/manual/data/manual and can be searched with PAT. In particular, to use PAT in command line mode with that Manual, simply say
For more convenient access to that manual, you can read or browse using the command ./go_lector or can search and read using the command ./go_otquery_xm (once you have gone to a device that can handle the X Window System and set the display to your device). If you don't have access to X, you can use the character mode version with command ./go_otquery_ch but that mode is a bit hard to manage.
To work with Lector, use its FormatS menu, select Outline - entries so you can see all the commands, and then use Start (to get to the beginning) and the paging commands to find a page where a command you are interested in appears. Next use the left mouse button to click on the command you select, which will bring it to the top of the page. Finally, go back to the FormatS menu, and switch back to the Standard view to read details. Note that End does not end your session (for which you use Exit in the File menu); it just moves you to the last page of the manual.
To search using Open TextQuery (i.e., after issuing command ./go_otquery_xm), simply fill in a word or string in the first box under Search For and hit Return. The large box below keeps a history of queries submitted along with results. Each line begins with a count of the number of matches. When you double click on a line, it brings up another window with a line for each of the matches. If you double click on one of those, it brings up a third window, which is Lector positioned just prior to the point of match. That lets you read more around the match, and the following pages. When you are done, close the first window, which closes the others too.
There are several options regarding searches. You can limit the context of a search using the Region selections - just use the downward arrows to see the popup menu of regions, such as inside a particular tag. Unless you have reason to do otherwise, use ENT (entry) for limiting things.
Under the Query menu, entry for Advanced Options, you can select weights or Boolean options, though I have had screen display problems with these.
Once you are familiar with all this, please do the following and turn in the results and you answers to the instructor.
Here you should also use the manual database, in particular, work on vtaix in directory /usr/local/opentext/examples/manual/data with file manual and issue commands like fgrep search manual. To decide what routine to use, do UNIX manual lookups, as in man grep to learn about grep, egrep, and fgrep. You can pipe results into wc to get counts. Thus, from your home directory (which you can reach with command cd), you could issue a command like
You should study to be able to answer each of the following questions.
In both chapters, there is a dearth of examples. That and the terse, technical language make them similar to many hard journal articles. The first lecture should help with Chapter 5. The second lecture and the xtango demonstrations should help with Chapter 10. There is good coverage of KMP and Boyer-Moore in [2].
Regarding Chapter 5, there are some specific comments to be made:
In Chapter 10, there are many points needed to make the algorithms clear. A few corrections and useful pointers follow.