CS5604, Unit SS

Edward A. Fox
Department of Computer Science, Virginia Tech, Blacksburg VA 24061-0106

Abstract:

Most computer users undertake some type of searching: in a text editor, with utilities like grep, or with special search systems. We begin with the most powerful string search system that is presently available, PAT (recently renamed OpenText, herein still referred to as PAT for historic reasons and because PAT is still a key part of OpenText) --- studying its use and the underlying algorithms and data structures. Chapter 5 explains these in detail, and a lecture provides clarification. The PAT User's Guide [1] and related exercises allow us to achieve a good working knowledge of the underlying model and its implementation.

Common string searching, on the other hand, simply takes a pattern and tries to find match(es) in a target string. Since this is such an important operation, fast algorithms are critical. Chapter 10 surveys the numerous algorithms, considers their strengths and weaknesses, and gives empirical performance data. Practice with the grep family of UNIX commands reinforces these results. Algorithm simulations help clarify the operation of the KMP and Boyer-Moore algorithms.

This Unit has 2 chapters, 2 exercises, 2 lectures, and a session in the laboratory.

1. Unit Highlights

Lecture/Motivating Question:
When is string matching the best approach for searching, and what are the tradeoffs in those situations regarding using Patricia trees/arrays vs. string search?

Special Class Presentations:
There will be lectures on PAT and Patricia trees/arrays, and on string searching. Also, the class will meet for a laboratory session.

Computer Exercises:
Exercises with PAT and the grep tools will help make the concepts clear.

2. Introduction

The most common type of search is string searching, covered in Chapter 10. UNIX has the grep family of search tools, and they are among the most frequently used UNIX commands, consuming a large share of the CPU and disk resources of many systems. String searching using fast algorithms is reasonably efficient, and sometimes those same algorithms can be simply changed to work with specialized computer architectures (such as parallel processors, or fast disk subsystems, sometimes with search chips on each disk head).

Plain string matching is valuable in situations when there is no space to build supporting indexes, when searching occurs infrequently, when complex patterns are involved, when data is transient or generally encrypted, or as a post-processing operation on retrieved sets that must be further reduced in size. On the other hand, when there is repeated use to search a collection using strings or pattern expressions, such as for dictionary or literary analysis, then some type of supporting index is needed.

Thus, PAT, a commercial system that uses Patricia trees and arrays, applies the algorithms and data structures discussed in Chapter 5. PAT's elegant model considers a collection as a long text string, and makes efficient use of the specialized supporting index structures. PAT will be used for searching a variety of test collections (i.e., 2), and to carry out a wide variety of operations that are well supported by Patricia trees.

3. Objectives

From the Course Objectives a key point is the objective to: critique, contrast, compare, and evaluate the efficiency, effectiveness, and utility of commercially available and research prototype systems for IS&R. We can operationalize this in terms of considering the PAT system and UNIX tools in the grep family. We can also look at the fundamental issues in this regard, involving data and files structures and their coupling with algorithms: building PAT trees and arrays, testing the many operations they support on real collections, and comparing the speed of various string matching schemes.

Another objective calls for students to: read and understand research contributions ... Certainly Chapter 5 is representative of the research literature, and Chapter 10 has some of that flavor as well as some of the attributes of a survey article.

In summary, this Unit has the following objectives, for students to be able to:

  1. effectively use the PAT system for string searching and other types of text analysis;

  2. explain how the PAT system carries out the operations available, referring to the relevant file and data structures;

  3. explain the key ideas behind the various string search algorithms that have evolved in recent decades; and

  4. compare and contrast the various string searching algorithms regarding complexity of code, efficiency, and utility for various types of patterns.

4. Suggested Procedure

There are two main types of effort required. First, the readings (see Section 5) should be carefully studied, keeping unit objectives in mind (see Section 3). Second, students should use the PAT system and the UNIX string searching functions. Note that it is wisest to do the readings and exercises together for each of the two main topics.

4.1 Computer Exercise

There are two parts to the exercise: using the PAT system and using UNIX for string searching. The laboratory session will help with these.

4.1.1 PAT

The PAT User's Guide is on reserve in the Library. It is not required, but helps you learn more about searching with PAT.

To work with PAT, connect to vtaix.cc.vt.edu using TELNET or similar facility. You probably will have the best luck using a UNIX system or X terminal such as are available to CS graduate students, or in McB 116 or 118.

Be sure you have the proper environment set up. This is explained in the page on UNIX Use in CS5604 under the bullet about Path and Environment. In particular you should

cp ~fox/.login ~
one time to be sure your path is proper. The next time you login, things should work OK.

An online copy of the Query Language Reference Manual is available on vtaix.cc.vt.edu as /usr/local/opentext/examples/manual/data/manual and can be searched with PAT. In particular, to use PAT in command line mode with that Manual, simply say

cd /usr/local/opentext/examples/manual/scripts
to get to the right place for such commands and then run PAT as
./go_pat
whereupon you can issue PAT commands, ending with the PAT quitcommand when done.

For more convenient access to that manual, you can read or browse using the command ./go_lector or can search and read using the command ./go_otquery_xm (once you have gone to a device that can handle the X Window System and set the display to your device). If you don't have access to X, you can use the character mode version with command ./go_otquery_ch but that mode is a bit hard to manage.

To work with Lector, use its FormatS menu, select Outline - entries so you can see all the commands, and then use Start (to get to the beginning) and the paging commands to find a page where a command you are interested in appears. Next use the left mouse button to click on the command you select, which will bring it to the top of the page. Finally, go back to the FormatS menu, and switch back to the Standard view to read details. Note that End does not end your session (for which you use Exit in the File menu); it just moves you to the last page of the manual.

To search using Open TextQuery (i.e., after issuing command ./go_otquery_xm), simply fill in a word or string in the first box under Search For and hit Return. The large box below keeps a history of queries submitted along with results. Each line begins with a count of the number of matches. When you double click on a line, it brings up another window with a line for each of the matches. If you double click on one of those, it brings up a third window, which is Lector positioned just prior to the point of match. That lets you read more around the match, and the following pages. When you are done, close the first window, which closes the others too.

There are several options regarding searches. You can limit the context of a search using the Region selections - just use the downward arrows to see the popup menu of regions, such as inside a particular tag. Unless you have reason to do otherwise, use ENT (entry) for limiting things.

Under the Query menu, entry for Advanced Options, you can select weights or Boolean options, though I have had screen display problems with these.

Once you are familiar with all this, please do the following and turn in the results and you answers to the instructor.

  1. Tell how many times the word system occurs.
  2. Tell how many times the word the occurs.
  3. Give the entry under which the phrase Pat system occurs.
  4. True or False: Entries in which including occurs in the Examples region also have region occuring in the Examples region, but not vice versa. (Be sure to explain how you found your answer.)

4.1.2 UNIX String Searching

Here you should also use the manual database, in particular, work on vtaix in directory /usr/local/opentext/examples/manual/data with file manual and issue commands like fgrep search manual. To decide what routine to use, do UNIX manual lookups, as in man grep to learn about grep, egrep, and fgrep. You can pipe results into wc to get counts. Thus, from your home directory (which you can reach with command cd), you could issue a command like

fgrep the /usr/local/opentext/examples/manual/data/manual|wc
For each of the following:
  1. find occurrences of the string the exactly as given [Why are there fewer here than PAT finds?]
  2. find occurrences of the string the where case is ignored [Why are there still fewer here than PAT finds?]
  3. find lines where the occurs as a separate word in lower case, not as part of a longer word (e.g., not cases like then)
Send your answers for these three experiments to the instructor.

4.2 Study Questions

You should study to be able to answer each of the following questions.

  1. Can you prepare a table for all of the operations discussed in regard to PAT searching, showing the computational complexity?
  2. If the skipped bits were not skipped, what effect would that have on the size of a PAT tree?
  3. For what operations would keeping the size of the associated subtree, for each internal node, be useful?
  4. Is the ``delayed reading paradigm'' a type of ``lazy evaluation'' scheme?
  5. Why are the results in Figures 10.7 and 10.11 different?
  6. Why is the ``match heuristic'' of Boyer-Moore not as valuable as one might think?
  7. What methods for string searching work well when the pattern has ``don't care'' symbols?
  8. What methods are likely to be used in the grep family of UNIX commands, and why?

5. Comments on Readings

In both chapters, there is a dearth of examples. That and the terse, technical language make them similar to many hard journal articles. The first lecture should help with Chapter 5. The second lecture and the xtango demonstrations should help with Chapter 10. There is good coverage of KMP and Boyer-Moore in [2].

5.1 Textbook Chapter 5

Regarding Chapter 5, there are some specific comments to be made:

5.2 Textbook Chapter 10

In Chapter 10, there are many points needed to make the algorithms clear. A few corrections and useful pointers follow.

6. Summary of Key Concepts

  1. The model of a text collection as a string, with various grammars or structures superimposed for reference purposes, is an important contribution made by those who have worked with PAT.

  2. Prefix, pattern, frequency, repetition, and range type questions are some places where Patricia trees excel.

  3. String searching speed will vary quite widely depending on the algorithm chosen. Some are tuned for average case, and some for worst case.

  4. String searching can be speeded up by pre-processing the pattern so that it can ``jump-ahead'' to the best place for resuming search after a mismatch.

  5. If one knows roughly how many matches there are for a pattern, that could be considered too in selecting an algorithm.

7. References

1
Heather Fawcett. PAT 3.3 User's Guide. Center for the New Oxford English Dictionary, University of Waterloo, 1989.

2
Gerard Salton. Automatic Text Processing. Addison-Wesley, 1989.



Edward A. Fox
1 Oct 1996