CS5604, Unit SS

Edward A. Fox
Department of Computer Science, Virginia Tech, Blacksburg VA 24061-0106

Abstract:

Most computer users undertake some type of searching: in a text editor, with utilities like grep, or with special search systems. We begin with the most powerful string search system that is presently available, PAT (recently renamed OpenText) --- studying its use and the underlying algorithms and data structures. Chapter 5 explains these in detail, and a lecture provides clarification. The PAT User's Guide [1] and related exercises allow us to achieve a good working knowledge of the underlying model and its implementation.

Common string searching, on the other hand, simply takes a pattern and tries to find match(es) in a target string. Since this is such an important operation, fast algorithms are critical. Chapter 10 surveys the numerous algorithms, considers their strengths and weaknesses, and gives empirical performance data. Practice with the grep family of UNIX commands reinforces these results. Algorithm animations help clarify the operation of the KMP and Boyer-Moore algorithms.

This Unit has 2 chapters, 2 exercises, 2 lectures, and a session in the laboratory.

Unit Highlights

Lecture/Motivating Question:
When is string matching the best approach for searching, and what are the tradeoffs in those situations regarding using Patricia trees/arrays vs. string search?

Special Class Presentations:
There will be lectures on PAT and Patricia trees/arrays, and on string searching. Also, the class will meet for a laboratory session.

Computer Exercises:
Exercises with PAT and the grep tools, as well as algorithm animations, will help make the concepts clear.

Introduction

The most common type of search is string searching, covered in Chapter 10. UNIX has the grep family of search tools, and they are among the most frequently used UNIX commands, consuming a large share of the CPU and disk resources of many systems. String searching using fast algorithms is reasonably efficient, and sometimes those same algorithms can be simply changed to work with specialized computer architectures (such as parallel processors, or fast disk subsystems, sometimes with search chips on each disk head).

Plain string matching is valuable in situations when there is no space to build supporting indexes, when searching occurs infrequently, when complex patterns are involved, when data is transient or generally encrypted, or as a post-processing operation on retrieved sets that must be further reduced in size. On the other hand, when there is repeated use to search a collection using strings or pattern expressions, such as for dictionary or literary analysis, then some type of supporting index is needed.

Thus, PAT, a commercial system that uses Patricia trees and arrays, applies the algorithms and data structures discussed in Chapter 5. PAT's elegant model considers a collection as a long text string, and makes efficient use of the specialized supporting index structures. PAT will be used for searching a variety of test collections (i.e., 2), and to carry out a wide variety of operations that are well supported by Patricia trees.

Objectives

From the Course Objectives a key point is the objective to: critique, contrast, compare, and evaluate the efficiency, effectiveness, and utility of commercially available and research prototype systems for IS&R. We can operationalize this in terms of considering the PAT system and UNIX tools in the grep family. We can also look at the fundamental issues in this regard, involving data and files structures and their coupling with algorithms: building PAT trees and arrays, testing the many operations they support on real collections, and comparing the speed of various string matching schemes.

Another objective calls for students to: read and understand research contributions ... Certainly Chapter 5 is representative of the research literature, and Chapter 10 has some of that flavor as well as some of the attributes of a survey article.

In summary, this Unit has the following objectives, for students to be able to:

  1. effectively use the PAT system for string searching and other types of text analysis;

  2. explain how the PAT system carries out the operations available, referring to the relevant file and data structures;

  3. explain the key ideas behind the various string search algorithms that have evolved in recent decades; and

  4. compare and contrast the various string searching algorithms regarding complexity of code, efficiency, and utility for various types of patterns.

Suggested Procedure

There are two main types of effort required. First, the readings (see Section 5) should be carefully studied, keeping unit objectives in mind (see Section 3). Second, students should use the PAT system and the UNIX string searching functions. Note that it is wisest to do the readings and exercises together for each of the two main topics.

Computer Exercise

There are two parts to the exercise: using the PAT system and using UNIX for string searching. The laboratory session will help with these, and also demonstrate working with algorithm animations.

PAT

The PAT User's Guide is on reserve in the Library.

To work with PAT, connect to fox.cs.vt.edu using TELNET or similar facility. Be sure you have the proper environment set up. This is explained in the page on UNIX Use in CS5604 under the bullet about Path and Environment. In particular you should
cp ~fox/.login ~
one time to be sure your path is proper. The next time you login, things should work OK.

An online copy of the PAT User's Guide is available on fox.cs.vt.edu in /info1/oed and can be searched with PAT. To use PAT in command line mode, simply say cd /info1/oed; pat manfox

To actually read the manual online, setup for an X Window System session to take place on your UNIX workstation or X terminal, and use LectorMotif -F spec-manual -X manual from that same directory. Try the Full Text selection in the FOrmats menu.

You can use any of the man pages in /u1/man by simply saying something like man -P /u1/man pat to work with the pat command, for example.

To use PAT, as mentioned above, move to the data area using the cd /info1/oed command and then use ls -l to look at the files. Consider for example the file manual which has the PAT manual. To browse through the file simply use more manual or else use Lector to get a nice X display with the LectorMotif -F spec-manual -X manual command mentioned previously. To search the manual, say pat manfox.

Similarly you can say more 2e to see the 2nd edition of the Oxford English Dictionary, or display it using X with

LectorMotif -F spec-oed -X 2e or search it with pat 2efox and various PAT commands.

Please refer to the Quick Reference Guide to PAT which is at the back of the User's Guide. You will note a series of boxes under the heading Examples --- they are present on both sides of the page. The first part of your assignment is to try out most of the examples shown, on one or another of the PAT databases built (preferably the manual, to save time, though you need to use the OED for anything involving docs. Be sure to send the instructor a log of the interactions including results. Also record how long each operation takes. (Hint: When you quit from PAT it gives you a time.) As becomes necessary, you may make minor substitutions or other small changes to the commands listed under Examples if they yield no or improper results. Since you do not have write permission in the directories you are working with, you should issue a command like {SaveFile ``/u1/loginid/pat.out''} where loginid is replaced by your account name and where the brackets and quotes are all required. Then your results will be in your home directory in file pat.out so you can rename the file, print it out, etc. For more information about this command, look in the relevant page of the manual database (or in the printed manual).

See the file 2e.docs to learn more about the actual tagging used with the OED. This will help you use the ``docs'' type commands, and to carry out the following required searches in the OED collection:

  1. find examples of usage from the first decade of the 19th century [Hint: How would you make a range search for dates in that period? What happens when you run the command "1990" within docs Q?]
  2. find definitions where the words quality and attribute occur in close proximity
  3. what is the longest repeated phrase that begins with search [Hint: be sure to look for the word ``search '', with the space included in the search expression.]

UNIX String Searching

Here you should also use the manual database, in particular, work in directory /info1/oed with file manual and issue commands like fgrep search manual. To decide what routine to use, do UNIX manual lookups, as in man grep to learn about grep, egrep, and fgrep. You can pipe results into wc to get counts. Be sure to time each command and to save the results, for each of the following:

  1. find occurrences of the string search exactly as given [Why are there fewer here than PAT finds?]
  2. find occurrences of the string search where case is ignored [Why are there still fewer here than PAT finds?]
  3. find lines where search occurs as a separate word in lower case, not as part of a longer word (e.g., searches)

Lab Algorithm Animation

For the laboratory session, you should work with the xtango algorithm animations on fox.cs.vt.edu, with your .login file set to match that in fox. Issue the following commands:

cd /u1/bin; kmp; boyer after doing the proper setenv operation to enable display on your X device. Then you can repeat the two animations as desired. Use the sample pattern and text from your textbook, or others that you think of, and try the Run Animation button and/or the pause button.

Study Questions

You should be able to answer each of the following questions.

  1. Can you prepare a table for all of the operations discussed in regard to PAT searching, showing the computational complexity?
  2. If the skipped bits were not skipped, what effect would that have on the size of a PAT tree?
  3. For what operations would keeping the size of the associated subtree, for each internal node, be useful?
  4. Is the ``delayed reading paradigm'' a type of ``lazy evaluation'' scheme?
  5. Why are the results in Figures 10.7 and 10.11 different?
  6. Why is the ``match heuristic'' of Boyer-Moore not as valuable as one might think?
  7. What methods for string searching work well when the pattern has ``don't care'' symbols?
  8. What methods are likely to be used in the grep family of UNIX commands, and why?

Comments on Readings

In both chapters, there is a dearth of examples. That and the terse, technical language make them similar to many hard journal articles. The first lecture should help with Chapter 5. The second lecture and the xtango demonstrations should help with Chapter 10. There is good coverage of KMP and Boyer-Moore in [2].

Textbook Chapter 5

Regarding Chapter 5, there are some specific comments to be made:

Textbook Chapter 10

In Chapter 10, there are many points needed to make the algorithms clear. A few corrections and useful pointers follow.

Summary of Key Concepts

  1. The model of a text collection as a string, with various grammars or structures superimposed for reference purposes, is an important contribution made by those who have worked with PAT.

  2. Prefix, pattern, frequency, repetition, and range type questions are some places where Patricia trees excel.

  3. String searching speed will vary quite widely depending on the algorithm chosen. Some are tuned for average case, and some for worst case.

  4. String searching can be speeded up by pre-processing the pattern so that it can ``jump-ahead'' to the best place for resuming search after a mismatch.

  5. If one knows roughly how many matches there are for a pattern, that could be considered too in selecting an algorithm.

References

1
Heather Fawcett. PAT 3.3 User's Guide. Center for the New Oxford English Dictionary, University of Waterloo, 1989.

2
Gerard Salton. Automatic Text Processing. Addison-Wesley, 1989.



Edward A. Fox
Fri Sep 15 19:09:17 EDT 1995