Edward A. Fox
Department of Computer Science,
Virginia Tech, Blacksburg VA 24061-0106
Common string searching, on the other hand, simply takes a pattern and tries to find match(es) in a target string. Since this is such an important operation, fast algorithms are critical. Chapter 10 surveys the numerous algorithms, considers their strengths and weaknesses, and gives empirical performance data. Practice with the grep family of UNIX commands reinforces these results. Algorithm animations help clarify the operation of the KMP and Boyer-Moore algorithms.
This Unit has 2 chapters, 2 exercises, 2 lectures, and a session in the laboratory.
The most common type of search is string searching, covered in Chapter 10. UNIX has the grep family of search tools, and they are among the most frequently used UNIX commands, consuming a large share of the CPU and disk resources of many systems. String searching using fast algorithms is reasonably efficient, and sometimes those same algorithms can be simply changed to work with specialized computer architectures (such as parallel processors, or fast disk subsystems, sometimes with search chips on each disk head).
Plain string matching is valuable in situations when there is no space to build supporting indexes, when searching occurs infrequently, when complex patterns are involved, when data is transient or generally encrypted, or as a post-processing operation on retrieved sets that must be further reduced in size. On the other hand, when there is repeated use to search a collection using strings or pattern expressions, such as for dictionary or literary analysis, then some type of supporting index is needed.
Thus, PAT, a commercial system that uses Patricia trees and arrays, applies the algorithms and data structures discussed in Chapter 5. PAT's elegant model considers a collection as a long text string, and makes efficient use of the specialized supporting index structures. PAT will be used for searching a variety of test collections (i.e., 2), and to carry out a wide variety of operations that are well supported by Patricia trees.
From the Course Objectives a key point is the objective to: critique, contrast, compare, and evaluate the efficiency, effectiveness, and utility of commercially available and research prototype systems for IS&R. We can operationalize this in terms of considering the PAT system and UNIX tools in the grep family. We can also look at the fundamental issues in this regard, involving data and files structures and their coupling with algorithms: building PAT trees and arrays, testing the many operations they support on real collections, and comparing the speed of various string matching schemes.
Another objective calls for students to: read and understand research contributions ... Certainly Chapter 5 is representative of the research literature, and Chapter 10 has some of that flavor as well as some of the attributes of a survey article.
In summary, this Unit has the following objectives, for students to be able to:
There are two main types of effort required. First, the readings (see Section 5) should be carefully studied, keeping unit objectives in mind (see Section 3). Second, students should use the PAT system and the UNIX string searching functions. Note that it is wisest to do the readings and exercises together for each of the two main topics.
There are two parts to the exercise: using the PAT system and using UNIX for string searching. The laboratory session will help with these, and also demonstrate working with algorithm animations.
The PAT User's Guide is on reserve in the Library.
To work with PAT, connect to fox.cs.vt.edu using TELNET or
similar facility.
Be sure you have the proper environment set up. This is explained
in the page on UNIX
Use in CS5604 under the bullet
about Path and Environment. In particular you should
cp ~fox/.login ~
one time to be sure your path is proper. The next time you
login, things should work OK.
An online copy of the PAT User's Guide is available on fox.cs.vt.edu in /info1/oed and can be searched with PAT. To use PAT in command line mode, simply say cd /info1/oed; pat manfox
To actually read the manual online, setup for an X Window System session to take place on your UNIX workstation or X terminal, and use LectorMotif -F spec-manual -X manual from that same directory. Try the Full Text selection in the FOrmats menu.
You can use any of the man pages in /u1/man by simply saying something like man -P /u1/man pat to work with the pat command, for example.
To use PAT, as mentioned above, move to the data area using the cd /info1/oed command and then use ls -l to look at the files. Consider for example the file manual which has the PAT manual. To browse through the file simply use more manual or else use Lector to get a nice X display with the LectorMotif -F spec-manual -X manual command mentioned previously. To search the manual, say pat manfox.
Similarly you can say more 2e to see the 2nd edition of the Oxford English Dictionary, or display it using X with
LectorMotif -F spec-oed -X 2e or search it with pat 2efox and various PAT commands.
Please refer to the Quick Reference Guide to PAT which is at the back of the User's Guide. You will note a series of boxes under the heading Examples --- they are present on both sides of the page. The first part of your assignment is to try out most of the examples shown, on one or another of the PAT databases built (preferably the manual, to save time, though you need to use the OED for anything involving docs. Be sure to send the instructor a log of the interactions including results. Also record how long each operation takes. (Hint: When you quit from PAT it gives you a time.) As becomes necessary, you may make minor substitutions or other small changes to the commands listed under Examples if they yield no or improper results. Since you do not have write permission in the directories you are working with, you should issue a command like {SaveFile ``/u1/loginid/pat.out''} where loginid is replaced by your account name and where the brackets and quotes are all required. Then your results will be in your home directory in file pat.out so you can rename the file, print it out, etc. For more information about this command, look in the relevant page of the manual database (or in the printed manual).
See the file 2e.docs to learn more about the actual tagging used with the OED. This will help you use the ``docs'' type commands, and to carry out the following required searches in the OED collection:
Here you should also use the manual database, in particular, work in directory /info1/oed with file manual and issue commands like fgrep search manual. To decide what routine to use, do UNIX manual lookups, as in man grep to learn about grep, egrep, and fgrep. You can pipe results into wc to get counts. Be sure to time each command and to save the results, for each of the following:
For the laboratory session, you should work with the xtango algorithm animations on fox.cs.vt.edu, with your .login file set to match that in fox. Issue the following commands:
cd /u1/bin; kmp; boyer after doing the proper setenv operation to enable display on your X device. Then you can repeat the two animations as desired. Use the sample pattern and text from your textbook, or others that you think of, and try the Run Animation button and/or the pause button.
You should be able to answer each of the following questions.
In both chapters, there is a dearth of examples. That and the terse, technical language make them similar to many hard journal articles. The first lecture should help with Chapter 5. The second lecture and the xtango demonstrations should help with Chapter 10. There is good coverage of KMP and Boyer-Moore in [2].
Regarding Chapter 5, there are some specific comments to be made:
In Chapter 10, there are many points needed to make the algorithms clear. A few corrections and useful pointers follow.