Introduction - 1
- 3 types of text searching: use
- cluster trees
- hashing

- sorted indexes
- inverted files
- PAT trees and arrays
Introduction - 2
- Model of text collection
- Position = sistring
- Match can refer to structures too: doc Ch including "PAT Array"
Introduction - 3
- Query language
- See PAT User's Guide and the Quick Reference Guide
- Strings: exact, regexp, context
- Ranges, Proximity conditions
- Boolean operations, Sets
- Frequent text, Repetitions
PAT Trees - 1
- Sistrings (semi-infinite strings)
- Lexicographic sistring ordering
- "A SA..." < "AMP..." < "E ST..."
- ID = position (e.g., 2 for "his is...")
- PAT tree = Patricia tree of all sistrings in a text
PAT Trees - 2
- 01 - THIS IS A SAMPLE STRING
- 02 - HIS IS A SAMPLE STRING
- 03 - IS IS A SAMPLE STRING
- 04 - S IS A SAMPLE STRING
- 06 - IS A SAMPLE STRING
- 07 - S A SAMPLE STRING
- 09 - A SAMPLE STRING
- 11 - SAMPLE STRING
- 12 - AMPLE STRING
- 13 - MPLE STRING
- 18 - STRING
PAT Trees - 3
- Patricia tree
- Binary digital tree
- n external nodes with key values
- n-1 internal nodes
- indicates bit for branching (count of bits to skip or absolute bit
position)
Algorithms on PAT Trees
- Prefix searching
- Proximity searching
- Range searching
- Longest repetition searching
- Most frequent searching
- Regular expression searching
Building PAT Trees
- Naive solution: 18n chars
- 3-4 words/ internal node (2 ptrs)
- 1 word/ external node (ptr)
- Bucketing of external nodes
- Supernodes of internal tree nodes
- 2-3 disk accesses in index + final access to verify skipped bits
PAT Arrays
- Sorted sistrings -> small array of IDs
- Simulate tree op: O(log n) accesses
- Fast prefix, range searches
- Naive build would require n log n disk accesses -- too long!
- Fast builds --> recall FAST-INV
Summary
- Contrast re space, speed, ops:
- Inverted files
- Signature, clustered files
- Slow build: needs lots of RAM, disk
- Supports many special operations and a very general model