Inverted Files Unit Notes
IR Models & File Structures - 1
- Documents, structure (later units)
- Terms/Concepts/Authority Info.
- Dictionary, lexicon, thesaurus (U-KB)
- Access (U-IR): hash table, B-tree
- Document access (later units)
- Operations and structures: ...
IR Models & File Structures - 2
- Document access structures
- Graph: hypertext (U-HT)
- Tree: clustered files (U-CL)
- String: string search (U-SS))
- Sequential file
- Unordered: signature file
- Ordered (Ch 3, 5): ...
IR Models & File Structures - 3
- Document access operations:
- Sequential file - Ordered:
- Build IF, Lookup term (Ch 3)
- Manipulate term sets (Ch 12)
- Compute query-doc sim (Ch 15)
Boolean Set Processing / Venn Diagram
Boolean Set Processing / Inverted File Results
- Inverted file is called that since looks at matrix by term instead of by document (the normal case for readers).
- Extension to handle proximity takes extra space.
- Entries for a term might include (doc,wt) pairs plus:
- List of locations inside document as:
- Byte offset (or offsets for start/end); or
- Paragraph number, sentence number, word number; or
- Pointer into a structure tree (or pointers for a span):
- e.g., chapter no. / section no. / subsection no. / par. no.;
- e.g., reference no. / title part / subtitle field;
- e.g., dictionary headword / part of speech / sense / definition.
Online Searching - 1
- Phases
- Clarify info. need / problem
- Identify access points: a, t, s, ...
- Identify concepts, terms
- Develop, try, adapt search strategies
- Examine results, use feedback
Online Searching - 2
- Query organization
- String of pearls (DNF)
- List of required concepts (CNF)
- Concept organization building on:
- Elements: descriptors, phrases, words, stems/roots
- Relationships: Synonym, xref, bt
Boolean Operations (Ch 12) - 1
- Set is for a term
- Set element represents a document
- Document representations (keys)
- Bit vector: doc # (e.g., 1 ... N= 256)
- Hash table: doc. name or id that is hashed into some bucket
- Iterate is like LISP map functions
Boolean Operations - 2
- Set is for a term
- Set element represents a document
- Document representations (keys)
- Bit vector: doc # (e.g., 1 ... N= 256)
- Hash table: doc. name or id that is hashed into some bucket
- Iterate is like LISP map functions
Boolean Operations - 3
- Use of operations for IS&R
- Lookup term in IF, create set for it, enter elements (docs) in the
set
- Operate (union, ...) on sets
- Iterate on result set for output
- Table 12.2 terminology
- Domain size = possible # docs
- No. elements = actual # docs
Preview of Unit 4
- PAT 3.3 User's Guide
- copies in library
- Lectures
- Exercises, Lab: PAT, visualization