NOTE: credit given to Joyce, Muhlenberg, Ryan.
On August 28, 1995, we reviewed some IR research both for and against the 80-20 rule. We then went over the history of IR breakthroughs. After that, we got a functional as well as a topical view of IR science and technologies. We then reviewed Chapter 1's introduction to IR which included a domain analysis of IR systems. Finally we previewed Chapter 2's review of data structures and algorithms.
This was the first evening that we used the V-tel interactive audio/video conferencing system. After a few minor technical difficulties, we started our group debates of the Fox, Samuelson, and Dongarra articles. Each paper gave insight into a different aspect of the world of Digital Libraries. Our group, Group 4, absorbed two new class members, so we spent the first 15 minutes of the class getting acquainted and insuring they understood the class format. We then began our debate.
Dr. Fox zoomed in on each group to provide comments and answer questions. Following the debates, he spoke at length on Informational Retrieval. We discussed the 80-20 rule-- 80% of the inquiries are usually on 20% of the data--as it pertained to the Precision/Results curve. We discussed the Blair Maron article and the rebuttle of the Salton article.
1. First 45 minutes spent discussing debate topics in the DL module. Our group (1) assigned to work on questions 1, 2, &3. Went over the discussion topics. ...
2. Check the web server announcements and news section regularly for information on class happenings. Accounts on Video are submitted Accounts on Fox and vtaix should work now. Add two 0's to the password if yours was too short. Dec Alpha will be setup in N.Va soon. Image of articles will be online @ HTTP://ei.cs.vt.edu/TR/search
3. Importance of IR systems
Finding relevant information is an essential human skill it is of great
value to society
Blair and Maron Article
IBM Stairs text retrieval system evaluated. The conclusions drawn from the
article didn't match results of experiment. Reconfirmented the 80/20 rule.
and exposed some of the weaknesses of IR systems Most people give up before
all the relevant articles have been found.
Controlled vocabulary controls the words used to query a system. This speeds indexed searching
Free text searching and controlled vocabulary have similar response
times and if not identical. Automatic indexing also has similar response
times. Stemming involves the taking of a word and removing endings to get
to the stem word. This is a resolution enhancing device. ( computing
becomes compute )
eaf: Actually, it is "effectiveness" not "response
time" and stemming helps with "recall" not "resolution".
Boolean operators ( and or, not ...)
Timeline
Hashing involves the encoding of a word to a token and then using the value of that token to do an indexed search.
Term weighting weights terms and equivalent terms with a value that
represents closeness to the actual query. The search is done tand the
weights are calculated based on the number of hits * the weighted value for
that word.
eaf: This is not quite right about term weighting and
closeness (similarity) - see discussion in later units.
A recall enhancing device is a thesarus
Phrase enhancing enahnces resolution of the query bu using phrases such as 'ice cream' as opposed to 'ice' and 'cream'
B trees.
Superimposed encoding - hypercard like, computing a number or a value based on the wording of a query. Similar to hashing.
Hashing has collisions,
Tries and Canfield experiments, & test. Complicated systems are not much
betterthan the mechanical/ manual systems used for IR
queries.
eaf: Actually, the effectiveness of automatic methods
was found to be roughly the same as that of more complex and
human-labor-intensive approaches.
Other Timeline information is found in the WEB page.
Functional view of IR.
Foundation of IR . See Web page diagrams
Read Chapters 1 &2 of the book
Comparision of Information systems
AI systems are usually smaller than IR systems and tend to be more detained and work with smaller pieces of information.
DBMS systems tend to deal with categorical relationships such as tables. Usually deals with larger scale systems where indexes are known.
IR is the study of unstructured information. It deals with mathematical probabilities and other mathematical areas that the DBMS and AI systems do not.
An E-measure is a measurement of the effectiveness of a IR system. it is similar to a benchmark number.
Class stopped before the SALT article.
Professor Fox then discussed:
* news and announcements.
* the Montechello Electronic Project
We then began going over the IR overheads
A) Importance of understanding IR
B) Blair and Morin article
1. weaknesses of IR
2. problems in releasing critical results of IBM STAIRS project.
a) lawyers doing DB searches only had a 20%
successful hit rate.
3. Precision vs. recall relationship and graph
...the more data you pull the lower the
signal/noise ratio
... the more you pull the "more chaff you mix
with the wheat
4. concept of controlled vocabulary
C) Salt86 article on use of:
a) Booleans
b) weighted words
c) auto index
D) Partial Timeline of Progress(major headings discussed)
1. superimposed coding
2. hashing
3. Cranfield experiments
4. Pat trees
5. B-trees
6. Lex
E) Functional View of IR
F) Topical Hierarchy of IR
1. technology
2. library & information
3. experimental computer
4. data structures
G) Textbook Chapter 1-Introduction to IR
1. Introduction
2. Domain Analysis
3. IR vs. Other Systems
4. Evaluation
H) Textbook Chapter 2-Introduction to Data Structures and
Algorothms (in depth discussion postponed until
next week)
1. Basic concepts
2. Data Structures
3. Algorithms
The class broke up into discussion groups to debate the assigned topics on digital libraries. I think my group got sidetracked somewhat as far as the real point of each topic, but this generated a lot of discussion pro and con, which I assume was the purpose of the exercise. We then assigned ourselves writing and reviewing tasks to submit our discussion comments to you.
Finally, we got into the lecture on IS & R, with a discussion of the historical development of products and concepts. You also brought up some of the debates in the literature with regards to the effectiveness of these systems. This is a good topic for this class since, as computer science students, we tend to assume that automation is better and forget that there is still a considerable population that fails to see the benefits. This is also a point that we don't like to hear since many of our jobs depend on the business of information automation.
Since my group had another new person besides myself, we spent most of the discussion period organizing our group, getting articles, and trying to come to grips with the course expectations. We will debate our Digital Library questions more next Monday, and are preparing by writing ideas for each topic. We will debate these and then perform group revisions via e-mail.
To summarize the lecture, key data structures (B-trees, PAT trees, hashing, Tries) for IR systems were reviewed. Since I do not have an undergrad degree in Computer Science, my familiarity on data structures is limited to one required class for my current Info. Systems program at VA Tech. Therefore, I will review that class material and forward any questions that arise. I am also working on better understanding the discussion on indexing and the Salton article, having not reviewed the materials before class.
For the first part of the class we discussed the debate topics and assigned roles for completing the group project.
For the remainder of the class, Prof Fox reviewed class notes on an overview of topics related to Information Retrieval. The following topics were discussed:
Finally, we reviewed the following topics in Chapter One of the textbook:
Chapter 2 was left for us to read on our own.
Major events: turn in opscan for the pretest, group discussion, and lecture.
Group Discussion. (I'm in Group. We had questions 1, 5, and 8). The session began with a quick agreement on how to structure the discussion (about 10 minutes, maybe a little more, for each question with the remaining time for wrap-up). In general, there was agreement on our agree/disagree positions on the three questions. The discussion was useful in that different points of view were presented to support those positions. A write-up of our consolidate points of view is being submitted separately.
...
Lecture. The lecture began with a preface, covered an article by Salton that responded to an earlier article by Blair and Maron, briefly presented a timeline of IS&R milestones and a functional view of IR, and covered chapter 1 of the text.
The preface for the lecture involved a discussion of a graph. Graph (same graph as in text, page 11) illustrates two major IR concepts: recall and precision. Recall is how much you get, precision is how useful the retrieved documents are. The Graph shows that there is an inverse relationship between precision and recall. Implication is that as you get more things the precision falls off. Get too much and people give up because they can't find relevant data.
Article by Salton:The next part of the lecture focused on the claims by the Blair and Maron and the responses by Salton.
Blair and Maron
- studied a library of legal documents
- STAIRS
- 0.75 precision, 0.20 recall - high recall but very
unspectacular recall
conclusions that cause the stir:
- broadening the search request results in output overload
- When high-recall is desired, manual indexing is preferable
to full-text searching
- It's hard to achieve adequate performance with full-text
systems, such as STAIRS, because the user interface is not
particularly user friendly.
- free text (user selects the text strings to search for) vs.
controlled vocab (controlled vocab is a list of key words
selected by an indexer)
- manual indexing (high skill level required, art form) and
automatic indexing (needed automatic indexing for large
collections)
- scale-up of lab results
The article had a lot more to offer than covered in the lecture. In fact, I found the "meat" of the article to be its coverage of information retrieval concepts. For example, it introduced recall and precision, truncation, synonyms, and weights. One of the nice things about the article is that besides introducing these concepts, it presented these concepts within a framework of recall and precision enhancing devices. Additional observations supplied in group write-up.
Timeline of progress - Work began at the end of WW II - The key works during the 50s and 60s - hashing, weights, tries, stemming - Refinements continue. Functional view - interesting model of components and relationships - elements of model include operations, documents, knowledge base, and queries. - working with documents and problem description is a difficult problem. To make the problem tractable - work with surrogates. Chapter 1 of Text - Article by Frakes (now works at VA Tech). Introduces IR concepts and the relationship of IR systems to other information systems. - The stuff of IR - documents, operations, and queries (similar to model above) - Domain analysis of IR (a software reuse connection) the major components and relationships. Conceptual model establishes general approach, rest follows. - IR vs. other system - see table on page 9, good summary, interesting distinctions. Fox mentions unstructured vs. unstructured document types. - Precision and recall introduced - just like in the Salton article. - Frakes introduces E for system evaluation - a single value measure of recall and precision.
It's 9:10, class poops out on chapter 2, me too
Definition: Subfield of computer science that deals with the automated storage and retrieval of documents.
All information retrieval systems have a database structure such as inverted file structures, flat file structures, signature, pat trees, graphs, and hashing. It allows a user to add, delete and change documents in the database. It also allows user to use query to search and retrieve documents.
There are 3 operations mentioned in the book:
In the design of IR systems, hardware is very important because it affects the speed as well as determine the amounts and types of information stored in the system.
As mentioned in class, here is an example of how IR works: One has to build a database which breaks the text in the document into words. Then the words are compared against a stoplist and go through the process of stemming (recall enhancing device) which eliminate all the words in the stoplist. They are counted and weighted. Then all the information is stored in the database.
To search the database, user enters a query which will be parsed. According to the specified boolean operators, terms are looked up in the inverted file described in the previous paragraph. The retrieved document set is then ranked and presented to the user. In some systems, the user can make relevance judgements and use the information to modify the query.
To evaluate the retrieval effectiveness, one can use recall-precision graph. The higher the precision, the lower the recall. If there is 100% recall, that means one has found everything; but there is a question of how many are what one is looking for.
Automatic indexing steps are tokenization, stoplist, and weights to help retrieve information in an effective way in terms of higher precision and time.
Held group discussions on the required articles. Prepared answers to the required questions. The Blair-Maron paper on information retrieval was discussed. Information retrieval follows the 80-20 rule in that 80 percent of the material is found with 20 percent of the effort and the remaining 20 percent is found with 80 percent of the effort. The precision/recall graph of information retrieval was discussed. The graph shows that the greater the precision that you are trying to achieve, the fewer things you will find with that criteria. To find more things that may be relevant, the precision drops. Discussed the following terms: * free text searches - look for normal words. * controlled vocabulary - only certain words are allowed. * automatic indexing - keywords in the article are determined by an algorithm. * manual indexing - keywords in the article are determined by an expert on the subject. * stemming - removing prefixes and suffixes to get the base keywords. A recall enhancing function. * weighting - assigns more merit to some keywords than others. * tokenize - find keywords automatically but removes common words and stems words. * E-measure a single measurement of the effectiveness of a information storage and retrieval system. Discussed a timeline of information storage and retrieval. Discussed the functional view of information storage and retrieval and the foundations of it. Generally discussed b-trees, hashing, fast string matching and PAT arrays.