Thanks to all who have asked questions or made comments!
(I'm not thanking you individually below too, only to save space.)
1. Given the volume of information stored in traditional libraries,
can we ever hope to convert all of it into a digital form?
eaf:
For the forseeable future we will work with a mixture of digital and
other forms in our libraries. Clearly, with enough money, we could
decide to get rid of paper, or at least to capture all of what is on
it, though archivists often want to keep the originals. Since
costs are high, people only convert the most important works. One
important piece of advice: develop methods to convert new works
now so we don't have to pay to convert them later too.
2. What role is the Library of Congress playing in digital
libraries?
eaf:
Staff there have been involved in many of the digital library efforts.
They have an effort to have at least 5M works in digital form by the year
2000. They are active in efforts to use handles, have advised IBM
on their DL systems, and have attended most of the DL conferences.
They are a sponsor of ACM DL'96.
3. Will the librarian of the future need to have a strong
background in computer science?
eaf:
Many computer scientists will work on digital libraries.
Many library schools require training in computing for librarians.
Library schools not active in new approaches are, in some places,
being shut down (e.g., Columbia University).
This class seemed a bit chaotic and I had a less than good feeling about how the evening went. I believe things will go better as time goes on. It was, to say the least, a very interesting evening. Question:
I am not able to access any course assignments except for the Pre-Test and
the DL assignment page. I get the following message for all other
assignment pages (I'm using the IR page as the example):
"The requested URL/~cs5604/f95/U-IR/U-IR.html not found on this server."
eaf:
The first several assignments are available now. If you wish to work ahead,
note that I have not yet revised later assignments from their version
from last year, which is generally harder than what I will require this
year. Nevertheless,
if you ever want to see an assignment in the version from last year,
simply go to http://ei.cs.vt.edu/~cs5604/Assignments.html and pick
the one of interest. Note that there may be slight changes from that
form to simplify the work this year.
I need some clarification on the netlib assignment:
1. Is there a way to get more syntax information on netlib commands
besides the article on electronic mail.
2. I would like to see /u1/README/netlib on fox.ct.vt.edu. Do you have
this on the home page or is it only in that UNIX directory?
eaf: See the current version of the DL assignment at
http://ei.cs.vt.edu/~cs5604/f95/U-DL/U-DL.html
which gives a URL for
the WWW version of netlib and also a WWW location for the
README file mentioned. You can use the WWW version instead of
email or xnetlib.
1. On the Pre-Test, how do you want us to answer questions that we are
notable to answer? Ex. # 94 - Is the KMS system better than Mosaic and
WWW for accessing course notes? I have not used KMS or Mosaic or the WWW
before this class, so I do not feel that I can honestly answer this
question.
eaf: Just do the best you can. Ignore questions you cannot answer.
2. How much UNIX do we need to know? I am an Information Systems major
and I have no experience with UNIX. I have a feeling that I am not the
only one in the class that does not know UNIX.
eaf: There is not very much you need to do with UNIX. I
have summaries of info you need under the pages about computer
tools at http://ei.cs.vt.edu/~cs5604/CompTools.html
Assignments give specific instructions when you need
to do something. If you are concerned, join a group with a
person who knows UNIX. You do not need to know UNIX for
quizes or the final.
3. Is there any way we can get more
information on the topics that we are
going to cover? To me, the course notes are summaries and I can not grasp
the subject by reviewing these notes.
eaf: As I mentioned and have written, the lectures are optional.
All that you need to learn is spelled out in the assignments.
They refer to the textbook and the CACM readings. If you are
unclear, ask me for assistance.
One question I do have at this time is concerning the Digital Library notes
on the WWW. There were pages on clustering and SS. Was this covered in
class, or will these be covered in future CL and SS units? I wasn't able
to follow the notes very well, but can hold off on those questions until
those units if possible.
eaf:
We will work on clustering and string searching in later units.
You are not responsible for that in the DL unit.
Was the Web server on ei changed? Starting this morning (9/1), I
get a "trailer" on each page returned by ei.cs.vt.edu For example,
when accessing the CS6504, Fall 1995, Table of Contents page, I now
get the page and the following trailer:
gatekeeper.mitre.org - - [01/Sep/1995:08:20:48 -0400] "GET /~cs5604/
HTTP/1.0" 200 1647
eaf:
We rebooted ei and made some changes in the WWW server that led to the
situation you noticed (see copy of your msg below). We went back to an old
version of the server as a result. The trailer info you noticed is not
harmful, and will be eliminated when we find the configuration option that
controls it.
I just wanted to let you know that the group discussion in the video
conference format was quite difficult. I was sitting at a table in the
front and I found it difficult to hear the group members over the
conversions you were having with the other students.
eaf:
I'm sorry. Noone predicted that. Now the studio people and I
know better. In the future if we do something like that we will
move people who want to ask me questions up to the front.
In closing, I have some general comments, having reviewed the course format
and syllabus. I appreciate the goals of the Keller Plan, and certainly
think the method is well-suited for my own learning style. I appreciate
the value of continual working with material in order to gain
understanding, but I would like to make a point about procrastination. The
course load presented in the material is quite intensive, and
procrastination is not the only threat to accomplishing the course goals -
time is also a factor. I have come home from work every day this week and
worked on this course for several hours, but am just beginning to make a
dent in the workload for this week. I am not suggesting that you lighten
the workload, but just want to make you aware that this is a very ambitious
course to take while working full time. Also, it is harder for our groups
to meet than for your students at Blacksburg, since we are geographically
dispersed. Therefore, our group will meet in conjunction with class and
via e-mail.
eaf:
Good idea to meet in person in connection with class, as a group.
I'll work to reduce the overhead for those like you are taking
this after a full-time job, and am open to suggestions to help
further.
I would like to see the composition of the groups
changed each week. Mike Dunn used a similar group discussion
format in CS6704 in the Fall of 1995. He, however, changed the
composition of the groups each week. Although there was some
uneasiness and commotion at the outset, as the class became more
familiar with the course material and the discussion process,
the comfort level with rotating groups seemed to increase. In
my own case, I benefited by having the opportunity to be exposed
to different points of view and the different experiences that
my classmates brought to the discussions.
eaf:
Please remind me at the start of each class and we will switch if
people wish to.
Communication through VTEL is great. Being able to communicate directly
and seeing the operating of the computer system makes us feel like you're
not teaching us from Blacksburg. I have noticed, though, that
there must be a delay in reactions because of long distance.
eaf:
I am glad you like this!
The delays are probably due to compression and decompression
rather than electrical connections, unless they route the signal
in a round-about-way, or there are circuit delays too.
You havn't indicated what format you are willing to accept information in....
Its my impression that things need to be a little clearer in terms of what is expected of us on a weekly basis (especially writing assignments and the like). Maybe a summary on each module web page.
I speculate that you found it hard to get everyones attention from
Blackburg. One of the problems of a televideo course.
eaf:
Please send information to me in ASCII as a text only file.
I'll gradually add in more hints, but the Assignments are pretty
thorough for each unit, and the Syllabus gives overall info.
Having the lights out, for the LCD, makes it harder than I expected
for me to see people --- please let me know if you need a break or
have other ideas to keep people awake.
I would like to make what I hope is a constructive comment
about the lectures. As you noted, towards the end of class
people are starting to fade from the long day. Rather than
discuss the reading assignments directly, I think we may
benefit more from a higher level discussion of the topic du
jour. Speaking for myself, I find that when tired I can enjoy
and absorb a relaxed conceptual discussion, whereas I can't
digest details. The details of the topic I can get while
doing the reading in a more comfortable environment.
eaf:
In most units I don't do too much that is in the readings,
unless I have observed that people seem to have trouble understanding
what is covered there and believe I can give more or better examples.
By the way, if you ask questions I'll say more, and less formally,
so you can help set the tone too.
Turned in the opscan forms for the pretest. Answering a 120
questions takes some time. Many of the questions suppose a
fairly high-level use of or experience level with Web
technology. For example, questions
50-61, "Survey of WWW Usage,"
and several questions in the 62-79 block, assume usage of Web
technology. This could be discouraging or intimidating to a
student not familiar with the technology.
eaf:
There are many interested in how well these "paperless" courses
work and so we are trying to do a careful evaluation. I'm
grateful for your time with the pre- and post-tests. My hope
is that you will know the answers to all the questions and so
the post-test will show significant gains. If you have other
suggestions regarding evaluation, please let me know.
For the purposes of the unit IR exercise, do conference proceedings
and "congress" proceedings count as "books"? We're trying to
determine what material should be included as results of our
searching?
eaf:
Yes, please do include these.
A natural-language text-search system is a computer system for managing collections of text objects or documents so users can locate ones of interest through a search process wherein the words of the user query can be any word in the natural language selected (in our case, English). Usually, the query is a sentence or paragraph or even a document, and statistical matching is done based on it. Usually the indexing language is the same as the natural language. Controlled language indexing involves describing documents in a collection only with words from a limited / restricted (i.e., controlled) vocabulary. A term or descriptor list or classification system or thesaurus usually records all of the allowable index terms, which are usually assigned by an expert human indexer. The indexing language is thus a restricted one.
Subject: cs5604b Implementation of Boolean op. in C++
By looking at the examples in Ch.12, I really feel that
using a language like C++ (or another OOL, for that matter)
would be a much more natural approach to implement
set operations. Do you know of any good C++ book
that address IS&R algorithms? Now, I'm kinda curious...
eaf:
Good observation. The chapter is organized to show
that abstract data types are involved, and object-oriented
(OO) programming is the next step. The MARIAN system developed
at Virginia Tech is implemented using C++. The first OO IR
system I heard about was RUBRIC, the precursor of Verity's
TOPIC system. I'm not aware of any IR texts with C++ code
in them, but would like to hear about them if you find one.
The amount of network traffic among my five team members is intense. I thought about a way to decrease the amount of E-Mail needed to stay in communication and teach a valuable IS&R lesson. I was wondering if it would be possible to give us write access to a designated WEB page to use as an electronic slate.
At first the slate would be blank and Teammember#1 would place the assignment outline on the WEB page and sign their initials. Then teammember #2 would turn the outline into bullet themes. The third and fourth people would flesh-out the composition. The final person would review and make final edits. and then sign-off when the task was complete.
Advantages:
I know I'm 'preaching to the converted' to list the advantages of this
form of electronic medium but I thought you might want to consider this
idea, perhaps for next term's class.
eaf:
One of the reasons for your having accounts (I'll assume yours
is ryan for the sake of discussion below) on video.cs.vt.edu
is so that you can put things on the WWW directly. So, please
use the UNIX command "mkdir ~/public_html" to make a directory,
and make it accessible with "chmod o+rx ~/public_html". Then
you can enter WWW accessible files in that directory with
"cd ~/public_html" and then creating files like "test.html" in
HTML form or just a plain ASCII text file like "test.txt". You
can tell your group members and me to look at them with URLs:
http://video.cs.vt.edu:90/~ryan/test.html
http://video.cs.vt.edu:90/~ryan/test.txt
and then use UNIX group permissions (contact me to setup a group if you want) to allow group members to edit.
I'd be happy if you create files on the WWW and just give me the URL for them by mail, to reduce traffic in this way.
Finally, we are working on other automation methods for later this semester. Stay tuned!
Questions Sept 5,1995
1.Can every concept be searched by describing it using one of these
normal forms? Isn't it restrictive?
eaf:
Mathematically we can prove that any Boolean expression
can be put in CNF or DNF. Also, if we assume that every document
has a unique representation, then we can construct a perfect Boolean
query that can retrieve precisely those documents that are relevant
(by construction, if we know which ones are relevant). Regarding
concepts, however,it is unclear how a semantic matter like that can
be mapped to something in Boolean logic. In part that is why we
move to richer representations, with weighting, soft operators, etc.
2.Will the state of the art NLP techniques be of any use in such systems?
eaf:
For over 20 years people have tried to apply NLP methods to IR.
There may be some progress in recent years, but only in restricted
domains. While there is great hope in the long term, present results
are often inconclusive.
I have some questions about the computer exercise of unit IR. For
the books that deal with CD-ROM, do you mean the books that are about
CD-ROM's technology or the books that are stored in CD-ROM? Also, when
I use VTLS, under the keyword "CD-ROM", there are about 138 items, does
this number sound correct?
eaf:
Please try to only get books about CD-ROM, not CD-ROMs that are
cataloged in the library. Yes, there are over a hundred entries.
Is it possible for you to make papers available in pdf format for
downloading? What do you think about this idea?
eaf:
I prefer now to use HTML since it is simpler and more flexible.
N. D. Barnette is trying out use of PDF for his courses.
The grad school is accepting PDF for theses and dissertations.
I find it valuable but am waiting for them to support other UNIX
systems and have better font handling. From my DECstation it is
hard to create PDF, or even view it, and the LaTeX fonts are not
usable without extra work that has not yet been done in CS.
I have a couple of questions about PAT trees.
1. How are the sistrings at the right-most end of the text represented in the tree?
The example presented in the textbook (pg. 69) and used in the course notes shows a PAT tree for the first eight sistrings of the text. None of these sistrings "runs into the end" of the text used in the example. For instance, what happens when the three right-most sistrings - sistring 12 (i.e., "111"), sistring 13 (i.e., "11") and sistring 14 (i.e., "1") - are processed?
Are the "special null characters" mentioned in paragraph 2 on page 68
factors in this processing?
eaf:
Yes, nulls (I'll use "0") are added, so "12" is really "11100000..."
and "13" is "1100000..." and "14" is "1100000...". We simply extend
the tree to add each of these in, into the correct place, creating more
internal nodes as needed.
2. How are character strings represented in a PAT tree?
The example presented in the textbook (pg. 69-70) and used in the course
notes introduces sistrings with a discussion of a character string then
jumps into a discussion of the navigation of a PAT tree in terms of zeros
(go left) and ones (go right). I'm not clear on the connection. How are
character strings represented in a PAT tree? Is the missing link that the
sistrings in the PAT tree are the binary representations of the characters
(e.g., "A" = x41 = b01000001)? I was somewhat confortable with this
interpretation until it seemed to crumble in face of the discussion in
paragraph 1 on page 74. The discussion on page 74 speaks about the size of
the Pat Tree as it relates to text characters, not text bits.
eaf:
Your first thought was correct. PAT trees are Patricia trees
which are binary trees and the entries are based on binary representations
of the data. Don't worry about the p. 74 comments --- it just uses
characters as equivalent to bytes to get a size.
I wanted to attempt the next quiz this weekend but I understand you don't
want us progressing until we have mastered the previous section according
to the Keller methodology.
eaf:
Yes, it is best to go in sequence, but you can overlap, that is
proceed on once you have submitted work. You may have to go back to
answer questions, but that can help reinforce learning.
questions about the Extended Boolean models in Section 15
1. A question about the relationship between the number of index terms associated with a document and the number of terms in the MMM query-document similarity functions.
The discussion of the MMM model on page 396 defines a document D has having n terms: A1, A2, A3, ...., An. These terms have the corresponding weights w1, w2, w3, ....wn.
In the middle of page 396, query-document similarity functions are defined for "and" and "or" operations.
The "or" function is used to illustrate the question. The SIM(or) function is defined as:
SIM (Qor, D) = (A1 or A2 or A3 or .... Am)
Notice I have changed the subscript. The text uses n, I've used m.
In the discussion on page 396, m = n. Does m have to equal n? If not,
what happens if m doesn't equal n?
eaf: Clearly, queries will have fewer terms than the number of terms
in the document collection, i.e., m<n.
On the other hand, when processing a Boolean
query, the only terms that matter are those in the query, so we need
only consider the truncated document vector which shows just the terms
in the document that are in the query. In that case n=m.
Is the unstated idea that the query-document similarity functions only
operate on those terms that satisfy the query. For example, a document
might have n index terms. However, only m of those index terms might
satisfy a specific query (m <= n). If so, then the notation the book seems
misleading.
eaf: Thank your clarifying matters.
2. A clarification about how the query-document similarity values are used
in the document retrieval process. The discussion on page 396 defines the
query-document similarity functions. In an actual retrieval, would a
appropriate query-document similarity value be computed for each document
satisfying the retrieval criteria and used to rank the documents retrieved?
This interpretation is suggested by the discussion on page 402 at the
opening of section 15.3.2; however, the discussion on page 396 is vague.
eaf: Technically, we need only specify the function and not worry about
the implementation, but I want you
to understand implementation issues. Yes, what you suggest is one possible
approach, though I'm not sure what you mean about satisfying the retrieval
criteria. One method is to have a similarity computation carried out for
any document that has any of the query terms. Another is to undertake a
Boolean search, rank the documents retrieved by carrying out similarity
computations, and then repeat the process with "softened" versions of the
query. If many documents have one or more query terms, one may try
heuristics like sorting the query terms on weight (e.g., IDF) and
processing documents with the most highly weighted terms first.
3. I'm confused about the references to fuzzy-set theory in the discussion
of the Extended Boolean models in section 15.2. The discussion of the MMM
and Paice Models repeatedly mentions fuzzy-set theory. For example, the
first paragraph in section 15.2.2 states "there is a fuzzy set associated
with each index term." However, the description of the query-document
similarity functions in paragraph 2 on page 396 and the description of the
implementation of the document weights at the top of page 399 suggest that
rather than there being "a fuzzy set associated with each index term," each
index term only has a weight, not a fuzzy set.
eaf: In the general case of retrieval, we refer to "document weights".
When one uses an extended Boolean approach, all document weights must be
fuzzy set membership values. Only with such values, in [0,1], can the
various formula make sense and be applied. So, both expressions are
allowed.
Question: How can one construct a PAT tree for a small string like
"This is a string" ? An example would be very helpful..
eaf:
See p. 69 for an example. The steps are:
telnet hot tip and thanks
Martha, the TA, had a very useful tip for me concerning the telnet system from northern virginia, which may help other students in class. After you issue the c vttelnet command and it returns a COMPLETED, you must type in a period and hit return. (You won't see the period but somehow this forces the system to respond instead of timing out like it did.)
Thanks for responding so soon about PAT. I will give it a try
later. Just signed on with a new reputable internet provider
that should have a more stable e-mail system. I will send you
my new email address when I get the software configured. (Full
unlimited time PPP account and Unix shell account for only $15
a month!)
eaf: Great news! I hope this works out well and that
others can benefit too. Thanks for sharing all this.
The PAT command now works. But
what format is the manual? When I do a man manual I get the
following,(see attached). What program should I be using to
display the manual?
eaf:
The manual is now accessible at http://ei.cs.vt.edu/~cs5604/lib/manual.sgml
for those who do not want to logon to fox.cs.vt.edu; I hope this helps.
To understand it, see the hints below. We are working on instructions
for using Panorama on PCs to be able to view it more nicely,
in addition to the
LectorMotif discussed in the assignment that works under X.
If you want to try Panorama, there is a free copy for MSWindows
which can be downloaded from
http://www.ncsa.uiuc.edu/SDG/Software/WinMosaic/Viewers/panorama.htm
The size of the executable file is 1,004,785 Bytes. It will run on any
machine that runs netscape or mosaic, (386's and higher).
Hints regarding the manual:
A suggestion - I know you see this as an on-line course but myself and
everybody I know are printing out the lecture notes anyway. Could you
put up a document with all notes for each unit together so that it would
be more effiecent to print out?
eaf:
I presume you are talking about the Course Notes.
If you can figure out a good way to linearize a tree of nodes in the WWW
and printout the entire tree as a compact document, you will save many
people lots of paper. I don't know of any utility to do this. There
is a "linear" command with print options with KMS but nothing similar
seems to exist for WWW. If you find something I'll be happy to use it!
The comp.theory.info-retrieval news group recently contained an article announcing a fast text searching facility. Perhaps this is something applicable to the class. I've included an edited version of the readme file.
agrep - a tool for fast text searching allowing errors.
agrep is similar to egrep (or grep or fgrep), but it is much more general (and usually faster). Version 2.4 of agrep incorporates Boyer-Moore type filtering to speed up searching considerably, 2) allowes multi patterns via the -f option; this is similar to fgrep, but from our experience agrep is much faster, 3) searches for "best match" without having to specify the number of errors allowed, and 4) ascii is no longer required. Several more options were added.
The three most significant features of agrep that are not supported by the grep family are
1) the ability to search for approximate patterns; for example, "agrep -2 homogenos foo" will find homogeneous as well as any other word that can be obtained from homogenos with at most 2 substitutions, insertions, or deletions. "agrep -B homogenos foo" will generate a message of the form best match has 2 errors, there are 5 matches, output them? (y/n)
2) agrep is record oriented rather than just line oriented; a record is by default a line, but it can be user defined; for example,
"agrep -d '^From ' 'pizza' mbox"outputs all mail messages that contain the keyword "pizza". Another example:
"agrep -d '$$' pattern foo"will output all paragraphs (separated by an empty line) that contain pattern.
3) multiple patterns with AND (or OR) logic queries. For example,
"agrep -d '^From ' 'burger,pizza' mbox"outputs all mail messages containing at least one of the two keywords (, stands for OR).
"agrep -d '^From ' 'good;pizza' mbox"outputs all mail messages containing both keywords.
Putting these options together one can ask queries like
agrep -d '$$' -2 '<CACM>;TheAuthor;Curriculum;<198[5-9]>' bib
which outputs all paragraphs referencing articles in CACM between 1985 and 1989 by TheAuthor dealing with curriculum. Two errors are allowed, but they cannot be in either CACM or the year (the <> brackets forbid errors in the pattern between them).
Other features include searching for regular expressions (with or without errors), unlimited wild cards, limiting the errors to only insertions or only substitutions or any combination, allowing each deletion, for example, to be counted as, say, 2 substitutions or 3 insertions, restricting parts of the query to be exact and parts to be approximate, and many more.
agrep is available by anonymous ftp from cs.arizona.edu (192.12.69.5) as agrep/agrep-2.04.tar.Z (uncompressed form as agrep/agrep-2.04.tar). The tar file contains the source code (in C), man pages (agrep.1), and two additional files, agrep.algorithms and agrep.chronicle, giving more information. To compile, simply run make in the agrep directory after untar'ing the tar file (tar -xf agrep-2.04.tar will do it).
The agrep directory also includes two postscript files: agrep.ps.1 is a technical report from June 1991 describing the design and implementation of agrep; agrep.ps.2 is a copy of the paper as appeared in the 1992 Winter USENIX conference.
Please mail bug reports (or any other comments)
to sw@cs.arizona.edu or to udi@cs.arizona.edu.
eaf:
Thanks, this is relevant to the SS unit. It is used in the
Glimpse
system, available on the WWW.
How can formula 4 of the probabilistic model be used outside a relevance
feedback context?
eaf:
Here is a hypothetical solution, never tested as far as I know:
You can set R to the number of documents you want.
Then you could set r to p*R where you assume that the probability of
a relevant document having the term is p.
Why is this model in Ch 14? Doesn't it assume that a
relevance judgement
is made for the query? (If not, how could we estimate the value of R?) I'm
confused...
eaf:
Yes, it is unfortunate that this is in chapter 14 instead of
chapter 11 where it belongs. The authors/editor should have
fixed this confusion. I guess the only reason is that someone
might just read chapter 14 and not 11, and so will learn about
the prob. model and a little on feedback just from that.
I missed the class where you announced that there were only going to be 10 quizzes. However, when asking people about this I can't get a solid answer on how this works. Apparently there is a lot of confusion about this. Can you explain?
Also, once we have pass the quiz on a unit, is it ok to look at the
alternate quizzes?
eaf:
There are 11 units, each with exercises and quizes. However, because of
all that the No. VA class is learning about distance learning and
video conferencing, I am giving 10 points bonus there.
Regarding alternate quizes, yes, you are encouraged to look at all versions for an unit once you have passed one version.
A question about the expansion of the acronym MST. On page 248 MST
is expanded to maximum spanning tree and on page 432 it is expanded to
minimal spanning tree. There is also a reference on page 257 (para. 2,
line 4) to maximal spanning tree and on page 419 in the abstract to minimal
spanning tree. What gives?
eaf:
MST can refer to either minimal or maximal spanning tree. We want either
the smallest or largest value to decide, and since sometimes we work with
distances and other times with similarities, the comparison will vary.
Thus, we have a minimal ST if we focus on distances, and a maximal ST
if we focus on similarities.
In general we want the best tree --- that is what you should remember.
I was wondering if you
could point me to a source which lays out the differences between all of the
compression formats such as: JPEG, MPEG, GIF, TIFF, etc.
I don't need the kind of
detail found in the Wallace paper but if you know of a good Web source which
describes these differences I would appreciate it if you could pass it along.
eaf:
See the basic overview starting at
http://ei.cs.vt.edu/~mm/s95/sspace/CS2984_MM_Links_669.html
or some of the following specifics:
... information about search engines.
I've had the opportunity to look at the
marketing material provided by a couple vendors (e.g., FreeWAIS, ConQuest,
Topic) and I've visisted the web site at the University of Colorado
to look at their Harvest material. While this is useful material, I
am interested in looking into the concepts on which these products are
based. A survey of search article, for example that looks at the strengths
and weaknesses of various searching schemes would be most interesting.
I'm also interested in the finding out more about the use of intelligent
agents in searching.
eaf: The online sources are actually a good way to start.
FreeWAIS is a standard vector system with the addition of handling Z39.50, a propocol for client-server IR. The WAIS version of that protocol makes use of the notion of a collection of files to describe each information source, which is first searched. Then, the collection of files for selected information sources are searched. Feedback occurs by adding in relevant documents to the search.
TOPIC is discussed a bit in the course notes. It evolved out of RUBRIC, which was presented at the 1986 IEEE Expert Systems in Government Conference by Tong et al. and also by that group at SIGIR'87.
ConQuest is like these other systems, but also uses proximity of occurrence to help with ranking. I know how it works, but am not sure what has been published.
Writing a good review would be an interesting exercise or M.S. project. I don't know of any that exists, though there may be.
For the last three units it states that the course notes for those lectures are on the KMS system. Is there any way to access these notes through some viewer without using KMS?
The system I use at home doesn't support X-window applications (which I got the impression that you needed in order to run KMS).
Is there any way these notes could be made available through
regular WWW access? Or are they already accessible through WWW
and regular Web browsers, but I just don't know how to access them?
eaf: All the notes for thesee 3 units are available over
WWW. I have moved what I could from KMS about MM into an Adobe PDF format
file, since KMS can produce PostScript. If video will stay up I may
do more of that for the other units.
The Aug 1995 of the Communications of the ACM contains a special issue
on hypermiedia applications. The theme is designing real-world,
commercial-scale applications that envelope hypermedia-based navigation
and annotation support. While excellent reading, the experience can be
enhanced by accessing the on-line version of the articles through the
ACM SIGLINK home page at http://www.acm.org/siglink
eaf: Thanks for the pointer!
Several years ago there was much discussion about compound documents.
I haven't heard much about compound documents recently. Have compound
documents become multimedia documents?
eaf:
Well, things come in spurts, and get specialized.
There was a debate between ODA and SGML, related to this, and
in my opinion SGML won, so there is now handling of compound documents in
SGML. Some of this is done with HyTime.
Part of the push has been toward multimedia, with MIME becoming commonplace, and related mail and WWW use of it. There are still efforts to do something better than MIME, and X.400 extensions to mail with more compound documents.
There is a journal
Electronic Publishing - Origination,
Dissemination and Design Journal (Wiley)
and I believe there will be a European document processing
conference sometime next year.
Then there are the vendor efforts, by Microsoft and Apple. They may be the real places of battling about compound documents.
Unique application of KB Technology
I happened this novel use of Mailbot. Enjoy
http://www.delphi.com/power/powrmail.htm
Besides helping the Power Rangers communicate with Zordon, I am
also acting as the official Power Rangers Web Site Mailbot. Click
on any of the Power Rangers below to send them a quick note. You
can even send some email to the Monsters and see what they're up
to! I will get your letter delivered, no matter where in the
universe the Power Rangers and Monsters are, and I will forward
their reply to you within 24 hours!
eaf: Well, I guess we need all the help this world can get!
5604n, Review of Web Page Editors
The December 11th issue of InformationWeek contains a competitive review of several Web page editors, including NaviPress, PageMill, and HoTMetaL. (Others are mentioned in sidebars.)
You can see the review on the Web, at:
http://techweb.cmp.com/iw/557/57olweb.htm
You can also get a free trial of the winner, at:
http://www.navisoft.com
eaf: For our purposes it was good to explore HoTMetaL since
it helps convey ideas of SGML. I'll have to look at the others, and
welcome comments regarding their applicability.
SWISH indexing:
I have never used SWISH but in my research on www robots, it looked like the most likely candidate for an indexing program agains the web pages. To get to the site use www.eit.com/software/swish/
Add swish.html to the end if you want to see the documentation.
"SWISH stands for Simple Web Indexing System for Humans. With it, you
can index directories of files and search the generated indexes."
eaf: Thanks, this looks interesting! However, for our course it
is necessary to have a WWW interface. You can use Harvest now to search
our WWW pages, at
http://ei.cs.vt.edu/Harvest/brokers/cs5604/
JPEG references:
1. Proceedings of the IEEE, Vol 83 No. 2 Feb 1995. Mario Kovac & N. Ranganthan. Pages 247-??. (I didn't copy then end of the article) The first three pages are an introduction to JPEG. The introduction gives a descent overview to the status of JPEG standard, and how the algorithm works. The rest is VERY Hardware oriented.
2. IEEE Spectrum October 1991, Pages 16- 19, Peng H Ang, Peter A Ruetz, David
Auld. All authors appear to be with LSI Logic Corp and interested in VLSI
design. I found this article to be good in terms of understanding JPEG
compression. It also discusses MPEG and CCITT encoding.
eaf: Yes, JPEG has become popular. The class assignment
is to the first public announcement, by the chair of the committee.
These others may be of interest and help people like you who have
in-depth interest, particularly regarding hardware etc.