CS4624 Glossary
Following are some terms and definitions / explanations. More will be
added. Definitions will be supplied.
- Boolean queries
- formal representation of a question or information need using AND, OR,
or NOT to connect terms
- browsing
- looking around, reading, viewing objects or information, sometimes with
the help of some organization or tool (e.g., a browser) --- often with
no or only a vague objective, sometimes as in window-shopping
- CD-DA
- compact disc digital audio (i.e., standard music CD)
- CD-ROM
- compact disc read only memory (storing about 600 Mbytes of digital data
in similar format to a music CD)
- clustering
- grouping similar items together to form clusters whose centroid or representative
characterizes the group
- CODER
- COmposite Document Expert/extended/effective Retrieval system developed
at Virginia Tech
- composite
- something made up of parts - for example, in the Amsterdam model
there are atomic objects (treated as a whole, not decomposed) and
composite objects (made up of atomic or other composite objects)
- controlled vocabulary
- a fixed terminological set from which indexing and query terms are selected
- CSCW
- computer supported cooperative work --- group work supported with collaboration
technologo
- digital library
- a collection of digital representations of information content, along
with hardware, software, and personnel to support the functions of a traditional
library plus knowledge worker operations like searching, browsing, and navigation
- digital tree
- a hierarchical organization of data where at each level there is a multiway
branch, e.g., 10-way so each digit of a number can determine the next step
in a path from the root
- distance function
- a function that computes the distance between a pair of items, e.g.,
d(d1,d2) with properties:
- d(d1,d2) = 0 if d1=d2
- d(d1,d2) = d(d2,d1)
- d(d1,d3) = d(d1,d2)+d(d2,d3)
- document
- an article, book, or other work, typically containing text or other
media, that has some type of information content
- E measure
- a single-valued measure of the effectiveness of an information retrieval
system (with 0=best, 1=worst), which is a function of both recall and precision,
as well as a factor that determines the relative importance between these
- Englebart
- Doug E. was the inventor of the mouse and other early interactive technologies
who first demonstrated a powerful hypertext system working as a CSCW tool,
and led work on Augment and NLS, to augment human knowledge and skills
- Envision
- a system developed at Virginia Tech in connection with the NSF-funded,
ACM-supported project "A User-Centered Database from the Computer Science
Literature" 1991-1995
- exhaustivity
- measure of the degree to which the content of a collection is "covered",
typically used to describe a controlled vocabulary
- faceted classification
- a system for categorizing information in which diferent aspects or facets
are separately considered
- flat file
- a component of a file system or entry on a storage device, that is treated
as having no special structure beyond that of bytes, characters, words and/or
lines
- FSA (finite state automata)
- an abstract machine made up of states (including a special "start"
or "initial" one as well as one or more "final" states)
where one takes a state-state transition if the input token matches that
for the transition --- that can recognize a regular language and so is equivalent
to a regular expression --- often used for document analysis
- filtering
- producing output by restricting input according to some criteria ---
in connection with text, images, speech, electromagnetic waves or signals
- Guide
- a hypertext system marketed by OWL (Office Workstations Ltd.) that includes
scrolling and note capabilities
- hashing
- computing an address to look for an item by applying a mathematical
function to a key for that item
- HTML (HyperText Markup Language)
- an application of SGML, defined by a simple Document Type Definition
developed in 1993, that is used for tagging documents on the World-Wide
Web, which can then be rendered with viewers like Mosaic or Netscape
- HTML+ (HyperText Markup Language - extended)
- an extended version of HTML, proposed in 1994, adding extra elements
such as for interactive forms
- HTTP (HyperText Transfer Protocol)
- a standard used as the basis of the World-Wide Web for communication
between clients and servers, proposed in 1993, that allows for retrieval
of data and following of hypermedia links
- HyperCard
- a hypertext/hypermedia system developed by Apple, provided free of charge
with new systems in 1987 and then sold by Claris, which implements a card-based
model derived from Xerox NoteCards, and uses an object-oriented scripting
language called HyperTalk
- hypermedia
- a collection of information objects or nodes in multimedia formats with
links (i.e., hypertext extended to multimedia)
- hypertext
- a term coined by Theodor Nelson for a collection of information objects
or nodes, containing text (and sometimes other multimedia formats in which
case it is often called hypermedia), with links, that thus serves as an
information graph that can be traversed by an hypertext system, which can
present each node and follow links from anchors in nodes to other nodes
(at which time the target node is also presented) --- information with a
nonlinear organization
- Hypertext Compendium
- an ACM Database and Electronic Products offering that includes most
of the early (through 1990) publications on hypertext, available in ASCII,
using KMS, or in HyperCard form
- Hypertext on Hypertext
- an ACM Database and Electronic Products offering that includes the articles
appearing in the July 1988 CACM special issue on hypertext, available in
KMS, HyperCard, and HyperTies forms
- HyTime
- ISO standard 10744, describing the structure of time-based hypermedia
documents
- indexing
- the process of building an index, such as when a collection of text
documents is analyzed to automatically identify its word or word stems that
are then recorded and made to point to locations in the collection where
they occur
- indexing language
- the set of terms used during indexing, possibly all the words in
a collection,
or a fixed set of terms found in a controlled vocabulary or thesaurus, possibly
including phrases or other more complex forms
- IDF (inverse document frequency)
- a weighting formula used in some information retrieval systems whereby
the importance of a term is based on the reciprocal of its document frequency
in the collection; for example log (N/n) when the term occurs in n documents
from a collection of N documents
- INQUERY
- INQUERY Information Retrieval System (U. Mass. Amherst), which implements
a probabilistic model based on use of a Bayesian inference network
- Intermedia
- a system developed in the IRIS project at Brown University, that is
the precursor to both StorySpace and Hyper-G. The IRIS group produced
ACM HonH - Hypertext on Hypertext.
A good article on it is discussed in the course notes:
HAAN92.
- inverted file
- a file structure in which words or other terms used to index a collection
of information are connected with a list of pointers to the locations where
those words occur --- the inverted form of documents containing terms, where
terms point to document (occurrences)
- JPEG
- Joint Photographic Experts Group, ISO/CCITT standard for compressing
still images (grayscale or color), available in lossless form for roughly
3:1 compression or in lossy form for 10:1 or more compression using the
discrete cosine transform (DCT), coefficients based on the frequency response
of the Human Visual System, a zig-zag run-length sequencing, and Huffman
or arithmetic coding
- KMS
- hypertext/hypermedia system for expert users, computer-supported
collaborative
work, implementing a 2-frame/window model with a powerful scripting language
- Licklider
- J.C.R. Licklider was author of Libraries of the Future (1965), director
at ARPA involved in early funding of the ARPAnet, director of MIT's Project
Mac in the late 1960s-70s
- link
- one of the distinguishing types of logical objects that characterizes
hypertext; something that connects two anchors (usually directed, from
the source to the target anchor); a type of pointer from one part (e.g., node)
of a hypertext to another, sometimes labelled (by name or type), which
is resolved by the hypertext system when the source anchor is selected, and
then usually causes presentation of the target node
- MARC
- a record format developed by the Library of Congress for library catalogs,
that can describe an individual book, journal, or other work, using a collection
of fields and subfields
- MARIAN
- an experimental OPAC retrieval system developed starting in 1990 at
Virginia Tech Computing Center, first used as an alternative to searching
with the VTLS system, but also used as the search component of the Envision
digital library system
- MBone
- MBONE stands for the Multicast Backbone on the Internet, launched
in 1992. It carries audio and video conferences, plus shared whiteboards.
- memex
- an imaginary device described by Vannevar Bush in his seminal article
"As We May Think" in the July 1945 Atlantic Monthly, implementing
hypertext-style associative linking between documents and images, described
using microform technology
- MIDI
- Musical Instrument Digital Interface
- Multipurpose Internet
Mail Extensions (multimedia mail protocol defined by RFC 1521)
- MJPEG
- Motion JPEG, a video compression scheme in which each frame is separately
compressing using the JPEG standard
- Mosaic
- an Internet application used to browse and navigate on the World-Wide
Web, that can render documents provided in HTML, follow links among such
documents, use HTTP as well as other protocols (e.g., gopher, FTP, UUCP),
and manipulate multimedia information carried using the MIME standard
- MPEG
- Moving Picture Expert Group --- digital video standard
- natural language
- a language used by humans to communicate, e.g., Chinese, English, Farsi,
French, Hindi, Russian, Zulu
- natural language text-search
- a method of searching text collections in which user queries are supplied
as natural language texts or at least phrases or word strings, usually involving
the vector-space or probabilistic model of partial matching
- Nelson
- Theodor N. coined the terms hypertext and hypermedia, was a great proponent
of these ideas, worked at Brown on some of the early systems, and proposed
and worked toward Xanadu, a universal system for shared hypertext publishing
and editing
- netlib
- a software system developed at Bell Labs by Dongarra and Gross for searching
of numerical analysis information, including algorithms and code
- NII
- National Information Infrastructure, the framework for U.S. efforts
in the information industry, electronic publishing, and high-performance
computing and communication (HPCC)
- NREN
- the National Research and Education Network, an evolving U.S. network
to support the research and education community, building upon the NSFNET
- NSFNET
- an expansion of the ARPANET to serve the NSF community, leading toward
the larger future NREN
- OPAC
- online public access catalog --- an automated system to allow searching
in library catalogs
- paperless society
- a vision proposed by F.W. Lancaster and others in which electronic publishing
and communication would largely eliminate the need for paper
- PAT
- a system developed at the Univ. of Waterloo at the Centre for the New
OED, later taken over by Open Text Corp., which supports dictionary, SGML
collection, and other types of searching, using a Patricia tree representation
to give very rapid response to queries involving strings or phrases
- Patricia tree
- a data structure, somewhat like a trie, but implemented as a binary
digital tree, where every semi-infinite string (sistring) from a large string
(the concatenation of all text in a collection) is entered in the tree,
and is associated with a pointer to the start of the sistring
- plasticity
- a property of electronic information in that it can be easily reshaped,
republished, reused because it is in a manipulable digital representation
- precision
- a measure of how precise or specific an information retrieval system
is, or behaves for a given query, computed as the ratio of the number of
relevant items retrieved to the total number of items retrieved
- query
- a formal representation of a search need or anomolous state of knowledge
(ASK, a la Belkin), that can be processed by an information retrieval system
- RAID
- Redundant Array of Inexpensive Disks - a method of combining several
relatively cheap (e.g., SCSI-2) disks into a single unit where the disks
operate in parallel to give higher throughput. Thus, data may be striped
across the disks so playback or recording runs at the sum of the transfer
speeds of the disks. Some levels allow for hot spares so that
extra disks keep error correction data that allows one of the disks to
be replaced in case of failure while the array keeps running.
- ranking
- ordering the set of documents or items found by an information retrieval
system in response to a query, usually in descending order of estimated
relevance to the query
- recall
- a measure of how comprehensive or thorough an information retrieval
system is, or behaves for a given query, computed as the ratio of number
of relevant items retrieved to the total number of relevant items
- regular expression
- a string following the rules of a regular language, used to describe
a class of strings (that can be recognized by an FSA), allowing alternatives,
specifying a sequence, and indicating number of occurrences (0, 1, any number,
at least one)
- relevance judgment
- a decision made by a human regarding if a particular document is relevant
to a particular query
- searching
- purposefully trying to find some object or information, sometimes
with the help of a search system or search engine, sometimes using
an information retrieval system, sometimes by submitting a formal query,
often following some search strategy or plan
- search tree
- any data structure that involves a tree and can be used to speed up
search for an item or keyword, such as a trie or Patricia tree
- SGML
- Standard Generalized Markup Language, ISO standard 8879, published in
1986, a flexible system to describe and represent documents, actually a
metalanguage to describe classes of documents through Document Type Definitions
(DTDs) and then documents that are in those classes
- signature file
- a file, sometimes implemented using superimposed coding, in which a
document or document block is described by a signature, usually a fairly
long bit string, in which bits are set if some term in the block hashes
to that bit location --- a conjunctive query can be processed by building
a signature for the query, and then all signatures that match that of the
query are guaranteed to match the query (though other documents may also
and need to be discarded)
- similarity measure
- a method of estimating the similarity or "closeness" between
two entities, such as two documents or a document and a query, where 0 represents
none and higher values indicate more
- SMART
- an experimental information retrieval system developed initially at
Harvard University in the early 1960s and then continued through the 1990s
at Cornell University, under the supervision of Gerard Salton
- specificity
- how precise or exact a term or indexing language is in its ability to
describe
- stemming (suffix stripping)
- removing (usually automatically) the ending of a word, typically with
a fast algorithm, to form a canonical representation that usually approximates
the root form
- stop word list
- a list of word or terms that is excluded from indexing and searching,
i.e., ignored as irrelevant, usually made up of function words or words
that occur very often in a given collection
- superimposed coding
- a scheme for developing a signature for a block of text, i.e., a short
record with bits set because terms in that block hash to their location,
that allows rapid search for conjunctive queries, and usually does not find
many records that have a suitable signature but do not satisfy the query
- term
- a word, word stem, keyword, root, phrase, acronym, abbreviation, descriptor,
controlled vocabulary entry, thesaurus category or other construct meant
to characterize some object or concept
- term broadening
- a process used by searchers or information retrieval systems to replace
a single term with another or with a collection of terms that occur more
often, and have wider or less precise coverage and/or meaning
- term narrowing
- a process used by searchers or information retrieval systems to replace
a single term or phrase with another that occurs less often, and has narrower
or more specific coverage and/or meaning
- term weighting
- a process of associating a value, usually real-valued, and possibly
estimating a probability, that reflects its relative importance in a collection
or document
- TF (term frequency)
- a weighting scheme usually used in information retrieval systems to
rate the value of a term in a document based on the number of times it occurs
in that document
- thesaurus
- an information structure lising words or other terms, along with relationships
between them, such as: broader than, narrrower than, cross reference to,
synonym of
- trie
- a digital tree, in which a multiway branch occurs at each level, such
as for the letters of the alphabet, where information entered is represented
by the path from the root to a node (possibly leaf) marked as "final"
- truncation
- cutting off the (right) end of a word or term, such as when a searcher
askes for "inform*" to locate all words with "inform"
as a prefix
- TULIP
- The University Licensing Program --- Elsevier, 40 bitmap journals on
materials
- URL
- defined by Tim Berners-Lee's 1993 IETF Draft "Uniform Resource
Locators" --- describing a document or service on the internet as a
string which identifies the protocol, server machine, and additional information
(e.g., file path)
- URN
- Universal Resource Name that will identify a document or service, as
does a URL, but in a location-independent, logical, robust manner
- video-on-demand
- a system or service, usually involving storage of a large number of
video programs, that can support a number of users each of whom can at
any time request playback/delivery of any of the stored programs
- volatility
- a measure of how rapidly a collection of information changes
- WAIS
- Wide Area Information Server, originally developed using Z39.50, allowing
client-server searching over the Internet, first of a collection of sources
and then of actually information collections, usually involving a vector-space
type search, often with relevance feedback
- WWW
- World-Wide Web, a logical infrastructure on the Internet in which documents
and multimedia objects are linked, making use of HTTP, the HyperText Transfer
Protocol, and represented in various forms including HTML, the HyperText
Markup Language
- Z39.50
- the Information Retrieval Protocol, an ANSI and ISO standard for client-server
computing between information retrieval systems, especially library catalogs
(OPACs), adapted in WAIS
[Home]
Copyright 1996 Edward A. Fox