Glossary


Following are some terms and definitions / explanations. More will be added. Definitions will be supplied.

Boolean queries

formal representation of a question or information need using AND, OR, or NOT to connect terms

CD-DA

compact disc digital audio (i.e., standard music CD)

CD-ROM

compact disc read only memory (storing about 600 Mbytes of digital data in similar format to a music CD)

clustering

grouping similar items together to form clusters whose centroid or representative characterizes the group

CODER

COmposite Document Expert/extended/effective Retrieval system developed at Virginia Tech

controlled vocabulary

a fixed terminological set from which indexing and query terms are selected

CSCW

computer supported cooperative work --- group work supported with collaboration technologo

digital library

a collection of digital representations of information content, along with hardware, software, and personnel to support the functions of a traditional library plus knowledge worker operations like searching, browsing, and navigation

digital tree

a hierarchical organization of data where at each level there is a multiway branch, e.g., 10-way so each digit of a number can determine the next step in a path from the root

distance function

a function that computes the distance between a pair of items, e.g., d(d1,d2) with properties:

  1. d(d1,d2) = 0 if d1=d2
  2. d(d1,d2) = d(d2,d1)
  3. d(d1,d3) <= d(d1,d2)+d(d2,d3)

document

an article, book, or other work, typically containing text or other media, that has some type of information content

E measure

a single-valued measure of the effectiveness of an information retrieval system (with 0=best, 1=worst), which is a function of both recall and precision, as well as a factor that determines the relative importance between these

Englebart

Doug E. was the inventor of the mouse and other early interactive technologies who first demonstrated a powerful hypertext system working as a CSCW tool, and led work on Augment and NLS, to augment human knowledge and skills

Envision

a system developed at Virginia Tech in connection with the NSF-funded, ACM-supported project "A User-Centered Database from the Computer Science Literature" 1991-1995

exhaustivity

measure of the degree to which the content of a collection is "covered", typically used to describe a controlled vocabulary

faceted classification

a system for categorizing information in which diferent aspects or facets are separately considered

flat file

a component of a file system or entry on a storage device, that is treated as having no special structure beyond that of bytes, characters, words and/or lines

FSA (finite state automata)

an abstract machine made up of states (including a special "start" or "initial" one as well as one or more "final" states) where one takes a state-state transition if the input token matches that for the transition --- that can recognize a regular language and so is equivalent to a regular expression --- often used for document analysis

filtering

producing output by restricting input according to some criteria --- in connection with text, images, speech, electromagnetic waves or signals

Guide

a hypertext system marketed by OWL (Office Workstations Ltd.) that includes scrolling and note capabilities

hashing

computing an address to look for an item by applying a mathematical function to a key for that item

HTML (HyperText Markup Language)

an application of SGML, defined by a simple Document Type Definition developed in 1993, that is used for tagging documents on the World-Wide Web, which can then be rendered with viewers like Mosaic or Netscape

HTML+ (HyperText Markup Language - extended)

an extended version of HTML, proposed in 1994, adding extra elements such as for interactive forms

HTTP (HyperText Transfer Protocol)

a standard used as the basis of the World-Wide Web for communication between clients and servers, proposed in 1993, that allows for retrieval of data and following of hypermedia links


HyperCard

a hypertext/hypermedia system developed by Apple, provided free of charge with new systems in 1987 and then sold by Claris, which implements a card-based model derived from Xerox NoteCards, and uses an object-oriented scripting language called HyperTalk

hypermedia

a collection of information objects or nodes in multimedia formats with links (i.e., hypertext extended to multimedia)

hypertext

a term coined by Theodor Nelson for a collection of information objects or nodes, containing text (and sometimes other multimedia formats in which case it is often called hypermedia), with links, that thus serves as an information graph that can be traversed by an hypertext system, which can present each node and follow links from anchors in nodes to other nodes (at which time the target node is also presented) --- information with a nonlinear organization

Hypertext Compendium

an ACM Database and Electronic Products offering that includes most of the early (through 1990) publications on hypertext, available in ASCII, using KMS, or in HyperCard form

Hypertext on Hypertext

an ACM Database and Electronic Products offering that includes the articles appearing in the July 1988 CACM special issue on hypertext, available in KMS, HyperCard, and HyperTies forms

HyTime

ISO standard 10744, describing the structure of time-based hypermedia documents

indexing

the process of building an index, such as when a collection of text documents is analyzed to automatically identify its word or word stems that are then recorded and made to point to locations in the collection where they occur

indexing language

the set of terms used during indexing, possibly all the words in a collection, or a fixed set of terms found in a controlled vocabulary or thesaurus, possibly including phrases or other more complex forms

IDF (inverse document frequency)

a weighting formula used in some information retrieval systems whereby the importance of a term is based on the reciprocal of its document frequency in the collection; for example log (N/n) when the term occurs in n documents from a collection of N documents

INQUERY

INQUERY Information Retrieval System (U. Mass. Amherst), which implements a probabilistic model based on use of a Bayesian inference network

inverted file

a file structure in which words or other terms used to index a collection of information are connected with a list of pointers to the locations where those words occur --- the inverted form of documents containing terms, where terms point to document (occurrences)

JPEG

Joint Photographic Experts Group, ISO/CCITT standard for compressing still images (grayscale or color), available in lossless form for roughly 3:1 compression or in lossy form for 10:1 or more compression using the discrete cosine transform (DCT), coefficients based on the frequency response of the Human Visual System, a zig-zag run-length sequencing, and Huffman or arithmetic coding

KMS

hypertext/hypermedia system for expert users, computer-supported collaborative work, implementing a 2-frame/window model with a powerful scripting language

Licklider

J.C.R. Licklider was author of Libraries of the Future (1965), director at ARPA involved in early funding of the ARPAnet, director of MIT's Project Mac in the late 1960s-70s

MARC

a record format developed by the Library of Congress for library catalogs, that can describe an individual book, journal, or other work, using a collection of fields and subfields

MARIAN

an experimental OPAC retrieval system developed starting in 1990 at Virginia Tech Computing Center, first used as an alternative to searching with the VTLS system, but also used as the search component of the Envision digital library system

memex

an imaginary device described by Vannevar Bush in his seminal article "As We May Think" in the July 1945 Atlantic Monthly, implementing hypertext-style associative linking between documents and images, described using microform technology

MIDI

Musical Instrument Digital Interface
Multipurpose Internet Mail Extensions (multimedia mail protocol defined by RFC 1521)

MJPEG

Motion JPEG, a video compression scheme in which each frame is separately compressing using the JPEG standard

Mosaic

an Internet application used to browse and navigate on the World-Wide Web, that can render documents provided in HTML, follow links among such documents, use HTTP as well as other protocols (e.g., gopher, FTP, UUCP), and manipulate multimedia information carried using the MIME standard

MPEG

Moving Picture Expert Group --- digital video standard

natural language

a language used by humans to communicate, e.g., Chinese, English, Farsi, French, Hindi, Russian, Zulu

natural language text-search

a method of searching text collections in which user queries are supplied as natural language texts or at least phrases or word strings, usually involving the vector-space or probabilistic model of partial matching

Nelson

Theodor N. coined the terms hypertext and hypermedia, was a great proponent of these ideas, worked at Brown on some of the early systems, and proposed and worked toward Xanadu, a universal system for shared hypertext publishing and editing

netlib

a software system developed at Bell Labs by Dongarra and Gross for searching of numerical analysis information, including algorithms and code

NII

National Information Infrastructure, the framework for U.S. efforts in the information industry, electronic publishing, and high-performance computing and communication (HPCC)

NREN

the National Research and Education Network, an evolving U.S. network to support the research and education community, building upon the NSFNET

NSFNET

an expansion of the ARPANET to serve the NSF community, leading toward the larger future NREN

OPAC

online public access catalog --- an automated system to allow searching in library catalogs

paperless society

a vision proposed by F.W. Lancaster and others in which electronic publishing and communication would largely eliminate the need for paper

PAT

a system developed at the Univ. of Waterloo at the Centre for the New OED, later taken over by Open Text Corp., which supports dictionary, SGML collection, and other types of searching, using a Patricia tree representation to give very rapid response to queries involving strings or phrases

Patricia tree

a data structure, somewhat like a trie, but implemented as a binary digital tree, where every semi-infinite string (sistring) from a large string (the concatenation of all text in a collection) is entered in the tree, and is associated with a pointer to the start of the sistring

plasticity

a property of electronic information in that it can be easily reshaped, republished, reused because it is in a manipulable digital representation

precision

a measure of how precise or specific an information retrieval system is, or behaves for a given query, computed as the ratio of the number of relevant items retrieved to the total number of items retrieved

query

a formal representation of a search need or anomolous state of knowledge (ASK, a la Belkin), that can be processed by an information retrieval system

RAID


Redundant Array of Inexpensive Disks - a method of combining several relatively cheap (e.g., SCSI-2) disks into a single unit where the disks
operate in parallel to give higher throughput. Thus, data may be striped
across the disks so playback or recording runs at the sum of the transfer
speeds of the disks. Some levels allow for hot spares so that
extra disks keep error correction data that allows one of the disks to
be replaced in case of failure while the array keeps running.


ranking

ordering the set of documents or items found by an information retrieval system in response to a query, usually in descending order of estimated relevance to the query

recall

a measure of how comprehensive or thorough an information retrieval system is, or behaves for a given query, computed as the ratio of number of relevant items retrieved to the total number of relevant items

regular expression

a string following the rules of a regular language, used to describe a class of strings (that can be recognized by an FSA), allowing alternatives, specifying a sequence, and indicating number of occurrences (0, 1, any number, at least one)

relevance judgment

a decision made by a human regarding if a particular document is relevant to a particular query

search tree

any data structure that involves a tree and can be used to speed up search for an item or keyword, such as a trie or Patricia tree

SGML

Standard Generalized Markup Language, ISO standard 8879, published in 1986, a flexible system to describe and represent documents, actually a metalanguage to describe classes of documents through Document Type Definitions (DTDs) and then documents that are in those classes

signature file

a file, sometimes implemented using superimposed coding, in which a document or document block is described by a signature, usually a fairly long bit string, in which bits are set if some term in the block hashes to that bit location --- a conjunctive query can be processed by building a signature for the query, and then all signatures that match that of the query are guaranteed to match the query (though other documents may also and need to be discarded)

similarity measure

a method of estimating the similarity or "closeness" between two entities, such as two documents or a document and a query, where 0 represents none and higher values indicate more

SMART

an experimental information retrieval system developed initially at Harvard University in the early 1960s and then continued through the 1990s at Cornell University, under the supervision of Gerard Salton

specificity

how precise or exact a term or indexing language is in its ability to describe

stemming (suffix stripping)

removing (usually automatically) the ending of a word, typically with a fast algorithm, to form a canonical representation that usually approximates the root form

stop word list

a list of word or terms that is excluded from indexing and searching, i.e., ignored as irrelevant, usually made up of function words or words that occur very often in a given collection

superimposed coding

a scheme for developing a signature for a block of text, i.e., a short record with bits set because terms in that block hash to their location, that allows rapid search for conjunctive queries, and usually does not find many records that have a suitable signature but do not satisfy the query

term

a word, word stem, keyword, root, phrase, acronym, abbreviation, descriptor, controlled vocabulary entry, thesaurus category or other construct meant to characterize some object or concept

term broadening

a process used by searchers or information retrieval systems to replace a single term with another or with a collection of terms that occur more often, and have wider or less precise coverage and/or meaning

term narrowing

a process used by searchers or information retrieval systems to replace a single term or phrase with another that occurs less often, and has narrower or more specific coverage and/or meaning

term weighting

a process of associating a value, usually real-valued, and possibly estimating a probability, that reflects its relative importance in a collection or document

TF (term frequency)

a weighting scheme usually used in information retrieval systems to rate the value of a term in a document based on the number of times it occurs in that document

thesaurus

an information structure lising words or other terms, along with relationships between them, such as: broader than, narrrower than, cross reference to, synonym of

trie

a digital tree, in which a multiway branch occurs at each level, such as for the letters of the alphabet, where information entered is represented by the path from the root to a node (possibly leaf) marked as "final"

truncation

cutting off the (right) end of a word or term, such as when a searcher askes for "inform*" to locate all words with "inform" as a prefix

TULIP

The University Licensing Program --- Elsevier, 40 bitmap journals on materials

URL

defined by Tim Berners-Lee's 1993 IETF Draft "Uniform Resource Locators" --- describing a document or service on the internet as a string which identifies the protocol, server machine, and additional information (e.g., file path)

URN

Universal Resource Name that will identify a document or service, as does a URL, but in a location-independent, logical, robust manner

volatility

a measure of how rapidly a collection of information changes

WAIS

Wide Area Information Server, originally developed using Z39.50, allowing client-server searching over the Internet, first of a collection of sources and then of actually information collections, usually involving a vector-space type search, often with relevance feedback

WWW

World-Wide Web, a logical infrastructure on the Internet in which documents and multimedia objects are linked, making use of HTTP, the HyperText Transfer Protocol, and represented in various forms including HTML, the HyperText Markup Language

Z39.50

the Information Retrieval Protocol, an ANSI and ISO standard for client-server computing between information retrieval systems, especially library catalogs (OPACs), adapted in WAIS


Hierarchy