Glossary
Following are some terms and definitions / explanations. More will be added. Definitions will be supplied.
- Boolean queries
- formal representation of a question or information need using AND, OR, or NOT to connect terms
- CD-DA
- compact disc digital audio (i.e., standard music CD)
- CD-ROM
- compact disc read only memory (storing about 600 Mbytes of digital data in similar format to a music CD)
- clustering
- grouping similar items together to form clusters whose centroid or representative characterizes the group
- CODER
- COmposite Document Expert/extended/effective Retrieval system developed at Virginia Tech
- controlled vocabulary
- a fixed terminological set from which indexing and query terms are selected
- CSCW
- computer supported cooperative work --- group work supported with collaboration technologo
- digital library
- a collection of digital representations of information content, along with hardware, software, and personnel to support the functions of a traditional library plus knowledge worker operations like searching, browsing, and navigation
- digital tree
- a hierarchical organization of data where at each level there is a multiway branch, e.g., 10-way so each digit of a number can determine the next step in a path from the root
- distance function
- a function that computes the distance between a pair of items, e.g., d(d1,d2) with properties:
- d(d1,d2) = 0 if d1=d2
- d(d1,d2) = d(d2,d1)
- d(d1,d3) <= d(d1,d2)+d(d2,d3)
- document
- an article, book, or other work, typically containing text or other media, that has some type of information content
- E measure
- a single-valued measure of the effectiveness of an information retrieval system (with 0=best, 1=worst), which is a function of both recall and precision, as well as a factor that determines the relative importance between these
- Englebart
- Doug E. was the inventor of the mouse and other early interactive technologies who first demonstrated a powerful hypertext system working as a CSCW tool, and led work on Augment and NLS, to augment human knowledge and skills
- Envision
- a system developed at Virginia Tech in connection with the NSF-funded, ACM-supported project "A User-Centered Database from the Computer Science Literature" 1991-1995
- exhaustivity
- measure of the degree to which the content of a collection is "covered", typically used to describe a controlled vocabulary
- faceted classification
- a system for categorizing information in which diferent aspects or facets are separately considered
- flat file
- a component of a file system or entry on a storage device, that is treated as having no special structure beyond that of bytes, characters, words and/or lines
- FSA (finite state automata)
- an abstract machine made up of states (including a special "start" or "initial" one as well as one or more "final" states) where one takes a state-state transition if the input token matches that for the transition --- that can recognize a regular language and so is equivalent to a regular expression --- often used for document analysis
- filtering
- producing output by restricting input according to some criteria --- in connection with text, images, speech, electromagnetic waves or signals
- Guide
- a hypertext system marketed by OWL (Office Workstations Ltd.) that includes scrolling and note capabilities
- hashing
- computing an address to look for an item by applying a mathematical function to a key for that item
- HTML (HyperText Markup Language)
- an application of SGML, defined by a simple Document Type Definition developed in 1993, that is used for tagging documents on the World-Wide Web, which can then be rendered with viewers like Mosaic or Netscape
- HTML+ (HyperText Markup Language - extended)
- an extended version of HTML, proposed in 1994, adding extra elements such as for interactive forms
- HTTP (HyperText Transfer Protocol)
- a standard used as the basis of the World-Wide Web for communication between clients and servers, proposed in 1993, that allows for retrieval of data and following of hypermedia links
- HyperCard
- a hypertext/hypermedia system developed by Apple, provided free of charge with new systems in 1987 and then sold by Claris, which implements a card-based model derived from Xerox NoteCards, and uses an object-oriented scripting language called HyperTalk
- hypermedia
- a collection of information objects or nodes in multimedia formats with links (i.e., hypertext extended to multimedia)
- hypertext
- a term coined by Theodor Nelson for a collection of information objects or nodes, containing text (and sometimes other multimedia formats in which case it is often called hypermedia), with links, that thus serves as an information graph that can be traversed by an hypertext system, which can present each node and follow links from anchors in nodes to other nodes (at which time the target node is also presented) --- information with a nonlinear organization
- Hypertext Compendium
- an ACM Database and Electronic Products offering that includes most of the early (through 1990) publications on hypertext, available in ASCII, using KMS, or in HyperCard form
- Hypertext on Hypertext
- an ACM Database and Electronic Products offering that includes the articles appearing in the July 1988 CACM special issue on hypertext, available in KMS, HyperCard, and HyperTies forms
- HyTime
- ISO standard 10744, describing the structure of time-based hypermedia documents
- indexing
- the process of building an index, such as when a collection of text documents is analyzed to automatically identify its word or word stems that are then recorded and made to point to locations in the collection where they occur
- indexing language
- the set of terms used during indexing, possibly all the words in a collection, or a fixed set of terms found in a controlled vocabulary or thesaurus, possibly including phrases or other more complex forms
- IDF (inverse document frequency)
- a weighting formula used in some information retrieval systems whereby the importance of a term is based on the reciprocal of its document frequency in the collection; for example log (N/n) when the term occurs in n documents from a collection of N documents
- INQUERY
- INQUERY Information Retrieval System (U. Mass. Amherst), which implements a probabilistic model based on use of a Bayesian inference network
- inverted file
- a file structure in which words or other terms used to index a collection of information are connected with a list of pointers to the locations where those words occur --- the inverted form of documents containing terms, where terms point to document (occurrences)
- JPEG
- Joint Photographic Experts Group, ISO/CCITT standard for compressing still images (grayscale or color), available in lossless form for roughly 3:1 compression or in lossy form for 10:1 or more compression using the discrete cosine transform (DCT), coefficients based on the frequency response of the Human Visual System, a zig-zag run-length sequencing, and Huffman or arithmetic coding
- KMS
- hypertext/hypermedia system for expert users, computer-supported collaborative work, implementing a 2-frame/window model with a powerful scripting language
- Licklider
- J.C.R. Licklider was author of Libraries of the Future (1965), director at ARPA involved in early funding of the ARPAnet, director of MIT's Project Mac in the late 1960s-70s
- MARC
- a record format developed by the Library of Congress for library catalogs, that can describe an individual book, journal, or other work, using a collection of fields and subfields
- MARIAN
- an experimental OPAC retrieval system developed starting in 1990 at Virginia Tech Computing Center, first used as an alternative to searching with the VTLS system, but also used as the search component of the Envision digital library system
- memex
- an imaginary device described by Vannevar Bush in his seminal article "As We May Think" in the July 1945 Atlantic Monthly, implementing hypertext-style associative linking between documents and images, described using microform technology
- MIDI
- Musical Instrument Digital Interface
- Multipurpose Internet Mail Extensions (multimedia mail protocol defined by RFC 1521)
- MJPEG
- Motion JPEG, a video compression scheme in which each frame is separately compressing using the JPEG standard
- Mosaic
- an Internet application used to browse and navigate on the World-Wide Web, that can render documents provided in HTML, follow links among such documents, use HTTP as well as other protocols (e.g., gopher, FTP, UUCP), and manipulate multimedia information carried using the MIME standard
- MPEG
- Moving Picture Expert Group --- digital video standard
- natural language
- a language used by humans to communicate, e.g., Chinese, English, Farsi, French, Hindi, Russian, Zulu
- natural language text-search
- a method of searching text collections in which user queries are supplied as natural language texts or at least phrases or word strings, usually involving the vector-space or probabilistic model of partial matching
- Nelson
- Theodor N. coined the terms hypertext and hypermedia, was a great proponent of these ideas, worked at Brown on some of the early systems, and proposed and worked toward Xanadu, a universal system for shared hypertext publishing and editing
- netlib
- a software system developed at Bell Labs by Dongarra and Gross for searching of numerical analysis information, including algorithms and code
- NII
- National Information Infrastructure, the framework for U.S. efforts in the information industry, electronic publishing, and high-performance computing and communication (HPCC)
- NREN
- the National Research and Education Network, an evolving U.S. network to support the research and education community, building upon the NSFNET
- NSFNET
- an expansion of the ARPANET to serve the NSF community, leading toward the larger future NREN
- OPAC
- online public access catalog --- an automated system to allow searching in library catalogs
- paperless society
- a vision proposed by F.W. Lancaster and others in which electronic publishing and communication would largely eliminate the need for paper
- PAT
- a system developed at the Univ. of Waterloo at the Centre for the New OED, later taken over by Open Text Corp., which supports dictionary, SGML collection, and other types of searching, using a Patricia tree representation to give very rapid response to queries involving strings or phrases
- Patricia tree
- a data structure, somewhat like a trie, but implemented as a binary digital tree, where every semi-infinite string (sistring) from a large string (the concatenation of all text in a collection) is entered in the tree, and is associated with a pointer to the start of the sistring
- plasticity
- a property of electronic information in that it can be easily reshaped, republished, reused because it is in a manipulable digital representation
- precision
- a measure of how precise or specific an information retrieval system is, or behaves for a given query, computed as the ratio of the number of relevant items retrieved to the total number of items retrieved
- query
- a formal representation of a search need or anomolous state of knowledge (ASK, a la Belkin), that can be processed by an information retrieval system
- RAID
- Redundant Array of Inexpensive Disks - a method of combining several relatively cheap (e.g., SCSI-2) disks into a single unit where the disks
operate in parallel to give higher throughput. Thus, data may be striped
across the disks so playback or recording runs at the sum of the transfer
speeds of the disks. Some levels allow for hot spares so that
extra disks keep error correction data that allows one of the disks to
be replaced in case of failure while the array keeps running.
- ranking
- ordering the set of documents or items found by an information retrieval system in response to a query, usually in descending order of estimated relevance to the query
- recall
- a measure of how comprehensive or thorough an information retrieval system is, or behaves for a given query, computed as the ratio of number of relevant items retrieved to the total number of relevant items
- regular expression
- a string following the rules of a regular language, used to describe a class of strings (that can be recognized by an FSA), allowing alternatives, specifying a sequence, and indicating number of occurrences (0, 1, any number, at least one)
- relevance judgment
- a decision made by a human regarding if a particular document is relevant to a particular query
- search tree
- any data structure that involves a tree and can be used to speed up search for an item or keyword, such as a trie or Patricia tree
- SGML
- Standard Generalized Markup Language, ISO standard 8879, published in 1986, a flexible system to describe and represent documents, actually a metalanguage to describe classes of documents through Document Type Definitions (DTDs) and then documents that are in those classes
- signature file
- a file, sometimes implemented using superimposed coding, in which a document or document block is described by a signature, usually a fairly long bit string, in which bits are set if some term in the block hashes to that bit location --- a conjunctive query can be processed by building a signature for the query, and then all signatures that match that of the query are guaranteed to match the query (though other documents may also and need to be discarded)
- similarity measure
- a method of estimating the similarity or "closeness" between two entities, such as two documents or a document and a query, where 0 represents none and higher values indicate more
- SMART
- an experimental information retrieval system developed initially at Harvard University in the early 1960s and then continued through the 1990s at Cornell University, under the supervision of Gerard Salton
- specificity
- how precise or exact a term or indexing language is in its ability to describe
- stemming (suffix stripping)
- removing (usually automatically) the ending of a word, typically with a fast algorithm, to form a canonical representation that usually approximates the root form
- stop word list
- a list of word or terms that is excluded from indexing and searching, i.e., ignored as irrelevant, usually made up of function words or words that occur very often in a given collection
- superimposed coding
- a scheme for developing a signature for a block of text, i.e., a short record with bits set because terms in that block hash to their location, that allows rapid search for conjunctive queries, and usually does not find many records that have a suitable signature but do not satisfy the query
- term
- a word, word stem, keyword, root, phrase, acronym, abbreviation, descriptor, controlled vocabulary entry, thesaurus category or other construct meant to characterize some object or concept
- term broadening
- a process used by searchers or information retrieval systems to replace a single term with another or with a collection of terms that occur more often, and have wider or less precise coverage and/or meaning
- term narrowing
- a process used by searchers or information retrieval systems to replace a single term or phrase with another that occurs less often, and has narrower or more specific coverage and/or meaning
- term weighting
- a process of associating a value, usually real-valued, and possibly estimating a probability, that reflects its relative importance in a collection or document
- TF (term frequency)
- a weighting scheme usually used in information retrieval systems to rate the value of a term in a document based on the number of times it occurs in that document
- thesaurus
- an information structure lising words or other terms, along with relationships between them, such as: broader than, narrrower than, cross reference to, synonym of
- trie
- a digital tree, in which a multiway branch occurs at each level, such as for the letters of the alphabet, where information entered is represented by the path from the root to a node (possibly leaf) marked as "final"
- truncation
- cutting off the (right) end of a word or term, such as when a searcher askes for "inform*" to locate all words with "inform" as a prefix
- TULIP
- The University Licensing Program --- Elsevier, 40 bitmap journals on materials
- URL
- defined by Tim Berners-Lee's 1993 IETF Draft "Uniform Resource Locators" --- describing a document or service on the internet as a string which identifies the protocol, server machine, and additional information (e.g., file path)
- URN
- Universal Resource Name that will identify a document or service, as does a URL, but in a location-independent, logical, robust manner
- volatility
- a measure of how rapidly a collection of information changes
- WAIS
- Wide Area Information Server, originally developed using Z39.50, allowing client-server searching over the Internet, first of a collection of sources and then of actually information collections, usually involving a vector-space type search, often with relevance feedback
- WWW
- World-Wide Web, a logical infrastructure on the Internet in which documents and multimedia objects are linked, making use of HTTP, the HyperText Transfer Protocol, and represented in various forms including HTML, the HyperText Markup Language
- Z39.50
- the Information Retrieval Protocol, an ANSI and ISO standard for client-server computing between information retrieval systems, especially library catalogs (OPACs), adapted in WAIS
Hierarchy