Introduction to Computer Text File Formats

The specification of how a document can be displayed and printed is determined by the format in which the file is stored. Several standards have been proposed by governments and companies. This document will briefly describe some of the attributes of the most popular formats. The reader should be aware that many more formats than discussed are in existence and more are being proposed as standards quite frequently.

The lowest common denominator of text file formats is the American Standard Code for Informtation Interchange (ASCII). There are actually two forms of ASCII: standard and extended. Standard ASCII only contains codes for 128 characters, (i.e. a 7 bit binary code). It is transportable across all networks and capable of being accessed and manipulated on all computers. Extended ASCII is a non-standard format containing codes for 256 characters, (i.e. 1 byte code). The first (lower), 128 characters of extended ASCII are actually just the standard ASCII character codes. The second (upper), 128 character codes are machine dependent. Each hardware vendor defines the codes for their own platform. The characters for the codes differ from font (character typeface), to font. They are typically used for various purposes such as: specifing special graphic characters; extended device control (printer, modem, etc.); and application specific file format codes.

Even transfering plain text files between different systems can present problems. The most common problem is the way in which different systems represent the end of the line. The Macintosh stores a carriage return character (ASCII code 13), at the end of each line, whereas Unix uses a linefeed character (ASCII code 10), to denote the end of the line and yet DOS does it a third way using both a linefeed and carriage return character at the end of lines.

The PostScript file format was developed by Adobe Systems Inc. It was designed as a typesetting language for specifying high quality page layouts. There are actually several forms of PostScript. Only two forms will be briefly mentioned here: PostScript and encapsulated PostScript. The driving force behind the development of the PostScript (sample file), language was as a device driver language for laser printers. It has since developed into a display language and exists in two levels offering two-dimensional and three-dimensional drawing features. Encapsulated PostScript (EPS), is simply an extension of the PostScript language to achieve machine independence. The aim was to allow documents to be transportable across hardware platforms without any loss of document data, structure or display information. This goal is only now being realized through the development of the Adobe Acrobat system that will be discussed later in the course. EPS files are typically much larger than corresponding PostScript files since they include font table information and PostScript functions to aid portability. For example the sample document above required 30K (30,000 characters), of storage for the PostScript version and over 900K (900,000 characters) for an EPS version.

Rich Text Format (RTF), was developed by the Microsoft Corp. for many of the same reasons as PostScript was developed. It was not however designed as a full-featured typesetting language. As such it will not store enough layout information about a document to ensure WYSIWYG platform portability. Most word processors support the RTF format (usually in the Open or Save dialogs options are provided for specifying the document type). RTF (sample file), unlike PostScript is a text format only. Any documents containing graphics converted to RTF and then re-opened as an RTF document will not contain the original document graphics.

HyperText Markup Language (HTML), is the format underlying the World-Wide Web (WWW), documents that WWW browser programs such as Mosaic display and traverse. HTML is a document structure language. As such it does not provide operations for a writer to specify exactly how a document should be displayed, only the outline or structure may be indicated.

Any HTML document viewed with Mosaic, Netscape Explorer, or Internet Explorer can be viewed in raw HTML form (i.e. with the HTML formatting code not interpreted). By choosing View source from the File menu (View menu if using Netscape) or choosing Save as ... from the File menu and then setting the saved file type to HTML one can then look at the HTML document structure codes. The HTML format provides operation codes for linking (connecting), documents together. This is the basis for Hypertext and it forces a non-linear method of information organization upon the author. A writer must continually think about relationships that exist in their information and how best to present those relationships to the readers. HTML is actually a subset of a more general and powerful document specification language SGML. An extension presentation of HTML with regards to writing HTML documents for publication on the Internet will be covered in the Network Hypermedia section of the course.

Further Exploration

Common Internet File Formats


Author: N. Dwight Barnette
Curator: Computer Science Dept : VA TECH © Copyright 1994.
Last Updated: 5/25/96