Structured Glossary of Technical Terms-- • CLIR

Structured Glossary of Technical Terms–

[Document Tree]
The Preserved Copy
3.1 Preservation Technology
3.2 Capture Technology
3.3 Storage Technology
3.4 Access Technology
3.5 Distribution Technology
3.6 Presentation Technology

3.1. Preservation and Media Conversion Technologies

Many different technologies have been proposed to address the problems of preservation. These can be divided into three broad categories: those directed at preserving both the content and physical embodiment of the original, those directed at preserving the content and copying the physical embodiment, and those directed at preserving the content only, without concern for the physical embodiment. Conservation and paper deacidification fall into the first category. The remaining technologies described below fall into the other categories.

In the second category every effort is made to copy the physical embodiment or format of the original as faithfully as possible, normally onto another medium. The term media conversion technologies is thus used for this class (note: this does not exclude copying a paper document onto another paper document: media conversion has still occurred). Media Conversion includes photocopying (3.1.3), microform recording (3.1.4), and the use of electronic digitization techniques (3.1.5).

3.1 The Preserved Copy: Media Conversion Technologies

The third category makes no attempt to preserve or copy the physical embodiment of the original. For example, merely rekeying the text (see 3.2.8) of a document composed entirely of text preserves only content and nothing else if no attempt is made to capture font and other formatting information.

Among librarians, the term “reformatting” has traditionally been used for “media conversion.” The former term is not used in this Glossary because of possible confusion with the concept of Document Format (1.2). Furthermore, “reformatting” does not do justice to the concept of copying onto microform (3.1.4) or of digital scanning (3.1.5). 20

This necessarily brief glossary of different preservation approaches also summarizes some of the key issues involved in comparing the various alternatives.

3.1.1. Conservation Treatment 21

The treatment of a document to preserve it in its original form, in recognition that the original medium, format, and content are all important for research and other purposes. Pure conservation approaches are normally hand-tailored to the individual document and, as such, may be relatively expensive. Use is normally, therefore, limited to those situations where such expensive treatment is justified by the research requirements.

3.1.2. Paper Deacidification and Strengthening 22

The treatment by chemicals to stabilize a document (in paper, by alkalization to neutralize the acid content) and/or to strengthen it (in paper by the use of a support coating or by impregnation). The alkalization treatment also usually entails depositing an alkaline reserve to buffer against further acidification.

Deacidification or strengthening can be applied to individual documents or, with some treatment processes, to a large number of documents at once (mass or bulk deacidification). The latter is a relatively cheap approach, and pilot plants have been or are being established in a number of countries to support different processes. There is, however, no standard approach at this time even though there appear to be a number of promising alternatives. There are also a number of unanswered questions at this time regarding the longevity of chemical stabilization processes, toxicity, the feasibility of scaling processes to full production requirements, the potential continuing “offgassing” implications to patrons resulting from the storage of thousands of treated volumes in confined library spaces, and other issues. Recent research appears to be addressing many of these concerns.

Deacidification is essentially a stabilization process that arrests deterioration. It does not turn brittle books back to their original state, although coating or impregnation can strengthen the paper to extend its useful life. Its greatest utility may lie in arresting embrittlement in books that are not too far gone, or for prophylactic protection of new or old books that have not yet started to turn brittle. Deacidification may also “buy time” in anticipation of later preservation by other processes.

3.1.3. Photocopying

Photocopying refers to the process of preserving the document by making a full-size (usually bound similarly to the original) facsimile copy on archival (1.5.1) paper by creating a photographic copy of the images of the pages contained in the document, possibly using a photocopier (3.2.1). As used here, photocopying refers to an in-line process where the original is scanned and one or more photocopies made all in one pass, with no form of retained intermediate storage being automatically generated (as contrasted with microform recording (3.1.4)) so that more copies can be made in the future. In actual practice, however, when photocopying is used for preservation it is customary to make a second photocopy that is retained in unbound form, so that further copies can readily be made in the future from this master copy.

A distinction is made between straight photocopying, which does not necessarily involve the use of archival paper (1.5.1), and preservation photocopying, which does require the use of archival paper.

The advantages of making such a facsimile are that normally a single paper facsimile is produced that is quite faithful to the original, there is no machine interface required other than the photocopier itself, the medium (1.1) and format( 1.2) of the original are retained, and the cost is usually less than other processes, particularly if the original is a monochrome document. Furthermore, library patrons prefer paper facsimiles to the use of, say, microforms (3.1.4), except where bulky documents, such as newspapers, are involved. The disadvantages, as compared with microform recording (3.1.4) and electronic digital preservation (3.1.5), is that normally second copies made from the master copy are of poorer quality than, say, prints of microforms made from master microforms. Furthermore, the costs of making subsequent copies is higher than the cost of printing microforms. Another disadvantage, shared to a greater or lesser extent with microforms, is that photocopying does not precisely reproduce all the information in the original, and there is some loss of information, especially for graphic objects (1.4.2.3) involving other than line art (1.4.2.3.1).

3.1.4. Microform Recording

Microform Recording refers to the process of preserving the document by filming the original document onto a microform film negative (1.1.2), that is, storing microimages of the pages or segments of the document on film. Positive film copies, which can be produced inexpensively, are made from this original film negative or master. Such a positive copy is both a storage (3.3) and distribution (3.5) technology, and is normally viewed using a microform reader (3.6.2.2), or paper positive prints may be made from the positive microform using printing devices designed for the purpose. Access to microfilm (1.1.2) using such a reader is serial (cf 3.3.1.6), whereas access to microfiche (1.1.2) is random (cf 3.3.1.6) like a book.

The advantages of microform are that the process is economically competitive with other processes; that film has a long useful life (3.3.5); and that microform copies–made from a second negative 23 (known as the printing master) copied from the original negative–may be made cheaply and distributed among other institutions, so that access is not limited to a single facsimile. Microform preservation is a well-tried, tested, and accepted method of preservation.

The disadvantages are that there is usually a loss of information in the recording process, particularly in recording continuous tone imagery (1.4.2.3.4), since the film used is usually of high contrast; 24 and that readers dislike using microform readers compared with, say, reading books.

Microform-preserved documents can subsequently be converted to other media besides paper. They can be scanned (3.2.3) and converted to digitally-encoded documents (3.1.5) to take advantage of the benefits of digital encoding for storage, distribution, and access. However, any loss of information in the original recording process will be perpetuated in the subsequent digital recording.

3.1.5. Electronic Digitization

Electronic Digitization refers to the capture of the document in electronic form through a process of scanning (see 3.2.3) and digitization. The scanned image is stored electronically, usually on magnetic (see 3.3.1.6.1 and 3.3.1.6.2) or optical (see 3.3.1.6.3 and 3.3.1.6.4) storage media. The electronically stored image may be further transformed for reasons such as compression (see 3.3.2) or information interpretation (see 3.3.3); and subsequently selected through the use of access technologies (see 3.4), distributed through the use of distribution technologies (see 3.5), or viewed through the use of presentation technologies (see 3.6).

When originally scanned, or as a result of subsequent transformations, the document may in whole or in part be stored in image (3.1.5.1), unformatted text (3.1.5.2.1), formatted text (3.1.5.2.2), or compound (3.1.5.3) form. The distinction is important insofar as it affects inter alia the extent to which information such as text in the scanned document may be interpreted (3.2.5, 3.2.6, 3.2.7, 3.2.5, 3.2.6, 3.2.7) and used for purposes of information access (3.4, in particular 3.4.2, but see also 3.1.5.1, 3.1.5.2, 3.2.4, 3.4, in particular 3.4.2, but see also 3.1.5.1, 3.1.5.2, 3.2.4). An image representation is an electronic pictorial representation composed of dots (black and white, greyscale, or color) much like a halftone (1.4.2.3.2) printed photograph, and no distinction is made between text and other information (such as graphs, pictures, and so forth) contained in the document–in other words, the letter “b” is not stored as a character per se, but as a “digital picture” of the letter “b”, and the series of numbers stored to represent the picture would be quite distinct among different typestyles used. Text representations, on the other hand, represent text as text, with a specific code used to denote the letter “b” independent of what typestyle is used.

Image representations cannot be searched for words or phrases: text representations can. Image representations of text may be converted into formatted or unformatted text representations using OCR (3.2.4) or ICR (3.2.5) techniques, but with loss of accuracy. In the context of preservation, image representations are likely to dominate, since the cost of transforming image into text representations with sufficient accuracy may be prohibitively high, at least in the immediate future. Thus full-text searching, for example, is not likely to be a feature of digitally-preserved documents. This is unlike the situation that exists with documents where the text already exists in digital electronic form, such as if the publisher had preserved the original tapes used in typesetting.

If and when OCR techniques are able to convert image format to text format with sufficient accuracy and performance, then the archives of digitally-preserved material in image format can be converted to text format using ICR (3.2.5) techniques, provided the original material was scanned with sufficiently high resolution (3.2.3). Furthermore, promising research has been done recently on the searching of documents for retrieval purposes using the “corrupted” (erroneous) text derived from the OCR or ICR scanning of image documents at existing levels of OCR/ICR accuracy and performance.

The advantage of electronic digitization is that it potentially combines the advantages of photocopying and microform recording while eliminating some of the disadvantages. Paper facsimiles can be produced at will by printing-on-demand (3.5.4) on paper (or writing the appropriate signals on whatever might be the appropriate output medium, in the case of video, film, or sound), thus eliminating the need for awkward microform readers. Alternatively, the stored images can be reconstructed and viewed at computer workstations (3.6.2.6). Furthermore, the stored digital images can be distributed essentially at will across data networks (3.5.5) for sharing among institutions. The content of the stored images can also be interpreted at any time (3.2.5, 3.2.6, 3.2.7 3.2.5, 3.2.6, 3.2.7 after recording (whenever it might become economically desirable to do so) for purposes of, say, creating indices for access purposes (3.4.1).

Another key advantage is the robustness of digital encoding. Further copies, including copies made in new formats (3.3.3) on other digital electronic storage media (3.3.1.6) for purposes of extending the useful life of the digital copy (see Introduction and 3.3.5), can be made without loss of information, as contrasted with photocopying (3.1.3) or microform recording (3.1.4). Furthermore, scanned images can be digitally enhanced (3.2.9) to improve the image quality.

The disadvantages are that this is a new and relatively untried technology, and the cost and other trade-offs are uncertain at this time. There are also concerns about the useful life (3.3.5) of present storage media, both in terms of the physical properties of the media and in terms of the robustness of the recording format (3.3.3) and of the means of access. Some, however, take the view that it will be both functionally and economically imperative in any event to recopy the data from storage medium to storage medium every few years to take advantage of the rapidly declining storage costs and increasing storage capacities of the technology, and that the useful life of a given medium is not the relevant issue (see Introduction and 3.3.5).

3.1.5.1 Image Document

A representation of the document image is electronically captured (usually with the aid of a digital image scanner–see 3.2.3) or created without interpretation of its actual content. This is stored as a sequence of l’s or O’s (known as bits), a “digital photograph” as it were. In certain image representations, a “l” indicates “black” and a “O” indicates “white” (Binary Encoding), but usually the representation is encoded in more complex representations (see 3.3.4 Encoding Method). In some representations, for example, the average grey level of a small area of the page, termed a “pixel”, is encoded (Greyscale Encoding. See also 1.4.1.1.2). Such a pixel is a grey dot. The number of dots per inch is termed the pixel resolution. This pixel resolution may range from 100 per inch to several thousand per inch.

It is not unusual, for reasons of storage economy, to convert a greyscale- encoded image document into a binary-encoded image document of higher resolution at the time an image document is stored. Compression techniques (3.3.2) are used to achieve this. The resultant stored image represents a compromise between scanning resolution, image fidelity, and storage space.

The electronically-encoded sequence of l’s and O’s that represent an Image Document is also known as a Bitmap.

Image Documents are generally accessed by associating an index entry, such as a page number, with a segment of the Image Document. See discussion following under 3.1.5.2 regarding other issues associated with searching and retrieving Image Documents.

3.1.5.2 Text Document

The text of the document only is captured as character representations, that is, each alphabetic character has a unique representation (see discussion above) following a standard means of encoding, such as the ASCII standard. With electronic digital storage, the amount of space taken to store a representation of a character generally takes far less than the amount of space taken to represent a character in image form. Usually, each character representation of a letter of, say, the Roman alphabet takes 8 bits (1 byte) of storage space. When stored in image form, the representation may take several orders of magnitude more storage space, depending upon the size of the character, the scanning resolution, and the degree of compression (see 3.3.2) used. See also 3.3.4.2

Storing a document as a text document facilitates full-text or partial- text retrieval (see 3.4.2), where documents or parts of documents can be selected and retrieved by searching for the occurrence of keywords or strings of text. This is not possible with Image Documents (3.1.5.1), unless they have been wholly or partially converted to Text Documents using Optical Character Recognition (OCR) techniques (3.2.4, 3.2.5), a process that is not sufficiently accurate for most preservation purposes (see, however, 3.2.4 for a discussion of the use of such techniques for the construction of indices).

3.1.5.2.1 Unformatted Text

The character representation of the text contains no information to indicate font style, font size, or page layout. In this sense, unformatted character text representations are an example of irreversible compression (see 3.3.2.3).

3.1.5.2.2 Formatted Text

The character representation of the text also contains sufficient information to describe one or more of font type, font size, or page layout. In this sense, formatted text may, if the document segment contains only textual material, represent a form of reversible compression (see 3.3.2.2).

3.1.5.3 Compound Document

The document is captured as a combination of image and formatted or unformatted text.

3.1.6. Rekeying of Text

Rekeying of Text refers to a preservation technology where the text in a document is literally reentered by hand into a composition or other device for republication or reproduction purposes, often with the use of a digital computer. See also 3.2.8.

3.1.6.1 Unformatted Text

In the rekeying of the text, no attempt is made to key sufficient information to indicate font style, font size, or page layout.

3.1.6.2 Formatted Text

In the rekeying of text, information is captured to indicate one or more of font style, font size, or page layout.

3.1.7. Reprinting or Republication

The document is preserved by producing a new edition or reprint, possibly by reprinting from retained intermediate forms of the document, such as reprinting a book from photocomposition tapes. Alternatively, the document may be recreated from scratch.