Joint Study in Digital Preservation-Products • CLIR

The Cornell / Xerox / Commission on Preservation and Access Joint Study in Digital Preservation

Products

The first phase of the Joint Study was successfully concluded in December 1991 and substantially achieved the original goals of the Workplan of May 10, 1990. Products of the Joint Study that support these goals include the following:

A. the development, implementation, and testing of a Networked Scanning System for creating, storing, printing, and accessing digital images;

B. the creation of an Electronic Library consisting of the digital files for 950 deteriorating volumes;

C. the production of acceptable Paper Facsimiles to replace each of the original volumes by remote printing across the network;

D. the Cataloging of both the digital files and the paper facsimiles in the national and local on-line bibliographic databases;

E. the initial definition of the Document Control Structure to provide access points beyond basic bibliographic information;

F. the design of a Print Request Server that will enable researchers to obtain a print on demand version of any scanned volume;

G. the prototype implementation of Electronic Access to portions of the digital library from a distant workstation attached to the network;

H. an investigation of the feasibility of producing Microfilm directly from the digital files.

These products have contributed to the development of a scanning application suited to the preservation of research library materials. They are summarized in what follows. More details of the scanning and other processes are given in the following section on Process.

A. Networked Scanning System

A networked system for creating, storing, printing, and accessing an electronic library was developed and tested. This system allows for the distribution of various functions to a number of locations served by high speed network. Among other things, the system represents the first step in providing user access at remote locations to library materials.

Xerox Corporation and Cornell University have developed the College Library Access and Storage System (CLASS) to meet preservation reformatting needs. The CLASS system is designed as a network compatible, distributed system composed of scanning workstations, that are based on IBM PC or PC-compatible workstations running DOS, high speed /high resolution scanners, and application software; an optical storage server composed of a UNIX based server with locator database pointing to an optical jukebox; a print server consisting of a Xerox Docutech printer and network conversion server; a print request server running as an X-Windows application on a SUN Sparcstation; and a prototype viewing workstation running on an IBM PC under DOS and Microsoft Windows (clients for other environments are under development).

The transition from a standalone scanning workstation to a fully networked system has required a significant investment of time, money, and expertise. The system in place at the conclusion of this study meets the design goals of this architecture, although it is still in an early state, with ongoing changes being made to improve reliability. Xerox Corporation has prepared a project time-line highlighting the development of the various components of the CLASS System (Supplement I).

The architectural design of the networked system is predicated on a client/server environment in which the geographical proximity of system components, information, and reader is not important. The development of high bandwidth networks provides an opportunity to transfer large amounts of information quickly, meeting the needs of this application. The components of the system are distributed across Cornell’s campuswide TCP/IP network that forms part of the worldwide Internet. A representation of the system architecture is shown in Figure 1.

The system developed in this study creates files of bitmapped images that represent pages of books. These files are being stored on a large capacity optical jukebox. Until recently, the cost of storage for such files would have been prohibitive, but new information storage devices and larger capacity media make the use of digital image files possible, with every indication that technology costs will follow their historic decline well into the future.

An integral component of the study has been the use of the networked Xerox Docutech printer, a high-speed graphic printer that uses digital images to format printed pages. The network server that connects Docutech to the Cornell TCP/IP network accepts print requests for compressed images or encoded documents (e.g., ASCII, PostScript, Interpress). On-line Docutech finishing hardware offers binding and stitching options that support on-demand printing.

FIGURE 1. 1991 SYSTEM ARCHITECTURE

omitted

B. The Electronic Library: Image Capture and file format

Digital versions of 950 books were created The University intends to maintain and expand this embryonic electronic library and make it accessible to the broad national and international research community. The electronic library will be periodically copied to conform with newer technologies and standards. It will also serve as the basis for an experimental testbed by which further studies of storage, distribution, and access technologies may be evaluated.

At the end of this phase of the project, the electronic library contained approximately 285,000 digital files, each file representing a page of a book.[9] Each page was scanned as a bitmapped image and stored at a resolution of 600 dpi (dots per inch). The prototype scanner captures images using a complex scanning and interpolation scheme. The nominal scanning resolution of the scanner used in the Joint Study was 600 dpi. However, the definition of scanning resolution is not straightforward, and an explanation is required.

In fact, a 400 x 400 dpi aperture is used. A single scanline in the fast direction (across the platen) is sampled at 400 dpi, with 256 levels of grey (8 bits). This grey-level scanline is in turn sampled at 600 dpi. Thus, three 600 dpi samples are derived from two 400 dpi samples. The two end samples are directly converted to bits (a 1 or a 0) according to a thresholding algorithm; the two equal parts of the middle sample are averaged before thresholding. Thus, a single scanline 1/400th of an inch wide contains 600 bits (1’s or O’s).

In the long direction (along the platen), this process is repeated 600 times per inch. Thus, information from overlapping scanlines, each 1/400th of an inch wide, is obtained. The result is a 600×600 dpi bitmapped image. (See diagram, Appendix IV.)

The scanner used is under development by Xerox and is not yet generally available in the marketplace. It represents an effective compromise among speed, resolution, and quality. Although higher resolution scanners are available, they are too slow with today’s technology to be competitive in a production environment. No doubt, this will change with time.

The files were encoded in Aldus/Microsoft TIFF Version 5.0, which meets the new Internet Engineering Task Force standard definition for exchange of black and white images within the Internet.[10] TIFF (tagged image file format) provides a means for labeling a file so that it can be deciphered by application software, thus making it possible to exchange files among applications.[11] The files were compressed prior to storage and transmission using facsimile compatible CCITT Group 4 compression.[12]Because the images are binary representations, the compression algorithms resulted in considerable storage and transmission economy as well as a lossless means for compressing and decompressing the files. In this study, image compression has resulted in an approximate 40:1 reduction in file size. Even compressed, however, the digital files are large. File size varies depending on the size of the page and its content. An average 6″ X 9″ page composed of black and white text requires approximately 50 kilobytes of storage when compressed. A page of similar size that contains a halftone image can yield a file as much as ten times as great.

The files were stored on removable optical disks pending the development of software to store the images on the optical jukebox. At the end of Phase 1 covered by this Report, the images had not been fully transferred to the jukebox. Testing was in process.

C. Paper Facsimiles

Digital images were used to create hardcopy facsimiles for each of the volumes scanned in this project. The paper output is considered of sufficient quality and durability that the facsimiles serve as replacements for the deteriorating originals.

A primary goal of the Joint Study was to evaluate the paper output from the Xerox Docutech printer. The quality of the paper copy is very high: there is less than 1% variation in print size from the original; skew results only when the page trim is not parallel to text; front to back registration is reproduced within 1 /100th of an inch of the original; the contrast between text and background is sharp; and the 600 dpi resolution compares favorably with the capture capabilities of photocopy. Illustrated material is exceptionally well rendered. As the copies are printed on paper that meets the ANSI standards for permanence, and the Docutech printer meets the machine and toner requirements for proper adhesion of print to page, the product is considered to be the archival equivalent of preservation photocopy. [13]

Library staff and faculty advisors evaluated the quality of the paper product. Their subjective approval was a critical factor in the decision to replace the rapidly self-destructing books with the paper facsimiles. In most cases the original volumes were discarded after being scanned. A discussion of the quality achieved in this project and a comparison with light-lens processes is found in Section V, Findings.

D. Bibliographic Access

The digital books were cataloged in Cornell’s local on-line catalog and in the Research Library Group’s national database (RLIN). Although existing cataloging conventions were followed, some modifications will be necessary if digital technology is to become an accepted preservation format. Issues still to be resolved include what additional technical information is required to facilitate access, how preservation information should be conveyed, and what links can be drawn between the catalog records and other forms of indexing to the digital book.

In determining how to represent the digital files, catalogers started from the premise that the digitized book is a preservation product analogous to a preservation microform, and its treatment in the catalog record should be parallel. Sample records and a report on cataloging considerations and instructions by Judith Brugger, Catalog Management and Authorities Librarian, are located in Supplement II. The use of digital technology as a preservation medium, however, is a new concept in cataloging. Unlike microfilm reproductions, computer files are not accorded a “preservation reproduction” status. For instance, Chapter 9, the computer file chapter of Anglo-American Cataloging Rules. 2nd Edition, assumes that all items being cataloged are originally produced in machine-readable format. Computer files for digital books, therefore, are generally considered new editions, rather than versions of the original volume.

This interpretation precludes the parallel treatment of computer files and microforms. In 1980, the Research Libraries Group, Inc., with funding from the National Endowment for the Humanities, developed modifications to RLIN that resulted in the current system capabilities to highlight microfilm generational information and to display institutional decisions to reformat particular titles (the so-called “queuing” function). The latter capability was designed to assist institutions in avoiding duplicative filming efforts. Unfortunately in this study, Cornell was unable to “queue” records for titles to be scanned, nor were other institutions aware that a title had been preserved by Cornell when they searched for preservation replacements. Moreover, if paper replacements had not been prepared and catalogued, there would have been no RLIN record in the book file indicating the availability of paper copies and digital versions on demand. Records for the digital book appear only in the computer file of RLIN, which is not normally searched by an institution looking for a replacement for a brittle book.[14]

In the future, Cornell would like to see enhancements to the Research Library Group’s RLIN (Research Library Information Network) to record preservation information for material that has been reformatted using digital technology. These enhancements may be addressed by the reorganization and redefinition of certain data elements in the MARC record and by the movement toward format integration and/or a multiple versions approach.[15]

A cataloging issue still to be resolved involves the links to be drawn between the basic bibliographic record and other forms of indexing. The catalog record must carry information regarding the document structure file that accompanies the image files (see next section). Currently, both the call number field and the local notes field (USMARC 590) are reserved to record information on how to request a printed copy and ultimately to view the digital files directly. The means to assure a smooth transition from bibliographic record, to indexes, and finally to the electronic library has yet to be developed and tested.

E. Document Control Structure

A document structure is used to organize the individual images captured during the scanning process. It will also be used to provide direct access to components of the book. The arrangement of a physical book provides information to readers. For instance, the table of contents and the index are placed so that they can be easily found and used by any reader. The document structure file is designed to assist a reader in using an electronic version of the book.

Requirements for the document control structure have been defined and a prototype created for a number of books. Cornell recommends a collaborative process involving other libraries and consortia to define further the document control structure and to establish it in a standardized form for use in digital libraries of multiple institutions.

Because the digital images comprising a book are not text-searchable, there is a need to find easy ways for users to search and reference the major parts of each book to facilitate access. For example, the page numbers printed on the originals have to be incorporated into the document structure and correlated with the image file numbers so that a request to retrieve a particular page number recalls the image with that number printed on it.

The digital book as currently configured consists of two parts. First, individual pages are stored as a collection of discrete bitmapped images. Second, the document structure links the images into a single document. A database entry for each document also contains descriptive information such as the author, title, and document identification number. Further enhancements to the document structure will allow references to the major parts of the book, such as table of contents, chapters, indexes, and so forth.

The creation and storage of the document structure are critical to the system design. Xerox Corporation has produced detailed specifications for the database and the software to implement the document structure architecture. At this time, these are described only in internal Xerox project reports.

Although a file exists with the elements defined to hold structure data, only the most basic structure information (the order of the files) has been collected. For the purposes of testing the print request server and the view station described below, complete document structure records for a small number of books were created. Cornell is continuing the process of software testing and development.

F. Print On Demand Access

The distributed design of the CLASS system has allowed Cornell to separate physically the scanning from the printing and storage functions. Cornell staff members are developing a print request server that will enable researchers to request from their offices printed copies of documents stored in the digital library. At the close of this phase of the study, individual requests are being handled by the scanning technicians, and the Print Request Server is undergoing initial testing.

A prototype print request server has been developed and tested that simulates the process of identifying relevant material and initiating a print request. Functional specifications for the print request server are located in Supplement III. For the print request server to be an integrated part of the distributed digital library, it must interact with several components of the CLASS system. Since Xerox is still in the process of developing some of these, Cornell decided to implement temporary alternatives for some database information. For instance, the request server is designed to communicate with the image server to retrieve the document structure. Only page-linking information has been recorded in the Document Control Structure. A fully functional request server also depends on the development of a complete Docutech job ticket to record information on requests for printing. As of this writing, not all job ticket information had been specified for use by the request server. Thus the request server is still a prototype, and work will continue on its development next year.

G. Electronic Access

A prototype view station which can be used from anywhere on the network provides electronic access to the digital library. The view station retrieves and displays electronic books stored in the digital library. Further work must be done to develop view station software that is suitable for use by library patrons. However, the feasibility of viewing books remotely has been established.

The prototype view station used in the study was developed by Xerox with design advice provided by a committee of Cornell librarians and computer professionals. The view station software represents the first step in providing a level of electronic browsing and retrieval at the desktop. The view station offers all the search, retrieval, and printing functions present on the scanning workstation, but without any updating or editing capabilities. Some enhancements were added to facilitate navigating through multiple documents.

Using the view station, a library patron can search the digital library by author and title, with the results being displayed in a window. Images of pages are then displayed, with readability a function of the page size of the original, the window size, and the screen resolution of the monitor. The images were found to be readable by most users. The option of enlarging the page to fill the screen and a zoom feature make it easier to read small text. Books originally consisting of pages no larger than 6″X 9″ are easily read on screen. Larger texts, which must be scaled down to fit the screen, are more difficult to read without enlarging. In general, the quality of the on-screen image proved acceptable if screen viewing is used primarily for rapid browsing and retrieval. For extended reading, a print-on-demand request of the 600 dpi digital images provides a workable use copy.[16]

H. Digital-to-Microfilm Feasibility

Microfilm can be produced directly from the digital files. The advantage in producing film is that it can serve as the preservation backup for an emerging technology. In the unlikely event the digital files were to become unreadable, the microfilm could be scan1led and digitized at a fraction of the costs of initial capture. While preliminary experiments in this area were performed with promising results, the complementary roles of digital technology and microfilm require further examination.

Cornell has conducted preliminary tests to establish the feasibility of producing microfilm from the high resolution digital files. The first test, conducted in September 1991, was relatively modest. The digital images for several pages were transferred to magnetic tape and sent to Image Graphics, Inc. of Shelton, Connecticut for output onto microfilm using Image Graphics’ MICROGRAPHICS EBR SYSTEM 3000, an electron beam recorder.

Image Graphics successfully recorded the digital images at lOX reduction on 70mm, non-perforated KODAK Direct Electron Recording Film, S0-219, a film designed expressly for use in recorders that expose film by means of an electronic beam brought to bear directly on the emulsion.l7 The company produced both negative and positive versions of the film. Density readings on the negative version averaged.95. The positive film was inspected at Cornell on a light box and a microfilm reader. The images contained in the test strip were crisp, with sharp contrast between text and background. More significantly, the quality of the resulting images was faithful to the quality obtained in the digital files as represented in the paper copies. While issues of quality control that center on film base, processing, and resolution were not evaluated, the results appear promising, especially for illustrated material where the potential to create a high quality reproduction favors digital technology.

In late fall 1991, the digital files for a 70 page volume that contains halftones and other illustrations was sent to Image Graphics to produce microfilm. The film was not completed in time for evaluation under this phase of the project[l8], but it will be subject to full technical and bibliographic inspection. A standard microfilm version for the same volume has been prepared for comparison purposes. Both copies will be evaluated, principally to determine how the digital film compares in quality and technical specifications to the light lens microfilm. Yale University’s proposed project to convert large quantities of preserved library materials from microfilm to digital images will provide valuable comparative data on the means, costs, and benefits involved.[19] The issue of image quality should also be studied carefully. Cornell will continue to investigate the process and costs of creating microfilm from the digital files.