The Cornell / Xerox / Commission on Preservation and Access Joint Study in Digital Preservation
In addition to the development of scanning, system, the study resulted in the adoption of a process that applies digital scanning technology to preservation and access of library materials. This process parallels in many respects that used in preservation microfilming or photocopying, projects.
Material representing a wide range of subjects was selected for this study. The first 535 volumes came from Cornell’s Mathematics Library, representing the period 1850-1916. These materials were chosen for a number of reasons: the Cornell mathematics collection is especially strong; mathematics monographs from 1850 on are considered of current scholarly interest; the material is in poor physical condition and had been identified as one of the library’s highest preservation priorities; potential users are technically sophisticated (2/3 of Cornell’s Math faculty have Sun or equivalent workstations); the material falls outside of copyright restrictions; and very few libraries nationwide have strong retrospective holdings in this subject area. Further, the mathematics faculty had determined that these books had to be replaced in paper form, so that many of the volumes had been scheduled for preservation photocopy.
The mathematics monographs chosen included the works of significant authors and those individual titles that have contributed substantially to the development of the discipline. All were in need of preservation, and at the time of selection, had not been reprinted or microfilmed. A faculty advisory committee reviewed the selections made by the Mathematics Librarian and his assistant. The advisory committee also assisted in the evaluation of the quality of the paper copy produced from the digital files.
Cornell bibliographers, representing the sciences, social sciences, humanities, and various area collections, selected the remaining 415 volumes. They chose items primarily on the basis of their deteriorating condition that were representative of a cross section of materials typically found in research libraries. Items covered by copyright or which had been microfilmed or recently reprinted were excluded.
Selection decisions were also guided by the limitations of the prototype scanning system, and adherence to current compression standards. For instance, all items had to be disbound and trimmed and the page size could not exceed 8 1/2 X 11 inches, including the dimensions of foldouts.
In addition to the actual scanning, two project staff technicians performed all pre- and post-scanning functions. They collated each item to assure completeness, repaired torn pages, and ordered replacements for missing pages through interlibrary loan. Annotations and marginalia that did not obscure text were left intact. Bibliographers decided whether the technicians should attempt to capture or delete this information during scanning.
The volumes were then disbound and the binder’s edge trimmed parallel to the text. The scanning technicians prepared a worksheet for each volume, recording bibliographic information, physical description, document control structure information, the scanning settings used, and basic workflow. For the last three months of the project, they also recorded time spent on the various scanning functions.
C. Set Up
Prior to actual scanning, the technicians performed a variety of setup functions, using the CLASS software in quality control mode. These included keying in primary bibliographic data, defining the page size and page trim, establishing front to back registration, and scanning sample pages for on-screen review to identify a default range of settings for the entire volume. Scanning settings included choosing an image type (line art, photo, or halftone); setting the brightness level(density); adjusting the background setting (for paper that is yellowed or colored); and selecting filters, screens, and Tone Reproduction Curves (TRCs).
The image display window enabled technicians to preview each scanned page on the screen at 600 dpi resolution, although in production this was normally only done for a few pages for each book to determine standard production settings for the entire book. Highly-illustrated texts containing halftone images, however, required more manual intervention to adjust the settings.
The final step in setup involved the scanning of the production note that is reproduced in every book. The production note describes the scanning process, the paper used for printing, and serves as notice of Cornell’s copyright of the digital files.
D. Production Scanning
Once setup was complete, the technicians moved into production and scanned rapidly, performing little on-screen inspection for the rest of the book. The quality control windows on the scanning workstation were closed to improve response time, and the technicians concentrated on scanning–raising the platen, positioning each page, lowering the platen, and pressing the scan button. Technicians occasionally confirmed image placement (right-handed or left-handed page) by checking the position of the page in the book icon on the monitor. Very little quality control was required during production scanning, especially for books that were largely textual and printed in a consistent manner. If technicians came upon unusual material (an illustration or a very faint page), they returned to the quality control mode to check the on-screen image and to make any necessary adjustments in the settings.
In production mode, technicians could scan at a rate of about 5 pages per minute. Total production times, however, had to allow for setup and other factors. Since the pages of the scanned books are brittle, automatic document feeders are not used; however, Cornell plans to experiment with automatic feeders on books that have already been scanned to assess the degree of brittleness that can be tolerated. The use of such feeders may be realistic for certain classes of books that are (a) minimally brittle, and (b) available from other libraries, so that replacement pages may be obtained in the event a page from the original book is destroyed by a document feeder.
To print a volume, the scanning technicians initiate a command to transmit the digital files over the Cornell TCP/IP network for printing. Transmission time averaged 6 to 10 seconds per page depending on the file size. The delay was not caused by the network, but the time taken to transfer the files from the local disk–a result of the particular disk technology used. About mid-way through the study, Xerox provided software that made the printing command a background task so that digital books could be queued for printing after working hours. This enhancement led to an increase in scanning productivity.
Since rescans for quality control reasons were few, printing was done directly on paper meeting ANSI permanence standards.
F. Quality Control and Rescans
For quality control, the scanning technicians found it easier and more reliable to inspect the paper version, rather than to view the on-screen images. The paper copy was inspected page by page for completeness, order, legibility, and, by direct comparison, fidelity to the original. Any missing pages or those deemed unacceptable were rescanned from the original copy. The rescan rate was under one percent of all copies made. Until the volume was proofed and a final version accepted, the digital files were stored on the local hard disk of the scanning workstation.
G. Binding, Cataloging, and Shelving the Paper Replacement
All printing was done on Gilbert Neu Tech twenty-five percent rag, alkaline paper. The final paper version was bound with a one and a half inch binder’s margin, using the double fan adhesive method of leaf attachment and a full cloth binding by a local book bindery. Catalogers referred to the bound volume and the project worksheet that contained information on the digital files in creating bibliographic records. The completed volume was then sent to the stacks to replace the original book, which in most cases was withdrawn from the library.
H. Storing and Accessing the Digital Files
Once a satisfactory facsimile had been produced to replace the original, the final versions of the digital files were transferred to optical disks. During this phase of the project when the jukebox was not available, technicians transferred the digital files to local, removable 5.25″ optical disks using a disk drive attached to the scanning workstation. The scanning technicians manually reloaded these disks to retrieve the digital files for subsequent printing and viewing. As in after-hours printing, copying to optical disk became a background function made possible by a software upgrade.
Once it becomes available, the final versions will be transferred directly to the optical jukebox.
I. Technology Refreshing
The networked document imaging system used as the foundation of this project relies on emerging technologies. The obsolescence of formats and software access tools associated with a rapidly changing technology is a concern not only to those prepared to use the technology but to the preservation community as well. Preservation using this medium will therefore entail the recopying on a regular schedule set well below the medium’s expected longevity, a process that has become known as “refreshing.” Software programs are also becoming available that check the stability of a disk every time it is used.
Our cost study suggests that the costs of refreshing are likely to be offset by space savings as compression and storage capabilities improve. Cornell is committed to a process that will continually “refresh” our digital library so that each volume is copied every four years. Technology refreshing will be done at the University by the Information Technology department rather than the library. To rely on a unit outside the library to maintain actual collections is something of a departure in library practice, although not completely without precedent in that libraries often rely on their information technologies’ colleagues to maintain and refresh the on-line catalogs. Costs for this process are included in our cost study. Procedures to assure that this work is done are being developed. The responsibility for ensuring that the digital library will serve the interests of library users in the future depends on an institutional commitment in the present.
A program of maintenance also includes regular backup of the digital files, processes familiar to central university information technology organizations. Optical disks can be duplicated on additional optical disks or other electronic media, which can be stored in a separate, climate controlled location. But the backup costs can be high, especially if one must also refresh the additional copies on a regular schedule. Of course, the paper facsimile provides a form of backup. As previously noted, Cornell is also investigating the use for backup of microfilm produced from the digital files.