The Cornell / Xerox / Commission on Preservation and Access Joint Study in Digital Preservation
1. Digital image technology, for the purposes of this report, is defined as the electronic copying of scanned documents in image form. The text contained in these images in not converted to alphanumeric representation at the time of scanning, although the potential exists for such conversion, in whole or in part, from the digital files at some later time. The present capabilities of optical character recognition are inadequate for capturing both the information and the presentation of the original page, which is critical when replacing rapidly self-destructing books, especially when one considers the vast number of languages, illustrations, type faces, and printing techniques present in the collections of modern research libraries. The creation of digital images does not preclude the use of OCR capabilities. In fact it represents the first step in that direction–the scanning of paper copies to which character recognition can then be applied. See for instance: Stephen Smith and Craig Stanfill, “An Analysis of the Effects of Data Corruption on Text Retrieval Performance,” (Thinking Machines Corporation, Cambridge, MA: December 14, 1988).
4. Contrary to the frequently expresses concern about the longevity of the physical storage medium itself, it is the obsolescence of standards, formats, and access software tools that is of greatest concern. The physical medium will normally long outlive these considerations.
5. This Report covers the period of the project ending December 31, 1991. Subsequent to this date, Cornell project staff have verified that digitally-produced microfilm produced by this project does not match microfilm preservation standards. This is not surprising given the scanning resolution. However, such microfilm may nevertheless be adequate for preserving texts produced at 4 point type and larger. In addition, early experiments suggest that halftone images can be scanned with resultant quality superior to that normally obtained with most production microfilming processes. Quality issues will be discussed in subsequent reports.
6. The current national preservation program to preserve brittle material is based on the replacement of originals with copies that faithfully capture their intellectual content, including text, illustrations, and presentation. In order to preserve the largest number of items possible, the time spent in copying material should occur just once and should result in the production of a print master that can be used to make subsequent copies at lower costs. Information about the availability of copies should be widely publicized and included in the national on-line bibliographic databases. Finally, a preservation master of the original should be stored and maintained in a manner that will guarantee its long-term availability.
7. For instance, in the field of mathematics from which over half of the materials were selected, users “object to the inconvenience of microfilm, especially for monographs…Hardcopy reformatting (through photocopying) of older monographs is the preferred way to provide access in many libraries.” Constance C. Gould and Karla Pearce, Information Needs in the Sciences: An Assessment, (Mountain View, CA: Research Libraries Group, Inc., 1991), pp. 65-68.
8. For a Xerox Corporation perspective on the importance of co-development, see William Anderson, William Crocca, and Steven Barley, “Customer Co-Development: The Cornell/Xerox Joint Study Project Interim Report,” PARC Technical Report SSL-91-139.
11. Digital files must be created in a manner that provides users with instructions on how to gain access to the information contained in them. It is one thing to store information on a disk, and another to gain access to it. Material can not be considered preserved if one can not “read” it. Thus a file must contain documentation on its format. Though there are many competing file formats, TIFF is in wide use. Unfortunately there are multiple TIFF formats, but a committee currently exists to address this issue. Today TIFF comes close to representing an industry standard. Aldus Corporation and Microsoft Corporation, “Tag Image File Specification Revision 5.0” (Aldus/Microsoft Technical memorandum, August 1988).
13. Norvell M.M. Jones, Archival Copies of Thermofax. Verifax. and Other Unstable Records. National Archives Technical Information Paper No. 5 (Washington: National Archives and Records Administration, 1990). ANSI Standard Z39.48-1984, currently being revised, covers the requirements for permanent/durable paper. See also RLG Preservation Manual (1986) and the Reproduction of Library Materials (ALA) draft photocopy guidelines of the Subcommittee on Preservation Photocopying Guidelines. The guidelines currently available for preservation photocopying place greater emphasis on image stability and paper permanence than image quality.
14. Cornell did prepare a Preservation Scope Note for the mathematics material which appears in the RLIN Conspectus. Preservation Scope Notes provide RLG and individual institutions with information about large preservation projects, both in progress and completed, to assist in the planning and coordination of preservation activities.
17. The film emulsion layer is unusually thin and characterized by extremely fine grains and d relatively high silver to gel ratio; the support is ESTAR base, a clear 4-mil polyester film. Based on discussions with technical experts at Kodak and University Microfilms, it appears that the archival properties of the S0-219 are questionable. Image Graphics is investigating the use of Image Link film for subsequent tests.
19. Donald J. Waters, From Microfilm to Digital Imagery. On the feasibility of a project to study means. costs. and benefits of converting large quantities of preserved library materials from microfilm to digital images (Washington: The Commission on Preservation and Access, 1991).
20. The selection process is described by Steven Rockey in “The Cornell-Xerox-CPA Project to Digitally Reformat Books,” paper presented to the AMS/MAA Joint Mathematics Meetings, Baltimore, MD, January 8-11, 1992. A bibliography of the mathematics books preserved in this project is included as Appendix VII. A bibliography of all volumes scanned in this project can be prepared by conducting a search on RLIN using the Series Note (“CXJSP”), and downloading the on-line records.
22. It is anticipated that as data exchange standards are developed and implemented, the time between refreshing will increase from four years to ten years and beyond. See for instance, Charles M. Dollar, “The Impact of Information Technologies on Archival Principles and Practices: Some Considerations,” Draft Version 16, November 15,1990, pg. 63.
23. This study investigated the quality achieved with binary scanning only. Depending on the object being scanned, grey scale or color scanning may be superior, and the advantages/disadvantages of the various approaches need to be examined. Scanning resolutions and file formats can represent a complex tradeoff between time, file size, fidelity, on-screen display, printing, and equipment availability. The study had as a primary emphasis the production of printed facsimiles that were largely black and white text in a timely and cost-effective manner. With binary scanning, large files may be compressed efficiently and in a lossless manner using CCITT Group IV Facsimile compression algorithms. Grey scale compression, using JPEG, is much less economical and is “lossy,” which may make it inappropriate as a preservation method. It appears that while binary files produce a high quality printed version, other combinations of spatial resolution with grey and/or color will also be adequate. Grey scale can offer an advantage for on-screen viewing. For instance, on a low resolution screen display, two bits of grey at 100 dpi may be more readable than 600 dpi or 300 dpi binary. The advantage is lost, however, when the on-screen image is enlarged. The quality associated with binary or grey scale is also dependent on the equipment used, for instance binary scanning produces a better paper copy when it is printed on a binary printer. See Michael Ester, “Image Quality and User Perception,” LEONARDO Digital Image, Digital Cinema Supplemental Issue, (1990) pg. 51-63.
24. Generational loss is acknowledged in the draft photocopying guidelines of the Subcommittee on Preservation Photocopying Guidelines, of the Reproduction of Library Materials Section of ALA. The August 1991 version emphasizes that acceptable copy image quality should consider reproducibility (i.e., can the text be copied again). The generational loss with microfilm is not as great, but does represent about a 10% reduction in resolution with each generation. As such the technical specifications for microfilm vary from one generation to the next. See, for example Research Libraries Group, Inc., RLG Preservation Microfilming Handbook, edited by Nancy E. Elkington, (Mountain View, CA: The Research Libraries Group, Inc., 1992), Appendix 18. See also, Don Willis, A Hybrid Systems Approach to Preserving Printed Materials using Microfilm and Digital Imaging, presentation at the AIIM conference, April 1991.
25. A process of auto-segmentation, which incorporates the windowing function automatically as a page is scanned, is being refined by Xerox. When available, it will increase the speed of capture for illustrated text.
26. An excellent discussion of relating photographic quality indexes with digital scanning is presented in AIIM Technical Report (TR 26), “A Tutorial on Photographic and Electronic Imaging Resolution,” draft, 2/5/92. See also Tom Bagg, “Image Quality,” paper presented to the Digital Image Applications Group, Sept. 25, 1986; and Don Willis, “A Hybrid Systems Approach to Preserving Printed Materials using Microfilm and Digital Imaging,” draft paper, 1991, unnumbered.
28. Costs associated with digital technology are derived from Table A The numbers in [brackets] refer to line numbers in Table A. Overhead reflects the general and administrative costs and profit margin that would be included by an outside vendor. The 1992 cost of photocopying is based on two quotes for photocopying and binding a 300 page book (Library Bindery Service and Ridley’s Book Bindery). The average annual inflation rate is calculated at 5%.
29. The numbers in [brackets] for digital technology refer to line numbers in Table A. A book scanned in 1992 will be refreshed twice in the next decade, in 1992 and 2000. Overhead reflects the general and administrative costs and profit margin that would be included by an outside vendor. Microfilm figures are based on 1992 prices quoted by MicrogrAphics Preservation Service (MAPS). Cost of archival master is based on $.195/frame for one-up and two-up filming. Cost of print master is $15. For two-up filming, assume six books can be stored on each roll; for one-up filming, assume three books. The cost of one book on the print master will be $5.00 (one-up) or $2.50 (two-up). Storage costs are based on $1/year to store one roll of film. The cost of book storage/year will equal $1 divided by 3 (one-up) or by 6 (two-up). Since two generations are being stored, the cost equals $.66 (one-up) and $.33 (two-up) per year times 10 years, or $6.66 and $3.33 respectively.
30. The numbers in [brackets] for digital technology refer to line numbers in Table A. Overhead reflects the general and administrative costs and profit margin that would be included by an outside vendor. The binding cost included here assumes that 20% of all requests for subsequent copies will be bound with a full cloth library binding, 40% will be bound using Docutech in-line tape binding, and 40% will be unbound or stapled. If we assumed that all subsequent copies were bound in a full cloth binding the total digital cost would rise to $19.81 in 1992 dollars.
1-1. Chart IEEE Std 167A-1987. Prepared by the IEEE Facsimile Subcommittee and printed by Eastman Kodak Company. For use in accordance with IEEE Std 167-1966, Test Procedure for Facsimile. Copyright 1987, Institute of Electrical and Electronics Engineers.
2-1. The research and development flavor of the study was reflected in fluctuations In scanning productivity. Between April 5 and May 24, 1991–an eight week period–the average weekly scan rate was 6,795 pages, which represents 22.65 books/week. This highly productive period was followed by a week in which only 7.5 books were scanned. System upgrades occurred at regular intervals throughout the year and a reduction in scanning production invariably accompanied software installation. Installation itself usually took a day for testing and debugging. Technicians had to prepare for the installation by clearing the hard disk of work in progress. They then had to learn the new system. Difficulties associated with installing new software on a networked system also were common. For instance, during the week that the Pl software was installed, 3,883 images were scanned; the week the P2.0 software was installed only 3,245 images were scanned; and the week the P2.1 software was installed only 2,778 images were scanned.
3-1. For instance, subsequent iterations of system software will increase the speed of scanning. Xerox has developed a fast scan capability which delays the document structure building until after the actual scanning has been completed. This upgrade has been tested on a scanning workstation located in Cornell’s book store and its use at 300 dpi scanning led to a doubling of the production rate. Cornell did experiment with using a feed mechanism. It was determined that pages that were only marginally brittle (i.e., it took five double corner folds before the paper broke) could survive most paper jams. Libraries may be willing to risk a paper jam to achieve faster production rates for material held by a number of libraries. Before feed mechanisms can be used with this system, however, registration and deskewing must become software functions.