The Cornell / Xerox / Commission on Preservation and Access Joint Study
in Digital Preservation
Notes
1. Digital image technology, for the purposes
of this report, is defined as the electronic copying of scanned documents
in image form. The text contained in these images in not converted to alphanumeric
representation at the time of scanning, although the potential exists for
such conversion, in whole or in part, from the digital files at some later
time. The present capabilities of optical character recognition are inadequate
for capturing both the information and the presentation of the original
page, which is critical when replacing rapidly self-destructing books,
especially when one considers the vast number of languages, illustrations,
type faces, and printing techniques present in the collections of modern
research libraries. The creation of digital images does not preclude the
use of OCR capabilities. In fact it represents the first step in that direction--the
scanning of paper copies to which character recognition can then be applied.
See for instance: Stephen Smith and Craig Stanfill, "An Analysis of the
Effects of Data Corruption on Text Retrieval Performance," (Thinking Machines
Corporation, Cambridge, MA: December 14, 1988).
2. The Joint Study compared the quality and costs associated
with monochromatic scanning and photocopying only.
3. Conceivably, this may at some point allow librarians
to propose other service alternatives as a substitute for traditional shelf
storage.
4. Contrary to the frequently expresses concern about
the longevity of the physical storage medium itself, it is the obsolescence
of standards, formats, and access software tools that is of greatest concern.
The physical medium will normally long outlive these considerations.
5. This Report covers the period of the project ending
December 31, 1991. Subsequent to this date, Cornell project staff have
verified that digitally-produced microfilm produced by this project does
not match microfilm preservation standards. This is not surprising given
the scanning resolution. However, such microfilm may nevertheless be adequate
for preserving texts produced at 4 point type and larger. In addition,
early experiments suggest that halftone images can be scanned with resultant
quality superior to that normally obtained with most production microfilming
processes. Quality issues will be discussed in subsequent reports.
6. The current national preservation program to preserve
brittle material is based on the replacement of originals with copies that
faithfully capture their intellectual content, including text, illustrations,
and presentation. In order to preserve the largest number of items possible,
the time spent in copying material should occur just once and should result
in the production of a print master that can be used to make subsequent
copies at lower costs. Information about the availability of copies should
be widely publicized and included in the national on-line bibliographic
databases. Finally, a preservation master of the original should be stored
and maintained in a manner that will guarantee its long-term availability.
7. For instance, in the field of mathematics from which
over half of the materials were selected, users "object to the inconvenience
of microfilm, especially for monographs...Hardcopy reformatting (through
photocopying) of older monographs is the preferred way to provide access
in many libraries." Constance C. Gould and Karla Pearce, Information
Needs in the Sciences: An Assessment, (Mountain View, CA: Research
Libraries Group, Inc., 1991), pp. 65-68.
8. For a Xerox Corporation perspective on the importance
of co-development, see William Anderson, William Crocca, and Steven Barley, "Customer
Co-Development: The Cornell/Xerox Joint Study Project Interim Report," PARC
Technical Report SSL-91-139.
9. One thousand books were chosen for scanning. Fifty
of the most heavily illustrated ones have been reserved for scanning using
the windowing capabilities recently developed by Xerox.
10. Katz, A. Cohen, D. Network FAX Working Group
of the Internet Engineering Task Force, A File Format for the Exchange
of Images in the Internet. Request for Comments number 1314, April
1992.
11. Digital files must be created in a manner that provides
users with instructions on how to gain access to the information contained
in them. It is one thing to store information on a disk, and another to
gain access to it. Material can not be considered preserved if one can
not "read" it. Thus a file must contain documentation on its format. Though
there are many competing file formats, TIFF is in wide use. Unfortunately
there are multiple TIFF formats, but a committee currently exists to address
this issue. Today TIFF comes close to representing an industry standard.
Aldus Corporation and Microsoft Corporation, "Tag Image File Specification
Revision 5.0" (Aldus/Microsoft Technical memorandum, August 1988).
12. The International Telegraph and Telephone Consultative
Committee (CCITT) has originated two algorithms, Group 3 and Group 4, that
are in wide use for black and white Images.
13. Norvell M.M. Jones, Archival Copies of Thermofax.
Verifax. and Other Unstable Records. National Archives Technical
Information Paper No. 5 (Washington: National Archives and Records Administration,
1990). ANSI Standard Z39.48-1984, currently being revised, covers the
requirements for permanent/durable paper. See also RLG Preservation
Manual (1986) and the Reproduction of Library Materials (ALA)
draft photocopy guidelines of the Subcommittee on Preservation Photocopying
Guidelines. The guidelines currently available for preservation photocopying
place greater emphasis on image stability and paper permanence than image
quality.
14. Cornell did prepare a Preservation Scope Note for
the mathematics material which appears in the RLIN Conspectus. Preservation
Scope Notes provide RLG and individual institutions with information about
large preservation projects, both in progress and completed, to assist
in the planning and coordination of preservation activities.
15. Format Integration and Its Effect on the USMARC
Bibliographic Format, Library of Congress, 1988. Prepared by Network
Development and MARC Standards Office.
16. Performance issues associated with reading material
from the network will be addressed in the Testbed Project, begun in January
1992.
17. The film emulsion layer is unusually thin and characterized
by extremely fine grains and d relatively high silver to gel ratio; the
support is ESTAR base, a clear 4-mil polyester film. Based on discussions
with technical experts at Kodak and University Microfilms, it appears that
the archival properties of the S0-219 are questionable. Image Graphics
is investigating the use of Image Link film for subsequent tests.
18. Subsequent to the close of Phase 1, the microfilm
was indeed produced. The quality will be discussed in subsequent reports.
19. Donald J. Waters, From Microfilm to Digital
Imagery. On the feasibility of a project to study means. costs.
and benefits of converting large quantities of preserved library materials
from microfilm to digital images (Washington: The Commission on Preservation
and Access, 1991).
20. The selection process is described by Steven Rockey
in "The Cornell-Xerox-CPA Project to Digitally Reformat Books," paper presented
to the AMS/MAA Joint Mathematics Meetings, Baltimore, MD, January 8-11,
1992. A bibliography of the mathematics books preserved in this project
is included as Appendix VII. A bibliography of all volumes scanned in this
project can be prepared by conducting a search on RLIN using the Series
Note ("CXJSP"), and downloading the on-line records.
21. Disbinding books with minimal artifactual value
met little faculty resistance when high-quality replacement facsimiles
were produced, and additional copies can be printed on demand.
22. It is anticipated that as data exchange standards
are developed and implemented, the time between refreshing will increase
from four years to ten years and beyond. See for instance, Charles M. Dollar, "The
Impact of Information Technologies on Archival Principles and Practices:
Some Considerations," Draft Version 16, November 15,1990, pg. 63.
23. This study investigated the quality achieved with
binary scanning only. Depending on the object being scanned, grey scale
or color scanning may be superior, and the advantages/disadvantages of
the various approaches need to be examined. Scanning resolutions and file
formats can represent a complex tradeoff between time, file size, fidelity,
on-screen display, printing, and equipment availability. The study had
as a primary emphasis the production of printed facsimiles that were largely
black and white text in a timely and cost-effective manner. With binary
scanning, large files may be compressed efficiently and in a lossless manner
using CCITT Group IV Facsimile compression algorithms. Grey scale compression,
using JPEG, is much less economical and is "lossy," which may make it inappropriate
as a preservation method. It appears that while binary files produce a
high quality printed version, other combinations of spatial resolution
with grey and/or color will also be adequate. Grey scale can offer an advantage
for on-screen viewing. For instance, on a low resolution screen display,
two bits of grey at 100 dpi may be more readable than 600 dpi or 300 dpi
binary. The advantage is lost, however, when the on-screen image is enlarged.
The quality associated with binary or grey scale is also dependent on the
equipment used, for instance binary scanning produces a better paper copy
when it is printed on a binary printer. See Michael Ester, "Image Quality
and User Perception," LEONARDO Digital Image, Digital Cinema
Supplemental Issue, (1990) pg. 51-63.
24. Generational loss is acknowledged in the draft photocopying
guidelines of the Subcommittee on Preservation Photocopying Guidelines,
of the Reproduction of Library Materials Section of ALA. The August 1991
version emphasizes that acceptable copy image quality should consider reproducibility
(i.e., can the text be copied again). The generational loss with microfilm
is not as great, but does represent about a 10% reduction in resolution
with each generation. As such the technical specifications for microfilm
vary from one generation to the next. See, for example Research Libraries
Group, Inc., RLG Preservation Microfilming Handbook, edited
by Nancy E. Elkington, (Mountain View, CA: The Research Libraries Group,
Inc., 1992), Appendix 18. See also, Don Willis, A Hybrid Systems
Approach to Preserving Printed Materials using Microfilm and Digital Imaging,
presentation at the AIIM conference, April 1991.
25. A process of auto-segmentation, which incorporates
the windowing function automatically as a page is scanned, is being refined
by Xerox. When available, it will increase the speed of capture for illustrated
text.
26. An excellent discussion of relating photographic
quality indexes with digital scanning is presented in AIIM Technical Report
(TR 26), "A Tutorial on Photographic and Electronic Imaging Resolution," draft,
2/5/92. See also Tom Bagg, "Image Quality," paper presented to the Digital
Image Applications Group, Sept. 25, 1986; and Don Willis, "A Hybrid Systems
Approach to Preserving Printed Materials using Microfilm and Digital Imaging," draft
paper, 1991, unnumbered.
27. Nonetheless, Xerox has concurred with the figure
used in the cost study.
28. Costs associated with digital technology are derived
from Table A The numbers in [brackets] refer to line numbers in
Table A. Overhead reflects the general and administrative costs and profit
margin that would be included by an outside vendor. The 1992 cost of photocopying
is based on two quotes for photocopying and binding a 300 page book (Library
Bindery Service and Ridley's Book Bindery). The average annual inflation
rate is calculated at 5%.
29. The numbers in [brackets] for digital technology
refer to line numbers in Table A. A book scanned in 1992 will be refreshed
twice in the next decade, in 1992 and 2000. Overhead reflects the general
and administrative costs and profit margin that would be included by an
outside vendor. Microfilm figures are based on 1992 prices quoted by MicrogrAphics
Preservation Service (MAPS). Cost of archival master is based on $.195/frame
for one-up and two-up filming. Cost of print master is $15. For two-up
filming, assume six books can be stored on each roll; for one-up filming,
assume three books. The cost of one book on the print master will be $5.00
(one-up) or $2.50 (two-up). Storage costs are based on $1/year to store
one roll of film. The cost of book storage/year will equal $1 divided by
3 (one-up) or by 6 (two-up). Since two generations are being stored, the
cost equals $.66 (one-up) and $.33 (two-up) per year times 10 years, or
$6.66 and $3.33 respectively.
30. The numbers in [brackets] for digital technology
refer to line numbers in Table A. Overhead reflects the general and administrative
costs and profit margin that would be included by an outside vendor. The
binding cost included here assumes that 20% of all requests for subsequent
copies will be bound with a full cloth library binding, 40% will be bound
using Docutech in-line tape binding, and 40% will be unbound or stapled.
If we assumed that all subsequent copies were bound in a full cloth binding
the total digital cost would rise to $19.81 in 1992 dollars.
1-1. Chart IEEE Std 167A-1987. Prepared by the IEEE
Facsimile Subcommittee and printed by Eastman Kodak Company. For use in
accordance with IEEE Std 167-1966, Test Procedure for Facsimile. Copyright
1987, Institute of Electrical and Electronics Engineers.
2-1. The research and development flavor of the study
was reflected in fluctuations In scanning productivity. Between April 5
and May 24, 1991--an eight week period--the average weekly scan rate was
6,795 pages, which represents 22.65 books/week. This highly productive
period was followed by a week in which only 7.5 books were scanned. System
upgrades occurred at regular intervals throughout the year and a reduction
in scanning production invariably accompanied software installation. Installation
itself usually took a day for testing and debugging. Technicians had to
prepare for the installation by clearing the hard disk of work in progress.
They then had to learn the new system. Difficulties associated with installing
new software on a networked system also were common. For instance, during
the week that the Pl software was installed, 3,883 images were scanned;
the week the P2.0 software was installed only 3,245 images were scanned;
and the week the P2.1 software was installed only 2,778 images were scanned.
2-2. Statistics prepared by Dorothy Wright, Preservation
Librarian, Mann Library, Cornell University, December 1991.
3-1. For instance, subsequent iterations of system
software will increase the speed of scanning. Xerox has developed a fast
scan capability which delays the document structure building until after
the actual scanning has been completed. This upgrade has been tested on
a scanning workstation located in Cornell's book store and its use at 300
dpi scanning led to a doubling of the production rate. Cornell did experiment
with using a feed mechanism. It was determined that pages that were only
marginally brittle (i.e., it took five double corner folds before the paper
broke) could survive most paper jams. Libraries may be willing to risk
a paper jam to achieve faster production rates for material held by a number
of libraries. Before feed mechanisms can be used with this system, however,
registration and deskewing must become software functions.
Return to CLIR Home Page >>