Preservation in the Age of Large-Scale Digitization: A White Paper

pub cover

Oya Y. Rieger
February 2008

Copyright 2008 by the Council on Library and Information Resources. No part of this publication may be reproduced or transcribed in any form without permission of the publishers. Requests for reproduction or other uses or questions pertaining to permissions should be submitted in writing to the Director of Communications at the Council on Library and Information Resources.

About the Author



1. Introduction: Large-Scale Digitization Initiatives in the Limelight
1.1 Interplay between Access and Preservation
1.2 Terminology
1.3 Outline

2. Overview of Leading Large-Scale Digitization Initiatives
2.1 Motivating Factors in Partnerships: Library Perspective
2.2 Motivating Factors in Partnerships: Commercial Entities
2.2.1 Google
2.2.2 Microsoft
2.3 Large-Scale Digitization Efforts by Nonprofit Entities
2.3.1 Open Content Alliance
2.3.2 Million Book Project

3. Framework for Assessing Preservation Aspects of Large-Scale Digitization Initiatives
3.1 Selection for Digitization and Preservation Reformatting
3.2 Content Creation
3.2.1 Image-Quality Procedures for Large-Scale Digitization Initiatives
3.2.2 Preservation Metadata
3.2.3 Descriptive and Structural Metadata
3.2.4 Quality Control
3.3 Technical Infrastructure
3.4 Organizational Infrastructure

4. Implications of LSDIs for Book Collections
4.1 Pressure for Relieving Space
4.2 Impact on Traditional Preservation and Conservation Programs
4.3 Print-on-Demand Books

5. Recommendations
5.1 Reassess Digitization Requirements for Archival Images
5.2 Develop a Feasible Quality Control Program
5.3 Balance Preservation and Access Requirements
5.4 Enhance Access to Digitized Content
5.5 Understand the Impact of Contractual Restriction on Preservation Responsibilities
5.6 Lend Support for Shared Print-Storage Initiatives
5.7 Promote the Use of Registry of Digital Masters
5.8 Outline a Large-Scale Digitization Initiative Archiving Action Agenda
5.9 Devise Policies for Designating Digital Preservation Levels
5.10 Capture and Share Cost Information
5.11 Revisit Library Priorities and Strategies
5.12 Shift to an Agile and Open Planning Model
5.13 Re-envision Collection Development for Research Libraries

6. Conclusion: Why Join Forces?

Large-Scale Digitization Initiatives: Survey of Preservation Implications

About the Author

Oya Rieger is interim assistant university librarian for digital library and information technologies at the Cornell University Library, where she oversees the institution’s repository development, digital preservation, electronic publishing, digitization, and e-scholarship initiatives. Her responsibilities also include coordinating the library’s large-scale digitization collaborations with Microsoft and Google. She is the coauthor of the award-winning Moving Theory into Practice: Digital Imaging for Libraries and Archives (Research Libraries Group 2000). A member of several digital imaging and preservation working groups, Ms. Rieger cochaired a group charged with developing ANSI/NISO Technical Metadata for Digital Images. Having earned a B.S. in economics, a master’s degree in public administration, and an M.S. in information systems, she is currently pursuing a Ph.D. degree in a joint Cornell program with the Communication, Information Science, and Science and Technology Studies departments. Her research interests focus on the sociocultural aspects of digital technologies and scholarly communication.


I sincerely appreciate the invitation from the Council on Library and Information Resources (CLIR) to write a white paper focusing on two of my favorite topics-digitization and preservation. I am especially grateful to Kathlin Smith, CLIR’s editor and director of communications, who guided me with great expertise and constant encouragement as the paper evolved from its inception to the final stages. The deep preservation background of Connie Brooks, CLIR preservation consultant, was instrumental in making sure that the paper addresses the preservation community’s questions. The paper also benefited from Linda Harteker’s thorough copy editing.

Special thanks go to several colleagues who were generous with their feedback during the external review. They include Bill Carney, Steve Chapman, Michele Cloonen, Paul Conway, Ricky Erway, Dale Flecker, Evelyn Frangakis, Amy Friedlander, Gary Frost, Janet Gertz, Paul Gherman, Anne Kenney, Bob Kieft, Katherine Kott, Bill Lefurgy, Anne Okerson, Vicky Reich, Brian Schottlaender, Abby Smith, and Don Waters. I also appreciate the Google Book Search, Microsoft Live Search, Million Book Project, and Open Content Alliance representatives’ willingness to review the paper to confirm the accuracy of information presented about their respective initiatives. These individuals included Laura DeBonis, Jodi Healy, and Jennifer Parson from Google; Jay Girotto, Jessica Jobes, and Michel Cote from Microsoft; Denise Troll Covey and Gloriana St. Clair from Million Book Project (both from the Carnegie Mellon University Libraries); and Brewster Kahle from the Open Content Alliance.


The digitization of millions of books under programs such as Google Book Search and Microsoft Live Search Books is dramatically expanding our ability to search and find information. For scholars, it is the unparalleled scale of these undertakings that holds such promise. But it is likewise the scale of such projects that gives rise to concerns that the quality of the digitized material is inconsistent, and that the files sometimes lack important bibliographic information in their metadata.

The primary aim of large-scale digitization projects-to quickly create a critical mass of digitized books-stands in contrast to that of earlier projects, which frequently sought to create fewer, but higher-quality, scans for scholarly use. These changes in scale and quality raise a new challenge: that of maintaining the massive new collections. The point of the large-scale projects-to make content accessible-is interwoven with the question of how one keeps that content, whether digital or print, fit for use over time.

This paper examines large-scale initiatives to identify issues that will influence the availability and usability, over time, of the digital books that these projects create. As an introduction, the paper describes four key large-scale projects and their digitization strategies. Issues range from the quality of image capture to the commitment and viability of archiving institutions, as well as those institutions’ willingness to collaborate. The paper also attempts to foresee the likely impacts of large-scale digitization on book collections. It offers a set of recommendations for rethinking a preservation strategy. It concludes with a plea for collaboration among cultural institutions. No single library can afford to undertake a project on the scale of Google Book Search; it can, however, collaborate with others to address the common challenges that such large projects pose.

Although this paper covers preservation administration, digital preservation, and digital imaging, it does not attempt to present a comprehensive discussion of any of these distinct specialty areas. Deliberately broad in scope, the paper is designed to be of interest to a wide range of stakeholders. These stakeholders include scholars; staff at institutions that are currently providing content for large-scale digital initiatives, are in a position to do so in the future, or are otherwise influenced by the outcomes of such projects; and leaders of foundations and government agencies that support, or have supported, large digitization projects. The paper recommends that Google and Microsoft, as well as other commercial leaders, also be brought into this conversation.

The commercial partners, as well as the participating libraries, are investing significant resources in digitization projects. How can we secure-or improve-a long-term return on this investment? Can we strike a better balance between quantity and quality? This paper outlines a range of issues relevant to the stewardship of digital resources being created by large-scale projects and to the relationship of these new resources to our print legacies. Our goal is to stimulate discussion among stakeholders and to generate productive thinking about collaborative approaches to enduring access.

CLIR is deeply grateful to Oya Rieger for so ably taking on this timely and important task. In writing this white paper, Ms. Rieger drew on her own experience and knowledge of the field as well as on responses to surveys she conducted of partners in large-scale digitization initiatives. CLIR also thanks the many experts who provided thoughtful feedback on the first draft of the paper. CLIR encourages comments from the community at large.

Charles Henry
President, CLIR

first section in this report >>

pub 141 abstract >>