Several research libraries are either already involved in large-scale digitization initiatives (LSDIs) or are contemplating or planning involvement in such endeavors. Two of the most visible large-scale projects, Google Book Search and Microsoft Live Search Books, have generated a flurry of debates, exchanges of opinion, and articles in various library forums and publications. Because such collaborations have far-reaching impact and are deemed inherently interesting for a general audience, the scope of commentaries has expanded to include mainstream media such as The New Yorker and The Atlantic Monthly.1 Everyone has an opinion to express, and polarization has emerged between supporters and critics of such collaborations. There is also a group in the middle that continues to contemplate with mixed feelings the range of issues associated with LSDIs.
The goal of this white paper is to consider the potential links between large-scale digitization and long-term preservation of print and digital content, with an emphasis on research library collections. Research libraries serve as stewards of cultural heritage resources, notably books and journals, but also photographs, recordings, and other information sources. This paper focuses on books, particularly the large collections that are or may be digitized as a result of a partnership with Google, Microsoft, the Open Content Alliance (OCA), or similar agencies.
1.1 Interplay between Access and Preservation
The primary motivation of all partners in LSDIs is to make it easier to find and access books. Nonetheless, access and preservation goals are usually interrelated, since access to scholarly materials depends upon their being fit for use over time. The connection between preservation and access in the digital world is complex. For example, a library may opt to archive its digitized content as a backup in case the print counterparts are damaged or lost. However, the institution may not be able to provide online discovery and retrieval of archived digital content through a Web portal, owing to lack of funds, copyright restrictions, or other reasons.
While many LSDI libraries have acknowledged their intent to assume long-term responsibility for preserving digital books,2 there is not yet a common understanding of what such responsibility entails. Who will ensure that digital content created through such initiatives remains accessible over time—a responsibility that is different from merely preserving it? Will responsibility for perpetual digital access be assigned to the corporate or nonprofit partners or to the libraries?
There is significant uncertainty about the long-term strategies of initiatives such as Google Book Search and Microsoft Live Search Books. These are relatively new programs and there is no evidence to suggest that the corporate and nonprofit partners have any long-term business plans for maintaining access to digitized collections or for migrating delivery platforms through future technology cycles.3 Their online delivery and retention decisions will most likely be based on use patterns and business interests. The recent announcement that the Arts and Humanities Research Council and Joint Information Systems Committee (JISC) will cease funding the Arts and Humanities Data Service (AHDS) gives cause for concern about the long-term viability of even government-funded archiving services.4 Such uncertainties strengthen the case for libraries taking responsibility for preservation—both from archival and access perspectives. This possibility, however, raises other questions, such as the rights to archive and provide access to digitized content still under copyright.
The interplay between the goals of access and those of preservation is also evident in discussions about the quality of digitized content resulting from current LSDI efforts. In the context of LSDIs, digital preservation can represent two distinct but related operations. It can refer to (1) preserving digital objects that result from the conversion of print materials or (2) digitizing print materials (digital reformatting) to produce digital surrogates. These two aspects of digital preservation are often conflated. The confusion arises partly from the fact that they are complementary goals and often exist within the same initiative. Although the primary incentive of the Google and Microsoft programs is to enhance access (and the image and metadata technical specifications are not pegged for digital reformatting), this does not preclude the possibility of using the resulting digital books as digital surrogates. However, some have observed that the image and optical character recognition (OCR) quality of books scanned in the LSDI projects do not adhere to reformatting best practices developed by librarians and archivists over the past 15 years. There are questions about whether materials are being converted at a quality that will stand the test of time. If participating cultural institutions intend to use the resulting digital files as surrogates for analog books, or even as a just-in-case backup if an original book is lost or damaged, how can we define a digital preservation strategy that is built on the recognition that LSDIs are primarily access-driven projects?
There is not yet a clear and consistent taxonomy for digital preservation terminology, although there are some excellent glossaries.5 Terms such as archiving and preservation are used interchangeably, sometimes depending on the preferences of specific communities. For example, Open Archival Information System (OAIS) uses archive when referring to an organization that intends to preserve information for access and use by a “designated community.”6
In this paper, digital preservation is used interchangeably with archiving. Both terms refer to a range of managed activities to support the long-term maintenance of bitstreams to make sure that digital objects are usable.7The definition does not include the processes required to provide continued access to digital content through various delivery methods (referred to henceforth as “enduring access” to differentiate it from bitstream preservation). According to the Preservation Management of Digital Material Handbook, preserving access entails ensuring the “usability of a digital resource, retaining all quantities of authenticity, accuracy, and functionality deemed to be essential for the purposes the digital material was created and or acquired for.”8 Providing enduring access within the scope of an LSDI is a complicated responsibility. In addition to being subject to usage restrictions imposed by partners such as Google and Microsoft on digital copies provided to LSDI libraries, many digitized materials will remain in copyright for several years and cannot be made accessible online by participating libraries.
This paper uses the terms mass digitization and large-scale digitization interchangeably, although some draw a difference between these two terms.9
This paper starts with an overview of the prominent LSDIs and some of their key goals. It then provides a framework within which to assess the preservation components of digitization initiatives, including selection, content creation, and technical and organizational infrastructure. Next, the paper highlights some of the primary implications of LSDIs with regard to book collections. It concludes with a set of recommendations designed to further discussion and decision making on this important issue.
1 Jeffrey Toobin. 2007. “Google’s Moon Shot: The Quest for the Universal Library.” The New Yorker (February 5). Available at http://www.newyorker.com/reporting/2007/02/05/070205fa_fact_toobin. See also Michael Hirschorn. 2007. “The Hapless Seed.” The Atlantic Monthly 299(5) [June]: 134-139.
2 See Appendix, LSDIs: Survey of Preservation Implications, question 2.
3 According to Section 4.5 (Ownership and Control of Google Services) of the Cooperative Agreement between Google and the Committee on Institutional Cooperation (CIC), “… Google is not required to make any or all of the Google Digital Copy available through the Google Services.” Available at http://www.cic.uiuc.edu/programs/CenterForLibraryInitiatives/Archive/PressRelease/LibraryDigitization/AGREEMENT.pdf.
4 The AHDS has pioneered and encouraged awareness and use among Britain’s university researchers in the arts and humanities of best practices in preserving digital data created by publicly funded research projects. The decision to cease funding is perceived as undermining the effort put into these awareness activities.
5 Cornell University Library. Digital Preservation Management Tutorial. Available at http://www.library.cornell.edu/iris/tutorial/dpm/terminology/g_resources.html.
6 The OAIS Reference Model defines a designated community as “an identified group of potential users of the archives’ contents who should be able to understand a particular set of information.” ISO 14721:2003 OAIS. Available at http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=24683&ICS1=49&ICS2=140&ICS3.
7 The Trusted Digital Repositories report defines digital preservation as “the managed activities necessary for ensuring both the long-term maintenance of a bitsteam and continued accessibility of content.” Trusted Digital Repositories: Attributes and Responsibilities. An RLG-OCLC Report. May 2002. Available at http://www.rlg.org/legacy/longterm/repositories.pdf. Bitstream preservation aims to keep the digital objects intact and readable. It ensures bitstream integrity by monitoring for corruption to data fixity and authenticity; protecting digital content from undocumented alteration; securing the data from unauthorized use; and providing media stability. Digital objects are items stored in a digital repository and in their simplest form consist of data, metadata, and an identifier.
8The Preservation Management of Digital Material Handbook is maintained by the Digital Preservation Coalition in collaboration with the National Library of Australia and the PADI Gateway. Available at http://www.dpconline.org/text/intro/definitions.html.
9 According to Karen Coyle, mass digitization is the conversion of materials on an industrial scale without making a selection of individual materials. The goal of mass digitization is not to create collections but to digitize everything, or in this case, every book ever printed. In contrast, large-scale projects aim to create collections and produce complete sets of documents. See Karen Coyle. 2006. “Mass Digitization of Books.” Journal of Academic Librarianship 32(6) [November]: 641–645.