2. Overview of Leading Large-scale Digitization Initiatives

The main players in LSDIs are cultural institutions, commercial entities such as Google and Microsoft, and nonprofit groups including OCA and the Million Book Project (MBP). Although the key motivation of these stakeholders is a desire to expand access to scholarly resources, their goals differ in some ways depending on their organizational missions. The purpose of this section is to highlight the operating principles of the key players and to lay a foundation for a discussion of the preservation implications of LSDIs. Table 1 on page 9 summarizes the goals and highlights the distinguishing features of the LSDI participants.

2.1 Motivating Factors in Partnerships: Library Perspective

Some 34 cultural entities, including the 12-member Committee on Institutional Cooperation (CIC), have signed digitization agreements with Google or Microsoft. In addition, several cultural institutions are participating in the OCA and the MBP. Some libraries opt to be involved in only one initiative; others are diversifying their digitization strategies through multiple partnerships.10

Answers to frequently asked questions (FAQs) issued by cultural institutions participating in LSDIs indicate three major reasons for participating in large-scale projects: access, preservation, and research and development:11

Access. According to the FAQs, the libraries’ primary motivation for partnership is to support their core mission of advancing knowledge and to transform the ways in which users search and access library content. Several participating libraries also say that these initiatives support their vision to enhance access to information in support of scholarship at local institutions and beyond. A related motivation for participation is to make the institutional collections visible worldwide.

Although most of the libraries engaged in LSDIs have significant experience in digitization, their past efforts are dwarfed by the magnitude of the Google, Microsoft, and OCA endeavors. For example, before partnering with Google, the University of Michigan, considered a leader in this domain, had been digitizing about 5,000 volumes per year. Other LSDI institutions, such as Cornell University Library or the University of Wisconsin-Madison Libraries, have created between two and three million pages of content through initiatives carried out over the past 15 years. This is an approximate equivalent of 7,000 to 10,000 book titles. At this rate, it would take them hundreds of years to convert their entire collections. Because such undertakings are costly and demanding, most libraries recognize that a logical step is to accelerate comprehensive retrospective conversion through partnerships with commercial entities.12 Google and Microsoft have significantly raised the bar, as we are now measuring digitization initiatives in terms of millions of books, rather than millions of pages. The University of Michigan-Google LSDI is now scanning 30,000 volumes per week. At this rate, the library’s entire collection (excluding materials that do not qualify) will be converted in five years.

Preservation. LSDI libraries often note the desire to ensure that library materials remain accessible to future generations as a further motivation for participation. Some institutions plan to use digitized copies as backups for works in case they go out of print, deteriorate, or are lost or damaged-to the extent allowed by copyright law. Publishers often do not keep copies of their out-of-print books, whereas libraries have a perpetual responsibility for their materials.13

Thirteen of 14 LSDI libraries that responded to a Web survey conducted in conjunction with this white paper expressed a commitment to archive their digitized materials (see Appendix). However, the extent of this commitment is likely to vary among institutions and has not been fully articulated.

Research and development. Some libraries, such as Stanford, perceive their participation as an opportunity to gain experience in “handling truly large amounts of digital material.”14 Some LSDI libraries mention developing advanced tools for search and retrieval and experimenting with text mining as possible activities. Grogg and Ashmore reveal that most LSDI institutions are in the early stages of exploring how to use these new digital collections and often state that “future uses are under discussion.”15

2.2 Motivating Factors in Partnerships: Commercial Entities

Both Google and Microsoft cite the creation of a searchable database of full-text books their main motivation for partnership in LSDIs. The following sections provide an overview of the LSDIs and their access-related goals. The summaries are based on e-mail exchanges with representatives of the companies engaged in the digitization initiatives and a review of the organizations’ press releases.

The summaries do not provide information on business models or financial motivations. With the exception of a few publicly available agreements, most of the contracts between the commercial partners and cultural institution are under nondisclosure clauses. RLG Programs, part of OCLC Programs and Research, is leading an effort to coordinate a series of stakeholder meetings to devise best practices in support of LSDIs. One of the outcomes of the effort is a paper by Peter B. Kaufman and Jeff Ubois on “best practices for deal-making.”16 It is based on an analysis of publicly available agreements from commercial and noncommercial mass-digitization partnerships and commentaries on these agreements and others whose documentation is not publicly available.


2.2.1 Google17

The Google Book Search program aims to digitize the full text of books-both public domain and in copyright. The outcome will be a comprehensive, searchable index of a large body of published books in several languages. As of December 2007, 28 libraries were participating in the Google project, with the goal of scanning all or part of their collections and making those texts searchable online. Google is also collaborating with more than 10,000 publishers around the world in addition to its library partners. Google’s business model is based on attracting as many users as possible to its site by offering a far-reaching search engine.

In 2006, a group of publishers and authors filed suit against Google, claiming that it is digitizing books without permission in order to use the information for the company’s benefit. Google argues that only a limited amount of information-in the form of snippets-is displayed for materials in copyright or whose copyright status is unknown, and that this feature encourages users to obtain the book from other sources, such as bookstores and libraries. A reading of relevant publicly available documents reveals that Google’s position varies on allowing participating libraries to share the digital copies of their public domain holdings with academic institutions for noncommercial purposes.

2.2.2 Microsoft18

Microsoft launched its Live Search Books in 2005 through a partnership with the OCA (described in section 2.3.1) to create a database of full-text books. In 2006, the company expanded its effort by recruiting additional library partners and by contracting with Kirtas Technologies19 to undertake part of the digitization activities. Microsoft is focusing on public domain materials published before 1923. The participating libraries decide their own digitization requirements for the digital copies they will be receiving for their own use and have the option to make those copies available through the OCA in addition to through Microsoft Live Search Books. Microsoft allows academic institutions to share digital copies with other nonprofit entities as long as those entities agree not to make the files available to other commercial Internet search services.

On a complementary track, Microsoft offers the Live Search Books Publisher Program to add content through direct partnerships with publishers.20 Live Search has distinguished itself from Google Book Search by focusing on delivering results with a unique interface and on providing advanced tools to support search and retrieval. As with all the search products released under the Live Search brand, Live Search Books appears as a tab on the Live Search navigation bar, along with the recently launched Live Search Academic.

2.3 Large-Scale Digitization Efforts by Nonprofit Entities

This section describes the OCA and MBP, two large, fast-moving projects by nonprofit entities with different motivations. It excludes several consortial, regional, governmental, and international initiatives as well as library partnerships with organizations such as JSTOR and Chadwyck-Healy.21

2.3.1 Open Content Alliance

Based on a collaboration of cultural, technology, nonprofit, and governmental organizations, the Open Content Alliance was conceived in 2005 by the Internet Archive and Yahoo!22 Its goal is to build open-access digital collections and make them available through the Internet Archive and The Open Library.23 OCA distinguishes itself as a librarian-driven project. Unlike the Google and Microsoft initiatives, the OCA focuses on the creation of a “permanent archive” of multilingual digitized text and multimedia content.24 All content in the OCA archive is searchable through all major search engines.25 The files are hosted by the Internet Archive, Microsoft, and the Library of Alexandria. Other copies of these files are going into many different repository systems and may be publicly accessible from them in the future. By storing and maintaining data in multiple repositories, the OCA reports that it has been able to preserve the files, test the preservation action, and restore lost files. In addition, the images created with Microsoft funds are added to the Microsoft Live Search Books portal. Although currently focusing on public domain materials, OCA has been in discussion with some publishers to explore new business models around making copyrighted content available. OCA is partially funded by Microsoft and Adobe.


Table 1. Goals and Distinguishing Features of LSDI Participants

2.3.2 Million Book Project26

The MBP is led by the Carnegie Mellon University School of Computer Science and University Libraries.27 A distinguishing feature of MBP is its extensive digital library research agenda, which includes large-scale information storage and management, search engines for multilingual data, image processing, OCR in non-Romance languages, copyright laws and digital-rights management, and language processing. Created with a $3 million National Science Foundation (NSF) grant for equipment and travel, the MBP attracted international partners and matching funds exceeding US$100 million. The initial NSF-funded project officially ended in July 2007; however, partners continue to work together. Since 2001, the project has scanned more than 1.4 million books in China, India, and Egypt. It has included 26 partnering institutions, some contributing to content creation, others to the digital library research agenda. The Internet Archive is a project partner and helps acquire books for digitization. The primary countries that contribute materials for digitization (India, China, and Egypt) prefer to host the books they scan. They might eventually share their content with the Internet Archive or with OCLC, but there currently are no firm plans to do that.28


10 Richard K. Johnson provides a useful synopsis of implications of book-digitization projects and provides examples of core library interests in digitization partnerships in his article “In Google’s Broad Wake: Taking Responsibility for Shaping the Global Digital Library.” ARL: A Bimonthly Report 250: (February 2007). Available at http://www.arl.org/bm~doc/arlbr250digprinciples.pdf.

11 Examples of FAQs include
Stanford: http://www-sul.stanford.edu/about_sulair/special_projects/google_sulair_project_faq.html
Harvard: http://hul.harvard.edu/hgproject/faq.html
University of Michigan: http://www.lib.umich.edu/staff/google/public/faq.pdf
Cornell: http://wiki.library.cornell.edu/wiki/x/gng.

12 Although not at the same scale as Google and Microsoft, there are other methods to support an ambitious digitization initiative. For example, the Association of Southeastern Research Libraries, a consortium of 38 libraries, is exploring how to digitize selected portions of members’ print and archival collections as a cooperative initiative. Information about this initiative is available at http://www.aserl.org/documents/ASERL_RFP_Digitization_REVISED.pdf.

13 See University of Michigan Library/Google Digitization Partnership FAQ. August 2005. Available at http://www.lib.umich.edu/staff/google/public/faq.pdf. University of Michigan President Mary Sue Coleman has been an outspoken advocate of the preservation role of the digital materials created through the university’s partnership with Google. Noting that about five million of the books in the University of Michigan Library are either brittle or at risk because they are printed on acidic paper, she maintains that the digital copies may be the only versions of work that will survive into the future.

14 Stanford Google Library Project FAQ. January 18, 2006. Available at http://www-sul.stanford.edu/about_sulair/special_projects/google_sulair_project_faq.html.

15 Jill E. Grogg and Beth Ashmore. 2007. “Google Book Search Libraries and Their Digital Copies.” Searcher (April). Available at http://www.infotoday.com/searcher/apr07/Grogg_Ashmore.shtml.

16 Peter B. Kaufman and Jeff Ubois. 2007. “Good Terms: Improving Commercial-Noncommercial Partnerships for Mass Digitization.” D-Lib Magazine 13 (11-12).

17 Thanks to Laura DeBonis, Jennifer Parson, and Jodi Healy at Google for reviewing the information presented in this section of the paper. Additional information about the Google Book Search is available at http://books.google.com/intl/en/googlebooks/about.html.

18 Thanks to Jay Girotto, Jessica Jobes, and Michel Cote at Microsoft for reviewing the information presented in this section of the paper.

19 Kirtas Technologies: http://www.kirtas-tech.com/.

20 Microsoft Live Search Books Publisher Program: http://publisher.live.com/.

21 Collaborative Digitization Programs in the United States, a Web site maintained by Ken Middleton from Middle Tennessee State University, provides links to collaborative digitization projects that focus on cultural heritage materials (http://www.mtsu.edu/~kmiddlet/stateportals.html). The June 2005 issue of Library Hi Tech had collaborative digitization as its theme. It is also important to acknowledge that there have been several successful regional, international, and statewide collaborations in the United States and elsewhere, although at a much smaller scale than the Google and Microsoft initiatives. For example, the Collaborative Digitization Program (http://www.cdpheritage.org/index.cfm) and the Florida Digital Archive (http://www.fcla.edu/digitalArchive/) are often cited as exemplary collaborative digitization and archiving endeavors.

22 Open Content Alliance: http://www.opencontentalliance.org/faq.html.

23 Internet Archive: http://www.archive.org/index.php. The Open Library: http://www.openlibrary.org/toc.htm.

24 The OCA will seed the archive with collections from the following organizations: European Archive, Internet Archive, National Archives (UK), O’Reilly Media, Prelinger Archives, University of California, and University of Toronto.

25 One exception to this statement is the content digitized through the Microsoft Live Books initiative and contributed to the Open Content Alliance.

26 In addition to the Million Book Project FAQ, information about the initiative was provided by Dean of University Libraries Gloriana St. Clair and Principal Librarian for Special Projects Denise Troll Covey at the Carnegie Mellon University Libraries.

27 Million Book Project: http://www.library.cmu.edu/Libraries/MBP_FAQ.html.

28 This information is based on July 17, 2007, e-mail correspondence with Denise Troll Covey and Gloriana St. Clair at the Carnegie Mellon University Library.