APPENDIX Large-Scale Digitization Initiatives: Survey of Preservation Implications

In July 2007, a Web-based survey questionnaire was distributed to 20 research libraries in the United States, the United Kingdom, and Canada. The goal of the survey was to gather information about the preservation activities of large-scale digital initiatives (LSDIs). The survey was distributed only to libraries that were actively participating in the Google, Microsoft, or Open Content Alliance (OCA) initiatives as of July, and was not sent to those who had signed agreements but were still in the planning stages.123 To maintain the anonymity of respondents, we do not identify which libraries completed the survey; however, all are among those listed on page 7.

Fourteen of the 20 institutions were able to provide information about their large-scale digitization efforts. Six libraries were not able to participate for a variety of reasons, including privacy concerns and insufficient experience.

1. LSDI Participation

tables

The two tables below summarize the distribution of respondents’ participation in LSDIs. As the tables show, many respondents participated in more than one initiative.

The total number of materials digitized by eight of the fourteen participating libraries was 22 million. The other six libraries did not respond to this question in quantitative terms; they characterized their selection efforts as evolving and were not able to quantify the number of materials digitized or slated for digitization. For example, one respondent commented that because of the nature of its catalog records and the period of material being considered for digitization (nineteenth-century), it was difficult to determine in advance the number of items that would fall within the scope of the project.

Seven libraries included both in-copyright and public domain materials in their LSDIs; the other seven included only public domain content.

When asked about the duration of the project, nine institutions’ responses fell within the one- to six-year range. The others replied that the duration of their projects was undetermined or that the response to the question was confidential.

2. Digital Preservation Plans

Thirteen of the fourteen respondents expressed their intent to archive their digitized materials, that is, to assume long-term responsibility for preserving their digitized books. Twelve libraries said that their efforts were in the exploratory or planning stages; the other two libraries characterized their preservation efforts as “plans in place.” Nine libraries indicated that they are developing a plan to ingest, store, and archive digitized content. Three libraries identified their repositories as ready to ingest, store, and archive.

When asked about collaboration in preservation efforts, two institutions stated that they already have partnerships in place, five institutions do not have any immediate collaboration plans, and four institutions indicated that they were considering a collaborative approach. Three institutions did not provide information in response to this question.

3. Challenges Ahead

Thirteen respondents commented on the challenges they faced. Many emphasized that the scale and pace of their LSDI require extremely robust systems, effective and reliable tracking tools, and tested preservation ingest procedures. Seven respondents stressed the difficulty associated with storing large amounts of data. The following comments illustrate the challenges perceived by the respondents:

  • One library plans to base its preservation infrastructure on FEDORA architecture, but it has not yet tested it with such a large quantity of individual files.
  • In regard to storage, one respondent observed that “the obvious challenges have also been the most basic.” This library had found it very difficult to determine storage needs in advance.
  • Creating and storing 4–8 gigabytes of data daily has put an enormous stress on one library’s networking and storage system. Several respondents stressed the time-consuming nature of data transfer.
  • Three respondents expressed concerns about the lack of a clear institutional plan covering how long and why the library would be archiving the digitized books.
  • The biggest challenge, according to one respondent, is the unproven state of preservation standards. This institution would like to be certified as a trusted digital repository, but at this point it does not perceive it possible “because the criteria are not realistic (as acknowledged by the group that developed and just revised them!).”
  • Three libraries cited mischaracterization of mass digitization as preservation reformatting as a key challenge. They emphasized that the LSDIs were aiming at access, not preservation. One respondent noted that the resulting digital content may meet some preservation needs as well.
  • One participant expressed concern about the quality of some items reformatted through mass digitization programs and noted that some of the digital content was not suitable to be used with evolving viewing technologies.
  • Two respondents mentioned the impact of LSDIs on traditional preservation and conservation efforts. They indicated that an LSDI may draw attention to preservation needs that were not being addressed through mass digitization.
  • Several respondents expressed concern about long-term financial challenges and the cost of the archival efforts.
  • Appraisal and selection issues and the cost-effectiveness of maintaining duplicate copies of digitized content, especially given the current financial climate and competing priorities, were additional topics of concern.

4. Technical Requirements for Digitization

When asked to share imaging specifications (e.g., resolution, bit depth, file format, use of image-quality targets) for the digital copies they will archive, six libraries declined to provide information because of confidentiality obligations. Several libraries participating in the Google Initiative said that they have the “same specifications as for all other Google partners.” Among the eight libraries that were able to provide information, one described its requirement as 600 dpi, 1-bit TIFF; the rest characterized their technical parameters as 300–400 dpi, 8–12 bit JPEG or JPEG2000.

The respondents were also asked to provide information about metadata standards used for description, structuring, and preservation. Of 10 libraries providing information, all listed MARC or MARC XML as their primary standard for descriptive metadata. Seven of these libraries are also using METS and are considering including MODS descriptive records. Three libraries are capturing MIX using JHOVE.124

5. Quality Control

One section of the questionnaire concerned inspecting the quality of digitized images received from a vendor. Eleven libraries indicated that they have a quality control (QC) strategy in place for this purpose. However, with one exception, they characterized their QC programs as evolving and noted the challenges faced because of the ambitious scale of digitization and limited resources. Their comments revealed a wide range of QC implementations, depending on institutional resources and initiative parameters. For example, one respondent said that his library inspects approximately five percent of newly digitized books for image quality and checks all files to ensure that they open. Checksums run as files are transferred to other media. Most of the responding libraries qualified their QC efforts as “small-sample based” and referred to their QC processes as “spot checks.” Three institutions did not provide information about their quality control programs because of nondisclosure agreements.

When asked about the procedures for images or other associated deliverables, such as optical character recognition (OCR) files, with unacceptable quality, six (all Microsoft and/or Open Content Alliance participants) respondents indicated that the digital objects were sent back to the digitization service provider for correction. Three libraries recorded problems (two shared this information with the imaging center) but did not ask the service provider to make corrections. Five respondents said that they were either in the process of making decisions on this issue or that they could not share the information because of confidentially obligations.

6. Condition of Materials

One survey question aimed to elicit respondents’ experience with respect to the physical condition of materials during digitization. Nine institutions checked “no or minimal damage,” four had no opinion, and one respondent expressed concern about the level of damage. Some institutions that reported minimal harm noted that damage was not more than that experienced through normal use. One library conducted a pre- and post-condition survey early on and found no damage or minimal damage to its materials. The library that expressed concern stated that some books would be prone to damage regardless of how carefully they were handled. In keeping with curators’ or preservation librarians’ decisions, some libraries disbind books in which the text runs into the gutter and books that cannot be opened 180 degrees.

7. Completeness of Digitization Process

Asked whether they were tracking information about the completeness of the digitization process (e.g., missing pages, undigitized foldouts), six libraries replied that they were considering recording such information. Five libraries already had a system in place to capture such information, and one of them described an ongoing inventory database development effort to record why books were rejected for scanning. This database will also support collection development efforts. Another respondent recognized that a certain percentage of books with errors and missing pages will be discovered only upon access by users and questioned how corrections will be managed for requests received from users. Three libraries were not able to share information because of nondisclosure terms.


FOOTNOTES

123 Because of the geographically distributed nature of the Million Book Project, the survey did not include the MBP participants. Several MBP project partners contribute only to the digital library research and development agenda. The MBP-related information presented in Section 2.3.2 of the paper was provided by the Carnegie Mellon University Libraries.

124 Information about the metadata standards referenced in this section is available at Standards at the Library of Congress: http://www.loc.gov/standards/.