Acknowledgments: The author wishes to thank Taylor Surface, Pam Kircher, Leah Houser, and Linda Evers for their contributions.
This paper reports on four aspects of the Online Computer Library Center’s (OCLC’s) current activities in digital preservation. Section 1 discusses recent strategic integration at OCLC to support digital preservation initiatives. Section 2 describes digital preservation activities of the Digital and Preservation Resources (DPR) centers, which are creating digital masters for the library community. Section 3 outlines technical considerations associated with building a Digital Archive, and Section 4 provides a list of activities in which the OCLC plans to engage with the digital preservation community.
Section 1: Strategic Integration
In September 2000, OCLC’s Board approved a strategic plan under which libraries and OCLC will transform WorldCat from a bibliographic database and online union catalog into a globally networked information resource of text, graphics, sound, and motion. The rebirth of WorldCat in Oracle will create a global knowledge base supported by a set of integrated, Web-based tools and services that facilitate contribution, description, discovery, access, exchange, delivery, and preservation of knowledge objects as well as the expertise of participating institutions.
To realize this strategy, three of OCLC’s primary business units are developing new products and services while enhancing current offerings. These units are Metadata and Cataloging Services, Cooperative Discovery Services, and DPR.
Formalized in November 2001, DPR is OCLC’s newest division and the topic of this paper. Today, 35 years after OCLC founder Frederick Kilgour’s vision of pooling library resources began to be realized, DPR has taken on the task of building on his model. Our vision is to extend the OCLC cooperative to support the challenges of creating and sustaining access to and preservation of the global knowledge base’s contents.
At present, DPR is home to three major initiatives:
- expanding Preservation Resources, a state-of-the-art preservation reformatting facility in Bethlehem, Pennsylvania, into regional DPR centers
- building a Digital Archive
- launching and supporting growth of the DPR Cooperative
This report focuses on how OCLC is expanding Preservation Resources’ capabilities into regional DPR centers and construction of the Digital Archive.
Digital preservation takes place within a continuum, ideally starting from the point of digital-object creation, at a DPR center or elsewhere, and continuing through the processes involved in the long-term retention of those objects. DPR centers and the Digital Archive are two segments of the digital preservation continuum that have begun to converge as a result of OCLC’s new business direction. That continuum is supported by the integration of infrastructure, metadata, and processes. If any of these three elements does not extend from one entity to another, preservation is not possible. The assumption is that a distributed, interoperable environment is the only viable approach to digital preservation. Digital preservation activities will occur both in the DPR centers and in the Digital Archive to support and reinforce the concept of preservation as a continuum of interdependent activities.
Section 2: Digital Mastering
With the creation of regional DPR centers, we are building on Preservation Resources’ 15-plus years of experience with preservation microfilming. That translates into harnessing technology, skills, and processes for library-specific applications by creating a cost-effective, high-quality “digital factory” geared to meet the cultural heritage community’s needs for digital preservation reformatting. We will maintain a test bed environment to experiment with digital imaging and metadata application processes in order to identify best practices, build tool sets, and anticipate future needs.
Recognizing the unique nature of materials in the information-services and cultural-heritage communities, Preservation Resources adapted commercially available technology to meet the needs of these communities. Adaptation in this case is expensive and somewhat difficult because the need is quite specific and, compared with that of the imaging industry as a whole, relatively small. As Kenney and Rieger state, “Determining how to digitize and present library materials involves a fairly complex decision-making process that takes into consideration a range of issues, beginning with the nature of the source document but encompassing user needs, institutional goals and resources, and technological capabilities. These all map together as a matrix for making informed decisions rather than exacting standards” (2000, 24).
Preservation Resources and DPR strive to support preservation librarians as they work through this complex decision-making process. The DPR centers are also developing the infrastructure with which to support the following assertion from the “Benchmark for Digital Reproductions of Monographs and Serials,” endorsed by the Digital Library Foundation (2002): “Digital masters are digital objects that are optimally formatted and described with a view to their quality (functionality and use value), persistence (long-term access), and interoperability (e.g., across platforms and software environments).”
Complying with this benchmark requires a scanning environment capable of creating an accurate digital representation of our printed heritage. Consequently, we have determined that DPR centers must have three capacities. First, they must have material-handling skills to recognize bibliographic anomalies and other cultural representations. Second, they must have the technology with which to meet or exceed accepted quality standards. Finally, they must be able to engage in cost-effective standard setting for the broader commercial-service community.
Our challenge is to support this sophisticated scanning environment with technology and personnel to produce comparable levels of quality for various cultures, languages, and types of materials in DPR centers worldwide.
Preservation metadata are any metadata used by an institution that is carrying out some form of digital preservation. Preservation metadata could include discovery, administrative, and structural metadata. Structural metadata should be sufficiently detailed to allow reconstruction of the sequence of the original artifact, a point that is being addressed as an addendum to the DLF benchmark (2002). More commonly, though, the term preservation metadata is applied to metadata serving either of two functions:
- enabling preservation managers to take appropriate action to preserve a digital object’s bit stream over the long term; or
- ensuring that the content of the archived object can be rendered and interpreted.
Integrating digital preservation activities into the larger digital information life cycle and its associated workflows depends heavily on creating preservation metadata early in the process. Digital masters for digitally reformatted monographs and serials must have descriptive, structural, and administrative metadata, and the metadata must be made available in well-documented formats. OCLC is likely to adopt the Metadata Encoding and Transmission Standard (METS) and create tools to apply METS in DPR centers at the point of digital-object creation.
To that end, staff at the DPR Center in Bethlehem, Pennsylvania, have created programs to automate population of the TIFF header with preservation metadata and are reviewing the NISO standards for still images. The issue of preservation metadata was addressed earlier in defining the Digital Archive system architecture (see Section 3 of this paper), but further work is clearly needed (see Section 4).
Our community has much work ahead to develop processes at the point of digital-object creation that will support persistence. One area we are investigating is the high-volume, low-cost application of an authentication process at the point of creation. We may define authentication as a means for ascertaining that the digital material is what it purports to be and has not been altered since its creation.
Without data security, preservation is compromised. However, with the powerful flexibility of digital formats comes the ability to alter the original with ease and without detection. We are considering how to cost-effectively implement an authentication mechanism and are engaged in discussions about how to license and adapt third-party authentication software to our community’s requirements.
The software we are evaluating functions basically as a digital notary public. The creator of a digital object uses the software to add a digital signature and time stamp to the object. That information is sent to the authentication software company for long-term retention. Future users can verify the security of the digital object by sending the registration information to the software company, which will determine whether it matches the original signature and time stamp on file. This service also records changes of ownership, further verifying its authenticity and providing a means of digital provenance. We will begin conducting an authentication pilot project with two library partners in the summer of 2002.
Section 3: Building a Digital Archive
A logical extension of various OCLC and Preservation Resources services is the construction of a Digital Archive. This activity focuses OCLC’s longstanding strengths in research, software development, and cooperative work on the preservation mission.
Preservation Resources staff and computer scientists in OCLC’s Office of Research have been working together since 1995 to understand better what is required for long-term digital preservation, both from a user perspective and a scalable, maintainable systems perspective. The current project to build the Digital Archive’s infrastructure, metadata, and processes began in January 2001 and will proceed in multiple phases, as shown in figure 1.
Fig. 1. The Digital Archive is being built in multiple phases
The project has three major goals:
- To build a general-purpose digital archive for libraries, archives, and museums that may be used to store a variety of types of information and upon which various products and services may be built
- To identify workflows for capturing and managing digital objects; and
- To implement a metadata set for the archived objects
When Phase 1 is completed in May 2002, the system will facilitate the capture of Web documents, creation of preservation metadata for digital objects, ingestion of objects into the Digital Archive, and long-term retention of these digital information assets. However, this phase is limited in object format to text and still images. It is limited to ingesting objects into the archive one at a time, but it does have a set of tools that enable users to manage a complex workflow involving selection, cataloging, and archiving. The user can also generate a copy of the metadata and the object for in-house storage and dissemination.
Viewers will see objects in the Digital Archive by clicking on a URL in a bibliographic record in WorldCat, which they will access through FirstSearch, CORC, or a local catalog. They will also be able to access the Digital Archive by typing its URL into a Web browser.
Object owners will control access to their objects by creating content groups and related authorization groups. They will be able to delete their objects from the archive as well. For users familiar with OCLC’s CORC, and FirstSearch interfaces, the system will be easy to use; however, the harvest software, the archive-object viewing interface, and the administration module have new interfaces.
Our decision to focus initially on Web documents was influenced by earlier work with the U.S. Government Printing Office (GPO) on a digital project. Having expressed the need to improve capture of Web-based government documents for long-term retention, the GPO was willing to work with us to define high-level user requirements for this data format.
As the project has progressed, we have involved other interested parties, mostly state libraries whose needs are similar to those of the GPO. Since 2001, the GPO has been joined by Ohio’s Joint Electronic Records Repository Initiative, which includes the State Library of Ohio, the Ohio Historical Society, the Ohio Supercomputing Center, and the State of Ohio Department of Administrative Services; the Connecticut State Library; the Library of Michigan; Arizona State Library, Archives, and Public Records; and the University of Edinburgh, Scotland. Staff members from these institutions have met with us, commented on prototypes and workflows, provided input regarding the metadata element set, and participated in interface usability testing.
Preservation Metadata for the Digital Archive
Characteristics of objects and user groups are major factors in metadata decisions and in the tools created to support the metadata-creation process. The first objects in the OCLC Digital Archive will be born-digital and mostly public-domain government documents published on the Web and consisting of text and still images presented in HTML, PDF, JPEG, GIF, BMP, TIFF, and ASCII text formats.
Phase 1 users are mainly viewing objects created by others. As a result, they may not know of or not be able to obtain preservation metadata elements such as the recommended hardware for rendering an object. Also, our users want to integrate workflows to select, capture, catalog, and archive in a streamlined fashion. Finally, users want this integrated workflow to be as seamless as possible so that current staff can ingest objects and their preservation metadata into the archive efficiently.
Consequently, we have created new tools to make metadata creation easier, using as our foundation CORC, OCLC’s tool set for creating descriptive metadata for electronic objects. CORC now supports a preservation metadata record that can be populated with data from a bibliographic record and updated with preservation data extracted from objects by the archive. Users may also enter data manually. We have also created a new harvester that launches from CORC and that uses tools within Oracle9i FS to extract technical information about the object. Finally, we are building a management module to enable users to assign objects to content groups and then specify access to that group.
OCLC staff kept these factors in mind when determining what preservation metadata elements would be needed in the first phase of the Digital Archive. These elements are as follows:
- user requirements
- object types
Some of the questions we asked ourselves were:
- What metadata are needed for these object types? (i.e., Web documents)
- When are the metadata available, and to whom?
- How are metadata captured, extracted, or created? By people or by a machine?
- How are the objects going to be accessed and by whom?
In answering those questions, we sought a balance among three elements:
- preservation and maintenance of access to an object
- what users can create practically
- what the archive can extract or create
The Digital Archive’s preservation metadata set is being developed by an OCLC team whose work is informed by the OCLC/RLG Working Group on Preservation Metadata as well as by other digital preservation initiatives. A report from the working group recommends a preservation metadata set and is available for review and comment (OCLC 2002).
When we compared the OCLC preservation metadata set with other element sets such as CEDARS or METS, we found that the convergences and issues for discussion were similar to the findings reported by the OCLC/RLG Working Group in its first white paper. To summarize those findings:
- based on the Open Archival Information System (OAIS) reference model
- prescribes metadata for preservation
- able to extend the use of the archive to other object types
Issues for Discussion
- Scope: We are dealing with born-digital Web documents-other projects are dealing with converted materials or other formats.
- Granularity: We must determine at what level the metadata need to be assigned-logical object or file or both.
- Interoperability: This is an open question, but we are using an XML wrapper; communication with other groups is key.
- Implementation: While differences in implementation may not change how well the object is preserved, they may drive what tools are created for an archiving workflow or for accessing an object.
The OAIS Reference Model
Critical to enabling interoperability with other digital archives is our compliance with the OAIS reference model, which merits a brief explanation here. The International Organization for Standardization will soon publish the OAIS standard as ISO 14721:2002 (Garrett 2002).
OAIS grew out of the need of NASA and other national space agencies to capture, access, and store vast quantities of digital information for the long term. While the details of the reference model are complex, the overall concept is a straightforward sequence of input-process-output.
Fig. 2. The OAIS model has six functional areas and three types of information packages
Figure 2 depicts the conceptual relationships of the six functional areas and the three variations of information packages (Sawyer 2002). The sequence proceeds as follows:
- A producer provides a submission information package (SIP) to the Ingest entity.
- An archival information package (AIP) is created and delivered to Archival Storage.
- Related descriptive information is provided to Data Management.
- A consumer searches for and requests information using appropriate descriptive information and access aids.
- The appropriate AIP is retrieved from Archival Storage and transformed by the Access entity into the appropriate dissemination information package (DIP) for delivery to the consumer.
- Activities are carried out under the guidance of the Administration entity.
- Preservation strategies and techniques are recommended by Preservation Planning and put in place by the Administration entity.
The three types of information packages are also shown in Figure 2:
- A producer submits a SIP to the OAIS.
- The OAIS holds and preserves the information using AIPs.
- In response to consumer queries and resulting orders, DIPs are returned.
It may be useful to describe how we are implementing an OAIS-compliant AIP in the Digital Archive. Here is a list of the components we incorporated in the system architecture for AIP:
- 100 percent Java application developed by OCLC using Java beans, Enterprise JavaBeans, and JavaServer pages
- IBM AIX (UNIX) servers
- Oracle9i software
Oracle’s content management products provide us with middleware, which we are using as a foundation for the Digital Archive. This middleware provides a Java abstract interface to the data repository for insertion, deletion, and other manipulation of content objects and metadata, including
- Object-level rights via ACL and ACE lists
- Fully extensible object-oriented database schema
- XML-enabled loading tools
- Extraction of structural metadata
- Metadata and future full-text searching
- HTTP, NFS, FTP, and Windows Explorer (SMB) protocol agents
The Digital Archive infrastructure builds on OCLC’s existing procedures, staff, and environmentally controlled computer rooms. OCLC’s experience in media migration keeps the bits alive as technology changes. Further, our experience with record conversions will be of good use in keeping both the metadata and content viable. As with WorldCat and other OCLC databases, copies of the Digital Archive’s content and metadata will be stored securely in underground facilities off-site.
Every digital archive must plan for when vendors no longer support the tools with which it was built and is maintained and for when technological advances require a new system architecture. Consequently, our Digital Archive itself, like the data it holds, must have a planned migration path.
Under our plan for DPR’s Digital Archive, we will extract objects and metadata from the repository in a system-neutral format to allow reloading into a new architecture. For significant system upgrades, support staff may be required to move data around, thus demonstrating their ability to do so in the event of completely new system architecture.
Section 4: Next Steps
As indicated in figure 1, we divided construction of the Digital Archive into multiple phases. Phases 2 and 3 will address the challenges posed by historical newspapers and e-journals; Phase 4 will probably focus on flat-text and still images. We have not yet identified the sequence in which we will accommodate audio, video, and dynamic formats such as relational databases and interactive instructional materials.
Inherent in each phase will be subsets of activities in which we intend to engage with the digital preservation community. Among these activities are investigation and development in several areas, including criteria for selection and required preservation-service level, new opportunities for cooperative activity in digital preservation, economic sustainability for digital repositories with preservation responsibilities, digital-rights management, preservation strategies, metadata requirements, and standards work based on the OAIS Reference Model.
Current trends in information technology and the emerging capabilities with which to build a global knowledge base offer exciting opportunities for libraries. Toward this end, DPR and other OCLC divisions are creating the tools and services libraries need to provide economical preservation of and access to materials. The expansion of Preservation Resources into DPR centers around the world and the construction of a large-scale, OAIS-compliant Digital Archive are tangible evidence of that work.
The magnitude of this task exceeds matters of hardware and software. We are aware of the need to build collaborative channels through which our members and the broader community can conveniently inform and immediately benefit from our ongoing work. Toward that end, we have launched and will continue to sponsor the DPR Cooperative, a group of diverse organizations and individuals who have joined us in accepting the challenge of exploring new opportunities in digital preservation.
OCLC takes as a profound responsibility the need that libraries and other organizations have to preserve cultural memory. Thus, it is imperative that we demonstrate the ability to provide a sustainable approach to long-term digital preservation and a commitment to do so with and for our community. This paper has described a methodology for expanding existing offerings and building new ones under a proven cost-recovery model. While these offerings will undergo transformations, we are building them with the belief that users in centuries to come will find our early collaborative efforts in digital preservation to have been worthwhile.
All URLs were valid as of July 10, 2002.
Digital Library Federation. 2002. Benchmark for Digital Reproductions of Monographs and Serials as Endorsed by the DLF. Available at http://www.diglib.org/standards/bmarkfin.htm.
Garrett, John (Web page curator). 2002. ISO Archiving Standards-Reference Model Papers. Available at http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html.
Kenney, Anne R., and Oya Y. Rieger, editors and principal authors. 2000. Moving Theory into Practice: Digital Imaging for Libraries and Archives. Mountain View, Calif.: Research Libraries Group.
OCLC. April 2002. A Recommendation for Preservation Description Information: A Report by the OCLC/RLG Working Group on Preservation Metadata. Available at http://www.oclc.org/research/pmwg/pres_desc_info.pdf.
Sawyer, Donald M. 2002. Framework for Digital Archiving: OAIS Reference Model. Presentation delivered at the OCLC Steering by Standards Teleconference on the OAIS Imperative: Enduring Record or Digital Dust? Columbus, Ohio, April 19, 2002.