APPENDIX 2 Digital Preservation in the United States: Survey of Current Research, Practice, and Common Understandings • CLIR

Daniel Greenstein, Digital Library Federation
Abby Smith, Council on Library and Information Resources
March 2002

Libraries and archives have long preserved significant parts of the published and unpublished record. They do this to ensure that the information in those records will be available to those who need it. Preservation has always been seen as a necessary condition for access. When information is recorded on paper and other analog media, the major challenges to preservation are posed by the fragility of the medium and by the costs of providing suitable storage, which are often high.

In the United States, preservation has traditionally been a distributed activity. Each library or archives is responsible for maintaining the accessibility of its own holdings, for its own users. Together, these individual collections constitute the national collection. The materials have traditionally been used on-site, although they may be loaned to other institutions through lending agreements that are designed, in part, to protect the artifact being lent. Sharing of resources occurs through reformatting (onto microforms, through preservation photocopying, and so forth). But in each case, the physical artifacts are assets that belong to the library or archives. The information in these artifacts may or may not belong to the institution; in fact, rarely are intellectual property rights given to the repository in which the materials are held. In the analog realm, fulfilling preservation responsibilities has entailed both meeting the information needs of (mostly onsite) users and protecting institutional assets. Preservation responsibilities are assumed upon the acquisition of a physical item and they continue through its life cycle.

These interests-preservation, physical possession or ownership, and access-are seldom as allied in the digital realm as they are in the world of analog media. The function of preservation for the purpose of providing physical or intellectual access does not fall automatically to an institution through the agency of physical ownership. The stakeholders in digital preservation often come from the same sectors as do stakeholders in the analog realm. They include creators, distributors or publishers, repositories or libraries and archives, and users. But these stakeholders may play very different roles in the digital realm than they do in the analog realm-roles that can put them in conflict with one another in areas where their interests once were parallel. Digital stakeholders can also create new alliances of interests.

One critical challenge to digital preservation in the near term is technical: the rapid rate at which hardware and software become obsolete means that information written in a specific code to run on specific hardware may be stranded by the adoption of newer, better code and hardware. This is the problem facing individuals who want to read an early version of a Lotus 1-2-3 spreadsheet that they have on a 5-1/4-inch disk they used to run on an IBM PC. The implication is that decisions about selection for preservation that can be deferred in the analog realm must be addressed early in the life cycle of digital files.

This paper summarizes activities under way in the United States that are designed to address the variety of preservation challenges-technical, legal, and social-and the changing roles and responsibilities of preservation stakeholders. It is divided into the following major sections:

Common understandings among stakeholders describes the agreements that exist among those who take an interest in the long-term management of digital information.
Practical preservation activity reports real archiving efforts and the circumstances under which they have emerged.
Experimental preservation activity discusses significant practical experimentation in data archiving.
Preservation research sets forth key areas for focused research and presents examples of projects in those areas.

Common Understandings Among Stakeholders

Limited but highly influential agreements about key issues exist among those who take an interest in the long-term management of digital information-interests that are intrinsically, if at times confusingly, interrelated. Those who create or publish such information, those who wish to use the information, and those who act as archival repositories for it all have a stake in maintaining digital assets over time. They often have different purposes in mind when they speak of making the information accessible in the future, but they share the conviction that such longevity is highly desirable.

The interests of the creator or distributor, user, and repository are interrelated because each group has a formative influence over whether, how, and at what cost digital information will be made accessible over the long term. The first decisive factor is how digital information is created and distributed. This may determine whether, how, and at what cost the information can be preserved and made accessible to users over time. The choice of some formats may make it more difficult to manage the digital object and ensure future, or even current, access. The selection of simple or standard formats (e.g., PDF files, TIFF images, or ASCII text) can simplify certain storage issues.

Another deciding influence is how, to whom, and under what terms or conditions archived digital information is to be distributed. This will determine how, by whom, and at what cost that information is created, distributed, and accessioned into an archive. Accordingly, preservation practice usually represents some ongoing negotiation between creators or publishers, archives, and users. Each stakeholder makes choices that can influence the long-term accessibility of a digital asset. The Inter-university Consortium for Political and Social Research (ICPSR), for example, was designed to ensure long-term access to important social science research data sets. This membership organization states that “to ensure that data resources are available to future generations of scholars, ICPSR preserves data, migrating them to new storage media as changes in technology warrant” (ICPSR, no date). To support its activity, ICPSR has a sustainable, mission-driven business model, and it defines criteria for data entry, use, and preservation within the framework of that model. It has worked successfully for 40 years.

Stakeholders have reached a common understanding about what constitutes a trusted digital repository and what activities the repository must routinely undertake, even though the way in which some of the basic preservation functions will be undertaken remains uncertain. A viable digital archival repository must have several attributes. For example, it must be explicit about what digital information it preserves, why, and for whom. It also must be clear about the attributes of the archived information it intends to preserve. It must offer services that meet the minimum requirements of data creators and users. It must be prepared to negotiate and accept deposits of appropriate digital information from those who create or distribute that information, and the terms of those negotiations must be clear to all. The repository must also obtain enough control of deposited information to ensure its long-term preservation; this responsibility may include gaining access to data in order to check on their integrity while protecting those same data from access by unauthorized parties. The repository must make information available to users under conditions negotiated and agreed on with depositors. Finally, given the rapidly changing technological environment in which the repository will take in and tend to digital information, it must seek new solutions as technology evolves.

Another area of common understanding is the emergence of the Open Archival Information System (OAIS) as the standard reference model. This model supplies a conceptual framework for discussing and describing archival practice. OAIS articulates the roles and interrelationships of the three groups that have a key stake in digital process, i.e., creator or distributor, user, and repository. The reference model identifies preservation as a process that begins when digital information is created; this is a critical point of difference from the standard analog model, which considers preservation much later in the life cycle of an artifact. Finally, the OAIS model identifies the core functions and organizational features of a digital archival repository. This has influenced perceptions of what constitutes a trusted archives. OAIS is on the International Organization for Standardization (ISO) standards track and is the reference model of choice of those involved in digital preservation worldwide.

Today, there are four commonly understood technical approaches to digital preservation. These approaches are not mutually exclusive; indeed, there is an emerging consensus that all four approaches, and probably others not yet devised, will be deployed for the variety of digital object types and the demands for access to them.

Migration. In this approach, digital information is stored in software-independent formats. The information is reformatted as needed so that it can be accessed using current hardware and software. Most digital archival repositories rely almost wholly on data migration. It is doubtful that the strategy will work well for mixed media.

Technology preservation. Under this approach, data are preserved along with the hardware/software on which they depend. Given the variety of hardware and software platforms and the rate at which they change, this strategy generally is not believed to be economically viable. Still, many data rescue efforts (see Digital archaeology below) rely on the persistence of outmoded hardware and software.

Emulation. Often considered a form of technology preservation, emulation entails storing digital information alongside detailed information about how it looked, felt, and functioned in its original software/hardware environment. The look, feel, and functionality of the digital information are then “emulated” or re-created on successive generations of hardware/software. Emulation is particularly pertinent to mixed media. Individuals who are conducting research on the technical and economic viability of this approach include Jeff Rothenberg at the RAND Corporation and researchers at CAMiLEON. Emulation is in the exploratory phase; it has never been successfully used for preservation in a sustainable way.

Persistent object preservation. The opposite of migration, persistent object preservation (POP) entails explicitly declaring the properties (e.g., content, structure, context, presentation) of the original digital information that ensure its persistence. Of the strategies listed here, POP is the only one that starts with and remains focused on preserving the digital information from its inception. Other strategies attempt to counter or overcome the generic technical problem of obsolescence.

Another important technical approach merits mention-digital archaeology or data mining. Although not a preservation strategy as such, digital archaeology enables digital information to be rescued or recovered from disks, tapes, and other storage media that are no longer readable as a result of physical deterioration, neglect, obsolescence, or similar reasons.

To remain viable over the long term, appropriate documentation or metadata must accompany digital information. Key preservation metadata initiatives are reviewed in a white paper by the Online Computer Library Center (OCLC) and the Research Libraries Group (RLG).¹

Practical Preservation Activity

There are several practical preservation efforts underway that demonstrate the range of experience and expertise around the country.

Active preservation programs are under way in archives where preservation is often legally mandated. For example, the archives of national and state governments are legally bound to preserve selected records of government, including electronic records, in perpetuity. Business archives, such as those at financial, pharmaceutical, chemical, and other companies, may maintain records for legal and other reasons. Statutes of limitations often govern these mandates; consequently, such archives do not typically keep data in perpetuity as do government archives. These systems can be said to be more analogous to records management than to archiving; nonetheless, managing digital records even for seven years can provide technical challenges. Archives are also established at not-for-profit institutions, such as universities, that maintain records (including electronic records) for legal, business, and cultural reasons.²

Preservation is also under way in organizations in which data creators and producers perceive the long-term commercial value of digital information. Publishers such as Elsevier Science preserve the electronic scholarly journals they produce. The entertainment industry, most notably music and film companies, have large investments in digital assets that they wish to reuse over time, and they have developed digital asset management systems tailored for their specific needs.

Preservation programs also are active in organizations that perceive a noncommercial value of digital information for use and reuse. Libraries, archives, and museums that digitize objects in their collections for online presentation, for example, may seek to maintain those objects over time rather than to rescan them as they become obsolete.

In places where data archives and systems vendors see commercial possibilities in the provision, supply, and support of long-term data storage facilities, preservation has become vital to commercial development. Data warehousing is a cottage industry with numerous related trade associations, exhibitions, and certification procedures. Data archives are beginning to emerge in the library community; for example, both OCLC and RLG are considering offering data archiving facilities on a cost-recovery basis.

Specific research communities, where data creators are also data users and where both groups recognize the importance of being able to reuse research data, undertake large-scale preservation of their intellectual assets. Both the ICPSR and the Roper Center preserve social science and government statistical data.

There are also major preservation activities in communities where data creators and data users recognize their interdependence and the value of the digital information in which they maintain a common interest. Through PubMed Central, the National Library of Medicine acts as a digital archival repository for medical publications and other medical information. Finally, archival repositories may be developed as a by-product of a commercial process. The Internet Archive is an archive of “snapshots” taken of selected Web pages by Alexa. An information company can use information gained from those snapshots for commercial purposes. Alexa assesses the visibility of Web pages by seeing who links into a site.

Experimental Preservation Activity

The InterPARES (International Research on Permanent Authentic Records in Electronic Systems) Project is a major international research initiative involving archival scholars, computer engineering scholars, and representatives of national archival institutions and private industry. Its goal is “to develop the theoretical and methodological knowledge essential for the permanent preservation of records generated electronically, and, on the basis of this knowledge, to formulate model policies, strategies, and standards capable of ensuring their preservation.” The InterPARES Project is investigating numerous issues in digital preservation, including the authenticity of electronic records.

The National Archives and Records Administration is developing a strategic and technical framework within which it may preserve in perpetuity selected electronic records of the federal government. It is closely involved with the InterPARES Project, the OAIS reference standard, the National Partnership for Advanced Computational Infrastructure led by the San Diego Super Computer Center, and others. It is an international leader in research in selected areas, including requirements and processes for the preservation and reproduction of authentic records, development of the persistent archives method, application of advanced computing tools to records-management processes, and integration of digital preservation technologies with infrastructure technologies for e-government and e-business.

Under the auspices of The Andrew W. Mellon Foundation’s e-journal archiving program, seven major libraries (the New York Public Library and the university libraries of Cornell, Harvard, Massachusetts Institute of Technology [MIT], Pennsylvania, Stanford, and Yale) are engaged in planning digital archival repositories for different kinds of scholarly journals. Yale, Harvard, and Pennsylvania have worked with commercial publishers on archiving the full range of their electronic journals; Cornell and the New York Public Library have worked on archiving journals in specific disciplines. MIT’s project involves archiving “dynamic” e-journals (i.e., those that change frequently), and Stanford is investigating the development of archiving software tools under the auspices of its LOCKSS (Lots of Copies Keep Stuff Safe) program.

RLG and OCLC are jointly conducting preservation research. At present, their work focuses on the attributes of a digital archival repository and on preservation metadata.

The Andrew W. Mellon Foundation has invested in an investigation of emulation as a viable preservation strategy. Jeff Rothenberg at the RAND Corporation is conducting this research.

The IBM Almaden Research Center is investigating the possibility of using a universal virtual machine for digital preservation.

The University of Pennsylvania is conducting work on data provenance.

Preservation Research

There are currently nine areas of significant research into preserving digital files. They are:

Architecture and performance of archival repositories. Key research is under way at the San Diego Super Computer Center, Stanford University, the National Archives and Records Administration, the Culpeper Center of the Library of Congress, Cornell University, Yale University, MIT, and Harvard University.
Persistent identification of and naming for archived information (e.g., International Digital Object Identifier [DOI], Persistent Uniform Resource Locator [PURL]).
Methods for recording and ensuring authenticity of archived information (digital signatures, watermarking, mechanisms for recording information about provenance). Determining the authenticity of a digital object is likely to require the use of techniques whose reliability is still being debated. Techniques appropriate to digital images may include digital signatures and watermarking. Checksums and other technical routines that produce message digests are appropriate for objects in virtually all formats. They help determine authenticity by analyzing the object’s structure and composition and whether it has been changed in any way since a particular benchmark point.Information may be found at
- Authenticity in a Digital Environment (CLIR 2000). Report of a group of experts convened by CLIR to address the question: What is an authentic digital object? https://www.clir.org/pubs/reports/pub92/contents.html
- The importance of verifying the authenticity of an information object is well described in The Evidence in Hand: Report of the Task Force on the Artifact in Library Collections (CLIR 2001) https://www.clir.org/activities/details/artifact-docs.html
- MD5 unofficial home page http://userpages.umbc.edu/~mabzug1/cs/md5/md5.html
- On checksum, see http://www.checksum.org/
- On digital signatures, see http://www.w3.org/DSig/ and information from the Electronic Privacy Information Center
- On digital watermarking, see The Information Hiding Homepage. Steganography and Digital Watermarking. Available at: http://www.cl.cam.ac.uk/~fapp2/steganography/.
Degradation and testing of magnetic and other media used to store digital information (work being conducted at the National Institute of Standards and Technology).
Attributes of preservable digital information. These efforts focus on specific kinds of digital information. For example, research communities interested in social science and in space data have defined standards for formatting and describing information in their respective fields.
Attributes of trusted digital archival repositories. This work centers on specific kinds of digital information and on the organizations that arise to preserve it. Participants in the Mellon e-journals archiving program, for example, are looking at the organizational, business, and rights issues that surround archives that are established to preserve scholarly e-journals.
Development of standards (including standards for data and metadata formats, digital storage media, and data management practice). Formal standardization takes place through bodies such as the ISO, World Wide Web Consortium (W3C), National Information Standards Organization (NISO), and Internet Engineering Task Force (IETF) and reflects the emerging consensus of stakeholder communities. It is important to distinguish between the standards themselves and the understandings that need to be reached among stakeholders about how the standards are to be applied in certain instances (see item 5).
Automatic copying and distribution of digital information (LOCKSS).
Policies and implementation mechanisms for the preservation risk management and assessment of Web-accessible content (Project Prism at Cornell University).

If preservation activity in the near future bears any resemblance to that activity in the past two years or so, there will be further significant and unpredictable changes in this dynamic field. References

References

ICPSR. No date. “About ICPSR.” Available from http://www.icpsr.umich.edu/ORG/about.html.

Web Sites Noted in Text

Alexa. http://info.alexa.com

CAMiLEON. www.si.umich.edu/CAMILEON/

Cornell University. www.library.cornell.edu/preservation/digital.html http://rmc-www.library.cornell.edu/online/studentrecords/

Electronic Privacy Information Center. www.epic.org/

Elsevier e-journal archiving. www.elsevier.nl www.elsevier.nl/homepage/about/resproj/tulip.shtml https://old.diglib.org/preserve/yale0206.htm

Harvard University. www.news.harvard.edu/gazette/1999/03.25/diglibrary.html

IBM Almaden Project: www.almaden.ibm.com

International Digital Object Identifier (DOI). www.doi.org

International Organization for Standardization (ISO). www.iso.org

Internet Archive. www.archive.org/about

Internet Engineering Task Force (IETF). www.ietf.org

InterPARES Project. www.interpares.org

Inter-university Consortium for Political and Social Research (ICPSR). www.icpsr.umich.edu

Library of Congress National Audio-Visual Conservation Center in Culpeper. http://lcweb.loc.gov/rr/mopic/avprot/avprhome.html

Lots of Copies Keep Stuff Safe (LOCKSS). http://lockss.stanford.edu

Massachusetts Institute of Technology (MIT). http://web.mit.edu/newsoffice/nr/2000/libraries.html

Mellon e-journal archiving. https://old.diglib.org/preserve/ejp.htm

National Partnership for Advanced Computational Infrastructure (NPACI). www.npaci.edu/online/v6.2/perm.html

National Archives and Records Administration (NARA). www.nara.gov www.nara.gov/nara/vision/eap/eapspec.html www.nara.gov/nara/electronic

National Information Standards Organization (NISO). http://www.niso.org

National Institute of Standards and Technology (NIST). www.nist.gov; www.itl.nist.gov/div895

Online Computer Library Center (OCLC). www.oclc.org/research/pmwg/

Open Archival Information System (OAIS) standard “reference model.” http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html

Persistent Uniform Resource Locator (PURL). www.purl.org

Project Prism. http://prism.cornell.edu/PrismWeb/AboutPrism.htm

PubMed Central. www.pubmedcentral.nih.gov

Research Libraries Group (RLG). www.rlg.org/longterm/ndex.html; www.rlg.org/pr/pr2000-oclc.html

Roper Center. www.ropercenter.uconn.edu/catalog40/StartQuery.html

Rothenberg, Jeff (RAND Corporation). www.rand.org/methodology/isg/archives.html

San Diego Super Computer Center. www.sdsc.edu/DigitalLibraries.html

Stanford University. www.sul.stanford.edu/depts/spc/indaids.html

University of Pennsylvania (work on data provenance). http://db.cis.upenn.edu/Research/provenance.html

World Wide Web Consortium (W3C). www.w3.org

Yale University. www.yale.edu/opa/newsr/01-02-23-02.all.html

FOOTNOTES

¹ See http://www.rlg.org/longterm/index.html.

² The National Archives and Records AdministrationÕs Center for Electronic Records is perhaps the largest government archive for electronic records (http://www.nara.gov/nara/electronic/).

previous section >> | report contents >>

pub 114 abstract >>