APPENDIX 2
Digital Preservation in the United States: Survey of Current Research,
Practice, and Common Understandings
Daniel Greenstein, Digital Library Federation
Abby Smith, Council on Library and Information Resources
March 2002
Libraries and archives have long preserved significant
parts of the published and unpublished record. They do this to ensure
that the information in those records will be available to those
who need it. Preservation has always been seen as a necessary condition
for access. When information is recorded on paper and other analog
media, the major challenges to preservation are posed by the fragility
of the medium and by the costs of providing suitable storage, which
are often high.
In the United States, preservation has traditionally
been a distributed activity. Each library or archives is responsible
for maintaining the accessibility of its own holdings, for its own
users. Together, these individual collections constitute the national
collection. The materials have traditionally been used on-site, although
they may be loaned to other institutions through lending agreements
that are designed, in part, to protect the artifact being lent. Sharing
of resources occurs through reformatting (onto microforms, through
preservation photocopying, and so forth). But in each case, the physical
artifacts are assets that belong to the library or archives. The
information in these artifacts may or may not belong to the institution;
in fact, rarely are intellectual property rights given to the repository
in which the materials are held. In the analog realm, fulfilling
preservation responsibilities has entailed both meeting the information
needs of (mostly onsite) users and protecting institutional assets.
Preservation responsibilities are assumed upon the acquisition of
a physical item and they continue through its life cycle.
These interestspreservation, physical possession
or ownership, and accessare seldom as allied in the digital
realm as they are in the world of analog media. The function of preservation
for the purpose of providing physical or intellectual access does
not fall automatically to an institution through the agency of physical
ownership. The stakeholders in digital preservation often come from
the same sectors as do stakeholders in the analog realm. They include
creators, distributors or publishers, repositories or libraries and
archives, and users. But these stakeholders may play very different
roles in the digital realm than they do in the analog realmroles
that can put them in conflict with one another in areas where their
interests once were parallel. Digital stakeholders can also create
new alliances of interests.
One critical challenge to digital preservation in
the near term is technical: the rapid rate at which hardware and
software become obsolete means that information written in a specific
code to run on specific hardware may be stranded by the adoption
of newer, better code and hardware. This is the problem facing individuals
who want to read an early version of a Lotus 1-2-3 spreadsheet that
they have on a 5-1/4-inch disk they used to run on an IBM PC. The
implication is that decisions about selection for preservation that
can be deferred in the analog realm must be addressed early in the
life cycle of digital files.
This paper summarizes activities under way in the
United States that are designed to address the variety of preservation
challengestechnical, legal, and socialand the changing
roles and responsibilities of preservation stakeholders. It is divided
into the following major sections:
- Common understandings among stakeholders describes the
agreements that exist among those who take an interest in the long-term
management of digital information.
- Practical preservation activity reports real archiving
efforts and the circumstances under which they have emerged.
- Experimental preservation activity discusses significant
practical experimentation in data archiving.
- Preservation research sets forth key areas for focused
research and presents examples of projects in those areas.
Common Understandings Among Stakeholders
Limited but highly influential agreements about key
issues exist among those who take an interest in the long-term management
of digital informationinterests that are intrinsically, if
at times confusingly, interrelated. Those who create or publish such
information, those who wish to use the information, and those who
act as archival repositories for it all have a stake in maintaining
digital assets over time. They often have different purposes in mind
when they speak of making the information accessible in the future,
but they share the conviction that such longevity is highly desirable.
The interests of the creator or distributor, user,
and repository are interrelated because each group has a formative
influence over whether, how, and at what cost digital information
will be made accessible over the long term. The first decisive factor
is how digital information is created and distributed. This may determine
whether, how, and at what cost the information can be preserved and
made accessible to users over time. The choice of some formats may
make it more difficult to manage the digital object and ensure future,
or even current, access. The selection of simple or standard formats
(e.g., PDF files, TIFF images, or ASCII text) can simplify certain
storage issues.
Another deciding influence is how, to whom, and under
what terms or conditions archived digital information is to be distributed.
This will determine how, by whom, and at what cost that information
is created, distributed, and accessioned into an archive. Accordingly,
preservation practice usually represents some ongoing negotiation
between creators or publishers, archives, and users. Each stakeholder
makes choices that can influence the long-term accessibility of a
digital asset. The Inter-university Consortium for Political and
Social Research (ICPSR), for example, was designed to ensure long-term
access to important social science research data sets. This membership
organization states that "to ensure that data resources are available
to future generations of scholars, ICPSR preserves data, migrating
them to new storage media as changes in technology warrant" (ICPSR,
no date). To support its activity, ICPSR has a sustainable, mission-driven
business model, and it defines criteria for data entry, use, and
preservation within the framework of that model. It has worked successfully
for 40 years.
Stakeholders have reached a common understanding about
what constitutes a trusted digital repository and what activities
the repository must routinely undertake, even though the way in which
some of the basic preservation functions will be undertaken remains
uncertain. A viable digital archival repository must have several
attributes. For example, it must be explicit about what digital information
it preserves, why, and for whom. It also must be clear about the
attributes of the archived information it intends to preserve. It
must offer services that meet the minimum requirements of data creators
and users. It must be prepared to negotiate and accept deposits of
appropriate digital information from those who create or distribute
that information, and the terms of those negotiations must be clear
to all. The repository must also obtain enough control of deposited
information to ensure its long-term preservation; this responsibility
may include gaining access to data in order to check on their integrity
while protecting those same data from access by unauthorized parties.
The repository must make information available to users under conditions
negotiated and agreed on with depositors. Finally, given the rapidly
changing technological environment in which the repository will take
in and tend to digital information, it must seek new solutions as
technology evolves.
Another area of common understanding is the emergence
of the Open Archival Information System (OAIS) as the standard reference
model. This model supplies a conceptual framework for discussing
and describing archival practice. OAIS articulates the roles and
interrelationships of the three groups that have a key stake in digital
process, i.e., creator or distributor, user, and repository. The
reference model identifies preservation as a process that begins
when digital information is created; this is a critical point of
difference from the standard analog model, which considers preservation
much later in the life cycle of an artifact. Finally, the OAIS model
identifies the core functions and organizational features of a digital
archival repository. This has influenced perceptions of what constitutes
a trusted archives. OAIS is on the International Organization for
Standardization (ISO) standards track and is the reference model
of choice of those involved in digital preservation worldwide.
Today, there are four commonly understood technical
approaches to digital preservation. These approaches are not mutually
exclusive; indeed, there is an emerging consensus that all four approaches,
and probably others not yet devised, will be deployed for the variety
of digital object types and the demands for access to them.
Migration. In this approach, digital information
is stored in software-independent formats. The information is reformatted
as needed so that it can be accessed using current hardware and software.
Most digital archival repositories rely almost wholly on data migration.
It is doubtful that the strategy will work well for mixed media.
Technology preservation. Under this approach,
data are preserved along with the hardware/software on which they
depend. Given the variety of hardware and software platforms and
the rate at which they change, this strategy generally is not believed
to be economically viable. Still, many data rescue efforts (see Digital
archaeology below) rely on the persistence of outmoded hardware and
software.
Emulation. Often considered a form of technology
preservation, emulation entails storing digital information alongside
detailed information about how it looked, felt, and functioned in
its original software/hardware environment. The look, feel, and functionality
of the digital information are then "emulated" or re-created on successive
generations of hardware/software. Emulation is particularly pertinent
to mixed media. Individuals who are conducting research on the technical
and economic viability of this approach include Jeff Rothenberg at
the RAND Corporation and researchers at CAMiLEON. Emulation is in
the exploratory phase; it has never been successfully used for preservation
in a sustainable way.
Persistent object preservation. The opposite
of migration, persistent object preservation (POP) entails explicitly
declaring the properties (e.g., content, structure, context, presentation)
of the original digital information that ensure its persistence.
Of the strategies listed here, POP is the only one that starts with
and remains focused on preserving the digital information from its
inception. Other strategies attempt to counter or overcome the generic
technical problem of obsolescence.
Another important technical approach merits mentiondigital
archaeology or data mining. Although not a preservation strategy
as such, digital archaeology enables digital information to be rescued
or recovered from disks, tapes, and other storage media that are
no longer readable as a result of physical deterioration, neglect,
obsolescence, or similar reasons.
To remain viable over the long term, appropriate documentation
or metadata must accompany digital information. Key preservation
metadata initiatives are reviewed in a white paper by the Online
Computer Library Center (OCLC) and the Research Libraries Group (RLG).1
Practical Preservation Activity
There are several practical preservation efforts underway
that demonstrate the range of experience and expertise around the
country.
Active preservation programs are under way in archives
where preservation is often legally mandated. For example, the archives
of national and state governments are legally bound to preserve selected
records of government, including electronic records, in perpetuity.
Business archives, such as those at financial, pharmaceutical, chemical,
and other companies, may maintain records for legal and other reasons.
Statutes of limitations often govern these mandates; consequently,
such archives do not typically keep data in perpetuity as do government
archives. These systems can be said to be more analogous to records
management than to archiving; nonetheless, managing digital records
even for seven years can provide technical challenges. Archives are
also established at not-for-profit institutions, such as universities,
that maintain records (including electronic records) for legal, business,
and cultural reasons.2
Preservation is also under way in organizations in
which data creators and producers perceive the long-term commercial
value of digital information. Publishers such as Elsevier Science
preserve the electronic scholarly journals they produce. The entertainment
industry, most notably music and film companies, have large investments
in digital assets that they wish to reuse over time, and they have
developed digital asset management systems tailored for their specific
needs.
Preservation programs also are active in organizations
that perceive a noncommercial value of digital information for use
and reuse. Libraries, archives, and museums that digitize objects
in their collections for online presentation, for example, may seek
to maintain those objects over time rather than to rescan them as
they become obsolete.
In places where data archives and systems vendors
see commercial possibilities in the provision, supply, and support
of long-term data storage facilities, preservation has become vital
to commercial development. Data warehousing is a cottage industry
with numerous related trade associations, exhibitions, and certification
procedures. Data archives are beginning to emerge in the library
community; for example, both OCLC and RLG are considering offering
data archiving facilities on a cost-recovery basis.
Specific research communities, where data creators
are also data users and where both groups recognize the importance
of being able to reuse research data, undertake large-scale preservation
of their intellectual assets. Both the ICPSR and the Roper Center
preserve social science and government statistical data.
There are also major preservation activities in communities
where data creators and data users recognize their interdependence
and the value of the digital information in which they maintain a
common interest. Through PubMed Central, the National Library of
Medicine acts as a digital archival repository for medical publications
and other medical information. Finally, archival repositories may
be developed as a by-product of a commercial process. The Internet
Archive is an archive of "snapshots" taken of selected Web pages
by Alexa. An information company can use information gained from
those snapshots for commercial purposes. Alexa assesses the visibility
of Web pages by seeing who links into a site.
Experimental Preservation Activity
The InterPARES (International Research on Permanent
Authentic Records in Electronic Systems) Project is a major international
research initiative involving archival scholars, computer engineering
scholars, and representatives of national archival institutions and
private industry. Its goal is "to develop the theoretical and methodological
knowledge essential for the permanent preservation of records generated
electronically, and, on the basis of this knowledge, to formulate
model policies, strategies, and standards capable of ensuring their
preservation." The InterPARES Project is investigating numerous issues
in digital preservation, including the authenticity of electronic
records.
The National Archives and Records Administration is
developing a strategic and technical framework within which it may
preserve in perpetuity selected electronic records of the federal
government. It is closely involved with the InterPARES Project, the
OAIS reference standard, the National Partnership for Advanced Computational
Infrastructure led by the San Diego Super Computer Center, and others.
It is an international leader in research in selected areas, including
requirements and processes for the preservation and reproduction
of authentic records, development of the persistent archives method,
application of advanced computing tools to records-management processes,
and integration of digital preservation technologies with infrastructure
technologies for e-government and e-business.
Under the auspices of The Andrew W. Mellon Foundation's
e-journal archiving program, seven major libraries (the New York
Public Library and the university libraries of Cornell, Harvard,
Massachusetts Institute of Technology [MIT], Pennsylvania, Stanford,
and Yale) are engaged in planning digital archival repositories for
different kinds of scholarly journals. Yale, Harvard, and Pennsylvania
have worked with commercial publishers on archiving the full range
of their electronic journals; Cornell and the New York Public Library
have worked on archiving journals in specific disciplines. MIT's
project involves archiving "dynamic" e-journals (i.e., those that
change frequently), and Stanford is investigating the development
of archiving software tools under the auspices of its LOCKSS (Lots
of Copies Keep Stuff Safe) program.
RLG and OCLC are jointly conducting preservation research.
At present, their work focuses on the attributes of a digital archival
repository and on preservation metadata.
The Andrew W. Mellon Foundation has invested in an
investigation of emulation as a viable preservation strategy. Jeff
Rothenberg at the RAND Corporation is conducting this research.
The IBM Almaden Research Center is investigating the
possibility of using a universal virtual machine for digital preservation.
The University of Pennsylvania is conducting work
on data provenance.
Preservation Research
There are currently nine areas of significant research
into preserving digital files. They are:
- Architecture and performance of archival repositories. Key research
is under way at the San Diego Super Computer Center, Stanford University,
the National Archives and Records Administration, the Culpeper
Center of the Library of Congress, Cornell University, Yale University,
MIT, and Harvard University.
- Persistent identification of and naming for archived information
(e.g., International Digital Object Identifier [DOI], Persistent
Uniform Resource Locator [PURL]).
- Methods for recording and ensuring authenticity of archived information
(digital signatures, watermarking, mechanisms for recording information
about provenance). Determining the authenticity of a digital object
is likely to require the use of techniques whose reliability is
still being debated. Techniques appropriate to digital images may
include digital signatures and watermarking. Checksums and other
technical routines that produce message digests are appropriate
for objects in virtually all formats. They help determine authenticity
by analyzing the object's structure and composition and whether
it has been changed in any way since a particular benchmark point.
Information may be found at
- Degradation and testing of magnetic and other media used to store
digital information (work being conducted at the National Institute
of Standards and Technology).
- Attributes of preservable digital information. These efforts
focus on specific kinds of digital information. For example, research
communities interested in social science and in space data have
defined standards for formatting and describing information in
their respective fields.
- Attributes of trusted digital archival repositories. This work
centers on specific kinds of digital information and on the organizations
that arise to preserve it. Participants in the Mellon e-journals
archiving program, for example, are looking at the organizational,
business, and rights issues that surround archives that are established
to preserve scholarly e-journals.
- Development of standards (including standards for data and metadata
formats, digital storage media, and data management practice).
Formal standardization takes place through bodies such as the ISO,
World Wide Web Consortium (W3C), National Information Standards
Organization (NISO), and Internet Engineering Task Force (IETF)
and reflects the emerging consensus of stakeholder communities.
It is important to distinguish between the standards themselves
and the understandings that need to be reached among stakeholders
about how the standards are to be applied in certain instances
(see item 5).
- Automatic copying and distribution of digital information (LOCKSS).
- Policies and implementation mechanisms for the preservation risk
management and assessment of Web-accessible content (Project Prism
at Cornell University).
If preservation activity in the near future bears
any resemblance to that activity in the past two years or so, there
will be further significant and unpredictable changes in this dynamic
field. References
References
ICPSR. No date. "About ICPSR." Available from http://www.icpsr.umich.edu/ORG/about.html.
Web Sites Noted in Text
Alexa. http://info.alexa.com
CAMiLEON. www.si.umich.edu/CAMILEON/
Cornell University. www.library.cornell.edu/preservation/digital.html http://rmc-www.library.cornell.edu/online/studentrecords/
Electronic Privacy Information Center. www.epic.org/
Elsevier e-journal archiving. www.elsevier.nl www.elsevier.nl/homepage/about/resproj/tulip.shtml www.diglib.org/preserve/yale0206.htm
Harvard University. www.news.harvard.edu/gazette/1999/03.25/diglibrary.html
IBM Almaden Project: www.almaden.ibm.com
International Digital Object Identifier (DOI). www.doi.org
International Organization for Standardization (ISO). www.iso.org
Internet Archive. www.archive.org/about
Internet Engineering Task Force (IETF). www.ietf.org
InterPARES Project. www.interpares.org
Inter-university Consortium for Political and Social
Research (ICPSR). www.icpsr.umich.edu
Library of Congress National Audio-Visual Conservation
Center in Culpeper. http://lcweb.loc.gov/rr/mopic/avprot/avprhome.html
Lots of Copies Keep Stuff Safe (LOCKSS). http://lockss.stanford.edu
Massachusetts Institute of Technology (MIT). http://web.mit.edu/newsoffice/nr/2000/libraries.html
Mellon e-journal archiving. http://www.diglib.org/preserve/ejp.htm
National Partnership for Advanced Computational Infrastructure
(NPACI). www.npaci.edu/online/v6.2/perm.html
National Archives and Records Administration (NARA). www.nara.gov www.nara.gov/nara/vision/eap/eapspec.html www.nara.gov/nara/electronic
National Information Standards Organization (NISO). http://www.niso.org
National Institute of Standards and Technology (NIST). www.nist.gov; www.itl.nist.gov/div895
Online Computer Library Center (OCLC). www.oclc.org/research/pmwg/
Open Archival Information System (OAIS) standard "reference
model." http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html
Persistent Uniform Resource Locator (PURL). www.purl.org
Project Prism. http://prism.cornell.edu/PrismWeb/AboutPrism.htm
PubMed Central. www.pubmedcentral.nih.gov
Research Libraries Group (RLG). www.rlg.org/longterm/ndex.html; www.rlg.org/pr/pr2000-oclc.html
Roper Center. www.ropercenter.uconn.edu/catalog40/StartQuery.html
Rothenberg, Jeff (RAND Corporation). www.rand.org/methodology/isg/archives.html
San Diego Super Computer Center. www.sdsc.edu/DigitalLibraries.html
Stanford University. www.sul.stanford.edu/depts/spc/indaids.html
University of Pennsylvania (work on data provenance). http://db.cis.upenn.edu/Research/provenance.html
World Wide Web Consortium (W3C). www.w3.org
Yale University. www.yale.edu/opa/newsr/01-02-23-02.all.html
Footnotes
1 See http://www.rlg.org/longterm/index.html.
2 The
National Archives and Records AdministrationÕs Center for Electronic
Records is perhaps the largest government archive for electronic
records (http://www.nara.gov/nara/electronic/).
Return
to Start Previous
Return to CLIR Home Page >>
|