In a very short time, preservation has developed into a critically important part of managing a library’s most precious assets, its collections.
-Abby Smith, “Preservation in the Future Tense”
The vision of creating digital libraries that will be able to preserve our heritage currently rests on technological quicksand. There is as yet no viable long-term strategy to ensure that digital information will be readable in the future. Not only are digital documents vulnerable to loss via media decay and obsolescence, but they become equally inaccessible and unreadable if the software needed to interpret them-or the hardware on which that software runs-is lost or becomes obsolete.
This report explores the technical depth of this problem, analyzes the inadequacies of a number of ideas that have been proposed as solutions, and elaborates the emulation strategy, which is, in my view, the only approach yet suggested to offer a true solution to the problem of digital preservation (Rothenberg 1995a). Other proposed solutions involve printing digital documents on paper, relying on standards to keep them readable, reading them by running obsolete software and hardware preserved in museums, or translating them so that they “migrate” into forms accessible by future generations of software. Yet all of these approaches are short-sighted, labor-intensive, and ultimately incapable of preserving digital documents in their original forms. Emulation, on the other hand, promises predictable, cost-effective preservation of original documents, by means of running their original software under emulation on future computers.
2. The Digital Longevity Problem
Documents, data, records, and informational and cultural artifacts of all kinds are rapidly being converted to digital form, if they were not created digitally to begin with. This rush to digitize is being driven by powerful incentives, including the ability to make perfect copies of digital artifacts, to publish them on a wide range of media, to distribute and disseminate them over networks, to reformat and convert them into alternate forms, to locate them, search their contents, and retrieve them, and to process them with automated and semi-automated tools. Yet the longevity of digital content is problematic for a number of complex and interrelated reasons (UNACCIS 1990, Lesk 1995, Morris 1998, Popkin and Cushman 1993, Rothenberg 1995b, Getty 1998).
It is now generally recognized that the physical lifetimes of digital storage media are often surprisingly short, requiring information to be “refreshed” by copying it onto new media with disturbing frequency. The technological obsolescence of these media (and of the hardware and software necessary to read them) poses a different and equally urgent threat. Moreover, most digital documents and artifacts exist only in encoded form, requiring specific software to bring their bit streams to life and make them truly usable; as these programs (or the hardware/software environments in which they run) become obsolete, the digital documents that depend on them become unreadable-held hostage to their own encoding. This problem is paradoxical, given the fact that digital documents can be copied perfectly, which is often naively taken to mean that they are eternal. This paradox prompted my ironic observation (Rothenberg 1997), “Digital documents last forever-or five years, whichever comes first.” There is currently no demonstrably viable technical solution to this problem; yet if it is not solved, our increasingly digital heritage is in grave risk of being lost (Michelson and Rothenberg 1992, Morelli 1998, Swade 1998).
In addition to the technical aspects of this problem, there are administrative, procedural, organizational, and policy issues surrounding the management of digital material. Digital documents are different from traditional paper documents in ways that have significant implications for the means by which they are generated, captured, transmitted, stored, maintained, accessed, and managed. Paramount among these differences is the greatly reduced lifetime of digital information without some form of active preservation. This mandates new approaches to accessioning and saving digital documents to avoid their loss. These approaches raise nontechnical issues concerning jurisdiction, funding, responsibility for successive phases of the digital document life cycle, and the development of policies requiring adherence to standard techniques and practices to prevent the loss of digital information. However, few of these nontechnical issues can be meaningfully addressed in the absence of a sound, accepted technical solution to the digital longevity problem.
3. Preservation in the Digital Age
The goal of any preservation program is to ensure long-term, ready access to the information resources of an institution.
-Abby Smith, “Preservation in the Future Tense”
Preservation constitutes one leg of a tripod that supports informational institutions such as libraries, the other legs being access and the development and management of collections (Fox and Marchionini 1998, Schurer 1998). Without preservation, access becomes impossible, and collections decay and disintegrate.
Informational artifacts include documents, data, and records of all kinds, in all media, which I refer to as “documents” here, for simplicity (Roberts 1994). The essence of preserving informational artifacts is the retention of their meaning. This requires the ability to recreate the original form and function of a document when it is accessed, for example, to establish its authenticity, validity, and evidential value and to allow the document’s users to understand how its creator and original viewers saw it, what they were (and were not) able to infer from it, what insights it may have conveyed to them, and what aesthetic value it may have had for them.
My focus is on digital documents, by which I mean informational artifacts, some or all aspects of whose intended behavior or use rely on their being encoded in digital form. The term “digital” in this context denotes any means of representing sequences of discrete symbolic values-each value having two or more unambiguously distinguishable states-so that they can, at least in principle, be accessed, manipulated, copied, stored, and transmitted entirely by mechanical means, with high reliability.1
The preservation of digital documents involves a number of distinctive requirements. In particular, all such documents possess a unique collection of core digital attributes that must be retained. These attributes include their ability to be copied perfectly, to be accessed without geographic constraint, to be disseminated at virtually no incremental cost (given the existence of appropriate digital infrastructure), and to be machine-readable so that they can be accessed, searched, and processed by automated mechanisms that can modify them, reformat them, and perform arbitrary computations on their contents in all phases of their creation and distribution. Furthermore, new inherently digital (“born-digital”) document forms, such as dynamic, distributed, interactive hypertext and hypermedia, must retain their unique functionality, including their ability to integrate information from disparate traditional sources, such as books, periodicals, newspapers, mail, phone messages, data, imagery, and video (Bearman 1991, Bearman 1992, Bikson 1997, Kenney 1997, Michelson and Rothenberg 1992).
In response to the difficulty of saving digital documents (due to factors such as media decay and software and hardware obsolescence, discussed in detail below) it is sometimes suggested that they be printed and saved in hard-copy form. This is a rear-guard action and not a true solution. Printing any but the simplest, traditional documents results in the loss of their unique functionality (such as dynamic interaction, nonlinearity, and integration), and printing any document makes it no longer truly machine-readable, which in turn destroys its core digital attributes (perfect copying, access, distribution, and so forth). Beyond this loss of functionality, printing digital documents sacrifices their original form, which may be of unique historical, contextual, or evidential interest (Bearman 1993, Hedstrom 1991, U. S. District Court for the District of Columbia 1993).
Proposed alternatives to printing digital documents include translating digital documents into standard forms or extracting their contents without regard to their original nature. Though these approaches have traditional analogues (such as the translation of ancient texts into the vernacular to give them a larger audience), they are fraught with danger. The meaning of a document may be quite fragile, since meaning is in the eye of the beholder: what may be a trivial transformation to a casual reader may be a disastrous loss to a scholar, historian, or lawyer. Examples of loss of meaning abound in our daily experience of converting digital documents from their native form into that of some other software application in order to read them. At its best, such conversion often sacrifices subtleties (such as format, font, footnotes, cross-references, citations, headings, numbering, shape, and color); at its worst, it leaves out entire segments (such as graphics, imagery, and sound) or produces meaningless garbage (Horsman 1994).
While it is often useful to create contemporary vernacular transcriptions of historical documents (such as Shakespeare’s Sonnets or the Declaration of Independence), society places a high value on retaining the originals so that we may verify that content has not been lost in transcription (whether inadvertently or for nefarious ends), as well as for scholarly and aesthetic purposes. For digital documents, retaining an original may not mean retaining the original medium (which rapidly decays and becomes obsolete), but it should mean retaining the functionality, look, and feel of the original document.
The skills and judgment developed in preservation professionals-the ability to discover the original form of an object and the intent of its creator, and to prolong the life of the object or return the object as nearly as possible to its state at the time of its creation-are precisely the same skill sets that are needed for the future, albeit practiced in a radically different context.
-Abby Smith, “Preservation in the Future Tense”
1 I use the term digital information in preference to electronic information because it more accurately captures the essential aspects of the problem. Digital information can in principle be represented in nonelectronic form, for example, by using optical or quantum techniques, whereas electronic information is not necessarily digital.