In 1996, the Commission on Preservation and Access and the Research Libraries Group issued the final report of the Task Force on the Archiving of Digital Information. Chaired by John Garrett and Donald Waters, the task force spent over a year analyzing the problem, considering options, consulting with others around the world, and formulating a series of recommendations. The conclusion reached by the impressive group of 21 experts was alarming-there is, at present, no way to guarantee the preservation of digital information. And it is not simply a technical problem. A serious commitment to preserving digital information requires a legal environment that enables preservation. It also means that specific organizations-libraries, government agencies, corporations-must take responsibility for preservation by enacting new policies and creating the economic means to secure survival of this generation’s knowledge into the future.

The Council on Library and Information Resources, which absorbed the Commission on Preservation and Access in July 1997, continues to search for answers to the troubling question of how digital information will be preserved. Raising public awareness is an important goal, and we have pursued it vigorously. Since the publication of the task force report in 1996, we have spoken to library and scholarly groups here and abroad, published a number of papers, and produced an hour-long documentary film on the subject for broadcast on public television. The film especially has made an impression, and several observers have wondered why we have spent so much time in describing the problems and so little in finding solutions.

In fact, we have also been seeking solutions, and the present paper by Jeff Rothenberg is the first in a series resulting from our efforts. Each paper in the series will propose an approach to the preservation of digital information. Each approach addresses the important parts of the problem. We believe that it is best to assemble as many ideas as possible, to place them before a knowledgeable audience, and to stimulate debate about their strengths and weaknesses as solutions to particular preservation problems.

Jeff Rothenberg is a senior research scientist of the RAND Corporation. His paper is an important contribution to our efforts.

Executive Summary

There is as yet no viable long-term strategy to ensure that digital information will be readable in the future. Digital documents are vulnerable to loss via the decay and obsolescence of the media on which they are stored, and they become inaccessible and unreadable when the software needed to interpret them, or the hardware on which that software runs, becomes obsolete and is lost. Preserving digital documents may require substantial new investments, since the scope of this problem extends beyond the traditional library domain, affecting such things as government records, environmental and scientific baseline data, documentation of toxic waste disposal, medical records, corporate data, and electronic-commerce transactions.

This report explores the technical depth of the problem of long-term digital preservation, analyzes the inadequacies of a number of ideas that have been proposed as solutions, and elaborates the emulation strategy. The central idea of the emulation strategy is to emulate obsolete systems on future, unknown systems, so that a digital document’s original software can be run in the future despite being obsolete. Though it requires further research and proof of feasibility, this approach appears to have many advantages over the other approaches suggested and is offered as a promising candidate for a solution to the problem of preserving digital material far into the future. Since this approach was first outlined, it has received considerable attention and, in the author’s view, is the only approach yet suggested to offer a true solution to the problem of digital preservation.

The long-term digital preservation problem calls for a long-lived solution that does not require continual heroic effort or repeated invention of new approaches every time formats, software or hardware paradigms, document types, or recordkeeping practices change. The approach must be extensible, since we cannot predict future changes, and it must not require labor-intensive translation or examination of individual documents. It must handle current and future documents of unknown type in a uniform way, while being capable of evolving as necessary. Furthermore, it should allow flexible choices and tradeoffs among priorities such as access, fidelity, and ease of document management.

Most approaches that have been suggested as solutions-printing digital documents on paper, relying on standards to keep them readable, reading them by running obsolete software and hardware preserved in museums, or translating them so that they “migrate” into forms accessible by future generations of software-are labor-intensive and ultimately incapable of preserving digital documents in their original forms.

The best way to satisfy the criteria for a solution is to run the original software under emulation on future computers. This is the only reliable way to recreate a digital document’s original functionality, look, and feel. Though it may not be feasible to preserve every conceivable attribute of a digital document in this way, it should be possible to recreate the document’s behavior as accurately as desired-and to test this accuracy in advance.

The implementation of this emulation approach involves: (1) developing generalizable techniques for specifying emulators that will run on unknown future computers and that capture all of those attributes required to recreate the behavior of current and future digital documents; (2) developing techniques for saving-in human-readable form-the metadata needed to find, access, and recreate digital documents, so that emulation techniques can be used for preservation; and (3) developing techniques for encapsulating documents, their attendant metadata, software, and emulator specifications in ways that ensure their cohesion and prevent their corruption. The only assumption that this approach makes about future computers is that they will be able to perform any computable function and (optionally) that they will be faster and/or cheaper to use than current computers.