4. The Scope of the Problem

The preservation of digital documents is a matter of more than purely academic concern. A 1990 House of Representatives report cited a number of cases of significant digital records that had already been lost or were in serious jeopardy of being lost (U. S. Congress 1990), and the 1997 documentary film Into the Future (Sanders 1997) cited additional cases (Bikson 1994, Bikson and Law 1993, Fonseca, Polles and Almeida 1996, Manes 1998, NRC 1995).

In its short history, computer science has become inured to the fact that every new generation of software and hardware technology entails the loss of information, as documents are translated between incompatible formats (Lesk 1992). The most serious losses are caused by paradigm shifts (such as those between networked, hierarchical, and relational databases), which often require the complete redesign of documents (or databases) to migrate to the new paradigm. Whenever this happens, documents that are not in continuing use may well be orphaned by the demise of their original formats, that is, they may be abandoned to save the cost of migration, while each document that does migrate may be turned into something unrecognizably different from its original, which is generally lost.2 Even when migration does not involve a major paradigm shift, it can result in subtle losses of context and content. Yet aside from this ad hoc migration process, there exists no proven strategy for preserving (or translating) documents over time (Bikson and Frinking 1993).

The scope of this problem extends beyond the traditional library domain, affecting government records, environmental and scientific baseline data, documentation of toxic waste disposal, medical records (whose lifetime must exceed 100 years), corporate data (such as the documentation of drug trials for pharmaceutical companies or of geologic survey data for petrochemical companies), electronic-commerce transactions, and electronic records needed to support forthcoming digital government initiatives (Bikson 1994, Erlandsson 1994).

The library and archives communities have identified at least some critical aspects of this problem and have recognized that preserving digital documents may require substantial new investments and commitments by institutions and government agencies (Bikson forthcoming, Hedstrom 1993, Indiana University 1997, Waters and Garrett 1996). Showing admirable foresight, these communities have begun to discuss alternative economic and administrative policies for funding and managing digital preservation and have begun to develop conceptual frameworks for metadata that are not restricted to the print medium (Day 1997, Dublin Core Metadata Initiative 1998, IEEE 1997, Giguere 1996, Rothenberg 1996). Yet the lack of any long-term technical solution to the problem of digital preservation limits the efficacy of such explorations, since attempting to allocate responsibilities and assess costs for a nonexistent process is of questionable value.

4.1 The need for triage

The practical problem of digital preservation can be viewed at three different time scales. In the short term, many organizations are faced with an urgent need to save digital material that is in imminent danger of becoming unreadable or inaccessible, or to retrieve digital records that are already difficult to access. Yet the often heroic efforts needed to save or retrieve such material may not be generally applicable to preserving digital documents far into the future, and the techniques employed may not even be generalizable to solve similar urgent problems that may arise in the future. These short-term efforts therefore do not provide much leverage, in the sense that they are not replicable for different document types, though they may still be necessary for saving crucial records. In the medium term, organizations must quickly implement policies and technical procedures to prevent digital records from becoming vulnerable to imminent loss in the near future. For the vast bulk of records-those being generated now, those that have been generated fairly recently, or those that have been translated into formats and stored on media that are currently in use-the medium-term issue is how to prevent these records from becoming urgent cases of imminent loss within the next few years, as current media, formats, and software evolve and become obsolete.

In the long term (which is the focus of this report), it is necessary to develop a truly long-lived solution to digital longevity that does not require continual heroic effort or repeated invention of new approaches every time formats, software or hardware paradigms, document types, or recordkeeping practices change. Such an approach must be extensible, in recognition of the fact that we cannot predict future changes, and it must not require labor-intensive (and error-prone) translation or examination of individual records. It must handle current and future records of unknown type in a uniform way, while being capable of evolving as necessary.

4.2 Types of digital information affected

Though most early digital documents have consisted largely of text, the generation of multimedia records has increased rapidly in recent years, to include audio recordings, graphical charts, photographic imagery, and video presentations, among others. In the digital domain, all of these media can be combined into hypermedia records, whose impact and expressivity can be expected to stimulate their increased use. Whereas the bulk of existing digital documents may be textual, multimedia and hypermedia records are likely to become ever more popular and may well become dominant in the near future. Any solution to digital preservation that is limited to text will therefore quickly become obsolete. A true long-term solution should be completely neutral to the form and content of the digital material it preserves.

4.3 Contextual issues

The preservation and management of digital records involves interrelated technical, administrative, procedural, organizational, and policy issues, but a sound technical approach must form the foundation on which everything else rests. Preserving digital records may require substantial new investments and commitments by organizations, institutions and agencies, forcing them to adopt new economic and administrative policies for funding and managing digital preservation. Yet it is impossible to allocate responsibilities or assess costs for an undefined process: until a viable technical approach to digital longevity has been identified and developed, it is premature to spend much effort attempting to design the administrative and organizational environment that will embed whatever technical approach is ultimately adopted.

5. Technical Dimensions of the Problem

Digital media are vulnerable to loss by two independent mechanisms: the physical media on which they are stored are subject to physical decay and obsolescence, and the proper interpretation of the documents themselves is inherently dependent on software.

5.1 Digital media suffer from physical decay and obsolescence

There is reasonably widespread (though by no means universal) awareness of the fact that digital storage media have severely limited physical lifetimes. The National Media Lab has published test results for a wide range of tapes, magnetic disks, CD-ROMs, and other media (Van Bogart 1996), showing that a tape, disk, or even CD that is picked at random (that is, without prior evaluation of the vendor or the specific batch of media) is unlikely to have a lifetime of even five years (Lancaster 1986). Vendors and media scientists may argue vehemently about such numbers, but accurate estimates are ultimately largely irrelevant, since the physical lifetime of media is rarely the constraining factor for digital preservation. Even if archival quality media were introduced in the market, they would probably fail, since they would quickly be made obsolete-despite their physical longevity-by newer media having increased capacity, higher speed, greater convenience, and lower price (Schurer 1998). This is a natural outgrowth of the exponential improvement in storage density, speed, and cost that has characterized digital media development for the past several decades: the market makes older storage media obsolete as newer, better media become available. The short lifetimes of eight-inch floppy disks, tape cartridges and reels, hard-sectored disks, and seven-track tapes, among others, demonstrate how quickly storage formats become inaccessible.

Media obsolescence manifests itself in several ways: the medium itself disappears from the market; appropriate drives capable of reading the medium are no longer produced; and media-accessing programs (device drivers) capable of controlling the drives and deciphering the encodings used on the medium are no longer written for new computers. Upgrading to a new computer system therefore often requires abandoning an old storage medium, even if an organization still has documents stored on that medium.

The dual problems of short media lifetime and rapid obsolescence have led to the nearly universal recognition that digital information must be copied to new media (refreshed) on a very short cycle (every few years). Copying is a straightforward solution to these media problems, though it is not trivial: in particular, the copy process must avoid corrupting documents via compression, encryption, or changing data formats.

In addition, as media become more dense, each copy cycle aggregates many disks, tapes, or other storage units onto a single new unit of storage (say, a CD or its successor, a DVD, digital video disk or digital versatile disk). This raises the question of how to retain any labeling information and metadata that may have been associated with the original media: since it is infeasible to squeeze the contents of the labels of 400 floppy disks to fit on the label of a single CD, label information must be digitized to ensure that it continues to accompany the records it describes. But whereas labels are directly human-readable, digitized information is not; labels and metadata must therefore be digitized in such a way that they remain more easily readable by humans than are the documents they describe. This may seem a relatively trivial aspect of the problem, but it has serious implications (Bearman 1992), as discussed below.

5.2 Digital documents are inherently software-dependent

Though media problems are far from trivial, they are but the tip of the iceberg. Far more problematic is the fact that digital documents are in general dependent on application software to make them accessible and meaningful. Copying media correctly at best ensures that the original bit stream of a digital document will be preserved. But a stream of bits cannot be made self-explanatory, any more than hieroglyphics were self-explanatory for the 1,300 years before the discovery of the Rosetta Stone. A bit stream (like any stream of symbols) can represent anything: not just text but also data, imagery, audio, video, animated graphics, and any other form or format, current or future, singly or combined in a hypermedia lattice of pointers whose formats themselves may be arbitrarily complex and idiosyncratic. Without knowing what is intended, it is impossible to decipher such a stream. In certain restricted cases, it may be possible to decode the stream without additional knowledge: for example, if a bit stream is known to represent simple, linear text, deciphering it is amenable to cryptographic techniques. But in general, a bit stream can be made intelligible only by running the software that created it, or some closely related software that understands it.

This point cannot be overstated: in a very real sense, digital documents exist only by virtue of software that understands how to access and display them; they come into existence only by virtue of running this software.

When all data are recorded as 0s and 1s, there is, essentially, no object that exists outside of the act of retrieval. The demand for access creates the ‘object,’ that is, the act of retrieval precipitates the temporary reassembling of 0s and 1s into a meaningful sequence that can be decoded by software and hardware.
-Abby Smith, “Preservation in the Future Tense”

As this statement implies, the only reliable way (and often the only possible way) to access the meaning and functionality of a digital document is to run its original software-either the software that created it or some closely related software that understands it (Swade 1998). Yet such application software becomes obsolete just as fast as do digital storage media and media-accessing software. And although we can save obsolete software (and the operating system environment in which it runs) as just another bit stream, running that software requires specific computer hardware, which itself becomes obsolete just as quickly. It is therefore not obvious how we can use a digital document’s original software to view the document in the future on some unknown future computer (which, for example, might use quantum rather than binary states to perform its computations). This is the crux of the technical problem of preserving digital documents.

5.3 Additional considerations

Any technical solution must also be able to cope with issues of corruption of information, privacy, authentication, validation, and preserving intellectual property rights. This last issue is especially complex for documents that are born digital and therefore have no single original instance, since traditional notions of copies are inapplicable to such documents. Finally, any technical solution must be feasible in terms of the societal and institutional responsibilities and the costs required to implement it.


2 Documents whose continued use is crucial to the individuals or organizations that own them are more likely to be included in the migration process, in recognition of their importance, but this does not guarantee that their meaning will not be inadvertently lost or corrupted.