Avoiding Technological Quicksand: Sections 7-8 • CLIR

7. Criteria for an Ideal Solution

In contrast to the above strategies, an ideal approach should provide a single, extensible, long-term solution that can be designed once and for all and applied uniformly, automatically, and in synchrony (for example, at every future refresh cycle) to all types of documents and all media, with minimal human intervention. It should provide maximum leverage, in the sense that implementing it for any document type should make it usable for all document types. It should facilitate document management (cataloging, deaccessioning, and so forth) by associating human-readable labeling information and metadata with each document. It should retain as much as desired (and feasible) of the original functionality, look, and feel of each original document, while minimizing translation so as to minimize both labor and the potential for loss via corruption. If translation is unavoidable (as when translating labeling information), the approach should guarantee that this translation will be reversible, so that the original form can be recovered without loss.

The ideal approach should offer alternatives for levels of safety and quality, volume of storage, ease of access, and other attributes at varying costs, and it should allow these alternatives to be changed for a given document, type of document, or corpus at any time in the future. It should provide single-step access to all documents, without requiring multiple layers of encapsulation to be stripped away to access older documents, while allowing the contents of a digital document to be extracted for conversion into the current vernacular, without losing the original form of the document. It should offer up-front acceptance testing at accession time, to demonstrate that a given document will be accessible in the future. Finally, the only assumptions it should make about future computers are that they will be able to perform any computable function and (optionally) that they will be faster and/or cheaper to use than current computers.

8. The Emulation Solution

In light of the foregoing analysis, I propose that the best (if not the only) way to satisfy the above criteria is to somehow run a digital document’s original software. This is the only reliable way to recreate a digital document’s original functionality, look, and feel. The central idea of the approach I describe here is to enable the emulation of obsolete systems on future, unknown systems, so that a digital document’s original software can be run in the future despite being obsolete. Though it may not be feasible to preserve every conceivable attribute of a digital document in this way, it should be possible to recreate the document’s behavior as accurately as desired-and to test this accuracy in advance.

The implementation of this emulation approach would involve: (1) developing generalizable techniques for specifying emulators that will run on unknown future computers and that capture all of those attributes required to recreate the behavior of current and future digital documents; (2) developing techniques for saving-in human-readable form-the metadata needed to find, access, and recreate digital documents, so that emulation techniques can be used for preservation; and (3) developing techniques for encapsulating documents, their attendant metadata, software, and emulator specifications in ways that ensure their cohesion and prevent their corruption. Since this approach was first outlined (Michelson and Rothenberg 1992, Rothenberg 1995a), it has received considerable attention and has been cited as the only proposed approach that appears to offer a true solution to the problem of digital preservation (Erlandson 1996).

8.1 The right stuff

In principle, the proposed solution involves encapsulating three kinds of information with each digital document. In practice, there are a number of ways of doing this, some of which would be safer (but would use more storage), while others would involve somewhat more risk (but would use less storage). Figure 1 shows a logical view of this encapsulation. For clarity all items are shown explicitly, representing the logical model, although in practice, items that are required by many different documents might be stored in centralized repositories and pointed to by each document, rather than being replicated as part of each document.

The first kind of information to be encapsulated comprises the document and its software environment. Central to the encapsulation is the digital document itself, consisting of one or more files representing the original bit stream of the document as it was stored and accessed by its original software. In addition, the encapsulation contains the original software for the document, itself stored as one or more files representing the original executable bit stream of the application program that created or displayed the document. A third set of files represents the bit streams of the operating system and any other software or data files comprising the software environment in which the document’s original application software ran. It must be guaranteed that these bit streams will be copied verbatim when storage media are refreshed, to avoid corruption. This first group of encapsulated items represents the original document in its entire software context: given a computing platform capable of emulating the document’s original hardware platform, this information should recreate the behavior of the original document.

The second type of information in the encapsulation of a document consists of a specification of an emulator for the document’s original computing platform. The specification must provide sufficient information to allow an emulator to be created that will run on any conceivable computer (so long as the computer is capable of performing any computable function). This emulator specification cannot be an executable program, since it must be created without knowledge of the future computers on which it will run. Among other things, it must specify all attributes of the original hardware platform that are deemed relevant to recreating the behavior of the original document when its original software is run under emulation. Only one emulator specification need be developed for any given hardware platform: a copy of it (or pointer to it) can then be encapsulated with every document whose software uses that platform. This provides the key to running the software encapsulated with the document: assuming that the emulator specification is sufficient to produce a working emulator, the document can be read (accessed in its original form) by running its original software under this emulator.

The final type of information in the encapsulation of a document consists of explanatory material, labeling information, annotations, metadata about the document and its history, and documentation for the software and (emulated) hardware included in the encapsulation. This material must first explain to someone in the future how to use the items in the encapsulation to read the encapsulated digital document. In order to fulfill this function, at least the top level of this explanatory material must remain human-readable in the future, to serve as a “bootstrap” in the process of opening and using the encapsulation. This is one place where standards may find a niche in this approach: simple textual annotation standards (which might evolve over time) would provide one way of keeping explanatory material human-readable. If translation of this explanatory material is required to keep it human-readable (that is, if the annotation standards themselves evolve), the translation might be performed when the encapsulation is copied to new media: I refer to this limited form of translation as transliteration.⁵ Any such translation must be reversible without loss, to ensure (and make it possible to verify) that the explanatory material is not corrupted. (These same techniques must be used to store emulator specifications, which must also remain human-readable in the future.) Additional metadata in the encapsulation describe the original document and provide labeling information that must accompany the document. Finally, additional metadata must provide historical context, provenance, life cycle history, and administrative information to help manage the document over time.

8.2 Annotate, Encapsulate, Transliterate and Emulate

Given a suitable emulator specification for a given obsolete hardware platform (which need only be created once for all documents whose software uses that platform), the process of preserving a digital document can be summarized as a sequence of four steps: annotate, encapsulate, transliterate and emulate. That is, (1) create any annotations needed to provide context for the document and to explain how to open and use the encapsulation; (2) encapsulate with the document all of the items described in the previous section; (3) when necessary (optionally, at each media refresh cycle), transliterate annotations to keep them human-readable; and (4) in the future, open the encapsulation, create the specified emulator, and run the emulator on a future computer. This allows the original software to be run under emulation, thereby recreating the saved document.

The sequence of events that must work in order for the emulation approach to allow an obsolete digital document to be read is illustrated in figure 2. The items in the top row of this figure represent elements that must be present for the scheme to work. Starting from the left, we must have a way of interpreting an emulator specification to produce a working hardware emulator (whether this interpretation is performed manually or automatically), and we must have a readable emulator specification for the required obsolete hardware (the original hardware and software are denoted HW and OS, respectively). This combination of a readable emulator specification and an interpreter for such specifications allows us to produce an emulator for the original hardware.

As shown in the middle of the top row, we then need a working, current computer and operating system (denoted HW’ and OS’) that can run the emulator: together, these produce a running OS’ environment, which is required to support both the emulation branch (shown by heavy lines at the left of the figure) and the media-access branch (shown at right). Following the media-access branch down from the upper right, the obsolete digital document itself must also exist on some current storage medium (to which it will have presumably migrated from its original medium) for which physical drives and device software are available. Assuming we can run the necessary driver software for this medium under the current hardware/operating system environment (HW’/OS’), we can thereby access the bit stream of the original document. Finally, going back to the main, emulation branch, running the emulator of the original, obsolete hardware (HW) in the current HW’/OS’ environment effectively “runs” the original hardware (under emulation); this allows us to run the original, saved (obsolete) operating system (OS), which in turn allows us to run the original, saved (obsolete) application software (SW) needed to read the saved (obsolete) digital document.

Though it may appear prohibitively inefficient to have to create and use an emulator to read each old document, three factors should be kept in mind. First, the inclusion of contextual annotation in the encapsulation makes it unnecessary to use emulation to perform routine management functions on the document, such as copying it, filing it, or distributing it. Emulation is needed only when the document is to be read or when its content is to be extracted for translation into some vernacular form.⁶

Second, an emulator specification for a given obsolete hardware platform need be created only once for all documents whose software uses that platform. This provides tremendous leverage: if an emulator specification is created for any document or document type, it will confer longevity on all other digital documents that use any of the software that runs on the given hardware platform.

Third, an emulator for a given obsolete platform need be created only once for each future platform on which emulation is required to run. Once created for each new generation of computer, the emulator for a given obsolete platform can be run whenever desired on any computer of that new generation. Generating new, running emulators for new computing platforms from saved emulator specifications will therefore be a rare process: once it has been done to access any document on a new platform, the resulting emulator for that platform can be used to access all other documents saved using the emulation scheme. The process of generating an emulator from its specifications can therefore be relatively inefficient (since it need be performed only infrequently), so long as the emulator that is generated is reasonably efficient when it runs.

8.3 Ancillary issues

Saving proprietary software, hardware specifications, and documentation, as required by this emulation strategy, raises potential intellectual property issues. Hardware specifications of the kind required for emulation are not necessarily proprietary, and since emulator specifications are not currently produced by hardware vendors (or anyone else), their intellectual ownership is as yet undefined. While application software and its documentation is often proprietary, the application programs required to access saved documents in general need be no more than readers for the desired document format, rather than editing programs. Such readers (along with their documentation) are often provided free by software vendors to encourage the use of their editing software. Operating system software and drivers, on the other hand, may very well be proprietary, and intellectual property restrictions or fees for these essential items must be respected if this approach is to work. Since the whole point of encapsulating this software is to make it available in the future, when it would otherwise be obsolete, one possible strategy would be to negotiate the free use of obsolete software, or to amend the copyright law to extend the principle of fair use to cover obsolete software.

For this strategy to work, responsibility for developing emulator specifications of the kind required would have to be accepted by one or more agencies, institutions, or market segments. Similarly, explanatory text standards would have to be developed and maintained, and responsibility would have to be accepted for refreshing media and performing transliteration (translating this explanatory material into new human-readable forms) when necessary.

8.4 Strengths and limitations of the emulation approach

It may appear that emulating a hardware platform simply to run application software is unnecessarily roundabout. If what is really desired is to emulate the behavior of the original digital document, why go to the trouble of running its original software at all? The answer to this is that we do not yet have any formal (or even informal) way of describing the full range of behaviors possible for even the simplest of digital documents, such as are produced by word processing programs. Describing the behavior of dynamic, interactive, hypermedia documents poses a far greater challenge. The only adequate specification of the behavior of a digital document is the one implicit in its interaction with its software. The only way to recreate the behavior of a digital document is to run its original software.

It may then be argued that instead of actually running original software, we might emulate the behavior of that software. This would provide considerable leverage over emulating the behavior of individual documents, since a given application program may be used for thousands (or millions) of different documents; emulating the behavior of that program would avoid having to understand and recreate the behavior of each individual document. However, we have no adequate way of specifying the behavior of most programs. The only meaningful specification of a program’s behavior is implicit in its interaction with its underlying software/hardware environment; that is, programs are self-describing, but only when they run. The only way to tell what a program really does is to run it.

Alternatively, it might be argued that most application programs run under an operating system (though some may run on a “bare” machine), so we might emulate the behavior of the OS to provide a virtual platform for all applications that run under that OS. This would provide even greater leverage than would emulating applications, since many different applications run on a given OS. (Although some applications run on several different operating systems, there are many more application programs than there are operating systems.) Emulating an OS would avoid having to emulate all the applications that run on it. However, it is at least as difficult to emulate the behavior of an OS as it is to emulate the behavior of an application program; in fact, it is probably more difficult, since an OS interacts with every aspect of the computing environment, whereas most applications are far more constrained. So the argument against emulating an application applies a fortiori against emulating an OS. Nevertheless, I do not rule out this possibility: in some cases, it may be preferable to emulate a hardware platform along with an OS to produce a virtual hardware/software platform that can run application programs. The approach proposed here allows for this variation, though it assumes that emulating hardware platforms will usually make the most sense.

Emulating the underlying hardware platform appears to be the best approach, given the current state of the art. We do not have accurate, explicit specifications of software, but we do (and must) have such specifications for hardware: if we did not, we could not build hardware devices in the first place. Why is it that we can specify hardware but not software? Any specification is intended for some reader or interpreter. Application software is intended to be interpreted automatically by hardware to produce an ephemeral, virtual entity (the running application) whose behavior we do not require to be fully specified (except to the hardware that will run it), since it is intended to be used interactively by humans who can glean its behavior as they use it. On the other hand, a hardware specification is interpreted (whether by humans or software) to produce a physical entity (a computer) whose behavior must be well-specified, since we expect to use it as a building block in other hardware and software systems. Hardware specifications are by necessity far more rigorous and meaningful than those of software. Emulating hardware is therefore entirely feasible and is in fact done routinely.⁷

Hardware emulation is also relatively easy to validate: when programs intended for a given computer run successfully on an emulator of that computer, this provides reasonable assurance that the emulation is correct. Test suites of programs could be developed specifically for the purpose of validation, and an emulator specification could be tested by generating emulators for a range of different existing computers and by running the test suite on each emulator. A test suite of this kind could also be saved as part of the emulator specification and its documentation, allowing an emulator generated for a future computer to be validated (in the future, before being used) by running the saved test suite. In addition, the computer museum approach dismissed above might be used to verify future emulators by comparing their behavior with that of saved, obsolete machines.

Furthermore, of the potential emulation approaches discussed here, emulating hardware has the greatest leverage. Except for special-purpose embedded processors (such as those in toasters, automobiles, watches, and other products), computers are rarely built to run a single program: there are generally many more programs than hardware platforms, even though some programs may run on more than one platform. At any given moment, there are relatively few hardware platforms in existence, though new hardware platforms may appear with greater frequency than new operating systems or applications (unless we consider each version of an OS to be a different instance, which it really is). Emulating hardware obviates the need to emulate the behavior of operating systems, application programs, and individual digital documents, all of which are problematic; it therefore appears to be by far the most efficient as well as the most viable emulation approach.

Finally, hardware emulation is a well-understood, common technique. It has been used for decades, both to help design new hardware and to provide upward compatibility for users.

8.5 Natural experiments related to the emulation approach

Two classes of natural experiments suggest that the emulation approach described here should work. Though none of these experiments addresses all of the questions that must be answered in order to use emulation as a basis for digital preservation, they show that key pieces of the strategy have worked in the past.

The first class consists of examples of bundling digital documents with their original software to ensure that they are accessible and readable. For example, Apple Macintosh software is often distributed with a README file that describes the software, explains how to install it, and gives other information such as restrictions on use and information about bugs. (This is also an example of encapsulating explanatory annotation with software.) In order to ensure that the README file will be readable by any user, distribution disks typically include a copy of a simple text-editing program (SimpleText) that can display the README file. Though most users already have at least one copy of SimpleText on their systems, as well as other, more powerful editors, most software vendors prefer not to assume that this will be the case. In the emulation approach, digital documents would be bundled with their original software, just as the README file is bundled with software capable of reading it in this example.

A second example of bundling occurs in the PC world, involving the compression scheme called PKZIP. When a file is compressed using this software, a decompression program, such as PKUNZIP, is required to expand the file. However, an option in PKZIP allows a simple version of an unzip program to be bundled with each compressed file. Choosing this option creates an executable file which, when run, expands automatically to the original file, avoiding the issue of whether the recipient of a compressed file will have the appropriate decompression software on hand.

A final example is a collection of images distributed on the Planetary Data Systems CD-ROM from NASA’s Jet Propulsion Laboratory (JPL). The imagery on this CD is designed to be read with Adobe Reader 2, a free program that can display files encoded in Adobe’s popular portable document format (PDF). If the user tries to display these images with a later version of Adobe Reader, the images refuse to display themselves: not only are the images bundled with appropriate software, but they are protected from being accessed by any other software, since there is no guarantee that such software will treat the images appropriately. In this case, not even later versions of the same program (such as Adobe Reader 3) are allowed to read the images; though this may seem restrictive, it is in fact a good approach, since later versions of software do not always treat files created by older versions appropriately.

All of these examples bundle software to be run on a known platform, so none of them provides much longevity for their documents. Nevertheless they do prove that bundling original software with a document is an effective way of making sure that the document can be read.

The second class of natural experiments involves the use of emulation to add longevity to programs and their documents. The first example is a decades-old practice that hardware vendors have used to provide upward compatibility for their customers. Forcing users to rewrite all of their application software (and its attendant databases, documents, and other files) when switching to a new computer would make it hard for vendors to sell new machines. Many vendors (in particular, IBM) have therefore often supplied emulation modes for older machines in their new machines. The IBM 360, for example, included an emulation mode for the older 7090/94 so that old programs could still be run. Apple did something similar when switching from the Motorola 68000 processor series to the PowerPC by including an emulator for 68000 code; not only did this allow users to run all of their old programs on the new machine, but significant pieces of the Macintosh operating system itself were also run under emulation after the switch, to avoid having to rewrite them. Whether emulation is provided by a special mode using microcode or by a separate application program, such examples prove that emulation can be used to keep programs (and their documents) usable long after they would otherwise have become obsolete.

A second example of the use of emulation is in designing new computing platforms. Emulation has long been used as a way of refining new hardware designs, testing and evaluating them, and even beginning to develop software for them before they have been built. Emulators of this kind might be a first step toward producing the emulator specifications needed for the approach proposed here: hardware vendors might be induced to turn their hardware-design emulators into products that could satisfy the emulator scheme’s need for emulator specifications.

A final example of the use of emulation is in the highly active “retro-computing” community, whose members delight in creating emulators for obsolete video game platforms and other old computers. There are numerous World Wide Web sites listing hundreds of free emulators of this kind that have been written to allow old programs to be run on modern computers. A particularly interesting example of this phenomenon is the MAME (Multiple Arcade Machine Emulator) system, which supports emulation of a large number of different platforms, suggesting that emulation can be cost-effective for a wide range of uses.

These three examples consist of emulators that run on existing hardware platforms, so they do not address the problem of specifying an emulator for a future, unknown computer; but they prove that emulation is an effective way of running otherwise obsolete software.

REFERENCES

⁵ While transliteration need not be tied to refresh cycles, doing so minimizes the number of passes that must be made through a collection of digital material. If a single annotation standard is selected for all documents in a given corpus or repository during a given epoch to simplify document management, transliteration could be performed for all documents in a collection in lock-step, just as media refreshing is done in lock-step. Though transliteration does not necessarily have to be done at the same time as refreshing, doing so would be more efficient (though potentially riskier) than performing transliteration and refreshing at different times.

⁶ The emulation environment must be designed to allow such extraction in order to facilitate the generation of vernacular versions of obsolete documents.

⁷ It is not clear whether digital preservation needs to include the retention of attributes of the original medium on which a digital document was stored. It can be argued that digital documents are (or should be) logically independent of their storage media and therefore need not preserve the behavior of these media; however, a counterarugment to this might be that since some digital documents are tailored to specific media (such as CD-ROM), they should retain at least some attributes of those media (such as speed). The approach described here is neutral with respect to this issue: it allows attributes of storage media to be emulated when desired but does not require them to be.