Avoiding Technological Quicksand: Sections 9-10 • CLIR

9. Research Required for the Emulation Approach

In order to prove the feasibility of the emulation approach, research is required in three areas: (1) techniques must be developed for specifying emulators that will run on unknown, future computers; (2) techniques must be developed for keeping annotations and explanations human-readable in the future; and (3) techniques must be developed for encapsulating documents, software, emulator specifications, and associated annotations and metadata to ensure their mutual cohesion and prevent their corruption.

9.1 Emulator specification formalism

An emulator specification formalism must be developed that captures all relevant attributes of a hardware platform, including interaction modes, speed (of execution, display, access, and so forth), display attributes (pixel size and shape, color, dimensionality, and so forth), time and calendar representations, device and peripheral characteristics, distribution and networking features, multiuser aspects, version and configuration information, and other attributes. The formalism must be extensible so that future attributes can be added when needed (for example, for compliance with future Y10K standards). The set of attributes needed to ensure that a future emulation precisely reproduces an obsolete platform in all possible aspects is unbounded, but the scheme assumes that a certain degree of variance in the behavior of emulators will be acceptable. (This variance corresponds to that in the original program’s behavior when executed on different contemporary systems and configurations, using different monitors, keyboards, disk drives, and other peripheral devices.)

Emulator specifications must be saved so that they can be used effectively to produce emulators in the future. There are several possible ways of doing this. First, an abstract, formal description could be saved, which could be interpreted by a human or program in the future to enable construction of the desired emulator. Second, an executable description (that is, an emulator program written in a high-level language) could be saved, which would be designed to run on some simple abstract machine that could easily be implemented in the future; instructions for implementing that abstract machine and/or an abstract formal description of it would be saved along with the emulator to allow it to run on future computers. Alternatively, the second approach could be transformed into an instance of the first by making the executable description abstract and formal, thereby allowing it either to be interpreted by a human or program in the future or to be run on a future implementation of an abstract machine. All of these alternatives require storing formal and/or explanatory descriptions that can be used to bootstrap the process of creating a future emulator. To ensure that these descriptions remain human-readable in the future, they must be stored in the annotation form discussed next.

9.2 Human-readable annotations and explanations

Ideally, the emulation scheme would be self-describing: that is, a suitable program running on a future computer, when asked to access an obsolete document saved using this scheme, would automatically interpret the saved explanations to find out how to open the encapsulation, generate the required emulator (or find one that has already been generated for this type of computer), and run the document’s saved software under this emulator to access the document itself. Alternatively, a user could interpret the saved explanations to perform the same steps. In either case, the key to successfully accessing a document saved using the emulation approach lies in the saved explanations that accompany the document, including explanations of how to use the encapsulation itself, user documentation, version and configuration information for all the software that is to be run under emulation (and for the emulated hardware), and the emulator specification itself. Whether or not these saved explanations can be automatically interpreted by future computer programs, they must remain readable by future humans, to ensure that saved documents are not lost.

The emulation approach requires the development of an annotation scheme that can save these explanations in a form that will remain human-readable, along with metadata which provide the historical, evidential, and administrative context for preserving digital documents. There has been considerable work in the library, archives, scientific data, and records communities on identifying such metadata (Cox 1994 and 1996, IEEE 1997, NRC 1995, Rothenberg 1996).

Future users of digital documents preserved using the emulation approach will be faced with an encapsulated collection of components that need to be used in a particular way in order to read the desired document. First and foremost, users must be able to read some intelligible explanation that tells them how to proceed. This explanation must itself be a digital document (if only to guarantee that it accompanies the other components of the encapsulation), but it must be human-readable if it is to serve its purpose. It will generally be of the same vintage as the encapsulated digital document whose exhumation it explains, but it cannot be stored in the same way as that document, or it will be equally unreadable. The solution to this conundrum lies in restricting the form of this explanatory documentation, for example, to simple text (or possibly text plus simple line drawings).

Even if the encoding of this explanatory material is standardized, however, whatever standard is chosen will eventually become obsolete, which is why the emulation strategy allows annotations and explanations to be translated (transliterated) whenever necessary. In order to guarantee that this translation is performed without loss, we must develop subset-translatable encodings, which I define as having the property that if some encoding Y is subset-translatable into another encoding Z, then anything expressed in Y can be translated into a subset Zy of Z, and anything in the resulting subset Zy can be translated back into Y without loss. This allows Z to be a proper superset of Y (not limited to Y’s expressivity) while ensuring that anything that is expressed in Y can be translated into Z and back into Y again without loss. A sequence of such encodings, evolving as necessary over time, will solve the readability problem for annotations: each encoding in this sequence serves as an annotation standard during a given epoch.

Although it is logically sufficient-having asserted that an encoding Y is subset-translatable into encoding Z-to translate a document from Y to Z and discard the original Y-form of the document, this is unlikely to convince skeptical future users. It is therefore also important to develop the concept of a subset-translator (consisting in each case of a table or a process, depending on the complexity of the translation) that shows how to translate Y into Z and back again. If this translator is saved, along with definitions of encodings Y and Z, and the Y and Z forms of all translated information, then any future user can verify that Y is indeed subset-translatable into Z, that the information was correctly translated from Y to Z, and that nothing was lost in this translation (by verifying that the reverse translation reproduces the original, saved Y-form of the information).⁸ In order for all of this saved information (encodings, translators, history of translations that have been performed, and so forth) to remain readable in the future, it must be stored using this same transliteration scheme, that is, it must be encoded in a current annotation standard, to be subset-translated as needed in the future.

9.3 Encapsulation techniques

One final piece of the puzzle is required to make the emulation approach work: how do we encapsulate all of the required items so that they do not become separated or corrupted and so that they can be handled as a single unit for purposes of data management, copying to new media, and the like? While encapsulation is one of the core concepts of computer science, the term carries a misleading connotation of safety and permanence in the current context. An encapsulation is, after all, nothing more than a logical grouping of items. For example, whether these are stored contiguously depends on the details of the storage medium in use at any given time. The logical shell implied by the term encapsulation has no physical reality (unless it is implemented as a hardened physical storage device). And while it is easy to mark certain bit streams as inviolate, it may be impossible to prevent them from being corrupted in the face of arbitrary digital manipulation, copying, and transformation.

Techniques must therefore be developed for protecting encapsulated documents and detecting and reporting (or correcting) any violations of their encapsulation. In addition, criteria must be defined for the explanatory information that must be visible outside an encapsulation to allow the encapsulation to be interpreted properly.

Many encapsulated digital documents from a given epoch will logically contain common items, including emulator specifications for common hardware platforms, common operating system and application code files, software and hardware documentation, and specifications of common annotation standards and their translators. Physically copying all of these common elements into each encapsulation would be highly redundant and wasteful of storage. If trustworthy repositories for such items can be established (by libraries, archives, government agencies, commercial consortia, or other organizations), then each encapsulation could simply contain a pointer to the required item (or its name and identifying information, along with a list of alternative places where it might be found). Different alternatives for storing common items may appeal to different institutions in different situations, so a range of such alternatives should be identified and analyzed.

There is also the question of what should go inside an encapsulation versus what should be presented at its surface to allow it to be manipulated effectively and efficiently. In principle, the surface of an encapsulation should present indexing and cataloging information to aid in storing and finding the encapsulated document, a description of the form and content of the encapsulated document and its associated items to allow the encapsulation to be opened, contextual and historical information to help a potential user (or document manager) evaluate the relevance and validity of the document, and management information to help track usage and facilitate retention and other management decisions. All of this information should be readable without opening the encapsulation, since none of it actually requires reading the encapsulated document itself.

It is logically necessary only that the tip of this information protrude through the encapsulation: there must be some explanatory annotation on the surface that tells a reader how to open at least enough of the encapsulation to access further explanatory information inside the encapsulation. Even this surface annotation will generally not be immediately human-readable, if the encapsulation is stored digitally. If it happens to be stored on a physical medium that is easily accessible by humans (such as a disk), then this surface annotation might be rendered as a human-readable label on the physical exterior of the storage unit, but this may not be feasible. For example, if a large number of encapsulations are stored on a single unit, it may be impossible to squeeze all of their surface annotations onto the label of the unit. So in general, even this surface annotation will be on a purely logical surface that has no physical correlate. The reader of this surface annotation will therefore be a program rather than a human, though it may quickly deliver what it reads to a human. It must therefore be decided how such surface annotations should be encoded, for example, whether the annotation standards described above are sufficient for this purpose or whether a hierarchy of such standards-corresponding to different levels of immediate human-readability-should be developed.

10. Summary

The long-term digital preservation problem calls for a long-lived solution that does not require continual heroic effort or repeated invention of new approaches every time formats, software or hardware paradigms, document types, or recordkeeping practices change. This approach must be extensible, since we cannot predict future changes, and it must not require labor-intensive translation or examination of individual documents. It must handle current and future documents of unknown type in a uniform way, while being capable of evolving as necessary. Furthermore, it should allow flexible choices and tradeoffs among priorities such as access, fidelity, and ease of document management.

Most approaches that have been suggested as solutions to this problem-including reliance on standards and the migration of digital material into new forms as required-suffer from serious inadequacies. In contrast, the emulation strategy as elaborated above, though it requires further research and proof of feasibility, appears to have many conceptual advantages over the other approaches suggested and is offered as a promising candidate for a solution to the problem of preserving digital material far into the future.

REFERENCE

⁸ Although a sequence of such translations may be needed over time, all that is really required is to save the sequence of encodings and translators: future custodians of this explanatory information could then safely defer translating a particular annotation until it is needed, so long as its encoding is not lost.