Avoiding Technological Quicksand: Section 6 • CLIR

6. The Inadequacy of Most Proposed Approaches

Most approaches that have been proposed fall into one of four categories: (1) reliance on hard copy, (2) reliance on standards, (3) reliance on computer museums, or (4) reliance on migration. Though some of these may play a role in an ultimate solution, none of them comes close to providing a solution by itself, nor does their combination.

6.1 Reliance on hard copy

It is sometimes suggested that digital documents be printed and saved as hard copy. This is not a true solution to the problem, since many documents (especially those that are inherently digital, such as hypermedia) cannot meaningfully be printed at all, or would lose many of their uniquely digital attributes and capabilities if they were printed. Even digital renditions of traditional documents (such as linear text) lose their core digital attributes by being printed; that is, they sacrifice direct machine-readability, which means they can no longer be copied perfectly, transmitted digitally, searched or processed by computer programs, and so forth. Similarly, attempting to save digital documents by printing the 0s and 1s of their bit streams on paper (or engraving them in metal) sacrifices their machine-readability and the core digital attributes that it enables (Bearman 1993, U. S. District Court for the District of Columbia 1993).

Moreover, any such scheme destroys whatever interactive or dynamic functionality an inherently digital document (or a digital rendition of a traditional document) may have, since the document’s original software can no longer be run. (While a human reader might be able to ignore the errors introduced by the optical scanning and character recognition of a printed version of a digital document, computer programs are far less tolerant of the errors that this process would introduce when scanning a printed sequence of 0s and 1s.) For all of these reasons, saving digital documents by printing them (whether rendering their content or printing their bit streams) does not offer a solution to the true problem of digital preservation.

6.2 Reliance on standards

On the face of it, reliance on standards appears to offer a solution by allowing digital documents to be represented in forms that will endure into the future and for which future software will always provide accessibility. One version of this argument offers the relational database (RDB) as a paradigmatic example of how this might work (NARA 1991, Thibodeau 1991). This argument contends that since all relational database management systems (RDBMSs) are based on the same mathematical foundation (Codd 1982), any RDB that is accessioned by an institution can be translated without loss into the specific RDB form recognized by the RDBMS used by that institution. Even if the institution later changes RDBMSs, all of its RDBs should be able to migrate to the new RDBMS without loss, since all RDBMSs support the same baseline functionality (UNACCIS 1990, 1992).

While this argument appears convincing, it fails in several significant ways. First, precisely because the relational model legislates a standard baseline of functionality, real RDBMSs must distinguish themselves in the marketplace by introducing proprietary features that extend the relational model (such as “outer” joins, support for views, unique diagramming methods for data modeling, and the like). Any RDB that makes use of such proprietary features becomes at least somewhat nonstandard and will lose some of its functionality if translated into some other RDBMS (Bikson and Frinking 1993). In this way standardization sows the seeds of its own destruction by encouraging vendors to implement nonstandard features in order to secure market share. Users are motivated to use such features because they provide enhanced functionality, but using these features produces nonstandard databases that are likely to be orphaned by reliance on standards, since standards enforce strict limitations on what they can represent, and thereby preserve.

In addition, far from being a paradigmatic example, the relational database is actually unique: no other kind of digital document rests on an underlying formal mathematical foundation that can serve as the basis for its standardization. Word-processing documents, spreadsheets, graphics and image files, hypermedia, animation, and audio and video formats are still evolving so rapidly that it is unrealistic to expect definitive standards for any of these forms to emerge in the near future. Though standards continue to be developed for many kinds of digital documents, they are almost always informal, ad hoc, and relatively short-lived. Moreover, since they lack a formal underpinning, they often compete with rival standards produced for different purposes or by different groups. This leads to the sad but true statement, attributed to Andrew S. Tanenbaum, “One of the great things about standards is that there are so many different ones to choose from!”

Finally, the relational database example demonstrates a fundamental flaw in the standards-based approach. Just as the relational paradigm replaced earlier network and hierarchical database paradigms, it is currently under attack by the new object-oriented database (OODB) paradigm, which may well replace it or at least relegate it to the role of a low-level storage mechanism hidden beneath future object-oriented database management systems (OODBMSs). As was the case with previous database paradigm shifts, the transition from relational to object-oriented databases cannot be made simply by automatically translating RDBs into OODBs. The paradigms are so different that such translation is typically meaningless: even when it is possible, the result is likely to possess neither the formal rigor of the original relational form nor the enhanced semantic expressiveness of the new object-oriented form. This illustrates the fact that even the best standards are often bypassed and made irrelevant by the inevitable paradigm shifts that characterize information science-and will continue to do so.

Proponents of standards often argue that the way to deal with the problem of paradigm shifts is to force digital documents into current standard forms (even if this sacrifices some of their functionality) and then translate them, when current standards become obsolete, into whatever standards supplant the obsolete ones.³ This is analogous to translating Homer into modern English by way of every intervening language that has existed during the past 2,500 years. The fact that scholars do not do this (but instead find the earliest original they can, which they then translate directly into the current vernacular) is indicative of the fact that something is always lost in translation. Rarely is it possible to recover the original by retranslating the translated version back into the original language.

Not even character encodings last forever: ASCII (the venerable 7-bit American Standard Code for Information Interchange) is slowly giving way to Unicode (a newer 16-bit character set). Furthermore, the history of these encodings shows that successful standards do not always subsume their competitors, as exemplified by the ascendance of ASCII over EBCDIC (the 8-bit Extended Binary Coded Decimal Interchange Code long used by IBM) and the APL character set (designed for Iverson’s “A Programming Language”), despite the fact that ASCII cannot represent all of their characters.

Nevertheless, standards should not be dismissed. Some standards (notably standard generalized markup language, SGML, and its offspring) have proven highly extensible and worthwhile within their limited scope. Since text is likely always to be a part of most documents, SGML provides a useful capability (Coleman and Willis 1997), even though it does not by itself solve the problems of nontextual representation or of representing dynamic, interactive documents. In fact, if SGML had been adopted as a common interlingua (a commonly translatable intermediate form) among word processing programs, it would have greatly relieved the daily conversion problems that plague most computer users; yet this has not occurred, implying that even well-designed standards do not necessarily sweep the marketplace (Bikson 1997). Nevertheless, converting digital documents into standard forms, and migrating to new standards if necessary, may be a useful interim approach while a true long-term solution is being developed. I also suggest below that standards may play a minor role in a long-term solution by providing a way to keep metadata and annotations readable.

6.3 Reliance on computer museums

To avoid the dual problems of corruption via translation and abandonment at paradigm shifts, some have suggested that computer museums be established, where old machines would run original software to access obsolete documents (Swade 1998). While this approach exudes a certain technological bravado, it is flawed in a number of fundamental ways. It is unlikely that old machines could be kept running indefinitely at any reasonable cost, and even if they were, this would limit true access to the original forms of old digital documents to a very few sites in the world, thereby again sacrificing many of these documents’ core digital attributes.

Furthermore, this approach ignores the fact that old digital documents (and the original software needed to access them) will rarely survive on their original digital media. If an obsolete digital document and its software survive into the future, this will probably be because their bit streams have been copied onto new media that did not exist when the document’s original computer was current. For example, an old word processing file from a 1970s personal computer system will not still exist on the 8-inch floppy disk that was native to that system but will instead have migrated onto a 3.5 inch floppy, a CD-ROM, or perhaps a DVD. The obsolete document would therefore have to be read by an obsolete machine from a new medium for which that machine has no physical drive, no interface, and no device software. The museum approach would therefore require building unique new device interfaces between every new medium and every obsolete computer in the museum as new storage media evolve, as well as coding driver software for these devices, which would demand maintaining programming skills for each obsolete machine. This seems hopelessly labor-intensive and ultimately infeasible.

Finally, computer chips themselves have limited physical lifetimes. Integrated circuits decay due to processes such as metal migration (the traces that define circuit connections on the chips migrate through the substrate over time) and dopant diffusion (the atoms that make semiconductors semiconduct diffuse away over time). Even if obsolete computers were stored carefully, maintained religiously, and never used, aging processes such as these would eventually render them inoperative; using them routinely to access obsolete digital documents would undoubtedly accelerate their demise.

One role that computer museums might play in preservation is to perform heroic efforts to retrieve digital information from old storage media. If an old disk or tape is found that may indeed still have readable information on it, an obsolete machine in the museum (which would presumably have a drive and software for the medium in question) could be used in a last-ditch attempt to tease the bits off the medium, as an alternative to electron microscopy or other equally extreme measures. A second role for computer museums might be in verifying the behavior of emulators, as discussed below. Beyond these limited roles, however, computer museums do not appear to be a serious option for the long-term preservation of digital documents.

6.4 Reliance on migration

The approach that most institutions are adopting (if only by default) is to expect digital documents to become unreadable or inaccessible as their original software becomes obsolete and to translate them into new forms as needed whenever this occurs (Bikson and Frinking 1993, Dollar 1992). This is the traditional migration approach of computer science. While it may be better than nothing (better than having no strategy at all or denying that there is a problem), it has little to recommend it.⁴

Migration is by no means a new approach: computer scientists, data administrators and data processing personnel have spent decades performing migration of data, documents, records, and programs to keep valuable information alive and usable. Though it has been employed widely (in the absence of any alternative), the nearly universal experience has been that migration is labor-intensive, time-consuming, expensive, error-prone, and fraught with the danger of losing or corrupting information. Migration requires a unique new solution for each new format or paradigm and each type of document that is to be converted into that new form. Since every paradigm shift entails a new set of problems, there is not necessarily much to be learned from previous migration efforts, making each migration cycle just as difficult, expensive, and problematic as the last. Automatic conversion is rarely possible, and whether conversion is performed automatically, semiautomatically, or by hand, it is very likely to result in at least some loss or corruption, as documents are forced to fit into new forms.

As has been proven repeatedly during the short history of computer science, formats, encodings, and software paradigms change often and in surprising ways. Of the many dynamic aspects of information science, document paradigms, computing paradigms, and software paradigms are among the most volatile, and their evolution routinely eludes prediction. Relational and object-oriented databases, spreadsheets, Web-based hypermedia documents, e-mail attachments, and many other paradigms have appeared on the scene with relatively little warning, at least from the point of view of most computer users. Each new paradigm of this kind requires considerable conversion of programs, documents, and work styles, whether performed by users themselves or by programmers, data administrators, or data processing personnel.

Even though some new paradigms subsume the ones they replace, they often still require a significant conversion effort. For example, the spreadsheet paradigm subsumes simple textual tables, but converting an existing table into a meaningful spreadsheet requires defining the formulas that link the entries in the table, although these relationships are likely to have been merely implicit in the original textual form (and long since forgotten). Similarly, word processing subsumes simple text editing, but conversion of a document from a simple textual form into a specific word processing format requires that fonts, paragraph types, indentation, highlighting, and so forth, be specified, in order to make use of the new medium and avoid producing a result that would otherwise be unacceptably old-fashioned, if not illegible.

One of the worst aspects of migration is that it is impossible to predict what it will entail. Since paradigm shifts cannot be predicted, they may necessitate arbitrarily complex conversion for some or all digital documents in a collection. In reality, of course, particularly complex conversions are unlikely to be affordable in all cases, leading to the abandonment of individual documents or entire corpora when conversion would be prohibitively expensive.

In addition, as when refreshing media, there is a degree of urgency involved in migration. If a given document is not converted when a new paradigm first appears, even if the document is saved in its original form (and refreshed by being copied onto new media), the software required to access its now-obsolete form may be lost or become unusable due to the obsolescence of the required hardware, making future conversion difficult or impossible. Though this urgency is driven by the obsolescence of software and hardware, rather than by the physical decay and obsolescence of the media on which digital documents are stored, it is potentially just as crucial. Therefore migration cannot generally be postponed without incurring the risk that it may become impossible in the future, and that the documents may be irretrievably lost. Worse yet, this problem does not occur just once for a given document (when its original form becomes obsolete) but recurs throughout the future, as each form into which the document has migrated becomes obsolete in turn.

Furthermore, because the cycles of migration that must be performed are determined by the emergence of new formats or paradigms, which cannot be controlled or predicted, it is essentially impossible to estimate when migration will have to be performed for a given type of document-the only reliable prediction being that any given type of document is very likely to require conversion into some unforeseeable new form within some random (but probably small) number of years. Since different format changes and paradigm shifts affect different (and unpredictable) types of documents, it is likely that some of the documents within a given corpus will require migration before others, unless the corpus consists entirely of a single type of document (which becomes less likely as documents make increasing use of hypermedia, since a single hypermedia document consists of component subdocuments of various types). This implies that any given corpus is likely to require migration on an arbitrarily (and uncontrollably) short cycle, determined by whichever of the component types of any of its documents is the next to be affected by a new format or paradigm shift.

Finally, migration does not scale well. Because it is labor-intensive and highly dependent on the particular characteristics of individual document formats and paradigms, migration will derive little benefit from increased computing power. It is unlikely that general-purpose automated or semiautomated migration techniques will emerge, and if they do, they should be regarded with great suspicion because of their potential for silently corrupting entire corpora of digital documents by performing inadvertently destructive conversions on them. As the volume of our digital holdings increases over time, each migration cycle will face a greater challenge than the last, making the essentially manual methods available for performing migration increasingly inadequate to the task.

In summary, migration is essentially an approach based on wishful thinking. Because we cannot know how things will change in the future, we cannot predict what we will have to do to keep a given digital document (or type of document) accessible and readable. We can merely predict that document formats, software and hardware will become obsolete, that paradigms will shift in unpredictable ways (unpredictability being the essence of a paradigm shift), and that we will often have to do something. Since different changes may apply to different kinds of documents and records, we must expect such changes to force the migration of different kinds of documents on independent, unrelated schedules, with each document type (and perhaps even individual documents or their components) requiring specialized, labor-intensive handling. We cannot predict how much effort, time, or expense migration will require, how successful it will be in each case, how much will be lost in each conversion, nor how many of our documents will be corrupted, orphaned, or lost in each migration cycle. Furthermore, we can expect each cycle to be a unique experience that derives little or no benefit or cost savings from previous cycles, since each migration will pose a new set of unique problems.

In the absence of any alternative, a migration strategy may be better than no strategy at all; however, to the extent that it provides merely the illusion of a solution, it may in some cases actually be worse than nothing. In the long run, migration promises to be expensive, unscalable, error-prone, at most partially successful, and ultimately infeasible.

REFERENCES

³ This approach is the standards-based version of migration, described below.

⁴ The migration approach is often linked to the use of standards, but standards are not intrinsically a part of migration.