Preserving Authentic Digital Information

by Jeff Rothenberg


Introduction

This paper argues that to better understand what is required to meaningfully preserve digital information, we should attempt to create a foundation for the concept of the authenticity of informational entities that transcends the multiple disciplines in which this concept arises. Whenever informational entities are used and for whatever purpose, their suitability relies on their authenticity. Yet archivists, librarians, museum curators, historians, scholars, and researchers in various fields define authenticity in distinct, though often overlapping, ways. They combine legal, ethical, historical, and artistic perspectives such as the desire to provide accountability, the desire to ensure proper attribution, or the desire to recreate, contextualize, or interpret the original meaning, function, impact, effect, or aesthetic character of an artifact. Each discipline may have its own explicit definition of authenticity; however, in interdisciplinary discussions of authenticity, the dependence of a given definition on its discipline is often manifested only implicitly.

The technological issues surrounding the preservation of digital informational entities interact with authenticity in novel and profound ways. We are far more likely to achieve meaningful insights into the implications of these interactions if we develop a unified, coherent, discipline-transcendent view of authenticity. Such a view would

  • improve communication across disciplines;
  • provide a better basis for understanding what preservation requirements are implied by the need for authenticity; and
  • facilitate the development of common preservation strategies that would work for as many different disciplines as possible and thereby effect technological economies of scale.

Developing a preservation strategy that economically transcends disciplines would free preservationists from the need for discipline-specific definitions of authenticity. In this paper, I will suggest that there is at least one preservation strategy, based on the notion of a digital-original, that makes the details of how we define authenticity all but irrelevant from the perspective of preservation. However, to derive this conclusion, it is necessary to examine authenticity in some depth.

Although a discipline-transcendent view of authenticity would be the ideal, it may turn out to be impractical. If so, we may need to settle for a multidisciplinary perspective. This means establishing either a unified concept of authenticity as it is used in a subgroup of disciplines (such as archives, libraries, and museums) or a set of variant concepts of authenticity, each of which addresses the specific needs of a different discipline yet retains as much in common with the other concepts as possible.

Basic Definitions

The term informational entity, as used here, refers to an entity whose purpose or role is informational. By definition, any informational entity is entirely characterized by information, which may include contextual and descriptive information as well as the core entity. Examples of informational entities include digital books, records, multimedia objects, Web pages, e-mail messages, audio or video material, and works of art, whether they are “born digital” or digitized from analog forms.

It is not easy for computer scientists to agree on a definition of the word digital.1 In the current context, it generally denotes any means of representing sequences of discrete symbolic values-each value having two or more unambiguously distinguishable states-so that these sequences can, at least in principle, be accessed, manipulated, copied, stored, and transmitted entirely by mechanical means with a high degree of reliability (Rothenberg 1999). Digital informational entities are defined in the next section.

The term authenticity is even harder to define, but the term is used here in its broadest sense. Its meaning is not restricted to authentication, as in verifying authorship, but is intended to include issues of integrity, completeness, correctness, validity, faithfulness to an original, meaningfulness, and suitability for an intended purpose. I leave to specialists in various scholarly disciplines the task of elaborating the dimensions of authenticity in those disciplines. The focus of this paper is the interplay between those dimensions and the technological issues involved in preserving digital informational entities. The dimensions of authenticity have a profound effect on the technical requirements of any preservation scheme, digital or otherwise.

The remainder of this paper discusses the importance of understanding authenticity as a prerequisite to defining meaningful digital preservation.

Digital Informational Entities are Executable Programs

The distinguishing characteristic of a digital informational entity is that it is essentially a program that must be interpreted to be made intelligible to a human: it cannot simply be held up to the light to be read. A program is a sequence of commands in a formal language that is intended to be read by an interpreter that understands that language.2 An interpreter is a process that knows how to perform the commands specified in the formal language in which the program is written. Even a simple text document consisting of a stream of ASCII character codes is a program, i.e., it is a sequence of commands in a formal language (each command specifying a character to be rendered) that must be interpreted before it can be read by a human. More elaborate digital formats, such as distributed, hypermedia documents, may-in addition to requiring interpretation for navigation and rendering-embed macros, scripts, animation processes, or other active components, any of which may require arbitrarily complex interpretation.

Some programs are interpreted directly by hardware (for example, a printer may render ASCII characters from their codes), but the interpreters of most digital informational entities are software (i.e., application programs). Any software interpreter must itself be interpreted by another hardware or software interpreter, but any sequence of software interpretations must ultimately result in some lowest level (“machine language”) expression that is interpreted (“executed”) by hardware.

It follows that it is not sufficient to save the bit stream of a digital informational entity without also saving the intended interpreter of that bit stream. Doing so would be analogous to saving hieroglyphics without saving a Rosetta Stone.3

In light of this discussion, it is useful to define a digital informational entity as consisting of a single, composite bit stream4 that includes the following:

  • the bit stream representing the core content of the entity (that is, the encoding of a document, data, or a record), including all structural information required to constitute the entity from its various components, wherever and however they may be represented;
  • component bit streams representing all necessary contextual or ancillary information or metadata needed to make the entity meaningful and usable; and
  • one or more component bit streams representing a perpetually executable interpreter capable of rendering the core content of the entity from its bit stream, in the manner intended.5

If we define a digital informational entity in this way, as including both any necessary contextual information and any required interpreter, we can see that preserving such an entity requires preserving all of these components.6 Given this definition, one of the key technical issues in preserving digital informational entities becomes how to devise mechanisms for ensuring that interpreters can be made perpetually executable.

Preservation Implies Meaningful Usability

The relationship between digital preservation and authenticity stems from the fact that meaningful preservation implies the usability of that which is preserved. That is, the goal of preservation is to allow future users to retrieve, access, decipher, view, interpret, understand, and experience documents, data, and records in meaningful and valid (that is, authentic) ways. An informational entity that is “preserved” without being usable in a meaningful and valid way has not been meaningfully preserved, i.e., has not been preserved at all.

As a growing proportion of the informational entities that we create and use become digital, it has become increasingly clear that we do not have effective mechanisms for preserving digital entities. As I have summarized this problem elsewhere: “There is as yet no viable long-term strategy to ensure that digital information will be readable in the future. Digital documents are vulnerable to loss via the decay and obsolescence of the media on which they are stored, and they become inaccessible and unreadable when the software needed to interpret them, or the hardware on which that software runs, becomes obsolete and is lost” (Rothenberg 1999).

The difficulty of defining a viable digital preservation strategy is partly the result of our failing to understand and appreciate the authenticity issues surrounding digital informational entities and the implications of these issues for potential technical solutions to the digital preservation problem. The following argues that the impact of authenticity on preservation is manifested in terms of usability, namely that a preserved informational entity can serve its intended or required uses if and only if it is preserved authentically.

For traditional, analog informational entities, the connection between preservation and usability is obvious. If a paper document is “preserved” in such a way that the ink on its pages fades into illegibility, it probably has not been meaningfully preserved. Yet even in the traditional realm, it is at least implicitly recognized that informational entities have a number of distinct attributes that may be preserved differentially. For example, stone tablets bearing hieroglyphics that were physically preserved before the discovery of the Rosetta Stone were nevertheless unreadable because the ability to read the language of their text had been lost. Similarly, although the original Declaration of Independence has been preserved, most of its signatures have faded into illegibility. Many statues, frescos, tapestries, illuminated manuscripts, and similar works are preserved except for the fact that their pigments have faded, often beyond recognition. Although it is not always possible to fully preserve an informational entity, it may be worth preserving whichever attributes can be preserved if doing so enables the entity to be used in a meaningful way. In other words, if preserving certain attributes of an informational entity may allow it to fulfill some desired future use, then we are likely to consider those attributes worth preserving and to consider that we have at least partially preserved the entity by preserving those attributes. Generalizing from this, the meaningful preservation of any informational entity is ultimately defined in terms of which of its attributes can and must be preserved to ensure that it will fulfill its future use, whether originally intended, subsequently expected, or unanticipated.

Deciding which attributes of traditional informational entities to preserve involves little discretion. Because a traditional informational entity is a physical artifact, saving it in its entirety preserves (to the extent possible) all aspects of the entity that are inherent in its physical being, which is to say all of its attributes. Decisions may still have to be made, for example, about what technological measures should be used to attempt to preserve attributes such as color. For the most part, however, saving any aspect of a traditional information entity saves every aspect, because all of its aspects are embodied in its physicality.

For digital informational entities, the situation is quite different. There is no accepted definition of digital preservation that ensures saving all aspects of such entities. By choosing a particular digital preservation method, we determine which aspects of such entities will be preserved and which ones will be sacrificed. We can save the physical artifact that corresponds to a traditional informational entity in its entirety; however, there is no equivalent option for a digital entity.7 The choice of any particular digital preservation technology therefore has inescapable implications for what will and will not be preserved. In the digital case (so far, at least), we must choose what to lose (Rothenberg and Bikson 1999).

This situation is complicated by the fact that we currently have no definitive taxonomies either of the attributes of digital informational entities or of the uses to which they may be put in the future. Traditional informational entities have been around long enough that we can feel some confidence in understanding their attributes, as well as the ways in which we use them. Anyone who claims to have a corresponding understanding of digital informational entities lacks imagination. Society has barely begun to tap the rich lode of digital capabilities waiting to be mined. The attributes of future digital informational entities, the functional capabilities that these attributes will enable, and the uses to which they may be put defy prediction.

Strategies for Defining Authenticity

It is instructive to consider several strategies that can be used to define authenticity. Each strategy may lead to a number of different ways of defining the concept and may, in turn, involve a number of alternative tactics that enable its implementation.

One strategy is to focus on the originality of an informational entity, that is, on whether it is unaltered from its original state. This strategy works reasonably well for traditional, physical informational entities but is problematic for digital informational entities. The originality strategy can be implemented by means of several tactics. One such tactic is to focus on the intrinsic properties of an informational entity by providing criteria for whether each property is present in its proper, original form. For example, one can demand that the paper and ink of a traditional document be original and devise chemical, radiological, or other tests of these physical properties.8

A second tactic for implementing the originality strategy is to focus on the process by which an entity is saved, relying on its provenance or history of custodianship to warrant that the entity has not been modified, replaced, or corrupted and must therefore be original. For example, from an archival perspective, a record is an informational artifact that provides evidence of some event or decision that was performed as part of the function of some organization or agency. The form and content of the record convey this evidence, but the legitimacy of the evidence rests on being able to prove that the record is what it purports to be and has not been altered or corrupted in such a way as to invalidate its evidential meaning. The archival principle of provenance seeks to establish the authenticity of archival records by providing evidence of their origin, authorship, and context of generation, and then by proving that the records have been maintained by an unbroken chain of custodianship in which they have not been corrupted.

Relying on this tactic to ensure the authenticity of records involves two conditions: first, that an unbroken chain of custodianship has been maintained; and second, that no inappropriate modifications have been made to the records during that custodianship. The first of these conditions is only a way of supplying indirect evidence for the second, which is the one that really matters. An unbroken chain of custodianship does not in itself prove that records have not been corrupted, whereas if we could prove that records had not been corrupted, there would be no logical need to establish that custodianship had been maintained. However, since it is difficult to obtain direct proof that records have not been corrupted, evidence of an unbroken chain of custodianship serves, at least for traditional records, as a surrogate for such proof.

Intrinsic properties of the entity may be completely ignored using this tactic, which relies on the authenticity of documentation of the process by which the entity has been preserved as a surrogate for the intrinsic authenticity of the entity. This has a somewhat recursive aspect, since the authenticity of this documentation must in turn be established; however, in many cases, this is easier than establishing the authenticity of the entity itself.

Alternatively, an intrinsic properties strategy can be based solely on the intrinsic properties tactic discussed above. This involves identifying certain properties of an informational entity that define authenticity, regardless of whether they imply the originality of the entity. For example, one might define an authentic impressionistic painting as one that conforms to the style and methods of Impressionism, regardless of when it was painted or by whom. A less controversial example might be a jade artifact that is considered “authentic” merely by virtue of being truly composed of jade.9 Whether this strategy is viable for a given discipline depends on whether the demands that the discipline places on informational entities can be met by ensuring that certain properties of those entities meet specified criteria, regardless of their origin.

Although there are undoubtedly other strategies, the final one I will consider here is to define authenticity in terms of whether an informational entity is suitable for some purpose. This suitability strategy would use various tactics to specify and test whether an informational entity fulfills a given range of purposes or uses. This may be logically independent of whether the entity is original. Similarly, although the suitability of an entity for some purpose is presumably related to whether certain of its properties meet prescribed criteria, under this strategy both the specific properties involved and the criteria for their presence are derived entirely from the purpose that the entity is to serve. Since a given purpose may be satisfiable by means of a number of different properties of an entity, the functional orientation of this strategy makes it both less demanding and more meaningful than the alternatives.10 The range of uses that an entity must satisfy to be considered authentic under this strategy may be anticipated in advance or allowed to evolve over time.11

Authenticity as Suitability for a Purpose

In the context of preservation, authenticity is inherently related to time. A piece of jade may be authentic, irrespective of its origin or provenance; however, a specific preserved jade artifact has additional requirements for being authentic in the historical sense.12 The alternative strategies and tactics presented above for defining authenticity suggest the range of meanings that may be attributed to the concept, but all of these imply the retention of some essential properties or functional capabilities over time.13

Authenticity seems inextricably bound to the notion of suitability for a purpose. A possible exception is the case where originality per se serves as the criterion for authenticity. Such is the case, for example, for venerated artifacts such as the Declaration of Independence. Even if such an entity ultimately becomes unsuitable for its normal purpose (for example, if it becomes unreadable), it continues to serve some purpose-in this example, veneration. In all cases, therefore, authenticity implies some future purpose or use, such as the ability to obtain factual information, prove legal accountability, derive aesthetic appreciation, or support veneration.

While recognizing that it is likely to be a contentious position, I will assume in the remainder of this paper that the authenticity of preserved informational entities in any domain is ultimately bound to their suitability for specific purposes that are of interest within that domain.

At any point in time, it is generally considered preferable to be able to articulate a relatively stable, a priori set of principles for any discipline. For this reason, a posteriori criteria for authenticity may generate a degree of intellectual anxiety among theoreticians. Some archivists, for example, argue that archival theory specifies a precise, fixed set of suitability requirements for authentically preserved records, namely that future users should be able to understand the roles that the records played in the business processes of the organizations that generated and used them, and that users should be able to continue to use the records in any future business processes that may require them (e.g., for determining past accountability). Similarly, some libraries of deposit may require, to the extent possible, that future users be able to see and use authentically preserved publications exactly as their original audiences did. On the other hand, a data warehouse might require that authentic preservation allow future users to explore implicit relationships in data that the original users were unable to see or define.

In different ways, all these examples attempt to allow for unanticipated future uses of preserved informational entities. They also reveal a tension between the desire to articulate fixed, a priori criteria for authenticity and the need to define criteria that are general enough to satisfy unanticipated future needs. This suggests that we distinguish between a priori suitability criteria, which specify in advance the full range of uses that authentically preserved informational entities must support, and a posteriori suitability criteria, which require such entities to support unanticipated future uses. The a priori approach will work only in a discipline that carefully articulates its preservation mandate and successfully (for all time) proscribes any attempt to expand that mandate retroactively.14 In contrast, an evolutionary, a posteriori approach to defining suitability criteria should be adopted by disciplines that are less confident of their ability to ward off all future attempts to expand their suitability requirements or those whose preservation mandates are intentionally dynamic and designed to adapt to future user needs and demands as they arise.

Authenticity Principles and Criteria

Because it is so difficult to define authenticity abstractly, it is useful to try to develop authenticity principles for various domains or disciplines that will make it possible to define authenticity in functional terms. An authenticity principle encapsulates the overall intent of authentic preservation from a given legal, ethical, historical, artistic, or other perspective-for example, to assess accountability or to recreate the original function, impact, or effect of preserved entities. Ideally, an authenticity principle should be a succinct, functional statement of what constitutes authentic preservation from a specific, stated perspective. Requiring that these principles be stated functionally allows them to be used in verifying whether a given preservation approach satisfies a given principle. For example, one possible archival authenticity principle was proposed above, namely, to enable future users to understand the roles that preserved records played in the business processes of the organizations that generated and used them, and to continue to use those records in future business processes that may require them. Alternative authenticity principles might be proposed for archives as well as for other disciplines. It would be desirable to devise a relatively small number of alternative authenticity principles that collectively capture the perspectives of most disciplines concerned with the preservation of informational entities.

Next, from each authenticity principle, it is useful to derive a set of authenticity criteria to serve both as generators for specific preservation requirements and as conceptual and practical tests of the success of specific preservation techniques. For example, to implement the authenticity principle described previously, authenticity criteria would be derived that specify which aspects of records and their context must be preserved to satisfy that principle. These criteria would then provide a basis for developing preservation requirements, such as the need to retain metadata describing provenance, as well as tests of whether and how well alternative preservation techniques satisfy those requirements.

The a priori/a posteriori dichotomy mentioned previously arises again in connection with authenticity principles. From a theoretical perspective, it is more attractive to derive such principles a priori, without the need to consider any future, unanticipated uses to which informational entities may be put. If authenticity principles are derived a posteriori, then they may evolve in unexpected ways as unanticipated uses arise. This situation is unappealing to many disciplines. In either case, if authenticity is logically determined by suitability for some purpose, then an authenticity principle for a given domain will generally be derived, explicitly or implicitly, from the expected range of uses of informational entities within that domain. It may, therefore, be helpful to discuss ways of characterizing such expected ranges of use before returning to the subject of authenticity principles and criteria.

Describing Expected Ranges of Use of Preserved Informational Entities

If expected use is to serve as a basis from which to derive authenticity criteria for a given discipline or organization, then it is important to describe the range of expected uses of informational entities that is relevant to that discipline or organization. This description should consist of a set of premises, constraints, and expectations for how particular kinds of informational entities are likely to be used. It should include the ways in which entities may be initially generated or captured (in digital form, for digital informational entities). It should include the ways in which they may be annotated, amended, revised, organized, and structured into collections or series; published or disseminated; managed; and administered. It should describe how the informational entities will be accessed and used, whether by the organization that generates them or by organizations or individuals who wish to use them in the future for informational, historical, legal, cultural, aesthetic, or other purposes. The description should also include any legal mandates or other exogenous requirements for preservation, access, or management throughout the life of the entities, and it should ideally include estimates of the expected relative and absolute frequencies of each type of access, manipulation, and use.15 Additional aspects of a given range of expected uses may be added as appropriate.

Any attempt to enunciate comprehensive descriptions of ranges of expected uses of this kind for digital informational entitiesespecially in the near future before much experience with such entities has been accumulated-will necessarily be speculative. In all likelihood, it will be over-constrained in some aspects and under-constrained in others. Yet, it is important to try, however tentative the results, if suitability is to serve as a basis for deriving authenticity criteria.

Deriving Authenticity Principles from Expected Ranges of Use

The purpose of describing an expected range of use for informational entities is to provide a basis from which to derive a specific authenticity principle. Any authenticity principle is an ideal and may not be fully achievable under a particular set of technological and pragmatic constraints. Nevertheless, stating an authenticity principle defines a set of criteria to which any preservation approach must aspire.

Different ranges of expected use may result in different authenticity principles. One extreme is that a given range of expected uses might imply the need for a digital informational entity to retain as much as possible of the function, form, appearance, look, and feel that the entity presented to its author. Such a need might exist, for example, if future researchers wish to evaluate the range of alternatives that were available to the author and, thereby, the degree to which the resulting form of the entity may have been determined by constraint versus choice or chance.

A different range of expected uses might imply the need for a digital informational entity to retain the function, form, appearance, look, and feel that it presented to its original intended audience or readership. This would enable future researchers to reconstruct the range of insights or inferences that the original users would have been able to draw from the entity. Whereas retaining all the capabilities that authors would have had in creating a digital informational entity requires preserving the ability to modify and reformat that entity using whatever tools were available at the time, retaining the capabilities of readers merely requires preserving the ability to display, or render, the entity as it would have been seen originally.

Finally, a given range of expected uses may delineate precise and constrained capabilities that future users are to be given in accessing a given set of digital informational entities, regardless of the capabilities that the original authors or readers of those entities may have had. Such delineated capabilities might range from simple extraction of content to more elaborate viewing, rendering, or analysis, without considering the capabilities of original authors or readers. As in the data warehouse example cited previously, it might be important to enable future users to draw new inferences from old data, using tools that may not have been available to the data’s original users.

As these examples suggest, it is possible to identify alternative authenticity principles that levy different demands against preservation. For example, the following sequence of decreasingly stringent principles is stated in terms of the relationship between a preserved digital informational entity and its original instantiation:

  • same for all intents and purposes
  • same functionality and relationships to other informational entities
  • same “look and feel”
  • same content (for any definition of the term)
  • same description 16

An authenticity principle must also specify requirements for the preservation of certain metaattributes, such as authentication and privacy or security. For example, although a signature (whether digital or otherwise) in a record may normally be of no further interest once the record has been accepted into a recordkeeping system-whose custodianship thereafter substitutes its own authentication for that of the original-the original signature in a digital informational entity may on occasion be of historical, cultural, or technical interest, making it worth preserving as part of the “content” of the entity, as opposed to an active aspect of its authentication. Similarly, although the privacy and security capabilities of whatever system is used to preserve an informational entity may be sufficient to ensure the privacy and security of the entity, there may be cases in which the original privacy or security scheme of a digital informational entity may be of interest in its own right. An authenticity principle should determine a complete, albeit abstract, specification of all such aspects of a digital informational entity that must be preserved.

Since an authenticity principle encapsulates the preservation implications of a range of expected uses, it should always be derived from a specific range of this sort. Simply inventing an authenticity principle, rather than deriving it in this way, is methodologically unsound. The range of expected uses grounds the authenticity principle in reality and allows its derivation to be validated or questioned. Nevertheless, as discussed previously, since the range of expected uses for digital informational entities is speculative, the formal derivation of an authenticity principle may remain problematic for some time.

Different types of digital informational entities that fall under a given authenticity principle (within a given domain of use) may have different specific authenticity criteria. For example, authenticity criteria for databases or compound multimedia entities may differ from those for simple textual entities. Furthermore, digital informational entities may embody various behavioral attributes that may, in some cases, be important to retain. In particular, these entities may exhibit dynamic or interactive behavior that is an essential aspect of their content, they may include active (possibly dynamic) linkages to other entities, and they may possess a distinctive look and feel that affects their interpretation. To preserve such digital entities, specific authenticity criteria must be developed to ensure that the entities retain their original behavior, as well as their appearance, content, structure, and context.

Originality Revisited

As discussed earlier, the authenticity of traditional informational entities is often implicitly identified with ensuring that original entities have been retained. Both the notion of custodianship and the other component concepts of the archival principle of provenance (such as le respect des fonds and le respect de l’ordre intérieur) focus on the sanctity of the original (Horsman 1994). Although it may not be realistic to retain every aspect of an original entity, the intent is to retain all of its meaningful and relevant aspects.

Beyond the appropriate respect for the original, there is often a deeper fascination, sometimes called a fetish, for the original when it is valued as a historical or quasi-religious artifact. While fetishism may be understandable, its legitimacy as a motivator for preservation seems questionable. Moreover, fetishism notwithstanding, the main motivation for preserving original informational entities is the presumption that an original entity retains the maximum possible degree of authenticity. Though this may at first glance appear to be tautological, the tautology applies only to traditional, physical informational entities.

Retaining an original physical artifact without modifying it in any way would seem almost by definition to imply its authenticity. However, it is generally impossible to guarantee that a physical artifact can be retained without changing in any way (for example, by aging). Therefore, a more realistic statement would be that retaining an original without modifying it in any way that is meaningful and relevant (from some appropriate perspective) implies its authenticity. The archival emphasis on custodianship and provenance is at least partly a tactic for ensuring the retention of original records to maximize the likelihood of retaining their meaningful and relevant aspects, thereby ensuring their authenticity. Tautologically, an unmodified original is as authentic as a traditional, physical informational entity can be.

If we consider informational entities as abstractions rather than as physical artifacts, however, this tautology disappears. Although the informational aspects of such an entity may be represented in some particular physical form, they are logically independent of that representation, just as the Pythagorean Formula is independent of any particular physical embodiment or expression of that formula. An informational entity can be thought of as having a number of attributes, some of which are relevant and meaningful from a given perspective and some of which are not. For example, it might be relevant from one perspective that a given document was written on parchment but irrelevant that it was signed in red ink; from a different perspective, it might be relevant that it was signed in red yet irrelevant that it was written on parchment. The specific set of attributes of a given informational entity that is relevant and meaningful from one perspective may be difficult to define precisely. The full range of all such attributes that might be relevant from all possible perspectives may be open-ended. In all cases, however, some set of relevant logical attributes must exist, whether or not we can list them.

This implies that retaining the original physical artifact that represents an informational entity is at most sufficient (in the case of a traditional informational entity) but is never logically necessary to ensure its authenticity. If the relevant and meaningful attributes of the entity were retained independently of its original physical embodiment, they would by definition serve the same purpose as the original. Furthermore, since it is impossible to retain all attributes of a physical artifact in the real world because of aging, retaining the original physical artifact for an informational entity may not be sufficient, since it may lose attributes that are relevant and meaningful for a given purpose. (For example, the color of a signature may fade beyond recognition.) Retaining an original physical artifact is therefore neither necessary nor sufficient to ensure the authenticity of an informational entity.

Digital Informational Entities and the Concept of an Original

The preceding argument applies a fortiori to digital informational entities. It is well accepted that the physical storage media that hold digital entities have regrettably short lifetimes, especially when obsolescence is taken into account. Preserving these physical storage media as a way of retaining the informational entities they hold is not a viable option. Rather, it is almost universally acknowledged that meaningful retention of such entities requires copying them onto new media as old media become physically unreadable or otherwise inaccessible.

Fortunately, the nature of digital information makes this process far less problematic than it would be for traditional informational entities. For one thing, digital information is completely characterized by simple sequences of symbols (zero and one bits in the common, binary case). All of the information in a digital informational entity lies in its bit stream (if, as argued earlier, this is taken to include all necessary context, interpreter software, etc.). Although this bit stream may be stored on many different kinds of recording media, the digital entity itself is independent of the medium on which it is stored. One of the most fundamental aspects of digital entities is that they can be stored in program memory; on a removable disk, hard disk, CD-ROM; or on any future storage medium that preserves bit streams, without affecting the entities themselves.17

One unique aspect of digital information is that it can be copied perfectly and that the perfection of a copy can be verified without human effort or intervention. This means that, at least in principle, copying digital informational entities to new media can be relied upon to result in no loss of information. (In practice, perfection cannot be guaranteed, but increasingly strong assurances of perfection can be attained at relatively affordable cost.)

The combination of these two facts-that digital informational entities consist entirely of bit streams, and that bit streams can be copied perfectly onto new media-makes such entities logically independent of the physical media on which they happen to be stored. This is fortunate since, as pointed out above, it is not feasible to save the original physical storage artifact (e.g., disk and tape) that contains a digital informational entity.

The deeper implication of the logical independence of digital informational entities from the media on which they are stored is that it is meaningless to speak of an original digital entity as if it were a unique, identifiable thing. A digital document may be drafted in program memory and saved simultaneously on a variety of storage media during its creation. The finished document may be represented by multiple, identical, equivalent copies, no one of which is any more “original” than any other. Furthermore, copying a digital entity may produce multiple instances of the entity that are logically indistinguishable from each other.18

Defining Digital-Original Informational Entities

It is meaningless to rely on physical properties of storage media as a basis for distinguishing original digital informational entities. It is likewise meaningless to speak of an original digital entity as a single, unique thing. Nevertheless, the concept of an “original” is so pervasive in our culture and jurisprudence that it seems worth trying to salvage some vestige of its traditional meaning. It appears that the true significance (in the preservation context) of an original traditional informational entity is that it has the maximum possible likelihood of retaining all meaningful and relevant aspects of the entity, thereby ensuring its authenticity. By analogy, we therefore define a digital-original as any representation of a digital informational entity that has the maximum possible likelihood of retaining all meaningful and relevant aspects of the entity.

This definition does not imply a single, unique digital-original for a given digital informational entity. All equivalent digital representations that share the defining property of having the maximum likelihood of retaining all meaningful and relevant aspects of the entity can equally be considered digital-originals of that entity. This lack of uniqueness implies that a digital-original of a given entity (not just a copy) may occur in multiple collections and contexts. This appears to be an inescapable aspect of digital informational entities and is analogous to the traditional case of a book that is an instance of a given edition: it is an original but not the original, since no single, unique original exists.19

It is tempting to try to eliminate the uncertainty implied by the phrase maximum possible likelihood, but it is not easy to do so. This uncertainty has two distinct dimensions. First, it is difficult enough to specify precisely which aspects of a particular informational entity are meaningful and relevant for a given purpose, let alone which aspects of any such entity might be meaningful and relevant for any possible purpose. Since we cannot in general enumerate the set of such meaningful, relevant aspects of an informational entity, we cannot guarantee, or even evaluate, their retention. Second, physical and logical constraints may make it impossible to guarantee that any digital-original will be able to retain all such aspects, any more than we can guarantee that a physical original will retain all relevant aspects of a traditional informational entity as it ages and wears. The uncertainty in our definition of digital-original therefore seems irreducible; however, its impact is no more damaging than the corresponding uncertainty for physical originals of traditional informational entities.

Although the definition used here does not imply any particular technical approach, the concept appears to have at least one possible implementation, based on emulation (Michelson and Rothenberg 1992; Rothenberg 1995, 1999; Erlandsson 1996; Swade 1998). In any case, any implementation of this approach must ensure that the interpreters of digital informational entities, themselves saved as bit streams, can be made perpetually executable. If this can be achieved, it should enable us to preserve digital-original informational entities that maintain their authenticity across all disciplines, by retaining as many of their attributes as possible.

Conclusion

If a single, uniform technological approach can be devised that authentically preserves all digital-informational entities for the purposes of all disciplines, the resulting economies of scale will yield tremendous benefits. To pave the way for this possibility, I have proposed a foundation for a universal, transdisciplinary concept of authenticity based on the notion of suitability. This foundation allows the specific uses that an entity must fulfill to be considered authentic to vary across disciplines; however, it also provides a common vocabulary for expressing authenticity principles and criteria, as well as a common basis for evaluating the success of any preservation approach.

I have also tried to show that many alternative strategies for determining authenticity ultimately rely on the preservation of relevant, meaningful aspects or attributes of informational entities. By creating digital-original informational entities that have the maximum possible likelihood of retaining all such attributes, we should be able to develop a single preservation strategy that will work across the full spectrum of disciplines, regardless of their individual definitions of authenticity.


FOOTNOTES

1. One could argue that if the key terms of any discipline are not susceptible to multiple interpretations and endless analysis, then that discipline has little depth.

2. Many programs are compiled or translated into some simpler formal language first, but the result must still ultimately be interpreted. The distinction between compilation and interpretation will therefore be ignored here.

3. Although this analogy is suggestive, it is simplistic, since the interpreter of a digital informational entity is itself usually an executable application program, not simply another document.

4. Any number of component bit streams can be represented as a single, composite bit stream.

5. The word rendering is used here as a generalization of its use in computer graphics, namely, the process of turning a data stream into something a human can see, hear, or otherwise experience.

6. Metadata and interpreter bit streams can be shared among many digital informational entities. Although they must be logical components of each such entity, they need not be redundantly represented.

7. In particular, saving the bit stream corresponding to the core content of such an entity is insufficient without saving some way of interpreting that bit stream, for example, by saving appropriate software (another bit stream) in a way that enables running that software in the future, despite the fact that it, and the hardware on which it was designed to run, may be obsolete.

8. New criteria based on newly recognized properties of informational entities may be added over time, as is the case when evaluating radiological properties of artifacts whose origins predate the discovery of radioactivity.

9. Here authenticity refers to a specific attribute of an entity (i.e., its chemical composition) rather than to the entity as a whole that is of concern for preservation purposes.

10. Although it is tempting to consider the suitability of an informational entity to be constrained only by technical factors, legal, social and economic factors often override technical considerations. For example, the suitability of an informational entity for a given purpose may be facilitated or impeded by factors such as the way it is controlled and made available to potential users. Therefore, if it is to serve as a criterion for authenticity, suitability must be understood to mean the potential suitability of an entity for some purpose, i.e., that which can be realized in the absence of arbitrary external constraints.

11. Because the strategy potentially leads to dynamic, evolving definitions of authenticity, it has a decidedly a posteriori flavor, which may be inescapable.

12. In the remainder of this paper, authenticity will be used exclusively in the context of preservation.

13. Whereas the originality strategy entails no explicit property or capability conditions (though some tactics for evaluating originality may rely on such conditions), it nevertheless implicitly assumes that simply by virtue of being original, an entity will retain as many of its properties and capabilities as possible.

14. Non-retroactive expansion can be accommodated by revising the corresponding suitability criteria for all informational entities to be preserved henceforth.

15. Future patterns of access for digital records may be quite different from historical or current patterns of access for traditional records, making it difficult to obtain meaningful information of this kind in the near future. Nevertheless, any preservation strategy is likely to depend at least to some extent on assumptions about such access patterns. The library community has performed considerable user research on the design of online public catalogs that may be helpful in this endeavor. For example, see M. Ongering, Evaluation of the Dutch national OPAC: the userfriendliness of PC3, Leiden 1992; Common approaches to a user interface for CD-ROMSurvey of user reactions to three national bibliographies on CD-ROM, British Library and The Royal Library, Denmark, Copenhagen, April, 1992; V. Laursen and A. Salomonsen, National Bibliographies on CD-ROM: Definition of User-dialogues Documentation of Criteria Used, The Royal Library, Denmark, Copenhagen, March 1991.

16. This requires preserving only a description of the entity (i.e., metadata). The entity itself can in effect be discarded if this principle is chosen.

17. In some cases, the original storage medium used to hold a digital informational entity may have some significance, just as it may be significant that a traditional document was written on parchment rather than paper. For example, the fact that a digital entity was published on CD-ROM might imply that it was intended to be widely distributed (although the increasing use of CD-ROM as a back-up medium serves as an example of the need for caution in drawing such conclusions from purely technical aspects of digital entities, such as how they are stored). However, even in the rare cases where such physical attributes of digital informational entities are in fact meaningful, that meaning can be captured by metadata. Operational implications of storage media—for example, whether an informational entity would have been randomly and quickly accessible, unalterable, or constrained in various ways, such as by the size or physical format of storage volumes—are similarly best captured by metadata to eliminate dependence on the arcane properties of these quickly obsolescent media. To the extent that operational attributes such as speed of access may have constrained the original functional behavior of a digital informational entity that was stored on a particular medium, these attributes may be relevant to preservation.

18. Even time stamps that purportedly indicate which copy was written first may be an arbitrary result of file synchronization processes, network or device delays, or similar phenomena that have no semantic significance.

19. Moreover, since there is no digital equivalent to a traditional manuscript, there can be no unique prepublication version of a digital informational entity.


REFERENCES

Erlandsson, A. 1996. Electronic Records Management: A Literature Review. International Council on Archives’ (ICA) Study. Available from http://www.archives.ca/ica.

Horsman, P. 1994. Taming the Elephant: An Orthodox Approach to the Principle of Provenance. In The Principle of Provenance, edited by Kerstin Abukhanfusa and Jan Sydbeck. Stockholm: Swedish National Archives.

Michelson, A., and J. Rothenberg. 1992. Scholarly Communication and Information Technology: Exploring the Impact of Changes in the Research Process on Archives. American Archivist 55(2):236-315.

Rothenberg, J. 1995. Ensuring the Longevity of Digital Documents. Scientific American, 272(1):42-7 (international edition, pp. 24-9).

_____. 1999. Avoiding Technological Quicksand: Finding a Viable Technical Foundation for Digital Preservation: A Report to the Council on Library and Information Resources. Washington, D.C.: Council on Library and Information Resources. Available from https://www.clir.org/pubs/reports/rothenberg/pub77.pdf.

Rothenberg, J., and T. Bikson. 1999. Carrying Authentic, Understandable and Usable Digital Records Through Time. RAND-Europe. Available from http://www.archief.nl/digiduur/final-report.4.pdf.

Swade, D. 1998. Preserving Software in an Object-Centred Culture. In History and Electronic Artefacts, edited by Edward Higgs. Oxford: Clarendon Press.