 |
Authenticity and Integrity
in the Digital Environment:
An Exploratory Analysis of the
Central Role of Trust
by Clifford Lynch
This paper has been modestly revised based on discussion
at the workshop and a reading of the other papers presented there.
All of the papers, but particularly those of David Levy and Peter
Hirtle, raise important issues that are relevant to the topic of
this article. From Hirtle's paper, I had the opportunity to learn
something of the science of diplomatics, and at the workshop, I had
the opportunity to learn much more from Luciana Duranti. Her book, Diplomatics:
New Uses for an Old Science (1998), offers valuable and fresh
insights on the topics discussed here. These other works provide
important additional viewpoints that are not fully integrated into
this paper and I urge the reader to explore them. My thanks also
to the participants in the Buckland/Lynch Friday Seminar at the School
of Information Management and Systems at the University of California,
Berkeley, for their comments on an earlier version of this paper.
Introduction
This paper seeks to illuminate several issues surrounding the ideas
of authenticity, integrity, and provenance in the networked information
environment. Its perspective is pragmatic and computational, rather
than philosophical. Authenticity and integrity are in fact deep and
controversial philosophical ideas that are linked in complex ways
to our conceptual views of documents and artifacts and their legal,
social, cultural, and historical contexts and roles. (See Bearman
and Trant [1998] for an excellent introduction to these issues.)
In the digital environment, as Larry Lessig (1999) has recently
emphasized, computer code is operationalizing and codifying ideas
and principles that, historically, have been fuzzy or subjective,
or that have been based on situational legal or social constructs.
Authenticity and integrity are two of the key arenas where computational
technology connects with philosophy and social constructs. One goal
of this paper is to help distinguish between what can be done in
code and what must be left for human and social judgment in areas
related to authenticity and integrity.
Gustavus Simmons wrote a paper in the 1980s with the memorable title "Secure
Communications in the Presence of Pervasive Deceit." The contents
of the paper are not relevant here, but the phrase "pervasive
deceit" has stuck in my mind because I believe it perfectly
captures the concerns and fears that many people are voicing about
information on the Internet. There seems to be a sense that digital
information needs to be held to a higher standard for authenticity
and integrity than has printed information. In other words, many
people feel that in an environment characterized by pervasive deceit,
it will be necessary to provide verifiable proof for claims related
to authorship and integrity that would usually be taken at face value
in the physical world. For example, although forgeries are always
a concern in the art world, one seldom hears concerns about (apparently)
mass-produced physical goodsbooks, journal issues, audio CDsbeing
undetected and undetectable fakes.1
This distrust of the immaterial world of digital information has
forced us to closely and rigorously examine definitions of authenticity
and integritydefinitions that we have historically been rather
glib aboutusing the requirements for verifiable proofs as a
benchmark. As this paper will demonstrate, authenticity and integrity,
when held to this standard, are elusive properties. It is much easier
to devise abstract definitions than testable ones. When we try to
define integrity and authenticity with precision and rigor, the definitions
recurse into a wilderness of mirrors, of questions about trust and
identity in the networked information world.
While there is widespread distrust of the digital environment, there
also seems to be considerable faith and optimism about the potential
for information technology to address concerns about authenticity
and integrity. Those unfamiliar with the details of cryptographic
technology assume the magical arsenal of this technology has solved
the problems of certifying authorship and integrity. Moreover, there
seems to be an assumption that the solutions are not deployed yet
because of some perverse reluctance to implement the necessary tools
and infrastructure.2 This
paper will take a critical view of these cryptographic technologies.
It will try to distinguish between the problems that cryptographic
technologies can and cannot solve and how they relate to the development
of infrastructure services. There seems to have been surprisingly
little examination of these questions; this is itself surprising.
Before attempting to define integrity or authenticity, it is worth
trying to gain an intuitive sense of how the digital environment
differs from the physical world of information-bearing artifacts
("meatspace," as some now call it). The archetypal situation
is this: We have an object and a collection of assertions about it.
The assertions may be internal, as in a claim of authorship or date
and place of publication on the title page of a book, or external,
represented in metadata that accompany the object, perhaps provided
by third parties. We want to ask questions about the integrity of
the object: Has the object been changed since its creation, and,
if so, has this altered the fundamental essence of the object? (This
can include asking these questions about accompanying assertions,
either embedded in the object or embodied in accompanying metadata).
Further, we want to ask questions about the authenticity of the object:
If its integrity is intact, are the assertions that cluster around
the object (including those embedded within it, if any) true or false?
How do we begin to answer these questions in meatspace? There are
only a few fundamental approaches.
- We examine the provenance of the object (for example, the documentation
of the chain of custody) and the extent to which we trust and believe
this documentation as well as the extent to which we trust the
custodians themselves.
- We perform a forensic and diplomatic examination of the object
(both its content and its artifactual form) to ensure that its
characteristics and content are consistent with the claims made
about it and the record of its provenance.
- We rely on signatures and seals that are attached to the object
or the claims that come with it, or both, and evaluate their forensics
and diplomatics and their consistency with claims and provenance.
- For mass-produced and distributed (i.e., published) objects,
we compare the object in hand with other versions (copies) of the
object that may be available (which, in turn, means also assessing
the integrity and provenance of these other versions or copies).
In the digital environment, there are few forensics or diplomatics,3 other
than the forensics and diplomatics of content itself. We cannot evaluate
inks, papers, binding technology, and similar physical characteristics.4 We
can note, just as with a physical work, that an essay allegedly written
in 1997 that makes detailed references to events and publications
from 1999 is either remarkably prescient or incorrectly dated. There
are limited forensics of availability, and they mainly provide negative
information. For example, if a document claims to have been written
in 1998 and we have copies of it that were deposited on various servers
in 1997 (and we trust the claims of the servers that the material
was in fact deposited in 1997), we can build a case that it was first
distributed no later than 1997, regardless of the date contained
in the object. Nevertheless, this does not tell us when the document
was written.
The fundamental concept of publication in the digital environmentthe
dissemination of a large number of copies to arbitrary interested
parties that are subsequently autonomously managed and maintainedhas
come under great stress from numerous factors in the networked information
environment. These factors include, for example, the move from sale
to licensing, limited distribution, making copies public for viewing
without giving viewers permission to maintain the copies, and technical
protection systems (National Research Council 2000). While the basic
principle of broad distribution and subsequent autonomous management
of copies remains valid and useful as a base of evidence against
which to test the authenticity of documents in question, the availability
of relevant and trustworthy copies may be limited in the digital
environment, and assessing the copies is likely to be more difficult.
Moreover, the forensics and diplomatics of evaluating seals and signatures,
and documentation of provenance, become much more formal and computational.
It is difficult to say whether digital seals and signatures are more
or less compelling in the digital world than in the analog world,
but their characters unquestionably change. Finally, provenance and
chains of custody in the digital world begin to reflect our evaluation
of archives and custodians as implementers and operators of "trusted
systems" that enforce the integrity and provenance records of
objects entrusted to them.
At some level, authenticity and integrity are mechanical characteristics
of digital objects; they do not speak to deeper questions of whether
the contents of a digital document are accurate or truthful when
judged objectively. An authentic document may faithfully transmit
complete falsehoods. There is a hierarchy of assessment in operation:
forensics, diplomatics, intellectual analyses of consistency and
plausibility, and evaluations of truthfulness and accuracy. Our concern
here is with the lower levels of this hierarchy (i.e., forensics
and diplomatics as they are reconceived in the digital environment)
but we must recognize that conclusive evaluations at the higher levels
may also provide evidence that is relevant to lower-level assessment.
Exploring Definitions and Defining Terms:
Digital Objects, Integrity, and Authenticity
The Nature of Digital Information Objects
Before we can discuss integrity and authenticity, we must examine
the objects to which we apply these characterizations.
Most commonly, computer scientists are concerned with digital objects
that are defined as a set of sequences of bits. One can then ask
computationally based questions about whether one has the correct
set of sequences of bits, such as whether the digital object in one's
possession is the same as that which some entity published under
a specific identifier at a specific point in time. However, this
is a simplistic notion. There are additional factors to consider.
Bits are not directly apprehended by the human sensory apparatusthey
are never truly artifacts. Instead, they are rendered, executed,
performed, and presented to people by hardware and software systems
that interpret them. The question is how sophisticated these environmental
hardware and software systems are and how integral they are to the
understanding of the bits. In some cases, the focus is purely on
the bits: numeric data files, or sensor outputs, for example, that
are manipulated by computational or visualization programs. Documentary
objects are characterized primarily by their bits (think of simple
ASCII text), but the craft of publishing begins to make a sensory
presentation of this collection of bitsto turn content into
experience. Text, marked up in HTML and displayed through a Web browser,
takes on a sensory dimension; the words that make up the text being
rendered no longer tell the whole story. Digital objects that are
performedmusic, video, images that are rendered on screenincorporate
a stronger sensory component. Issues of interaction with the human
sensory systempsychoacoustics, quality of reproduction, visual
artifacts, and the likebecome more important. The bits may
be the same across space and time, but because of differences in
the hardware and software used by recipients, the experience
of viewing them may vary substantially. This raises questions
about how to define and measure authenticity and integrity. In the
most extreme case, we have objects that are rendered experientiallyvideo
games, virtual reality walk-throughs, and similar interactive workswhere
the focus shifts from the bits that constitute the digital object
to the behavior of the rendering system, or at least to the interaction
between the digital object and the rendering system.
Thus, we might think about a hierarchy of digital objects that could
be expressed as follows:
(Interactive) experiential works
Sensory presentations
Documents
Data
As we move up the hierarchy, from data to experiential works, the
questions about the integrity and authenticity of the digital objects
become more complex and perhaps more subjective; they address experience
rather than documentary content (Lynch 2000). This paper will focus
on the lower part of the digital object hierarchy. The upper part
is poorly understood and today is addressed only in a limited way;
for example, through discussions about emulation as a preservation
strategy (Rothenberg 1999, 1995). It seems conceivable that one could
extend some of the observations and assertions discussed later in
this paper to the more experiential works by performing computations
on the output of the renderings rather than on the objects themselves.
However, this approach is fraught with problems involving canonical
representations of the user interface (which, in the most complex
cases, involves interaction and not just presentation) and agreeing
on what constitutes the authentic experience of the work.
In meatspace, we cheerfully extend the notion of authenticity to
much more than objectsin fact, we explicitly apply it to the
experiential sphere, speaking of an "authentic" performance
of a baroque concerto or an "authentic" Hawaiian luau.
To the extent that we can make the extension and expansion of the
use of authenticity as a characteristic precise within the framework
and terminology of this paper, these statements seem to parallel
statements about integrity of what in the digital environment could
be viewed as experiential works, or performance.
Even as we struggle with definitions and tests of integrity and
authenticity for intellectual works in the digital environment, we
are seeing new classes of digital objectsfor example, e-cash
and digital bearer bondsthat explicitly involve and rely upon
stylized and precise manipulation of provenance, authenticity, identity
and anonymity, and integrity within a specific trust framework and
infrastructure. While these fit somewhere between data and documents
in the digital object hierarchy, they are interesting because they
derive their meaning and significance from their explicit interaction
with frameworks of integrity, authenticity, provenance, and trust.
Canonicalization and (Computational) Essence
Often, we seek to discuss the essence of a work rather than
the exact set of sequences of bits that may represent it in a specific
context; we are concerned with integrity and authenticity as they
apply to this essence, rather than to the literal bits. Discussions
of essence become more problematic as we move up the digital object
hierarchy. However, even at the lower levels of data and documents,
we encounter a troublesome imprecision that is a barrier to making
definitions operational computationally when we move beyond the literal
definition of precisely equivalent sets of sequences of bits. Those
approaching the question from a literary or documentary perspective
cast the issue in a palette of grays: there are series (not necessarily
a strict hierarchy; at best a partial ordering) of intellectual abstractions
of a document that capture its essence at various levels, and the
key problem is whether this abstract essence is retained. The abstraction
may involve words, layout, typography, or even the feel of the pages.
Are hardcover and paperback editions of a book equivalent? Does equivalence
depend on whether the pagination is identical? Elsewhere, I have
proposed canonicalization as a method of making such abstractions
precise (Lynch 1999). The fundamental point of canonicalization as
an organizing principle is that it defines computational algorithms (called "canonicalizations")
that can be used to extract the "essence" of documents
according to various definitions of what constitutes that essence.
If we have such computational procedures for extracting the essence
of digital objects, we can then compare digital objects through the
prism of that definition of essence. We can also make assertions
that involve abstract representations of this essence, rather than
more specific (and presumably haphazard) representations that incorporate
extraneous characteristics.
The hard problem, of course, is precisely defining and achieving
a consensus about the right canonicalization algorithm, or algorithms,
for a given context.
Integrity
When we say that a digital object has "integrity," we
mean that it has not been corrupted over time or in transit; in other
words, that we have in hand the same set of sequences of bits that
came into existence when the object was created. The introduction
of appropriate canonicalization algorithms allows us to consider
the integrity of various abstractions of the object, rather than
of the literal bits that make it up, and to operationalize this discussion
of abstractions into equality of sets of sequences of bits produced
by the canonicalization algorithm.
When we seek to test the integrity of an object, however, we encounter
paradoxes and puzzles. One way to test integrity is to compare the
object in hand with a copy that is known to be "true."5 Yet,
if we have a secure channel to a known true copy, we can simply take
a duplicate of the known true copy. We do not need to worry about
the accuracy of the copy in hand, unless the point of the exercise
is to ensure that the copy in hand is correctfor example, to
detect an attempt at fraud, rather than to be sure that we have a
correct copy. These are subtly different questions.6
If we do not have secure access to an independently maintained,
known true copy of the object (or at least a digest surrogate), then
our testing of integrity is limited to internal consistency checking.
If the object is accompanied by an authenticated ("digitally
signed") digest, we can check whether the object is consistent
with the digest (and thus whether its integrity has been maintained)
by recomputing the digest from the object in hand and then comparing
it with the authenticated digest. But our confidence in the integrity
of the object is only as good as our confidence in the authenticity
and integrity of the digest. We have only changed the locus of the
question to say that if the digest is authentic and accurate,
then we can trust the integrity of the object. Verifying integrity
is no different from verifying the authenticity of a claim that "the
correct message digest for this object is M" without assigning
a name to the object. The linkage between claim and object is done
by association and contextby keeping the claim bound with the
object, perhaps within the scope of a trusted processing system such
as an object repository.
In the digital environment, we also commonly encounter the issue
of what might be termed "situational" integrity, i.e.,
the integrity of derivative works. Consider questions such as "Is
this an accurate transcript?", "Is this a correct translation?",
or "Is this the best possible version given a specific set of
constraints on display capability?" Here we are raising a pair
of questions: one about the integrity of a base object, and another
about the correctness of a computation or other transformation applied
to the object. (To be comprehensive, we must also consider the integrity
of the result of the computation or transformation after it has been
produced). This usually boils down to trust in the source or provider
of the computation or transformation, and thus to a question of authentication
of source or of validity, integrity, and correctness of code.
Authenticity
Validating authenticity entails verifying claims that are associated
with an objectin effect, verifying that an object is indeed
what it claims to be, or what it is claimed to be (by external metadata).
For example, an object may claim to be created on a given date, to
be authored by a specific person, or to be the object that corresponds
with a name or identifier assigned by some organization. Some claims
may be more mechanistic and indirect than others. For example, a
claim that "This object was deposited in a given repository
by an entity holding this public/private key pair at this time" might
be used as evidence to support authorship or precedence in discovery.
Typically, claims are linked to an object in such a way that they
include, at least implicitly, a verification of integrity of the
object about which claims are made. Rather than simply speaking of
the (implied) object accompanying the claim (under the assumption
that the correct object will be kept with the claims, and that the
object management environment will ensure the integrity of the object)
one may include a message digest (and any necessary information about
canonicalization algorithms to be applied prior to computing the
digest) as part of the metadata assertion that embodies the claim.
It is important to note that tests of authenticity deal only with
specific claims (for example, "did X author this document?")
and not with open-ended inquiry ("Who wrote it?"). Validating
the authenticity of an object is more limited than is an open-ended
inquiry into its nature and provenance.
There are two basic strategies for testing a claim. The first is
to believe the claim because we can verify its integrity and authenticate
its source, and because we choose to trust the source. In other words,
we validate the claim that "A is the author of the object with
digest X" by first verifying the integrity of the object relative
to the claim (that it has digest X), and then by checking that the
claim is authenticated (i.e., digitally signed) by a trusted entity
(T). The heart of the problem is ensuring that we are certain who
T really is, and that T really makes or warrants the claim. The second
strategy is what we might call "independent verification" of
the claim. For example, if there is a national author registry that
we trust, we might verify that the data in the author registry are
consistent with the claim of authorship. In both cases, however,
validating a claim that is associated with an object ultimately means
nothing more or less than making the decision to trust some entity
that makes or warrants the claim.
Several final points about authenticity merit attention. First,
trust in the maker or warrantor of a claim is not necessarily binary;
in the real world, we deal with levels of confidence or degrees of
trust. Second, many claims may accompany an object; in evaluating
different claims, we may assign them differing degrees of confidence
or trust. Thus, it does not necessarily make sense to speak about
checking the authenticity of an object as if it were a simple true-or-false
testa computation that produces a one or a zero. It may be
more constructive to think about checking authenticity as a process
of examining and assigning confidence to a collection of claims.
Finally, claims may be interdependent. For example, an object may
be accompanied by claims that "This is the object with identifier
N," and "The object with identifier N was authored by A" (the
second claim, of course, is independent of the document itself, in
some sense). Perhaps more interesting, in an archival context, would
be claims that "This object was derived from the object with
message digest M by a specific reformatting process" and "The
object with message digest M was authored by A." (See Lynch
1999 for a more detailed discussion of this case.)
Comparing Integrity and Authenticity
It is an interesting, and possibly surprising, conclusion that in
the digital environment, tests of integrity can be viewed as just
special cases and byproducts of evaluations of authenticity. Part
of this comes from the perspective of the environment of "pervasive
deceit" and the idea that checking integrity of an object means
comparing it with some precisely identified and rigorously vetted "original
version" or "authoritative copy." In fact, much of
the checking for integrity in the physical world is not about ferreting
out pervasive deceit and malice, but rather about accepting artifacts
for roughly what they seem to be on face value and then looking for
evidence of damage or corruption (i.e., torn-out pages or redacted
text). For this kind of integrity checking, a message digest that
accompanies a digital object as metadata serves as an effective mechanism
to ensure that the object has not been damaged or corrupted. This
is true even if the message digest is not supported by an elaborate
signature chain and trust assessment, but only by a general level
of confidence in the computational context in which the objects are
being stored and transmitted. In the digital environment, there is
a tendency to downplay the need for this kind of integrity checking
in favor of stronger measures that combine authenticity claims with
integrity checks.
The Role of Copies
David Levy argues that all digital objects are copies; this echoes
the findings of the National Research Council Committee on Intellectual
Property in the Emerging Information Infrastructure that usereading,
for exampleimplies the making of copies (National Research
Council 2000). If we accept this view, authenticity can be viewed
as an assessment that we make about something in the presentsomething
that we have in handrelative to claims about the past (predecessor
copies). The persistent question is whether a given object X has
the same properties as object Y. There is no "original." This
is particularly relevant when we are dealing with dynamic objects
such as databases, where an economy of copies is meaningless. In
such cases, there is no question of authenticity through comparison
with other copies; there is only trust or lack of trust in
the location and delivery processes and, perhaps, in the archival
custodial chain.
Provenance
The term provenance comes up often in discussions of authenticity
and integrity. Provenance, broadly speaking, is documentation about
the origin, characteristics, and history of an object; its chain
of custody; and its relationship to other objects. The final point
is particularly important. There are two ways to think about a digital
object that is created by changing the format of an older object
that has been validated according to some specific canonicalization
algorithm. We might think about a single object the provenance of
which includes a particular transformation, or we might think about
multiple objects that are related through provenance documentation.
Thus, provenance is not simply metadata about an objectit can
also be metadata that describe the relationships between objects.
Because provenance also includes claims about objects, it is part
of the authentication and trust infrastructures and frameworks.
I do not believe that we have a clear understanding of (and surely
not consensus about) where provenance data should be maintained in
the digital environment, or by what agencies. Indeed, it is not clear
to what extent the record of provenance exists independently and
permanently, as opposed to being assembled when needed from various
pools of metadata that may be maintained by various systems in association
with the digital objects that they manage. We also lack well-developed
metadata element sets and interchange structures for documenting
provenance. It seems possible that the Dublin Core, augmented by
semantics for signing metadata assertions, might form a foundation
for this, although attributes such as relationship would need to
be extended to allow for very precise vocabularies to describe algorithmically
based derivations of objects from other objects (or transformations
of objects). We would probably also need to incorporate metadata
assertions that allow an entity to record claims such as "Object
X is equivalent to object Y under canonicalization C."
Watermarks, Authenticity, and Integrity
In the most general sense, watermarking can be viewed as an attempt
to ensure that a set of claims is inseparably bound to a digital
object and thus can be assumed to travel with the object; one does
not have to trust transport and storage systems to correctly perform
this function. The most common use of watermarks today is to help
protect intellectual property by attaching a copyright claim (and
possibly an object-specific serial number to allow tracing of individual
copies) to an object. Software exists to scan public Web sites for
objects that contain watermarks and to notify the rights holders
about where these objects have been found. A serial number, if present,
helps the rights holder not only identify the presence of a possibly
illegal copy but also determine where it came from. Various trusted
system-based architectures for the control of copyrighted works have
also been proposed that use watermarking (for example, the Secure
Digital Music Initiative [2000]). The idea is that devices will refuse
to play, print, or otherwise process digital objects if the appropriate
watermarks are not present.7 The
desirable properties of watermarks include being very hard to remove
computationally (at least without knowledge of the private key as
well as the algorithm used to generate the watermark) and being resilient
under various alterations that may be applied to the watermarked
file (lossy compression, for example, or image cropping). The development
of effective watermarking systems is currently a very active area
of research.8
From the perspective of authenticity and integrity, watermarks present
several problems. First, they deliberately and systematically corrupt
the objects to which they are applied, in much the same way that
techniques such as lossy compression do. Fingerprints (individualized
watermarks) are particularly bad in this regard since they defeat
comparisons among copies as a way of establishing authenticityindeed
this is exactly what they are designed to do, to make each copy unique
and traceable. Applying a watermark to a digital object means changing
bits within the object, but in such a way that they change the perception
of the object only slightly. Thus, finding and verifying a watermark
in a digital object give us only weak evidence of its integrity.
In fact, the very presence of the watermark means that integrity
has been compromised at some level, unless we are willing to accept
the watermarked version of the object as the actual authoritative
onean image or sound recording that includes some data that
allegedly does not much change our perception of the object. If a
watermark can easily be stripped out of an object (a bad watermark
design, but perhaps characteristic of watermarking systems that try
to minimize corruption), then the absence of such a watermark does
not tell us much about the possible corruption of other parts of
the object.
A second problem is that some watermarking systems do not emphasize
preventing the creation of fake watermarks; they are concerned primarily
with the preservation of legitimate watermarks as evidence of ownership
or status of the watermarked object. To use watermarking to address
authenticity issues, it seems likely that one would need to use it
simply as a means of embedding a claim in an object, under the assumption
that the claim would then have to be separately verifiable (for example,
by being digitally signed).
To summarize: If one obtains a digital object that contains a watermark,
particularly if that watermark contains separately verifiable claims,
it can provide useful evidence about the provenance and characteristics
of the object, including good reasons to assume that it is a systematically
and deliberately corrupted version of a predecessor digital object
that one may or may not have access to or be able to locate. The
watermark may have some value in forensic examination of digital
objects, but it does not seem to be a good tool for the management of
digital objects within a controlled environment such as an archive
or repository system that is concerned with object integrity. It
seems more appropriate to require that the environment take responsibility
for maintaining linkages and associations between metadata (claims)
and the objects themselves. Watermarks are more appropriate for an
uncontrolled public distribution environment where integrity is just
one variable in a complex set of trade-offs about the management
and protection of content.
Semantics of Digital Signatures
One serious shortcoming of current cryptographic technology has
to do with the semantics of digital signaturesor, more precisely,
the lack thereof. In fairness, many cryptographers are not concerned
with replicating the higher levels of semantics that accompany the
use of signatures in the physical world. They regard these issues
as the responsibility of an applications environment that uses digital
signatures as a tool or supporting mechanism. But wherever we assign
responsibility for establishing a system of semantics, the need for
such semantics is very real, and I believe that many people outside
the cryptographic community have been misled by their assumptions
about the word signature. They do not understand that the
semantics problem is still largely unaddressed.
At its core, a digital signature is a mechanical, computational
process. Some entity in possession of a public/private key pair was
willing to perform a computation on a set of data using this key
pair, which permits someone who knows the public key of the key pair
to verify that the data were known to and computed upon by an entity
that held the key pair. A digital signature amounts to nothing more
than this. Notice that any digital data can be signednot just
documents or their digests, but also assertions about documents.
The interface between digital signature processing and documents
is extremely complex, questions about the semantics of signatures
aside. The reader is invited to explore the work of the joint Worldwide
Web Consortium/Internet Engineering Task Force on digital signatures
for XML documents (1998) to get a sense of how issues such as canonicalization
come into play here.
The use of digital signatures in conjunction with a public key infrastructure
(PKI) offers a little more.9 People
can choose to trust the procedures of a PKI to do the following kinds
of things:
- To verify, according to published policies, a user's right to
an "identity" and to subsequently document the binding
between that identity and a public/private key pair. Verification
policies vary widely, from taking someone's word in an e-mail message
to demanding witnesses, extensive documentation such as passports
and birth certificates, personal interviews, and other proof. In
essence, one can trust the PKI service to provide the public key
that corresponds to an identity. The identity can be either a name
("John Smith") or a role ("Chief Financial Officer
of X Corporation"). Attributes can also be bound to the identity.
- To provide a means for determining when a key pair/identity binding
has been compromised, expired, or revoked and should no longer
be considered valid.
Compare this mechanistic view of signatures with the rich and diverse
semantics of signatures in the real world. A signature might mean
that the signer
- authored the document;
- witnessed the document and other signatures on it;
- believes that the document is correct;
- has seen, or received, the document;
- approves the actions proposed in the document; or
- agrees to the document.
There are questions not only about the meaning of signatures but
also about their scope. In some situations, for example, documents
are signed or initialed on every page; in others, a signature witnesses
only another signature, not the entire document. Questions of scope
become complex in a digital world, particularly as signed objects
undergo transformations over time (because of reformatting, for example).
Considerable research is needed in these areas.
Digital signatures alone can neither differentiate among the possible
semantics outlined earlier, nor provide direct evidence of any one
of them. In other words, there is no reasonable "default" meaning
that can be given to a signature computation. Such signatures can
tell us that a set of bits has been computed upon, and, in conjunction
with a PKI, they can tell us who performed that computation. We clearly
need a mechanism for expressing semantics of signatures that can
be used in conjunction with the actual computational signature mechanisma
vocabulary for expressing the meaning of a signature in relationship
to a digital object (or, in fact, a set of digital objects that might
include other signed assertions).
One can imagine defining such a vocabulary and interchange syntax
for the management and preservation of digital objectsfor a
community of archives and cultural heritage organizations, for example.
But there is another problem that has not been well explored, to
my knowledge. It is likely that we will see the development of one
or more "public" vocabularies for commerce and contracting,
and perhaps additional ones for the registry and management of intellectual
property. These vocabularies might vary among nations, or even among
states in a nation such as the United States, where much contracting
is governed by state law.10 In
addition, we will almost certainly see the development of organization-specific "internal" vocabularies
in support of institutional processes. Many of the initial claims
about objects will likely be expressed in one of these other vocabularies
rather than the vocabularies of the cultural heritage communities;
consequently, we will face complex problems of mapping and interpreting
vocabularies. We will also face the problems of trying to interpret
vocabularies that may belong to organizations that no longer exist
or vocabularies in which usage has changed over time, perhaps in
poorly documented ways.
The Roles of Identity and Trust
Virtually all determination of authenticity or integrity in the
digital environment ultimately depends on trust. We verify the source
of claims about digital objects or, more generally, claims about
sets of digital objects and other claims, and, on the basis of that
source, assign a level of belief or trust to the claims. As a second,
more intellectual form of analysis, we can consider the consistency
of claims, and then further consider these claims in light of other
contextual knowledge and common sense. For example, an object that
claims to have been authored in 2003 by someone who died in 2001
would reasonably raise questions, even if all of the signatures verify.
We can draw precious few conclusions from objects standing alone,
except by applying this kind of broader intellectual analysis. As
we have seen, ensuring the validity of linkages between claims and
the objects about which those claims make assertions is an important
question. The question becomes even more difficult when we recognize
that both objects and sets of claims evolve independently and at
different rates, because of maintenance processes such as reformatting
or the expiration of key pairs and the issuance of new ones.
Ultimately, trust plays a central role, yet it is elusive. Signatures
can allow us to trust a claim if we trust the holder of a key pair,
and a public key infrastructure can allow us to know the identity
(name) of the holder of a key pair if we trust the operator of the
PKI. If we know the name of the entity we trust, we can thus use
the PKI to determine its public key and use that to verify signatures
that the entity has made. We can establish the link between identity
and keys directly (we can directly obtain, through some secure method,
the public key from a trusted entity) or through informal intermediaries
(we can securely obtain the key from someone we know and trust, as
is done in the Pretty Good Privacy [PGP] system) (Zimmermann 1995).
It is important to recognize that trust is not necessarily an absolute,
but often a subjective probability that we assign case by case. The
probability of trustworthiness may be higher for some PKIs than for
others, because of their policies for establishing identity. Moreover,
we may establish higher levels of trust based on identities that
we have directly confirmed ourselves than on those confirmed by others.
Considerable research is being done on methods that people could
use to define rules about how they assign trust and belief. These
rules can drive computations for a calculus of trust in evaluating
claims within the context of a set of known keys and identities and
PKI services that maintain identities. An interesting question, which
I do not think we are close to being able to answer, is whether there
will be a community consensus on trust assignment rules within the
cultural heritage community, or whether we will see many, wildly
differing, choices about when to establish trust.
We also need an extensive inquiry into the nature of identity in
the digital world as it relates to authenticity questions such as
claims of authorship. Consider just a few points here. Identity in
the digital world means that someone has agreed to trust an association
between a name and a key pair, because he or she has directly verified
it or trusts an intermediary, such as a PKI, that records such an
association. Control of an identity, however, can be mechanically
transferred or shared by the simple act of the owner of a key pair
sharing that key pair with some other entity. We have to trust not
only the identity but also the behavior of the owner of that identity.
If we are to trust a claim of authorship, whom do we expect to sign
it? The author? The publisher? A registry such as the copyright office,
which would more likely sign a claim stating that the author has
registered the object and claimed authorship?
Identity is more than simply a name. We frequently find anonymous
or pseudonymous authorship; how are these identities created and
named? We have works of corporate authorship, including the notion
of "official" works that are created through deliberate
corporate acts and that represent policy or statements with legal
implications. In this case, the signatory may be someone with a specific
role or office within a corporation (an officer of the corporation
or the corporate secretary, for example). These may be very volatile
in an era of endless mergers and acquisitions, as well as occasional
bankruptcies. Finally, we have various ad-hoc groups that come together
to author works; these groups may be unwilling or unable to create
digital identities within the trust and identity infrastructure (consider,
for example, artistic, revolutionary, or terrorist manifestos).
We know little about how identity management systems operate over
very long periods. Imagine a digital object that is released from
an archive in 2100 for the first timean object that had been
sealed since its deposit in 2000. A group of experts is trying to
assess the claims associated with the object. One scenario is that
all claims were verified upon deposit, and the archive has recorded
that verification; the experts then trust the archive to have correctly
maintained the object since its deposit and to have appropriately
verified the claims. A second scenario is that the group of experts
chooses to re-verify the claims. This may take them into an elaborate
exploration of the historical evolution of policies of certificate
authorities and public key infrastructure operators that have long
since vanished, of histories of key assignment and expiration, and
perhaps even of the evolution of our understanding of the vulnerabilities
of cryptographic algorithms themselves. This suggests that our ability
to manage and understand authenticity and integrity over long periods
of time will require us to manage and preserve documentation about
the evolution of the trust and identity management infrastructure
that supports the assertions and evaluation of authenticity and integrity.
This, in turn, raises the concern that relying on services and infrastructure
that are being established primarily to support relatively short-term
commercial activities may be problematic. At a minimum, it suggests
that we may need to begin a discussion about the archival requirements
for such services if they are to support the long-term management
of our cultural and intellectual heritage.
Authorship is just one example of the difficulties involved in "literary" signature
semantics. Consider the problem of assigning publication dates as
another example. Every publisher has different standards and thus
different semantics.
Conclusions
In an attempt to explore the central roles of trust and identity
in addressing authenticity and integrity for digital objects, this
paper points to a wide-ranging series of research questions. It identifies
the need to begin considering standardization efforts in areas such
as signing metadata claims and the semantics of digital signatures
to support authenticity and integrity.
But a set of more basic issues about infrastructure development
and large-scale deployment also needs to be carefully considered.
A great deal of technology and infrastructure now being deployed
will be useful in managing integrity and authenticity over time.
However, these developments are being driven by commercial requirements
with short time horizons in areas such as authentication, electronic
commerce, electronic contracting, and management and control of digital
intellectual property. The good news is that there is a huge economic
base in these areas that will underwrite the development of infrastructure
and drive deployment. To the extent that we can share this work to
manage cultural and intellectual heritage, we need to worry only
about how to pay to use it for these applications, not about how
to underwrite its development. Even there, however, we need to think
about who will pay to establish the necessary identities and key
pairs and to apply them to create the appropriate claims that will
accompany digital objects. The less-good news is that we need to
be sure that the infrastructure and deployed technology base actually
meet the needs of very long-term management of digital objects. To
take one example, knowing the authorship of a work is still important,
even after all the rights to the work have entered the public domain.
It is essential that institutions concerned with the management and
preservation of cultural and intellectual heritage engage, participate
in, and continue to critically analyze the development of the evolving
systems for implementing trust, identity, and attribution in the
digital environment.
Footnotes
1. Confusingly,
however, we have the appearance of perfect forgeries (at least in
terms of content; the packaging is often substandard) of digital
goods in the form of pirate audio CDs, DVDs, and software CD-ROMs.
In these cases, the purpose is not usually intellectual fraud so
much as commercial fraud through piracy. One might argue that these
copies have integrity (they are, after all, bitwise equivalent);
however, their authenticity is dubious, or at least needs to be proved
by comparison with copies that have a provenance that can be documented.
Another case that bears consideration and helps refine our thinking
is the bootleg or "gray-market" recordingperhaps
an audio CD of a live performance of a well-known band, released
without the authorization of the performers and not on their usual
record label. This does not stop the recording from being authentic
and accurate, albeit unauthorized. The performers may or may not
be willing to vouch for the authenticity of the recording; alternatively,
one may have to rely on the evidence of the content (i.e., nobody
else sounds like that) and, possibly, metadata provided by a third
party that potentially has its own provenance.
2. It
would be useful to better understand why there has not been a greater
effort to deploy these capabilities, even though they have substantial
limitations. Contributing factors undoubtedly include export controls
and other government regulations on cryptography, both in the United
States and elsewhere; legal and liability issues involved in an infrastructure
that addresses authentication and identity; and social and cultural
concerns about privacy, accountability, and related topics. Patent
issues are a particular problem. It is hard to develop infrastructure,
widely deployed standards, and critical mass when key elements are
tied up by patents. With the recent insane proliferation of patents
on software methods, algorithms, business models, and the like, uncertainty
about patent issues is also a serious barrier to deployment. All
of these have been well covered in the literature and the press.
What has been less well examined is the lack of clear, well-established
economic models to support systems of authentication and integrity
management. To put it bluntly, it is not clear who is willing to
pay for the substantial development, deployment, and operation of
such a system. While many people say they are worried about authenticity
and integrity in a digital environment, it is not clear that they
are willing to pay the increased costs to effectively address these
concerns.
3. It
is worth carefully examining the forensic clues available when evaluating
a digital object as an artifact. Today, many of them seem trivial,
but as our history with digital technology grows longer, understanding
them will likely become a specialized body of expertise. Examples
include character codes, file formats, and formats of embedded fonts,
all of which can help at least place the earliest time that a digital
object could be created, and perhaps even provide evidence to argue
that it was unlikely to have been created after a certain time. For
an object that has undergone format conversions over time as part
of its preservation, these forensic clues help only in the evaluation
of the record of provenance.
4. For
digital objects created by digitizing physical artifacts, if we can
identify and obtain access to the source physical artifact, we can
apply well-established forensic and diplomatic analysis practices
to the source object.
5. As
soon as we begin to speak of copies, however, we need to be very
careful. Unless we know the location of the copy through some external
(contextual) information, we run the risk of confusing authenticity
and integrity. For example, if we have an object that includes a
claim that "the identifier of this object is N" and we
simply go looking for copies of objects with identifier N on a server
that we trust, and then securely compare the object in hand with
one of these copies, what we have really done is simply to trust
the server to make statements about the assignment of the identifier
N and then confirmed we had an accurate copy of the object with that
identifier in hand. The key difference is between trusting the server
to keep a true copy of an object in a known place and trusting the
server to vouch for the assignment of an identifier to an object.
6. One
thing that we can do with cryptographic technologyspecifically,
digest algorithmsis to test whether two copies of an object
are identical without actually exchanging the object. This is important
in contexts where economics and intellectual property come into play.
For example, a publisher that is offering copies of a digital document
for license can also offer a verification service, where the holder
of a copy of a digital object can verify its integrity without having
to purchase access to a new copy. Or, two institutions, each of which
holds a copy of a digital object but does not have to rights to share
it with another institution, can verify that they hold the same object.
Digest algorithms are also useful for efficiency purposes, because
they avoid the need to transmit copies of what may be very large
objects in order to test integrity. We should note that digest algorithms
are probabilistic statements, however; the algorithms are
designed to make it very unlikely that two different objects (particularly
two similar but distinct documents) will have the same digest.
7. This
is not a universally accepted definition of a digital watermark.
The term is also used to refer to other things, such as modifications
to images that allow them to be viewed on-screen with only moderate
degradation but that produce very visible and unsightly artifacts
when the image is printed. The description here characterizes what
I believe to be the most commonly used definition of the technology.
Sometimes "watermark" is reserved for a "universal" encoding
hidden in all copies of a digital object that are distributed by
a given source (for example, containing an object identifier) and
the term "fingerprint" is reserved for watermarks that
are copy-specific, that is personalized to given recipients (containing
a serial number or the recipient's identifier). The fingerprint individualizes
an object to a version associated with a specific recipient.
8. See,
for example, the proceedings of the series of conferences on Information
Hiding (Anderson 1996, Aucsmith 1998, Pfitzmann 2000). See also proceedings
from the first, second, and third international conferences on financial
cryptography (Hirschfeld 1997, Hirschfeld 1998, Franklin 1999).
9. See,
for example, Ford and Baum 1997; Feghhi, Geghhi, and Williams 1999.
10. In
the United States, some of this is likely to be determined by how
quickly federal law regarding digital signatures is established and
by the extent to which federal law preempts developing state laws.
Changes to the Uniform Commercial Code will likely play a role. See
http://washofc.epic.org/crypto/dss/ for information on a variety
of material on current legislative and standards developments related
to digital signatures.
REFERENCES
Anderson, Ross, ed. 1996. Information Hiding: First International
Workshop, Cambridge, U.K., May 30June 1, 1996, proceedings. Lecture
Notes in Computer Science, vol. 1174. Berlin and New York: Springer.
Aucsmith, David, ed. 1998. Information Hiding: Second International
Workshop, Portland, Oregon, U.S.A., April 1417 1998, proceedings. Lecture
Notes in Computer Science, vol. 1525. Berlin and New York: Springer.
Bearman, David, and Jennifer Trant. 1998. Authenticity of Digital
Resources: Towards a Statement of Requirements in the Research Process, D-Lib
Magazine (June). Available from http://www.dlib.org/dlib/june98/06bearman.html.
Duranti, Luciana. 1998. Diplomatics: New Uses for an Old Science.
Lanham, Md.: Scarecrow Press.
Hirschfeld, Rafael, ed. 1997. Financial Cryptography: First International
Conference, Anguilla, British West Indies, February 2428, 1997,
proceedings. Lecture Notes in Computer Science, vol. 1318.
Berlin and New York: Springer.
Hirschfeld, Rafael, ed. 1998. Financial Cryptography: Second International
Conference, Anguilla, British West Indies, February 2325, 1988,
proceedings. Lecture Notes in Computer Science, vol. 1465.
Berlin and New York: Springer.
Feghhi, Jalal, Jalil Geghhi, and Peter Williams. 1999. Digital
Certificates: Applied Internet Security. Reading, Mass.: Addison
Wesley.
Ford, Warwick, and Michael S. Baum. 1997. Secure Electronic Commerce:
Building the Infrastructure for Digital Signatures and Encryption.
Upper Saddle River, N.J.: Prentice Hall.
Franklin, Matthew, ed. 1999. Financial Cryptography: Third International
Conference, Anguilla, British West Indies, February 2225, 1999,
proceedings. Lecture Notes in Computer Science, vol. 1648.
Berlin and New York: Springer.
Lessig, Lawrence. 1999. Code and Other Laws of Cyberspace.
New York: Basic Books.
Lynch, Clifford. 2000. "Experiential Documents and the Technologies
of Remembrance," in I in the Sky: Visions of the Information
Future, edited by Alison Scammell. London: Library Association
Publishing.
Lynch, Clifford. 1999. Canonicalization: A Fundamental Tool to Facilitate
Preservation and Management of Digital Information, D-Lib Magazine 5(9)
(September). Available from http://www.dlib.org/dlib/september99/09lynch.html.
National Research Council. 2000. The Digital Dilemma: Intellectual
Property in the Information Infrastructure. Washington, D.C.:
National Academy Press.
Pfitzmann, Andreas, ed. 2000. Information Hiding: Third International
Workshop, Dresden, Germany, September 29October 1, 1999, proceedings. Lecture
Notes in Computer Science, vol. 1768. Berlin and New York: Springer.
Rothenberg, Jeff. 1999. Avoiding Technological Quicksand: Finding
a Viable Technical Foundation for Digital Preservation. Washington,
D.C. Council on Library and Information Resources. Available from http://www.clir.org.
Rothenberg, Jeff. 1995. Ensuring the Longevity of Digital Documents. Scientific
American 272(1):24-9.
Secure Digital Music Initiative. 2000. Available from http://www.sdmi.org.
Worldwide Web Consortium/Internet Engineering Task Force on Digital
Signatures for XML Documents. 1998. Digital Signature Initiative.
Available from http://www.w3.org/DSig.
Zimmerman, Philip R. 1995. The Official PGP User's Guide.
Cambridge, Mass.: MIT Press.
Next Previous
Return to CLIR Home Page >> |