 |
APPENDIX 1
Organizational Models for Digital Archiving
Dale Flecker
Harvard University Library
March 2002
Background
Before considering some of the organizational models
that have emerged for the archiving of digital scholarly information,
it is useful to step back and look at some factors that influence
which organizations are likely to become active in this arena. These
factors include the following:
What organizations believe that digital archiving
is their role?
Digital archiving is complex and costly, and it requires a long-term
institutional commitment. Traditionally, few institutions have assumed
the role of preserving resources over long periods of time: such
institutions have included research and national libraries, records
and manuscript archives, museums, and entities interested in documenting
their own history. Archiving is not a responsibility assumed lightly.
In general, preservation is undertaken by institutions for which
it is an explicit part of their social role and a need or expectation
of the population they serve.
What organizations have the infrastructure needed
for digital archiving?
Digital archiving is a technically complex task and requires a fair
amount of infrastructure: appropriate hardware and software, a sound
and secure environment, and skilled staff. The increasing capacity
and ease of use of desktop or office-level technology may at first
glance make infrastructure seem to be less important than it once
was; however, the increasing scale of material to be archived, and
continual technological change, make the need for a robust and professional
infrastructure ever more important. The need for infrastructure has
been a factor to date in keeping many smaller, less technically sophisticated
institutions from extending their collections to include digital
resources. Over time, however, the need for technical infrastructure
may become the least important of the factors listed here, because
commercial service bureaus such as the incipient OCLC digital archive
may be available to handle the technical aspects of archiving.
Who has the right to archive digital information?
Most digital information is owned by someone. The ease with which
we daily access an enormous range of resources over the Internet
masks the core question of intellectual property rights. Some materials
we access are under explicit licenses (libraries now have experts
who spend their days negotiating such licenses), and most of these
licenses clearly state what rights an institution has to locally
store and manipulate the resource. Archiving as we generally think
of it would not be permitted under most contemporary use licenses.
Other materials are provided over the Internet without
explicit license. However, the fact that there is no access barrier
does not mean there is no archiving barrier. The "free" material
on the Internet may be even more challenging for archiving than licensed
resources are. Because there is no explicit negotiation over these
materials, there is no opportunity for an archive to negotiate the
necessary rights.
In most cases, legal archiving requires an explicit,
voluntary relationship between the archive and the intellectual property
owner. A possibly important exception is the legal provision for
copyright deposit in many countries. National legislation varies
as to whether digital materials are covered under mandatory copyright
deposit, but there is a growing awareness of the need to provide
such coverage. As time passes, national copyright libraries may have
a legal advantage in building archival collections.
Who can afford archiving?
Archiving requires significant resources. Institutions that assumed
an archival role in the paper era may not have the resources to
do so in the digital domain. More than in the physical environment,
digital collections require continual resource spending to keep
them vital. Many physical collections have persisted despite years
of neglect. The digital realm, however, is characterized by continual,
rapid technological change. Unless investments are made regularly
to move materials from platform to platform, and from format to
format, older resources will become unreadable or unusable.
One important economic factor is whether much of the
costly infrastructure required for archiving is already in place
and supported as part of an institution's general operating environment
and required by the institution's mission. Building archiving over
such existing infrastructure can significantly reduce costs.
Whom does the affected community trust?
Archiving is not a disconnected activity; it is intended to support
specific purposes for specific audiences. The questions of who
should be an archive and what intellectual property gets deposited
in an archive are frequently influenced by whom the target user
community trusts. If the user community does not trust the competence,
values, and viability of the archive, the necessary social support
for the archiving activity may be missing.
Organizational Models
At least five organizational models for archives of
scholarly digital materials are commonly in use today.
Discipline-Based Models
A specific discipline often has the primary interest
and motivation to preserve research resources. For this reason, it
is natural that archives are sometimes created within discipline-based
organizations. Two examples of such discipline-based archives are
- The Inter-university Consortium for Political and Social Research
(ICPSR): Housed at the Institute for Social Research at the University
of Michigan, the ICPSR collects survey and economic data sets for
use by social scientists. Its primary sources of data are the Bureau
of the Census and other government agencies and individual scholars
or research projects. The consortium takes responsibility for preserving
deposited data sets and, depending on the likely importance of
a data set, may invest in documentation and reformatting data for
ease of use. The current collection contains about 3,500 studies.
Access to the collection is generally limited to member institutions.
- The Astrophysics Data System (ADS): ADS collects and indexes
the literature of astronomy. It is housed at the Harvard-Smithsonian
Center for Astrophysics. The collection includes both retrospective
literature (much of it digitized by the ADS back to volume 1 of
any collected periodical) and prospective publications. An extensive
system of links connects the ADS to other online information resources.
The ADS has indexed about 2.5 million records. The scanned literature
archive contains about 260,000 articles with a total of 1.9 million
pages. The indexes and much of the collection of ADS are available
to the public, but some of the recent materials can be accessed
only by persons with subscriptions through the original publishers.
Both of these archives were purposely built within
their respective disciplines using significant government funding.
Both have become core resources within their disciplines: most researchers
know about these archives and use their collections regularly, and
it is widely expected that these collections will persist and grow.
This expectation is a key strength of the discipline-based model;
it encourages participation and provides the validation important
to funding sources. In the case of ICPSR, any respectable scholar
is expected to deposit data sets when his or her research on a given
subject is finished; in fact, some funders make eventual deposit
of data sets in ICPSR a condition of funding. This practice allows
others to replicate analyses as part of the normal scholarly process
of validation and to reuse the data for other analyses. Astronomers
commonly expect that all journals in the field will cooperate with
ADS, so that researchers can count on finding the relevant literature
by searching one system. All relevant journals do cooperate, although
some insist that users be connected to the journal's own site to
access articles, rather than have the content served from the ADS.
ICPSR and ADS are funded differently. As a membership organization,
ICPSR receives much of its core operational funding through member
institution subscriptions. If an institution subscribes, its researchers
and students can get copies of all data sets and associated documentation.
ICPSR continues to receive federal funding for some of its activities.
ADS is largely supported by the National Aeronautics and Space Administration
(NASA).
Commercial Services
There are domains where resources important to scholars
are viable as commercial products. Examples are JSTOR and LexisNexis.
- JSTOR is a nonprofit company that provides access to digitized
versions of major journals in several topic areas. It is licensed
by nearly 1,300 colleges and universities, two-thirds of which
are in the United States. In some disciplines (particularly the
social sciences), JSTOR has become a core resource that is heavily
used by scholars and students.
- LexisNexis has built an enormous collection of digital materials,
mainly in law, business, and contemporary affairs. It is largely
oriented toward use by law firms and businesses and derives most
of its income from those markets, although it is also heavily used
by universities. Essentially all the materials used for the study
of contemporary American law are available from LexisNexis, and
it is the single most widely used digital resource provided by
academic libraries.
The advantage of commercial collections is that they
answer the key question of how to financially support digital collections.
It is the willingness of the commercial and legal communities to
pay substantial fees for information access that makes LexisNexis
viable; sales to the academic and research community could never
generate enough income to support this costly collection. Another
advantage of the commercial model is that, because the services must
compete in the marketplace, they have a significant incentive to
continue to add new content and functionality to their products.
Both JSTOR and LexisNexis provide high functionality and attractive
services. The down side of this need for added value is that the
companies require significant capital investment. (In the case of
JSTOR, this came from The Andrew W. Mellon Foundation.)
Because commercial services generally require payment
for access, they are to some degree based on a model of scarcity:
not every one has access, because not everyone can pay. For scholarly
purposes this is unfortunate, because it is in the interest of scholarship
to have materials as widely available as possible.
Another issue central to the commercial model is that
the intellectual property issues inherent in almost any collection
of digital resources become more pronounced than they are in other
models. When an organization is going to make money by use of someone
else's intellectual property, licensing negotiations become a core
activity. JSTOR and LexisNexis show the effect of such issues. The
LexisNexis collection has experienced continual turmoil in nonlegal
materials, as content owners regularly change their minds about whether
to allow distribution through the LexisNexis system. JSTOR has also
had difficult issues in licensing content, and the publishers of
many journals for which JSTOR provides retrospective content will
not allow the inclusion of more recent digital materials, which the
journals themselves are providing online.
An important issue associated with commercially supported
research collections is continuity. What happens to the collection
if the marketplace changes and the supporting service is no longer
economically viable? LexisNexis is so central to the contemporary
law community that this seems an unlikely possibility, at least at
this point. In the case of JSTOR, however, the issue is real enough
that an endowment has been established to provide for ongoing preservation
of and access to the collection in the event of commercial failure.
Government Agencies
Governments, particularly national governments, frequently
support significant digital collections. National libraries, national
archives, and scientific arms of government are most commonly the
agencies involved. Two examples are
- PubMed Central: PubMed Central is a service of the National Library
of Medicine. It provides access to and archiving for a variety
of electronic journals in medicine. One of the aims of this system
is to make access to new biomedical literature open to all in less
than a year of its publication.
- PANDORA (Preserving and Accessing Networked Documentary Resources
of Australia): The aim of PANDORA, a project of the National Library
of Australia, is to collect, preserve, and give public access to
Internet resources created in Australia. It is intended to fulfill
the Library's traditional role of ensuring the continuing availability
of "a comprehensive record of Australian history and creative endeavour" in
the age of the Internet.
Although government agencies can be subject to cycles
of funding growth and contraction, they also can command a level
of resources not readily available to nonprofit institutions in the
private sector. Archiving and providing access to resources is frequently
a core mission for government agencies, particularly in documenting
national history and accomplishments in science, culture, and technology.
Because of their prestige, social role, and credibility,
governments can provide a comparatively stable base for archiving.
National libraries are uniquely able to attract content contributions
from a wide variety of corporate and noncorporate entities. Many
national libraries also expect national copyright laws to evolve
to cover the required deposit of digital materials, providing them
with a tool for acquiring content that might otherwise be unavailable
because of concerns about intellectual property rights.
One potential concern about government-based collections
is that they may have an ideological or political bias. Governments
frequently have specific views of history or culture that they wish
either to promote or to suppress, and these views can influence what
is collected. Sensitivities to political influence can also affect
the collecting of unpopular or "unacceptable" materials (for example,
pornography, neo-Nazi or other hate literature, or documents relating
to pedophilia or euthanasia).
Research Libraries
Research libraries are expanding their traditional
role of collection building into digital materials. Two interesting
examples of digital research collections in libraries, both of which
are available at no charge to the public, are
- DSpace: This is a project of the Massachusetts Institute of Technology
(MIT) Libraries that was developed with support from Hewlett-Packard.
Described as a "digital archive to capture and distribute the intellectual
output of MIT faculty," DSpace was originally envisioned as a collection
of electronic preprints and journal articles. Today, the scope
of this archive is widening to encompass research data and course-related
materials.
- arXiv: arXiv is a large collection of digital preprints and journal
articles, mainly in areas of physics and mathematics. Created by
a physicist at the Los Alamos National Laboratory a decade ago,
it has become a basic working tool and communication channel in
some areas of physics. Responsibility for arXiv recently moved
from Los Alamos to the Cornell University Library.
Collecting and providing access to research materials
is core to the mission of research libraries. The question of mission
was part of the motivation for transferring arXiv from the Los Alamos
National Laboratory to the Cornell University Library: Los Alamos
did not consider the support of a collection of research materials
for the general physics and mathematics community central to its
mission; Cornell did.
Research libraries provide the stable home that is
appropriate for materials of persistent value. These libraries have
expertise in collection building, access, and preservation. Most
are beginning to build local infrastructures for housing and preserving
digital resources; for instance, MIT is assuming that the DSpace
infrastructure will serve as a base for other digital resources.
Libraries also frequently have good relationships with the scholars
who create many research resources. Because the libraries have a
high level of credibility, scholars do not hesitate to trust them
to protect and preserve materials.
DSpace is a leading example of what is likely to be
a growing role for libraries in collecting and preserving digital
resources created within their universities. There is growing awareness
among scholars about the inherent fragility of digital materials.
As scholars and their universities seek a locus for the maintenance
of their digital assets, libraries are a natural choice.
The Passionate Individual
Many great collections, particularly those of rare
and ephemeral materials, have been the creation of individuals with
a passionate interest in an area. To some degree, such collecting
has continued in the digital era. Current archives, both of which
are freely available to the public, include the following:
- The Internet Archive: This archive was conceived and built by
Brewster Kahle, a computer scientist. It gathers and stores Web
pages, mainly through cyclical "crawls" of the entire Internet.
The collection, composed primarily of textual Web pages, already
includes more than 100 terabytes of data and is growing at a rate
of about 100 gigabytes a day.
- The David Rumsey Historical Map Collection: This is a collection
of eighteenth-, nineteenth-, and twentieth-century North and South
American cartographic materials digitized from the collection of
businessman David Rumsey. It includes about 6,500 items from Rumsey's
collection of 150,000. Rumsey collaborated with a specialized software
firm to expand the ability of its software to handle cartographic
materials.
It is extremely difficult to generalize about initiatives
created by one individual. Each project reflects the topical passion,
financial resources, technical skills and environment, and ability
to inspire others to help in the effort of its initiator. Rumsey
is working slowly, on a relatively small scale, with the technology
vendor Luna Imaging. The Internet Archive has attracted much interest
and support among technology companies, libraries, collectors, and
other individuals intrigued by Kahle's vision, and it is growing
at a dramatic rate. The archive is based on its founder's technical
knowledge and expertise and on a cooperative arrangement with a technology
company also owned by Kahle.
The sort of Web page collecting being done by the
Internet Archive was widely discussed by others before this service
began. The need to act fast to save some of the ephemeral documentation
of our time that lived only on the Web was widely recognized, but
institutions were reluctant to get involved because of their concern
about intellectual property issues. The scale of the issue immobilized
most; others, such as PANDORA, collected slowly because of the costs
associated with obtaining clearing rights. The Internet Archive was
willing to plunge ahead and assume the risk of copyright violation
to ensure that the materials would not be lost.
Personally based digital archives are still new; it
is not possible to predict how they will fare with time. It is possible
that they will follow the path of many parallel collections of the
paper era and, as time passes and those who started them grow older,
will begin to look for institutional homes that can provide stable
environments. On the other hand, the Internet Archive has attracted
considerable outside support and might well represent a new type
of specialized player in the archiving environmentone with
a particular technological and resource-type niche that suits a given
domain of materials. The Internet Archive has begun to provide project
support to the Library of Congress, and the idea of making it an
agent of the Library and assigning it responsibility for Web archiving
in its area of expertise has been discussed.
Summary
The examples of digital archives given in this paper vary enormously
in the scope of their ambitions and collections, their motivations,
the impetus for their creation, and their institutional settings, intended
audiences, and funding sources. This is not surprising; traditional
collecting institutions also varied a fair amount. There may well be
other types of players in the digital arena. There are few commercial
or discipline-based traditional collections analogous to LexisNexis
or ADS. As digital information grows ever more central to various communities,
the opportunity and need for archiving activities become more obvious,
and the field attracts new players. Because we are only at the beginning
of the digital era, this heterogeneity is likely to grow.
Web Site References
arXiv: http://arxiv.org/
Astrophysics Data System: http://adswww.harvard.edu/
David Rumsey Historical Map Collection: http://www.davidrumsey.com/
DSpace: http://www.dspace.org/
ICPSR: http://www.icpsr.umich.edu/
Internet Archive: http://www.archive.org/
JSTOR: http://www.jstor.org/
LexisNexis: http://www.lexisnexis.com/
Pandora: http://pandora.nla.gov.au/
PubMedCentral: http://www.pubmedcentral.nih.gov/
Next Previous
Return to CLIR Home Page >>
|