APPENDIX 1 Organizational Models for Digital Archiving • CLIR

Dale Flecker
Harvard University Library
March 2002

Background

Before considering some of the organizational models that have emerged for the archiving of digital scholarly information, it is useful to step back and look at some factors that influence which organizations are likely to become active in this arena. These factors include the following:

What organizations believe that digital archiving is their role?
Digital archiving is complex and costly, and it requires a long-term institutional commitment. Traditionally, few institutions have assumed the role of preserving resources over long periods of time: such institutions have included research and national libraries, records and manuscript archives, museums, and entities interested in documenting their own history. Archiving is not a responsibility assumed lightly. In general, preservation is undertaken by institutions for which it is an explicit part of their social role and a need or expectation of the population they serve.

What organizations have the infrastructure needed for digital archiving?
Digital archiving is a technically complex task and requires a fair amount of infrastructure: appropriate hardware and software, a sound and secure environment, and skilled staff. The increasing capacity and ease of use of desktop or office-level technology may at first glance make infrastructure seem to be less important than it once was; however, the increasing scale of material to be archived, and continual technological change, make the need for a robust and professional infrastructure ever more important. The need for infrastructure has been a factor to date in keeping many smaller, less technically sophisticated institutions from extending their collections to include digital resources. Over time, however, the need for technical infrastructure may become the least important of the factors listed here, because commercial service bureaus such as the incipient OCLC digital archive may be available to handle the technical aspects of archiving.

Who has the right to archive digital information?
Most digital information is owned by someone. The ease with which we daily access an enormous range of resources over the Internet masks the core question of intellectual property rights. Some materials we access are under explicit licenses (libraries now have experts who spend their days negotiating such licenses), and most of these licenses clearly state what rights an institution has to locally store and manipulate the resource. Archiving as we generally think of it would not be permitted under most contemporary use licenses.

Other materials are provided over the Internet without explicit license. However, the fact that there is no access barrier does not mean there is no archiving barrier. The “free” material on the Internet may be even more challenging for archiving than licensed resources are. Because there is no explicit negotiation over these materials, there is no opportunity for an archive to negotiate the necessary rights.

In most cases, legal archiving requires an explicit, voluntary relationship between the archive and the intellectual property owner. A possibly important exception is the legal provision for copyright deposit in many countries. National legislation varies as to whether digital materials are covered under mandatory copyright deposit, but there is a growing awareness of the need to provide such coverage. As time passes, national copyright libraries may have a legal advantage in building archival collections.

Who can afford archiving?
Archiving requires significant resources. Institutions that assumed an archival role in the paper era may not have the resources to do so in the digital domain. More than in the physical environment, digital collections require continual resource spending to keep them vital. Many physical collections have persisted despite years of neglect. The digital realm, however, is characterized by continual, rapid technological change. Unless investments are made regularly to move materials from platform to platform, and from format to format, older resources will become unreadable or unusable.

One important economic factor is whether much of the costly infrastructure required for archiving is already in place and supported as part of an institution’s general operating environment and required by the institution’s mission. Building archiving over such existing infrastructure can significantly reduce costs.

Whom does the affected community trust?
Archiving is not a disconnected activity; it is intended to support specific purposes for specific audiences. The questions of who should be an archive and what intellectual property gets deposited in an archive are frequently influenced by whom the target user community trusts. If the user community does not trust the competence, values, and viability of the archive, the necessary social support for the archiving activity may be missing.

Organizational Models

At least five organizational models for archives of scholarly digital materials are commonly in use today.

Discipline-Based Models

A specific discipline often has the primary interest and motivation to preserve research resources. For this reason, it is natural that archives are sometimes created within discipline-based organizations. Two examples of such discipline-based archives are

The Inter-university Consortium for Political and Social Research (ICPSR): Housed at the Institute for Social Research at the University of Michigan, the ICPSR collects survey and economic data sets for use by social scientists. Its primary sources of data are the Bureau of the Census and other government agencies and individual scholars or research projects. The consortium takes responsibility for preserving deposited data sets and, depending on the likely importance of a data set, may invest in documentation and reformatting data for ease of use. The current collection contains about 3,500 studies. Access to the collection is generally limited to member institutions.
The Astrophysics Data System (ADS): ADS collects and indexes the literature of astronomy. It is housed at the Harvard-Smithsonian Center for Astrophysics. The collection includes both retrospective literature (much of it digitized by the ADS back to volume 1 of any collected periodical) and prospective publications. An extensive system of links connects the ADS to other online information resources. The ADS has indexed about 2.5 million records. The scanned literature archive contains about 260,000 articles with a total of 1.9 million pages. The indexes and much of the collection of ADS are available to the public, but some of the recent materials can be accessed only by persons with subscriptions through the original publishers.

Both of these archives were purposely built within their respective disciplines using significant government funding. Both have become core resources within their disciplines: most researchers know about these archives and use their collections regularly, and it is widely expected that these collections will persist and grow. This expectation is a key strength of the discipline-based model; it encourages participation and provides the validation important to funding sources. In the case of ICPSR, any respectable scholar is expected to deposit data sets when his or her research on a given subject is finished; in fact, some funders make eventual deposit of data sets in ICPSR a condition of funding. This practice allows others to replicate analyses as part of the normal scholarly process of validation and to reuse the data for other analyses. Astronomers commonly expect that all journals in the field will cooperate with ADS, so that researchers can count on finding the relevant literature by searching one system. All relevant journals do cooperate, although some insist that users be connected to the journal’s own site to access articles, rather than have the content served from the ADS. ICPSR and ADS are funded differently. As a membership organization, ICPSR receives much of its core operational funding through member institution subscriptions. If an institution subscribes, its researchers and students can get copies of all data sets and associated documentation. ICPSR continues to receive federal funding for some of its activities. ADS is largely supported by the National Aeronautics and Space Administration (NASA).

Commercial Services

There are domains where resources important to scholars are viable as commercial products. Examples are JSTOR and LexisNexis.

JSTOR is a nonprofit company that provides access to digitized versions of major journals in several topic areas. It is licensed by nearly 1,300 colleges and universities, two-thirds of which are in the United States. In some disciplines (particularly the social sciences), JSTOR has become a core resource that is heavily used by scholars and students.
LexisNexis has built an enormous collection of digital materials, mainly in law, business, and contemporary affairs. It is largely oriented toward use by law firms and businesses and derives most of its income from those markets, although it is also heavily used by universities. Essentially all the materials used for the study of contemporary American law are available from LexisNexis, and it is the single most widely used digital resource provided by academic libraries.

The advantage of commercial collections is that they answer the key question of how to financially support digital collections. It is the willingness of the commercial and legal communities to pay substantial fees for information access that makes LexisNexis viable; sales to the academic and research community could never generate enough income to support this costly collection. Another advantage of the commercial model is that, because the services must compete in the marketplace, they have a significant incentive to continue to add new content and functionality to their products. Both JSTOR and LexisNexis provide high functionality and attractive services. The down side of this need for added value is that the companies require significant capital investment. (In the case of JSTOR, this came from The Andrew W. Mellon Foundation.)

Because commercial services generally require payment for access, they are to some degree based on a model of scarcity: not every one has access, because not everyone can pay. For scholarly purposes this is unfortunate, because it is in the interest of scholarship to have materials as widely available as possible.

Another issue central to the commercial model is that the intellectual property issues inherent in almost any collection of digital resources become more pronounced than they are in other models. When an organization is going to make money by use of someone else’s intellectual property, licensing negotiations become a core activity. JSTOR and LexisNexis show the effect of such issues. The LexisNexis collection has experienced continual turmoil in nonlegal materials, as content owners regularly change their minds about whether to allow distribution through the LexisNexis system. JSTOR has also had difficult issues in licensing content, and the publishers of many journals for which JSTOR provides retrospective content will not allow the inclusion of more recent digital materials, which the journals themselves are providing online.

An important issue associated with commercially supported research collections is continuity. What happens to the collection if the marketplace changes and the supporting service is no longer economically viable? LexisNexis is so central to the contemporary law community that this seems an unlikely possibility, at least at this point. In the case of JSTOR, however, the issue is real enough that an endowment has been established to provide for ongoing preservation of and access to the collection in the event of commercial failure.

Government Agencies

Governments, particularly national governments, frequently support significant digital collections. National libraries, national archives, and scientific arms of government are most commonly the agencies involved. Two examples are

PubMed Central: PubMed Central is a service of the National Library of Medicine. It provides access to and archiving for a variety of electronic journals in medicine. One of the aims of this system is to make access to new biomedical literature open to all in less than a year of its publication.
PANDORA (Preserving and Accessing Networked Documentary Resources of Australia): The aim of PANDORA, a project of the National Library of Australia, is to collect, preserve, and give public access to Internet resources created in Australia. It is intended to fulfill the Library’s traditional role of ensuring the continuing availability of “a comprehensive record of Australian history and creative endeavour” in the age of the Internet.

Although government agencies can be subject to cycles of funding growth and contraction, they also can command a level of resources not readily available to nonprofit institutions in the private sector. Archiving and providing access to resources is frequently a core mission for government agencies, particularly in documenting national history and accomplishments in science, culture, and technology.

Because of their prestige, social role, and credibility, governments can provide a comparatively stable base for archiving. National libraries are uniquely able to attract content contributions from a wide variety of corporate and noncorporate entities. Many national libraries also expect national copyright laws to evolve to cover the required deposit of digital materials, providing them with a tool for acquiring content that might otherwise be unavailable because of concerns about intellectual property rights.

One potential concern about government-based collections is that they may have an ideological or political bias. Governments frequently have specific views of history or culture that they wish either to promote or to suppress, and these views can influence what is collected. Sensitivities to political influence can also affect the collecting of unpopular or “unacceptable” materials (for example, pornography, neo-Nazi or other hate literature, or documents relating to pedophilia or euthanasia).

Research Libraries

Research libraries are expanding their traditional role of collection building into digital materials. Two interesting examples of digital research collections in libraries, both of which are available at no charge to the public, are

DSpace: This is a project of the Massachusetts Institute of Technology (MIT) Libraries that was developed with support from Hewlett-Packard. Described as a “digital archive to capture and distribute the intellectual output of MIT faculty,” DSpace was originally envisioned as a collection of electronic preprints and journal articles. Today, the scope of this archive is widening to encompass research data and course-related materials.
arXiv: arXiv is a large collection of digital preprints and journal articles, mainly in areas of physics and mathematics. Created by a physicist at the Los Alamos National Laboratory a decade ago, it has become a basic working tool and communication channel in some areas of physics. Responsibility for arXiv recently moved from Los Alamos to the Cornell University Library.

Collecting and providing access to research materials is core to the mission of research libraries. The question of mission was part of the motivation for transferring arXiv from the Los Alamos National Laboratory to the Cornell University Library: Los Alamos did not consider the support of a collection of research materials for the general physics and mathematics community central to its mission; Cornell did.

Research libraries provide the stable home that is appropriate for materials of persistent value. These libraries have expertise in collection building, access, and preservation. Most are beginning to build local infrastructures for housing and preserving digital resources; for instance, MIT is assuming that the DSpace infrastructure will serve as a base for other digital resources. Libraries also frequently have good relationships with the scholars who create many research resources. Because the libraries have a high level of credibility, scholars do not hesitate to trust them to protect and preserve materials.

DSpace is a leading example of what is likely to be a growing role for libraries in collecting and preserving digital resources created within their universities. There is growing awareness among scholars about the inherent fragility of digital materials. As scholars and their universities seek a locus for the maintenance of their digital assets, libraries are a natural choice.

The Passionate Individual

Many great collections, particularly those of rare and ephemeral materials, have been the creation of individuals with a passionate interest in an area. To some degree, such collecting has continued in the digital era. Current archives, both of which are freely available to the public, include the following:

The Internet Archive: This archive was conceived and built by Brewster Kahle, a computer scientist. It gathers and stores Web pages, mainly through cyclical “crawls” of the entire Internet. The collection, composed primarily of textual Web pages, already includes more than 100 terabytes of data and is growing at a rate of about 100 gigabytes a day.
The David Rumsey Historical Map Collection: This is a collection of eighteenth-, nineteenth-, and twentieth-century North and South American cartographic materials digitized from the collection of businessman David Rumsey. It includes about 6,500 items from Rumsey’s collection of 150,000. Rumsey collaborated with a specialized software firm to expand the ability of its software to handle cartographic materials.

It is extremely difficult to generalize about initiatives created by one individual. Each project reflects the topical passion, financial resources, technical skills and environment, and ability to inspire others to help in the effort of its initiator. Rumsey is working slowly, on a relatively small scale, with the technology vendor Luna Imaging. The Internet Archive has attracted much interest and support among technology companies, libraries, collectors, and other individuals intrigued by Kahle’s vision, and it is growing at a dramatic rate. The archive is based on its founder’s technical knowledge and expertise and on a cooperative arrangement with a technology company also owned by Kahle.

The sort of Web page collecting being done by the Internet Archive was widely discussed by others before this service began. The need to act fast to save some of the ephemeral documentation of our time that lived only on the Web was widely recognized, but institutions were reluctant to get involved because of their concern about intellectual property issues. The scale of the issue immobilized most; others, such as PANDORA, collected slowly because of the costs associated with obtaining clearing rights. The Internet Archive was willing to plunge ahead and assume the risk of copyright violation to ensure that the materials would not be lost.

Personally based digital archives are still new; it is not possible to predict how they will fare with time. It is possible that they will follow the path of many parallel collections of the paper era and, as time passes and those who started them grow older, will begin to look for institutional homes that can provide stable environments. On the other hand, the Internet Archive has attracted considerable outside support and might well represent a new type of specialized player in the archiving environment-one with a particular technological and resource-type niche that suits a given domain of materials. The Internet Archive has begun to provide project support to the Library of Congress, and the idea of making it an agent of the Library and assigning it responsibility for Web archiving in its area of expertise has been discussed.

Summary

The examples of digital archives given in this paper vary enormously in the scope of their ambitions and collections, their motivations, the impetus for their creation, and their institutional settings, intended audiences, and funding sources. This is not surprising; traditional collecting institutions also varied a fair amount. There may well be other types of players in the digital arena. There are few commercial or discipline-based traditional collections analogous to LexisNexis or ADS. As digital information grows ever more central to various communities, the opportunity and need for archiving activities become more obvious, and the field attracts new players. Because we are only at the beginning of the digital era, this heterogeneity is likely to grow.

Web Site References

arXiv: http://arxiv.org/
Astrophysics Data System: http://adswww.harvard.edu/
David Rumsey Historical Map Collection: http://www.davidrumsey.com/
DSpace: http://www.dspace.org/
ICPSR: http://www.icpsr.umich.edu/
Internet Archive: http://www.archive.org/
JSTOR: http://www.jstor.org/
LexisNexis: http://www.lexisnexis.com/
Pandora: http://pandora.nla.gov.au/
PubMedCentral: http://www.pubmedcentral.nih.gov/