Search
Close this search box.
Search
Close this search box.

Organizational Approaches to Preserving Digital Content


Digital preservation only begins with capturing and storing digital files; to ensure ongoing access to those files, someone must manage them continually. Media degradation and hardware/software dependencies pose risks to data over time. A critical first step is to consider the technical factors involved in managing these risks.5 But preservation also requires developing business models for sustainable repository services; addressing intellectual property constraints that hamper archiving; creating standards for metadata; and training creators, curators, and users in appropriate technologies, among other things.6

Each community of creators and users of digital information has a stake in keeping digital files accessible. Each community must consider its responsibilities for ensuring the longevity of information it deems important. Many in the research community expect that libraries and archives-and, by extension, museums and historical societies-should bear the responsibility for preservation and access in the digital realm, just as they have in the analog. However, evidence abounds that these institutions, crucial as they are, cannot fulfill this responsibility alone.

It has been decades since scholars in the humanities held significant responsibility for developing or managing library collections. The post-World War II professionalization of librarianship, with the increasing specialization of academic disciplines, has tended to distance faculty from the stewardship of information resources on campus. There are notable exceptions, including the oral historians and ethnographers who document behaviors, gather evidence, and create collections. There are also a host of social sciences that depend on their practitioners to gather data and deposit them in community archives; one example of such an archives is the Inter-university Consortium for Political and Social Research (ICPSR). The digital transformation is quickly eroding the distance between scholar and custodian, and faculty members are being asked once again to assume roles in the creation, preservation, and dissemination of scholarly resources.

Faculty members are not alone in redefining the scope of their responsibilities in the digital realm. Libraries, publishers, and academic associations that are seriously engaging the challenges of digital preservation are finding themselves in roles that are in some respects unfamiliar. Significant experiments in digital preservation are under way in many arenas, as Greenstein, Smith, and Flecker point out in their essays (see Appendixes 1 and 2). Each effort is bounded by the interests of the participants, the constraints of present technologies, and a dearth of tested models for sustainability. Nonetheless, we have much to learn from them. Looking at the range of institutions and individuals engaged in digital preservation, it is perhaps most instructive to divide these models in two groups: those models developed by institutions or enterprises that address their own preservation needs; and those models developed by enterprises whose communities of participants cross institutional boundaries.

Enterprise-Based Preservation Services

Research Libraries

Several research libraries are preserving university-created digital assets. Most of the 124 libraries belonging to the Association of Research Libraries (ARL) create digital content, chiefly by converting analog texts and images they already hold in their collections. Of these, a smaller number say that they are managing or intend to manage those collections for long-term access (Library of Congress 2003). The caution captured by the word “intend” reflects the consensus in the library world that there is no way to guarantee ongoing access to digital assets in the same way we can analog. In the analog realm there is agreement about the best way to preserve print-on-paper sources; the challenge is not how to preserve, but how to do so cost-effectively. In the digital realm, however, no such agreed-upon standards exist, at least for the complex objects generated by humanities scholars.

Most relevant to the preservation of Web-based resources are the actions of a few large universities that are building repositories to preserve faculty output and, in some cases, student output.

University of California Libraries (www.cdlib.org and lib.berkeley.edu). The University of California (UC) Libraries are developing a digital preservation program under the aegis of the California Digital Library (CDL) that will serve the entire system of libraries and be distributed across all nine campuses. CDL is shaping itself to be the central node in the UC digital preservation network. Under this scenario, local nodes of the network-in campus libraries, research institutes and laboratories, and museums-can offer specific preservation services for local clientele while relying on the system-wide infrastructure to support common digital preservation needs, from metadata standards to linking services and persistent identifiers. Each campus can customize delivery of centrally held materials to its own users.

The university library system is active in areas that target specific user needs. Through its E-Scholarship program, it has begun taking in data created by the faculty to manage over time. This program is not prescriptive about what it will take; it sees itself as the place where faculty can deposit data sets, preprints, and other materials that fall outside the purview of a campus library. The CDL is also partnering with other universities and with the San Diego Super Computer Center to develop models for managing journal literature, government documents, museum objects, and other complex digital objects over time.

DSpace at the Massachusetts Institute of Technology (MIT) (www.dspace.org). Recently inaugurated at the MIT Libraries, DSpace is an institutional repository that will provide basic bit storage and digital object management delivery, primarily for MIT faculty research materials. It will develop a format registry specifying documentation and best practices for metadata. Faculty members may contribute content using DSpace’s workflow submission system or contract with the library to process the intake. In collaboration with several research libraries across the country, DSpace is testing its software by conducting interoperability experiments. MIT plans to develop a federation of research libraries using the DSpace concept. It began to take in content at the end of 2002.

DSpace results from a collaboration between MIT and Hewlett Packard (HP) to develop an open-source turnkey system for digital asset management. HP is interested in developing and bringing to market a digital asset management system that works as easily with unstructured data as it does with structured. DSpace and HP are investigating ways to standardize content upon “ingest” (deposit into the repository) in ways that can be readily adapted by their federated partners. The team hopes to develop techniques for “personal archiving” so that faculty members can easily address issues of digital preservation when they create materials. Library managers believe that MIT faculty members are generally aware of the risk that threatens the longevity of their intellectual output and that they have an interest in working on a solution. Faculty members may be less aware of the complexity of finding that solution and the cost of implementing it over time. The MIT Libraries managers see DSpace as a way not only to address the preservation problem but also to engage faculty as partners in preservation.

MIT Libraries staff members are developing a business plan that includes a record of what it costs to run DSpace, and they hope to present possible revenue models. They expect the resulting models to be a combination of institutional funding and revenue-generating services, such as reformatting and creation of metadata. DSpace managers plan to accept whatever data faculty members are willing to deposit. At least, they will be able to return to the depositor the bits that were deposited. They are also specifying the file formats whose migration and management DSpace will support. Beyond that, they will do the best they can and hope to learn as much as they can from the experience. (Unlike UVA’s IATH, the MIT Libraries are not committed to preserving software applications, only the content.)

Harvard University Libraries (hul.Harvard.edu/ldi). Harvard University Libraries take another approach to establishing a preservation repository. The libraries view digital preservation as the responsibility of many in the campus community, not just a few designated experts in the libraries. They believe that effective preservation will depend on technical and curatorial expertise found throughout the university. At Harvard, individual curators and libraries are responsible for selecting the material to be preserved. (The university has more than 100 libraries among its schools.) Digital content created outside the library system poses problems because standard documentation cannot be enforced at the time of creation, and normalizing all the content for deposit into the archive, when feasible, is usually too labor-intensive. In response to the problems presented by such heterogeneous file formats, the Harvard repository will provide three levels of service:

  • service for normative formats that the repository will keep renderable
  • service for formats for which the repository will keep the bits in order but will not take responsibility for keeping the files renderable
  • service for all other, more complex, formats

Stanford University Libraries (http://www-sul.stanford.edu/). Stanford University Libraries (SUL) are building a digital repository that will take in any digital content that Stanford deems worthy of permanent preservation. The content will come from several sources: new digital conversion projects; so-called “legacy” digital content already owned by SUL; digital content purchased from external sources, including e-journals and e-texts; donations (such as archives); ongoing submissions, such as the Stanford Scholarly Communications Service, CourseWork (Stanford’s OKI-based course management system); and metadata related to other SUL initiatives. Over time, Stanford hopes to offer preservation services to the publishers with whom it works through HighWire Press as well as to other off-campus partners.

Another digital preservation endeavor being developed is LOCKSS (Lots of Copies Keep Stuff Safe), a system based on the tried-and-true analog preservation strategy of redundancy. LOCKSS ensures that many digital systems in different locations across the continent and abroad retain caches of identical digital content. Now in beta test, LOCKSS enables institutions to create low-cost, persistent digital caches of authoritative versions of http-delivered content. With specially created software, institutions locally collect, store, preserve, and archive authorized content, thus safeguarding their communities’ access to that content while doing no harm to publishers’ business models. Although LOCKSS is now restricted to electronic journals, application to other genres is being explored. The software is distributed as open source.

In the period of transition to whatever digital preservation infrastructure will emerge, a major problem is that many scholars experimenting with the most innovative digital technology for research and teaching are not affiliated with major universities such as MIT or UC. On most campuses, if scholars were to turn to the library for help in preparing digital content so that the library could later acquire and preserve it, they might find the staff willing to help but without the means or the infrastructure to support the creation or preservation of digital scholarship.

Academic Disciplines

Scholars who are not affiliated with a major university may be spared the prospect of major data loss because there are discipline-based approaches to archiving data. Examples include the Astrophysics Data System (ADS) and the ICPSR. What characterizes these fields is the quantity of data created and used, as well as the type. The data stored by the ICPSR, for example, are in fairly standard and highly structured formats. Information assets that are deemed important are normalized to improve the chances of their persistence. In fields that rely on massive data gathering and computer manipulation, such as genomics, researchers are required by their funders, as well as their own need for access, to deposit their data into a common database. The data must be submitted in formats upon which the community of depositors and users have agreed.

The contrast between these disciplines and those in the humanities is obvious. Humanistic inquiry is not characterized by teams of scholars, large grant support, or the creation of masses of new data for common use by the field. Even the largest and most robust of the humanities disciplines have learned societies that are not preserving, or planning to preserve, the born-digital resources that their own members create or rely on.

Publishers

Publishers represent another group that is planning for the preservation of digital content. Some publishers that market to the academic community, such as Reed Elsevier, Oxford University Press, the American Physical Society (APS), and the American Geophysical Union (AGU), have committed to deliver digital access services to their core journals. Elsevier Science, for example, guarantees access to back publications for a certain period. This service is a business proposition for them. They offer their authors and subscribers a publication “of record” that can be cited and accessed in the future without users finding broken links. Such warrant of future citation is important for the academic system of reward and tenure. Other presses that aspire to manage their digital assets for long periods of time are developing in-house systems, though digital asset management systems are different from preservation. Publishers do not design their asset management systems to ensure the preservation of their digital publications in perpetuity, as illustrated by the recent controversy over Elsevier’s deletion of some articles from its database (Foster 2003). Publishers that do wish to provide for such longevity have turned in some cases to libraries-APS is one-to host a mirror site or serve as a dark repository for fail-safe backup.

The AGU is an interesting model. It has complete control over the format in which it publishes and preserves, and it saves everything it publishes. The archive is set up as an independent legal entity, with an endowment that is empowered to manage the archival collection if AGU were to become defunct. The costs of preservation are borne by the readers, who pay a “tax” built into the subscription, and by authors, who pay per-page charges. The archive is not searchable, and it offers no user services. For security reasons, the collections will be copied worldwide. This will be done through arrangements with AGU’s European counterpart and similar organizations. Significantly, AGU does not preserve any of the underlying data that provide the evidence cited in reports. Should scientists of the next century wish to view the data supporting a particular interpretation of, for example, a seismic event discussed in a journal article, they would not be able to do so. The article would be available, but its links to source data would no doubt be broken.

Over time, the loss of the source data may pose a more serious threat than loss of the interpretation of such data in the secondary literature, yet little attention is paid to this problem. As long as the secondary literature plays a decisive role in the promotion and tenure of faculty, scholars, their publishers, and their campus libraries will be motivated to find ways to preserve it. But what about preservation as a public trust? Who is responsible for looking beyond a profession’s incentives for preservation to address the larger national and international need to preserve the data and primary sources that underlie scientific, technological, and scholarly advances? This would seem to be a national imperative that government preservation strategies can and should address, given that much of these data are built from federal grant-supported research.7

Government-Sponsored Preservation

Government and state agencies have a legal mandate to maintain records and make them accessible to the public. Now that most government agencies are conducting their business electronically, that mandate is in jeopardy. The major collecting agencies of the federal government-the National Records and Archives Administration (NARA), the National Library of Medicine (NLM), and the National Agricultural Library (NAL)-have programs in place to research and develop electronic records creation and preservation. Their research and development agendas are crucially important to all citizens and should also be of benefit to the academic community. NARA’s research work with the San Diego Supercomputer Center (SDSC) holds the promise of ensuring the future legibility of such structured documents as e-mails, though the Archives is just beginning to operationalize the research results, and the value of the SDSC research for building a scalable and sustainable digital archiving system is unknown. Part of the success of this work depends on the degree of control that a repository has over the file upon accessioning. Businesses and agencies are in a position to mandate the form that official documents are to take. Research libraries do not have that type of control over scholars and the other data creators they serve.

Only two government agencies, the Smithsonian Institution (SI) and the Library of Congress (LC), have collecting policies that include a large amount of the heterogeneous digital content under consideration here. (NLM and NAL collect technical and clinical materials that differ significantly from the special collections found in SI and LC.) Through its institutional archives, the Smithsonian has begun a program to preserve electronic records, and in some cases institutional Web sites, across the many entities that are part of the SI. However, none of the SI museums, such as the National Museum of American History, which collects important archives in the history of American invention, has begun to acquire Web-based sources as original sources, and none plans to do so.

The Library of Congress, which receives mandatory deposits of copyrighted works through its Copyright Office, has begun to collect contemporary works in digital formats, including Web sites and materials captured from the Web. More important, through a congressional mandate enacted in 2000, the National Digital Information Infrastructure and Preservation Program (NDIIPP), LC received an appropriation of up to $100 million to develop, design, and implement a preservation infrastructure that would create the technical, legal, organizational, and economic means to enable a variety of preservation stakeholders to work collaboratively to ensure the persistence of digital heritage (Library of Congress 2003). LC has proposed that such sectors as higher education, science, and other academic and research enterprises take primary responsibility for collecting, curating, and ensuring the preservation of their own information assets, especially those that are not deposited for copyright protection. The national infrastructure would enable preservation among many actors by engendering agreement on standards, ensuring that intellectual property laws encourage rather than deter preservation and access for educational purposes, and facilitating the building and certification of trusted repositories in a networked environment.

As part of this proposed infrastructure, LC has developed a preliminary technical architecture that would be built to serve as the backbone for a national infrastructure for digital preservation. This distributed architecture starts from the premise that the core functions of libraries and archives, from acquisition to user services, should be disaggregated in a networked environment. It does not envision that every collecting institution would assume the burden of building and maintaining digital preservation repositories; rather, it foresees that a handful of trusted repositories in higher education, such as those discussed above, will be certified through some means to assume a national responsibility for preservation. This scenario also envisions that major creators and users of digital information, such as research universities, would have repositories to manage their own digital output, at least for short-term needs. These repositories would differ from archival repositories because their primary purpose would be to facilitate access and dissemination, not to guarantee fail-safe preservation (see pp. 27-28).

For research universities, publishers, academic disciplines, and government agencies, the incentive to preserve digital materials is to protect institutional or proprietary information assets for future use or, in the case of the government bodies, to comply with legal requirements. Preservation is central to the core values of each enterprise. The types of preservation each undertakes-be it short-term asset management for publishers, preservation “in perpetuity” for universities whose mission is to further the creation of knowledge, or records management and selected permanent retention of government records by archives-is shaped by the enterprise and its mission.

Community-Based Preservation Services

What happens to the scholarship created and primary source data collected outside the handful of universities and scientific disciplines that commit to preservation and dedicate resources to support it? Most digital resources that scholars create today have no guarantee of surviving long enough to be acquired for long-term preservation and access by libraries, archives, or historical societies. What services are available to such collecting institutions to meet their own mission-driven goals of continuing to acquire and serve materials of research value that are born digital?

There are now no digital preservation service bureaus that can offer the full range of services needed by such libraries and archives (or creators, for that matter). Nonprofit membership organizations that have served libraries for decades, most notably the Online Computer Library Center (OCLC) and the Research Libraries Group, are developing a variety of preservation services for their members while also engaging in research on metadata standards and other topics that benefit the larger library community. Both organizations hope to develop services that their members not only need but also can and will pay for. The Center for Research Libraries, which has been a central repository for collecting, preserving, and providing access to important but little-used research collections, is also contemplating offering similar services to members for certain classes of digital materials.

JSTOR (www.jstor.org) JSTOR is an example of an archiving service with a business model that promises to be sustainable over time. JSTOR preserves and provides access to digital back files of scholarly journals in humanities, social sciences, and some physical and life sciences. This nonprofit enterprise, which began with a major investment of seed capital from a foundation, offers a service that is in growing demand. As a service organization, JSTOR is an interesting hybrid that reveals much about how various members of the research community perceive the value of preservation and access. JSTOR is a subscription-based enterprise that defines itself first and foremost as an archiving service. It charges a one-time fee to all subscribers to support the costs of digitizing print journals and managing those files. Many libraries subscribe to JSTOR because they want to offer their users electronic access to these journals, and they may place a much higher value on the access than on the preservation function of JSTOR. Because of the ways that library and university budgets work, most libraries probably pay for JSTOR from their acquisitions funds rather than from preservation budgets. This reality has the perhaps regrettable effect of further hiding from plain sight the costs of preserving analog and digital information resources and the crucial dependence of access on preservation.

It is not yet clear how preservation of digital scholarship will be paid for, or even how much it will cost, in the future, but it will be a cost that cannot be deferred or ignored. JSTOR managers have tried to keep this problem in the foreground and have been documenting what JSTOR usage can tell us about how access to digital secondary literature can affect research strategies and agendas. Much work remains, however, for digital service providers to be able to determine what such services cost, how much of a market they can make for such services, and whether any will offer the kinds of retail services needed by data creators working outside large and securely funded libraries.

The Internet Archive (www.archive.org)8

Another model of preservation, the Internet Archive, merits consideration, in part because of its promise to capture passively (or at least in a largely automated manner) much of what is publicly available on the Web, including many scholar-produced sites under discussion. Since 1996, the Internet Archive has been storing crawls of the Web. It now contains about 250 terabytes, and is the largest publicly available collection on the Web. The broad and wide-ranging crawls it regularly conducts represent about 2 billion pages and cover 40 million sites. The Archive also has several targeted collecting programs that focus on one or more specific site profiles and often are designed to go into the so-called Deep Web for retrieval of complex or otherwise inaccessible sites. The Archive plans to make copies of its data to store elsewhere. It aspires, therefore, to secure physical preservation of Web sites. It does not address the logical preservation that may be needed to search and retrieve complex digital objects over time.

Many people who use the Web, scholars included, see the Internet Archive as a “magic- bullet” solution to the archiving problem. They mistakenly believe that the Internet Archive crawls and preserves all parts of the World Wide Web. Although the Archive can harvest much of the publicly available surface Web, most of the Web is closed to the Archive’s crawlers (Lyman 2002). Sites in the Deep Web that cannot be harvested by crawlers include databases (the sorts of materials that generate responses to queries made “on the fly”); password-protected sites, such as those that require subscription for use (The Wall Street Journal); and sites with robot exclusions (The New York Times). Few sites produced by academic institutions are likely to fall into the latter two categories, but many fall into the first. Although a Web crawl does not require the cooperation of the creator or publisher, and thus can capture staggering amounts of material, it does not regularly penetrate the Deep Web and cannot capture interactive features on the Web. (Parts of the Deep Web are accessible to crawling, though, because they are linked to “surface” sites.) These features pose problems for scholarly innovators who create in multimedia or build querying into their sites.

The World Wide Web has neither a center nor a periphery: it is decentralized and boundless. As the Web grows, the managers of the Archive are realizing that they must become selective in their acquisition of content. Indeed, the Internet Archive is approaching a stage that is familiar to the most ambitious and wide-ranging of collectors and collecting institutions-the stage where it is necessary to focus on a set, or subset, of the universe of the possible.

Brewster Kahle, the moving spirit behind the Archive, has a special interest in capturing the underdocumented aspects of contemporary life revealed on the Web. He is encouraging national libraries to reach an agreement to collect sites that originate within their borders, to increase coverage worldwide, and to reduce possible redundancies where they are undesirable. The National Library of Australia (NLA) has been collecting Australian Web sites on PANDORA (Preserving and Accessing Networked Documentary Resources of Australia) for some time. Although such collecting has been outside the framework of any international agreement, PANDORA has been closely watched to see how feasible the approach will be. It turns out that, because the NLA selects and checks for copyright clearances, collecting Web sites, even within a single country domain, is very labor-intensive.

Until recently, the Internet Archive focused on collecting sites. With the debut of the Wayback Machine, however, the Archive offers what one staff member calls “retail” access to the Web, allowing individual users to search for specific sites. The Archive sees a need to develop a library-like workbench of research tools that provide technical and programmatic interfaces to the archived collections at a high level of abstraction. Although the Archive sees itself sharing many values and functions of research libraries in terms of collecting and preserving, it distinguishes itself from them because of its special interest in being a center of innovation and experimentation and operating alongside-but outside-a larger institution such as a university.

The Internet Archive is supported by philanthropy, government grants, and some contracts for specific purposes, but its financial future is not guaranteed. The largest cost component is content acquisition, and the Archive insists that these costs, which are growing exponentially, must be reduced. The high cost of acquisition, incidentally, seems to be a characteristic feature of digital repositories, be they very inclusive, such as the Internet Archive, or relatively exclusive. The Arts and Humanities Database (AHDS) in the United Kingdom determined that a hefty 70 percent of its operating costs goes to acquisition, and most of the rest to access services. Preservation of bits (the “spinning of disks,” as one former AHDS manager put it) has been only a small fraction of the total spending.

The Internet Archive’s commitment to being freely accessible diminishes its opportunities for financial support from libraries or commercial entities. It often crawls material that is under copyright protection without seeking permission first. (It scrupulously follows a policy of removing access to sites on the Wayback Machine when asked to do so by the Web master of a site, however.) Although some have suggested that libraries can find at least one potent solution to collection and preservation by contracting with the Archive to collect on their behalf, or simply to support the Archive in its present activities, libraries must be daunted by the legal implications of the Archive’s approach to capture. The Archive has successfully collected specific types of sites for the Library of Congress (on presidential elections, September 11, and others), but even the LC, which Congress mandated to acquire copyrighted materials through demand deposit, will have to seek a clear ruling about whether acquiring such sites through Web crawling is within the letter, not just the spirit, of copyright law.

What about the data that the Archive has already amassed? It may well share the fate of many an outstanding private collection and be passed, at some point during or after the collector’s life, to an institution that can care for it indefinitely. The role of the private collector, who identifies and secures for posterity materials of great value that others somehow miss, is unlikely to diminish in the digital realm. Indeed, it is likely to increase.

The Role of Funders in Digital Preservation

All the actors familiar in traditional library collecting have now appeared on the stage: the creators and the disciplines that support them, the publishers, the libraries, and the many services that support libraries. All have a stake in digital preservation, and all have distinctly new roles to play in the digital landscape. But what about funders-the foundations, university governing boards, and federal agencies that have played decisive roles in funding the creation and dissemination of scholarship?

Federal agencies have only recently begun to address the long-term access of digital materials whose creation they fund. The National Science Foundation (NSF), which has had a Digital Libraries Initiative (DLI) program in place for several years, has put more dollars behind the digital library research agenda than any other entity. It was not until last year, however, in response to a request by the LC, that the NSF made digital preservation a specific feature of its funding. The Digital Government Program, the DLI, and LC convened librarians and archivists, computer scientists, technologists, and government officials to develop a research agenda for digital preservation, and it intends to put out a call for proposals in 2003.

The other, more modestly funded, federal agencies that support digital library and content development-the Institute for Museum and Library Services (IMLS), NEH, and the National Endowment for the Arts (NEA)-all encourage their grant applicants to describe their plans to preserve the digital content they create. In so doing, they present sustainability as a competitive feature of a grant project. A commitment to preserve digital content is unlikely to become a grant requirement unless preservation services are available to chronically underfunded cultural heritage institutions. But encouraging applicants to plan for such preservation activities at least raises awareness of the need to think about the upkeep of digital assets among institutions that have traditionally focused more on the creation than the maintenance of content.

A handful of private foundations, including The J. Paul Getty Trust, The Andrew W. Mellon Foundation, and the Alfred P. Sloan Foundation, have funded the creation of digital scholarship. Because of their focus on research and scholarship, these foundations have an interest in ensuring that solutions to the digital preservation problem are found sooner rather than later, and they are thus seeking ways to use their influence to help. As long as preservation appears to be mainly a technical problem, foundations may not identify an active role for themselves. But as has been shown, technology is just one of several challenges to preserving digital content. The Mellon Foundation, for example, has funded the development and assessment of business models that would make preservation a sustainable enterprise. The Foundation’s involvement began with JSTOR, but has extended to several other initiatives already under way, such as DSpace, and to partnerships between publishers and libraries to preserve e-journal content. By encouraging innovative and responsible behaviors, all funders that support higher education can help define the crucial role the scholar must play in preserving digital scholarship.

Some funders incorporate preservation and its costs into their grants. For example, the Arts and Humanities Research Board and the National Environmental Research Council in Great Britain not only require that grantees deposit their data into a central databank but also make the creators “pay” for archiving their materials by incorporating preservation into the data creation grant (usually 2 percent to 6 percent of the grant). The archeological community in the United States follows a similar practice, where the commercial entities developing a site tend to pay for preservation. This model can be extended to other disciplines and to other funding agencies (such as ADS, ICPSR, and GenBank, the human genome databank) when data deposit is feasible.

Higher education administrators and governing boards, important sources of funding for the creation and preservation of scholarship, have remained curiously distanced from this issue. Some institutions have made funding both digital scholarship and librarianship campus priorities-the University of Virginia and Harvard University come to mind-but these are the noteworthy exceptions. Seldom have campus executives articulated a vision for the stewardship of university information assets, despite the importance of digital information networks on their campuses.

Campus administrators at the California Institute of Technology are an exception. They have spoken out on the institution’s obligation to preserve and make available the output of the faculty, both in the interest of furthering science and to share with taxpayers the fruits of government investment in science. However, although an institutional repository has been up and running for some time, the volume of contributions from faculty members has been small (Young 2002). It is important to identify the barriers to deposit in this case, for they may suggest how incentives for deposit can be created.

Intellectual property issues around access loom large in the sciences, and that may help explain why it is difficult to get scientists to contribute to institutional archives that are publicly available. DSpace may be instructive in this area, as it will have to allow depositors to remove articles if needed to comply with a given publisher’s mandate. DSpace will, however, retain a record of the article having once been a part of the repository.

Workflow issues are and will be a major barrier to deposit of scholarly resources into preservation repositories, as the example of HRST shows. If there were a frictionless way to create documents in preservation-friendly formats and to send the files to a repository for safe keeping with the click of a mouse, all without distracting creators from their primary focus, we might see different behaviors emerge. The possibility of automating key aspects of creating preservation-friendly formats, genres, and metadata should rise to the top of the research and funding agendas for research-intensive institutions. This could be one result of a commitment by senior campus administrators to the stewardship of digital information.


FOOTNOTES

5 See Appendix 2 for details.

6 See Appendix 1 for a discussion of the organizational issues important for digital archiving.

7 This subject is identified as a vital part of the science and engineering cyberinfrastructure in National Science Foundation 2003, and National Science Board 2002.

8 The author thanks Raymie Stata of the Internet Archive for information about the Archive and its range of activities.


Skip to content