In the short term, there are several actions that are within reach for both data creators and data repositories that will advance the preservation agenda. For the creators, these actions include the following:
- Work with libraries when beginning a project
- Use standard and, when possible, nonproprietary formats
- Declare the intended use and audience
- Declare intended longevity
For the repositories, such actions include the following:
- Work with data creators during all phases of the creation
- Declare policies and capabilities for archiving differing formats
- Take materials into custody for preservation experiments
Beyond these actions, digital scholars should think deeply about developing an informatics for their discipline, as has happened in some data-intensive sciences, so that they are able to create digital objects that share vocabularies and descriptive markup, facilitate shared access to information resources, and allow ready repurposing for teaching and scholarship. Teachers should ensure that their students master the skills needed to use the new technologies. Instruction in digital information literacy and research skills should be as vital a part of a student’s training as is teaching how to work in primary sources or cite authorities appropriately. Research divisions of the learned societies can provide leadership in this area.
Libraries can initiate partnerships with scholars on campus and with learned societies and their publishers to share knowledge and agree on common approaches to data creation and preservation. They can develop transparent digital preservation policies and make them accessible on their Web sites. They can develop depository programs that promise not necessarily to preserve flawlessly in perpetuity but rather to partner with data depositors in experiments that take in formats favored by disciplines and knowledge communities, perform risk assessments on those file formats, explore approaches that reduce format vulnerabilities, and share the results of that work with other data communities.
Looking Ahead
The current lack of provision for the responsible creation, curation, and retention of research data is highlighted in the National Science Foundation’s report on the science and engineering information infrastructure, which addresses the promise of computing capabilities to transform even further and more radically the conduct of basic and applied research (NSF 2003). This report has implications not only for scientific and engineering data; a similar argument could be mounted for the creation, curation, and preservation of nonscientific research data. There is no agency in the humanities with a mission, funding, or standing comparable to that of the National Academy of Sciences. The opportunities for articulating the problem of preserving nonscientific research data are therefore fewer, and, even when persuasive arguments are made, there are far fewer resources to commit to finding and funding solutions.
There are many barriers to digital preservation at this early stage in the development of digital information technologies, but they can be summed up in one phrase: lack of infrastructure. In the academy, and especially within humanities faculties, many scholars, teachers, and students will continue to look to libraries and archives to lead preservation efforts and to make information of high research value available now and into the future. The well-known preprint archive for high-energy physics, arXiv.org, moved from its home at a laboratory in Los Alamos to Cornell University because the lab did not see maintaining a historical record for access in the future as part of its mission. Even as the perception of the library’s value for providing access to information is declining among some on campuses, the value that faculty place on the preservation function of libraries remains high (JSTOR 2002).
The research community must begin to grapple seriously with the nature of resources stewardship in the digital age. What worked in the analog realm might not work as well in the future. One perspective in the heated debate on electronic academic publishing holds that the technology allows radical changes in the creation and distribution of scholarship. Others sense that while technology creates opportunities for doing business better (for example, lowering publishing and distribution costs), it also has many disadvantages (the expenses of creating in standard formats and preservation are two big ones). Some libraries are trying to become points of dissemination for scholarly literature in a way that differs radically from their role in the distribution system of print resources.
Libraries, particularly their special collections and archives units, have been the traditional custodians of primary sources, and it is natural to expect that they should continue to play that role. However, while libraries and archives have the curatorial expertise needed to fulfill their roles in the digital arena, they generally lack the technical infrastructure to support the key functions of digital preservation. There is some debate about whether it is advisable, or even possible, for every institution in higher education, or even the largest institutions, to develop the full range of services needed for digital preservation. (For commonly agreed-upon minimum standards for long-term repositories, see Appendix 2.) The digital librarians and archivists who are most deeply engaged in building repositories and preservation services agree that repositories are difficult and expensive to build and maintain. They argue cogently that such repositories will be few and will serve many users, including other libraries. In a distributed network, there do not need to be many.
Others argue that every major university can and should have its own digital repository, although the reasons adduced for having one usually relate more to intellectual property matters surrounding publication than to long-term preservation. A white paper commissioned by the Scholarly Publishing and Academic Resources Coalition (SPARC) expands on one type of repository, designed to be “a component in a restructured scholarly publishing model . . . [and] . . . tangible embodiment of institutional quality” (Crow 2002). The paper advocates for institutional repositories to transform scholarly publishing by allowing libraries to compete with commercial publishers online, and to increase the prestige of the university and build brand identity by showcasing the intellectual property of its faculty. The paper suggests that the disaggregation of functions in the networked environment allows libraries to develop consortia to build and maintain repositories for any number of purposes, including preservation. The SPARC model of repository is, however, intended to be complemented by repositories that do stake a claim for preservation. A reliable chain of referencing in scholarly publishing and the promise of scholarship’s persistence into the future are indispensable for the progress of science and humanities.
One challenge that remains is what happens to those scholarly resources created outside the purview of a large, well-funded research institution with a preservation mandate, such as those seated at the Dibner Institute and George Mason University. These resources share many of the characteristics of other noncommercial assets (or commercially produced assets that have exhausted their profitability) that can quickly become orphans in the world. In this way, they share the fate of most special collections.
Regardless of how this debate turns out, it is clear from the viewpoint of systems design that a robust network of repositories and services for long-term preservation of digital library objects favors a disaggregation of functions and does not require that each preserving institution have its own bit repository. The distributed architecture of preservation that LC proposes in its NDIIPP plan is one that will encourage even the smallest preservation and curatorial institutions to participate because it will allow them to bring their particular expertise to bear on some aspect of stewardship but not require that they replicate all aspects of preservation from bit repository to collections and end-user services. Such a system will address one need already apparent in the digital realm: the need to have in place an infrastructure that will allow both an aggressive rescue function to save endangered information assets and the ability to serve individual institutions, no matter the size, that are conscientious custodians of their digital collections.
The Responsibility for Stewardship
How will we pay for such an infrastructure, and how do we move beyond the incentives born of enlightened self-interest that we see in institutions managing their own information assets?
In the long run, digital technology will force all engaged in the research enterprise-from university president to graduate student, from library director to reference librarian-to rethink stewardship. Like all big challenges, the debate about information stewardship in this transformed landscape should begin with a simple proposition: Everyone who has a stake in access to digital information has a stake in the preservation of digital data. In higher education, that means the debate would be joined by all, with discussions taking place across and among campuses.
It is a debate in which university and college administrators and governors must play a visible role. In many ways, the issue of preservation-of the long-term care of information assets whether or not they have commercial potential or are crucial for lucrative or well-funded areas of research-is the dark side of the debate raging on campuses about scholarly communication, or, to be more precise, about publishing. But underlying the integrity and value of published scientific and scholarly literature are the deep and broad expanses of unpublished data and primary sources on which scientific and humanistic inquiry are based. To continue investing heavily in creating digital information assets without shoring up their long-term accessibility is like building castles on sand.
Today, we can expect that institutions will pay more attention to securing their own information assets into the future, even if that means using outside preservation services. We can press learned societies and the scholarly disciplines they represent to declare and act on their responsibilities to the information sources crucial to their own work. We can ask that all members of the research community not only look after their own near-term interests but also take the long view of the resources on which their professions depend. In the end, this debate affects not only research institutions and their constituents but also the public at large. It is the public that supports a vast research enterprise through federal tax structures that subsidize foundations and private as well as public educational institutions. Those tax structures and the stream of funding that goes into research through federal agencies have been created because our country’s Founders believed that the creation and dissemination of information and knowledge will lead to progress in the arts and sciences. It is not just digital information that is at risk if the academy does not act. It is also the compact between the public and the research-and-development infrastructure that the public supports.