Margaret Hedstrom
Research in digital archiving and long-term preservation is an increasingly popular topic for discussion. Although there have been many calls for research on long-term preservation of digital objects over the past decade, the present environment is especially conducive to defining a research agenda and developing effective research programs.
This paper focuses on three aspects of digital preservation research. It begins with a discussion of needs and opportunities that distinguish current efforts from previous attempts to organize research programs on digital preservation. It then describes some potential frameworks for research. The paper concludes with some recommendations for research programs that are methodologically and conceptually sound as well as useful to a broad community.
Current Needs and Opportunities
Two aspects of the current environment-need and opportunity-provide reason for optimism about the prospects for digital preservation research. Whereas those engaged in previous attempts to articulate digital preservation research issues were forced to spend a great deal of effort simply defining the problem, we now have a firm foundation on which to build research programs. During the past decade or so, libraries, archives, scientific data centers, government agencies, corporations, and private individuals have built significant collections of digital content. Many wonderful collections have been built through retrospective conversion, and a growing amount of born-digital content has been captured, in some way, by widely dispersed organizations. The need for preservation research is no longer a hypothetical question based on the premise that if we create valuable digital content, then someone will have to be concerned with its longevity. The need for preservation is real, and the absence of a preservation strategy is increasingly acknowledged as an obstacle to the full realization of a digital future. Numerous research projects, some of which are still under way, provide a foundation for identifying which approaches seem promising, for honing research methodologies, and for demonstrating the benefits of sound research. The collaboration between the National Archives and Records Administration (NARA) and the San Diego Supercomputer Center, and the Mellon Foundation-funded research projects on e-journal archives are two examples of at least a dozen such projects.
Significant opportunities exist for designing research programs that will advance basic knowledge and will also provide practical tools and solutions for libraries and archives, organizations with significant digital assets, and even for private citizens who are concerned about their own personal digital archives. Three such opportunities deserve mention. First, as part of the Library of Congress (LC)-led initiative to develop a National Digital Information Infrastructure and Preservation Program, the National Science Foundation (NSF) and LC have been working together to develop a research agenda for long-term digital archiving and preservation. A workshop held on April 12 and 13, 2002, focused on the research challenges in digital archiving and on building a national infrastructure for long-term preservation of digital information. Attending this workshop were 50 participants from industry, academia, and the government. The session engendered a sophisticated cross-fertilization of ideas among researchers in archives, information science, digital libraries, and computer science. The report from the workshop, which will be published this summer, will present a research agenda that funding agencies will use to mobilize resources for sponsored research universities.1
A related, and potentially more significant, development is the recent release of the draft report of the NSF Blue Ribbon Advisory Panel on Cyberinfrastructure (NSF 2002). This panel, chaired by Daniel E. Atkins, investigated the types of investments that NSF and other organizations need to make to create an infrastructure for advanced research in science and engineering. A cornerstone of the panel’s vision for a cyberinfrastructure is a network of knowledge-management institutions for collection building and curation of data, information, literature, and digital objects. The draft report recommends support for 50 to 100 data repositories that are grounded in the domain sciences where NSF funds research. The annual cost of this effort is estimated at $140 million.
A third area of opportunity is international collaboration. The presence of several participants from overseas at this symposium and the presentations by Titia van der Werf and Colin Webb indicate the global aspects of this issue. Researchers and institutions in the United States that are building digital repositories benefit tremendously from work under way in Australia, Germany, the Netherlands, the Nordic countries, the United Kingdom, and elsewhere. Several participants in this symposium are members of the joint Working Group on Digital Preservation Research, funded by the NSF and DELOS, a European Union (EU) initiative to promote excellence in digital libraries in the EU. This is one of several opportunities to coordinate research internationally.
Frameworks for Research and Practical Applications
Despite these opportunities, a tension remains between immediate needs for solutions and the potential lag in transferring research results into practical applications. The initial conclusions of the NSF workshop indicate a consensus on some conceptual models for digital archiving. This consensus underlies the models presented by Kenneth Thibodeau of NARA, which separate physical storage from logical interpretation and distinguish data management from knowledge management. The concept of organizing the digital preservation challenge into a series of components, or layers, in a model architecture provides a basis for distributing responsibility among various types of institutions. Moreover, elements of the basic architecture can change over time without requiring that an archival system be redesigned. There is a strong consensus that this framework is an important step forward, even though there is also agreement that there is no single answer to all digital preservation problems. We need a spectrum of solutions in terms of scale, format types, and institutional responsibilities.
There is a clear sense that effective practices exist for certain types of static digital objects. In those cases where there is a strong basis of knowledge that organizations can use to move forward with implementation, we need to teach practitioners about these methods and practices. As a community, we may need to agree that while these practices are not perfect, they are effective enough to serve as a basis for moving forward with implementation. Such implementations could readily occur in the area of reformatting and conversion of traditional materials for certain formats. We have excellent guidelines and best practices for print documents and images that are oriented to building collections with qualities that make them more easily preserved over the long run. Anne Kenney and Oya Reiger address these practices in Moving Theory Into Practice (2000). The Library of Congress provides sound guidance through the technical requirements developed by its American Memory Program.2 Standards and guidelines endorsed by the Digital Library Federation, such as the Framework for Building Good Digital Collections developed by the Institute of Museums and Library Services, also represent community-based best practices.3 Any organization that is building digital collections should follow these guidelines, and funding agencies should require that all sponsored projects conform to them. In the area of static born-digital documents, some models are emerging for converting materials from proprietary formats into extensible markup language (XML). On the other hand, complex and dynamic objects present a significant challenge that requires considerable research. We also have much to do before reaching a consensus on best practices for video, film, recorded sound, and multimedia.
Preservation research can draw on related research in computer science and information science. Digital preservation shares many requirements with well-designed information systems, such as security, authentication, robust models for representation, and sophisticated information retrieval mechanisms. By adapting related research to meet some digital preservation challenges, we will be able to focus on the unique problems of long-term preservation. Participants in the April workshop on digital preservation challenges discussed some of the problems that are unique to long-term preservation.
One unique aspect of preservation is its concern with the long term, where “long term” does not necessarily mean generations or centuries. It may simply mean long enough to be concerned about the obsolescence of technology. In this area, preservation requirements may exceed what information technology vendors typically provide. When long-term preservation spans several decades, generations, or centuries, the threat of interrupted management of digital objects becomes critical. Digital objects cannot be left in an obsolete format and then turned over to a repository after a long period of neglect. This challenge is as much a social and institutional problem as it is a technical one, because for long-term preservation, we rely on institutions that go through changes in direction, purpose, management, and funding.
Considerable research is needed to develop funding and business models for repositories that assume preservation responsibilities. Repositories may be expected to preserve digital resources even though their utility may not become apparent until well into the future and even though the future users are not yet born. Over the long term, new communities of users will emerge with needs and expectations that differ from those of the communities that created the digital content. The challenge of developing economic models for the value and costs of archiving over the long term deserves an entire meeting or conference.
Another factor that distinguishes digital preservation research from many other types of research is the difficulty of knowing whether or not we have solved the problems. We may know when we have failed, but we may not be alive to know whether we have succeeded. This problem requires some challenging thinking about success measures and evaluation criteria.
Methodologies for Research and Knowledge Transfer
How we carry out research may be as important as the topics we choose to investigate. There are some frameworks that can move the field of digital preservation research forward. One recommendation is to disaggregate digital preservation research issues into manageable problems. The principles for this disaggregation are not yet established, but one place to start is by distinguishing between preservation of converted materials and born-digital content, between static and dynamic objects, among different formats, and between different producer and user communities. So far, very small investments have been made in research on a very large problem. If we can develop frameworks that allow people to apply their specialized knowledge and skills to specific problems, we can move forward.
One informative concept comes from Pasteur’s Quadrant by Donald Stokes (1997). Pasteur’s Quadrant breaks down the tired dichotomy between basic and applied research. Using a four-cell matrix, Stokes provides a dynamic model that allows for considerations of use to inform a basic quest for understanding. He uses the example of Niels Bohr, who was seeking basic understanding; Thomas Edison, who was trying to build something useful; and Louis Pasteur, who is in the quadrant where use, demand, and interest intersect with a quest for finding basic answers.
Fig. 1. Quadrant Model of Scientific Research.
Source: Stokes 1997, 74. Reprinted with permission by the Brookings Institution Press.
To the extent that we can design research projects that fit into that quadrant of “use-inspired basic research,” we will benefit both from what the academic research community has to offer and from the interesting questions that practitioners present.
Potential research methodologies cover a spectrum-from theory building to exploratory research, simulations, and experiments. One difference between digital preservation research and research on preserving physical objects is that we can make copies of bits or objects and experiment with them. We can run digital objects through a number of processes and get observable and measurable results. Such experiments would allow researchers to compare the results of different preservation strategies in terms of effectiveness, cost, and user acceptance. For example, a series of experiments comparing emulation and migration would allow researchers to conclude that for a particular type of digital object, an emulation approach preserves these specific properties, has these complications, and would cost this amount of money, whereas a migration approach to the same material over three format conversions has these specific consequences and costs this much. We need more concrete evidence and an empirical basis for evaluating different preservation strategies and for deciding which strategy is most appropriate for particular types of resources.
A considerable amount of enthusiasm is building around the idea of creating test beds where a designer or researcher-or more likely a large team of researchers-creates a prototype environment that has metrics that will make it possible to measure the effectiveness of different strategies. The work of the San Diego Supercomputer Center falls under the definition of a test bed, where libraries, archives, and organizations can bring real collections and problems as experimental data sources. A test bed also involves a feedback loop among the people with collections to manage and the people designing and running test beds. Knowledge transfer and technology transfer remain significant challenges. Researchers can do wonderful things in the lab or the test-bed environment, but there is often a huge gap in translating that research into products, services, best practices, and guidelines. Use-inspired research, combined with practitioners’ willingness to test research results and implement effective strategies from the research lab, will benefit all of us involved in the challenges and rewards of digital preservation research.
FOOTNOTES
1 Additional information about this workshop is available at: www.si.umich.edu/digarch/. Support for this workshop was provided by National Science Foundation award #021469. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
2 Technical guidance is available at: http://memory.loc.gov/ammem/ftpfiles.html.
3 These guidelines are available at http://www.diglib.org/standards.htm.
REFERENCES
All URLs were valid as of July 10, 2002.
National Science Foundation. 2002. Revolutionizing Science and Technology through Cyberinfrastructure: Report of the National Science Foundation Blue Ribbon Advisory Panel on Cyberinfrastructure, Draft 1.0 (April 19). Available at: http://worktools.si.umich.edu/workspaces/datkins/001.nsf.
Kenney, Anne R., and Oya Y. Reiger. 2000. Moving Theory Into Practice. Mountain View, Calif.: Research Libraries Group.
Stokes, Donald E. 1997. Pasteur’s Quadrant. Washington, D.C.: Brookings Institution Press.