Digital Preservation: A Many-Layered Thing: Experience at the National Library of Australia
Acknowledgments: The assistance of the Council on Library and Information Resources (CLIR), the Digital Library Federation, and Documentation Abstracts, Inc., in supporting my participation in this symposium is gratefully acknowledged. This paper draws in places on the work of colleagues at the National Library of Australia, including Margaret Phillips, Kevin Bradley, and Gerard Clifton.
The National Library of Australia (NLA) has made a substantial contribution to digital preservation practice, research, and thinking. In 1995, it established one of the world's first library digital preservation sections. Our PANDORA (Preserving and Accessing Networked Documentary Resources of Australia) archive of online publications, in operation since 1996, has developed into a collaborative national collection. The NLA was an early contributor to international discussion and debate through negotiation of a Statement of Principles for the Preservation of and Long-Term Access to Australian Digital Objects (NLA 1997). The Library also provided input to the Task Force on Archiving of Digital Information set up by the Commission on Preservation and Access and the Research Libraries Group (1996). The NLA has played a leading role in raising and discussing digital preservation issues with other national libraries and in looking for areas of collaboration (Fullerton 1998). Finally, the Library's Preserving Access to Digital Information (PADI) Web site is a tool for keeping up to date on digital preservation developments worldwide.
While these and many other initiatives are evidence of the NLA's commitment to promoting digital preservation practice and thinking, their main value in the context of this symposium resides in the lessons and principles that can be drawn from them. The purpose of this paper is to analyze experience at the National Library of Australia and to determine whether it can offer anything of value to other libraries' digital preservation activities.
Unpacking the Digital Preservation Problem
In Australia as elsewhere, library professionals have been engaged in a quest to find effective solutions to an overwhelmingly complex problem: the preservation of digital information. Experience at the NLA suggests that it is profitable to see digital preservation not as a monolithic problem but as a challenge with many layers. For us, approaching digital preservation from this perspective has been enormously productive.
One set of layers concerns different types of digital collections. For NLA, these collections include online publications, physical format digital publications, digital sound files, image files, corporate records, and a number of other discrete collections. While we recognize that all digital data can be handled the same way, it has taken us nearly 10 years even to start implementing systems that will integrate the management of these collections. It will take more years before we achieve full integration; perhaps we will never do so. This situation reflects our collection-oriented approach: for example, we organize publications collected from the Internet in a different manner than we do the oral history audio files we create; and the way we provide access to corporate records is different from the way we handle open-access collections. Our approach also reflects some technical differences; for example, the range of file formats and software dependencies in the material we collect is much more diverse than that of the digital materials we create ourselves.
For pragmatic reasons, we began by setting up what we believed would be the best way to manage each type of collection, knowing we would have opportunities to bring these systems together as we built knowledge and system capabilities over time.
A second set of layers concerns stages of action. We decided to begin by addressing immediate issues that threatened to rob us of the chance of taking longer-term preservation action. We made a conscious decision to take first steps—intelligent first steps if possible—without knowing all the challenges we would face or how to solve them.
Our most pressing demands were to make some decisions about what we should try to preserve and to put those materials in a safe place. We subsequently added layers of description, control, access, preservation planning, and action, and we are gradually integrating processes. However, the ability to look for staged responses and, when necessary, to separate processes, remains key to understanding and tackling problems.
We have formalized these processes into two broad terms: archiving and long-term preservation. For the NLA, archiving refers to the process of bringing material into an archive; long-term preservation refers to the process of ensuring that archived material remains authentic and accessible. Thus, we quite happily have a manager of digital archiving and a manager of digital preservation who work closely together and understand the subtle but important differences between their roles.
A third set of layers concerns levels of action. We found that we could and should distinguish among intentions, commitments, actions being planned, and actions proven and in place. The differences among these levels are easily and dangerously blurred; for example, when we assume material has been preserved simply because it has been saved to an archive. When we have recognized the differences among these levels, it has spurred us forward.
Infrastructure often seems to be the main enabler in moving from one action level to another. At NLA, we define infrastructure as the tools and systems for managing digital collections, the policies that guide what we do, and our means of sharing information, agreements, and accountability measures. Developing infrastructure takes time and resources, but it is a necessary investment. By developing infrastructure in parallel with archiving action, we have allowed for feedback processes so both activities can inform each other. It is not always necessary, or even desirable, to wait until all the infrastructure is in place before beginning the preservation activity.
A final set of layers concerns responsibility. We were impressed with the approach taken by the Library of the University of California at Berkeley and described in its 1996 Digital Library SunSITE Collection and Preservation Policy (Library of the University of California, Berkeley, 1996). This policy makes a clear distinction between the resources for which it will take archiving responsibility and the other materials available from its site. In trying to establish a distributed, collaborative approach to managing a collection of online publications, NLA has explored responsibility principles expressed in down-to-earth terms such as
- "Everyone doesn't have to do everything."
- "We don't have to do everything all at once."
- "Responsibility can be time limited: it doesn't have to be forever for everyone."
These principles will be addressed in greater detail in the section entitled "A National Model."
To summarize, the concept of layers underlies much of the National Library of Australia's progress in responding to the challenges of digital preservation. While we have often been characterized as advocates of a "just do it" approach, we believe that "just doing it" can be carried out in a systematic, intelligent, and learning-oriented way.
The approach just described is one that tries to respond to the real world of digital information within a framework of evolving conceptualization. Under such an approach, it would be inconsistent to expect the experience in digital preservation at one institution to be a sure guide for every other program. The circumstances in which we operate function as constraints and enablers that help define what we want to achieve and how we go about it. Some explanation of the NLA's circumstances may help others understand what we do and identify commonalities and differences with their own experience.
Australia is a large country, similar in size to the United States but with a much smaller population. Roughly 20 million Australians live in a number of large urban sprawls scattered around the fertile edges of the continent, in regional towns and cities, or in remote communities. Almost every political jurisdiction is characterized by a dichotomy between relatively large urban populations and what we call "the bush," whose inhabitants often have very limited access to the information resources accessible in the cities.
To some extent, the Australian library system reflects these geographic realities. The system also reflects our national history and the foundations of pre-existing colonies that carried many of their roles with them into Federation in 1901. Central libraries with deposit functions and public library systems committed to serve the populace wherever they could efficiently do so were a part of Australia's history. Libraries have a proud place in the Australian ideal of a fair, open, and educated society in which there is both equality of opportunity and reward for initiative and excellence. Such an idealized picture has often been undercut by realities, including the fact that many Australians are denied equality of access to information because of distance, income, education, or background.
In such an environment sits the NLA—working with, leading, and serving a library system made up of many autonomous parts geographically distant from the majority of Australians who own it through their taxes, and committed to providing effective information services to all Australians.
It is not surprising that Australians have taken up digital technology and that institutions such as the NLA see the exploitation of digital information as critical to their futures. Without embracing digital information, and without managing and preserving digital resources, the NLA would face increasing irrelevance. Thus, in the 1990s, the Library made a deliberate choice to bring digital information resources and services into its core business. From this decision flows virtually all progress the Library has made in building and managing digital collections and in working with others engaged in similar work.
The NLA is established by law and largely funded by annual federal government appropriations to deliver a number of functions including
- developing and maintaining a national collection relating to Australia and the Australian people
- making material from its collections available for use
- cooperating in library matters with others in Australia and elsewhere
Bringing the management of digital information resources into the Library's core business means applying these functions to such resources.
The NLA manages a range of digital collections. While most attention has been paid to one of these—the PANDORA archive—our programs seek to manage all of the collections for which the Library accepts long-term responsibility. These collections include
- Online publications selected for the National Collection of Australian Online Publications managed in the PANDORA archive. Establishment and management of this collection are described in detail in the sections titled "Collection Building" and "Digital Preservation."
- Physical format digital publications (distributed on diskettes or CD-ROMs). Preservation actions for this collection are described under "Digital Preservation."
- Oral history sound recordings. Preservation actions for this collection are also described under "Digital Preservation."
- Both intentionally and unintentionally deposited manuscript materials on digital carriers. Recovery procedures for inaccessible items are briefly described under "Digital Preservation."
- Digital copies of analog collection items produced in our digitization programs.
- "Born-digital" unpublished pictorial works such as photographs.
- Corporate electronic records.
- Bibliographic and other metadata records.
Collection Building: PANDORA as an Example
Most of the NLA's collection-building activity has involved online publications. For this reason, the following discussion focuses on the PANDORA archive. The PANDORA archive of Australian online publications has been described in many papers available from the NLA Web site (Cathro 2001). This discussion is limited to the points that are most relevant to a broad understanding of what PANDORA is and how it works. Although initiated and managed by the NLA, PANDORA has in recent years developed into a collaboration among a number of partners.
The origins of PANDORA lie in the conviction that the Library has a responsibility to collect and preserve the published national heritage, regardless of format. The Library started discussing options for preserving online electronic information resources in the early 1990s. In spite of predictions that it would be technically too hard and that there would be insurmountable copyright obstacles, we decided to take some exploratory steps and see what progress we could make. Thus, in 1995-96 we appointed an electronic preservation specialist in our Preservation Services branch; set up a cross-program committee to develop guidelines for selecting online publications that should be collected; established an Electronic Unit to select and catalog online publications; and began to experiment with capturing (and sometimes losing) selected publications using cobbled-together, public domain software.
From these uncertain beginnings, PANDORA has developed into an operational National Collection of Australian Online Publications. It contains about 2,200 titles, roughly half of which have multiple instances (i.e., they have been gathered more than once). Roughly a third of the titles in the archive are collected on a regular basis; however, the frequency of capture varies, depending on the gathering regime negotiated with the publication owners.
To build a national collection, NLA works with a number of partners, including ScreenSound Australia (the national film and sound archive), and seven of the country's eight State and Territory libraries. The contributions of partners vary from simply selecting material to be archived, to negotiating with publishers, to programming the harvester to initiate a capture. So far, all the gathered material is stored and managed by the NLA. It will be interesting to see how this responsibility develops; there is an argument for sharing responsibility for storing, preserving, and providing access more equally among our partners, but there is also an argument that it is more efficient and reliable to centralize the storage and preservation functions.
In place of the inefficient harvesting and storage tools originally used, the Library has developed its own Digital Archiving System. This suite of tools has increased the efficiency of operations and made it easier for our partners to participate via a Web interface.
From the beginning, the Library has taken a selective approach to archiving. We believe the reasons for having taken this approach still apply. First, by archiving selectively we are able to focus some resources on quality control. We check each title to ensure that it has been copied completely and with full functionality (as far as is currently possible). Because all publications have been selected for their national significance and long-term research value, we consider this investment of time to be justified.
Second, by archiving selectively we can negotiate with publishers for permission to archive their publications (necessary in the absence of legal deposit legislation for digital publications) and make them accessible online or through dedicated onsite PCs.
While the Library recognizes many advantages in taking a more comprehensive approach to Web archiving, we have yet to be convinced that such advantages outweigh the benefits of quality control and accessibility that we have been able to achieve only while collecting selectively. However, we do not see these approaches as mutually exclusive. We would like to be able to pursue high-quality, ongoing capture for a core body of material selected for its research value, complemented by periodic capture of more comprehensive snapshots of the Australian domain.
Last year we engaged a consultant to look at the feasibility of such an approach. While funding difficulties interrupted this work, we are exploring a number of ways of making our national collection both broad and deep.
Although support for the enactment of legal deposit legislation for electronic publications is emerging, we will still need to communicate with many publishers to negotiate periods of restricted access or assistance with formats that are difficult to gather automatically. For this reason, the NLA is working with Australian publishers to establish a code of practice that would guide us, particularly in dealing with commercial online publications.
PANDORA encounters a number of technical problems, even in its collection-building tasks. An early decision that content should take precedence over format means that, in principle, no publication is excluded simply because it is difficult to capture or manage. This is a noble objective but one that has not always been successful in practice. Despite many years' experience in automating our archiving processes, we still have to handcraft some features, such as applets, to make them work reliably. Our greatest difficulty is with publications structured as databases, which we have been unable to harvest. We plan to do more work in this area because we recognize the difficulty with databases as a major deficiency.
The ultimate purpose of all this effort is improved access. The NLA has long been interested in persistent identification that will keep information resources findable while they are still available. We are currently using an in-house system of persistent identifiers and resolution mechanisms.
Rights management is critical to PANDORA's access arrangements. While the archive has been developed to respect and support rights management for all publications, special procedures and controls have been developed for commercial publications.
From this discussion of collection building for PANDORA, it should be evident that our archiving arrangements continue to evolve as we encounter and deal with a wider range of issues.
Digital Preservation Programs
The Library's digital preservation programs have developed more slowly than has our digital collection building. However, some concrete steps are starting to emerge. Our preservation programs are predicated on a concern to protect and maintain the data stream carrying the archived information, and to maintain, and if necessary to recover, a means of accessing the archived information.
Our six-year experience in active archiving has taught us that the browsers that provide access to online material are remarkably tolerant. It is hard to find any material that cannot be accessed once it has been saved to the archive. This will change, especially as dependencies such as plug-in software are superseded and lost from users' PCs.
The most notable step we have taken with PANDORA has been to design and carry out a trial migration of files affected by the superseding of formatting tags in the HTML standard. Our modest migration does not constitute absolute proof that we can preserve access to the entire archive this way. However, it does suggest that we can quite efficiently make consistent, well-documented changes within files in the archive and produce an outcome that meets our standards for preserving the significant properties of HTML files.
Physical Format Digital Publications
Our collection of physical format digital publications is not large; it comprises only a few thousand titles. It does, however, contain important material. In working with this collection, our most significant step has been to establish an ongoing regime of transferring information from unstable diskettes to more stable CD-Rs. We are about to experiment with transfer to a mass storage system. These are, again, quite minimal preservation steps aimed at enabling future preservation action.
The Library's collection of more than 35,000 hours of recorded sound has been a nursery for developing our thinking about digital preservation. We began digital recording in the early 1990s but did not begin archiving to a digital format until 1996. Our first archival digital carrier was CD-R, chosen for its manageability and expected reliability over the reasonably short time we intended to retain it. While always expecting to lose access to professional analog audio technology, until recently we have archived to both CD-R and analog tape. A few months ago, we finally dropped the analog part of our archiving strategy and moved to a digital mass storage system, managed through extensive metadata. This has been a rapid development over only five to six years, and most of the collection remains on analog tape in a controlled-climate store. We expect to be using our third or fourth mass storage system before we have copied the entire collection to a digital format.
Because the Library has retained control over the file formats we use and the quality of sound archiving work, we expect to use a straightforward migration path to maintain access to this collection.
There is insufficient space in this paper for a detailed description of our data recovery program for the many undocumented diskettes that emerge from the Library's manuscript collections. Our investments in buying format recognition and translation software, and in developing procedures for using it, have been rewarded by regaining access to some important material (and to quite a lot of junk).
While not prepared to rely on data recovery as a means of ensuring ongoing access, we have come to accept it as a satisfactory method of last resort.
Within Australia, it is likely that NLA's efforts in building infrastructure to manage all of these digital collections will be as seen as more important than the original initiatives themselves.
The following six types of infrastructure have been important for the Library:
- policy frameworks
- resources (including expertise)
- mechanisms for sharing information
- collaborative agreements
The key systems infrastructure to support all of our digital collections, and to carry and support the National Collection of Online Publications, is what we call our Digital Services Project. A challenge for the Library has been the lack of systems that could be bought off the shelf. Through procurement exercises starting in 1998, we tried to purchase systems for digital archiving, storage, and digital object management. Of the three, we have managed to buy only the storage system; the other two have had to be developed in-house—a slow and resource-intensive process.
Development of our systems was slowly aligned with the Open Archival Information System (OAIS) Reference Model, which is emerging as a standard framework (CCSDS 2001). The Library began modeling its business processes and data structures before we were aware of OAIS, and we continued to do so without feeling the need to fully adopt OAIS terminology. This apparent willfulness on our part does not seem to have caused either the NLA or anyone else much harm. At the right time, we found we could map our processes quite accurately to OAIS, providing something like an independent endorsement of the OAIS Reference Model.
Another essential tool for managing our digital collections is preservation metadata. Because we could find no existing model that met our needs, in 1999 we undertook development of a preservation metadata model to support all our digital collections. That work has contributed to the efforts of Research Libraries Group (RLG) and OCLC to negotiate a consensus metadata model that could be offered to the world (OCLC 2002).
Policy is the second kind of infrastructure we needed to establish. In developing its Digital Services Project, the Library produced various information papers that serve as policy documents for many of our collection management processes.
More recently, we have set down our intentions regarding ongoing maintenance by releasing a digital preservation policy that addresses the way we manage our own collections as well as the way we wish to work with others (NLA 2002). Hindsight will probably see it as an early and rather unsatisfactory draft, but for now it is having a powerful effect in focusing our preservation efforts.
Like all good policies, this one has spawned an action plan that commits the Library to the following steps:
- Documenting our collections so that we understand what we have and what we have to deal with;
- Understanding and auditing the preservation effects of the ways we manage our collections currently;
- Developing mechanisms to monitor threats, preferably in collaboration with others;
- Defining the significant properties of our collections that must be maintained through our preservation processes; and
- Investigating how we can retain access to software required by our collections, and continuing our practical tests of emulation, migration, and other strategies, on the assumption that we will need to apply different approaches to different kinds of material. For example, we are confident that migration will work for our large, homogeneous collections of digital audio and image surrogates from our digitization programs, whereas emulation will probably be needed for parts of our physical formats collection and PANDORA, supported by ongoing access to software archives. We are also looking at the practicalities of using XML as a format simplification approach and at the use of generic document viewers for nonexecutable files, as currently used for our corporate electronic records.
While the NLA has been pleased to discover what it could achieve without outside funding, it would be foolish to deny that digital archiving and preservation programs are resource-intensive. The Library has had to reallocate quite a few million dollars from other work to achieve what it has been able to do so far in digital preservation. This reallocation has not been without pain, as the Library continues to acquire nondigital collections as rapidly as ever and remains as committed as it ever has been to their good management and preservation. The Library is reaching a point where it will be difficult to make further progress with its digital collections without additional resources and a sustainable business model.
With regard to managing workflows, the Library's practice of placing dedicated teams of specialists inside existing organizational units has proved effective in building expertise we require without losing contact with the broader institutional culture and direction.
Mechanisms for Sharing Information
We see information sharing as a critical enabler of digital preservation. The most visible manifestation of the Library's commitment to information sharing is PADI, the Web-based subject gateway on preserving access to digital information.
PADI was set up by a group of institutions as a place where we could share, compare, and find information about digital preservation. PADI is not the only good place to go looking, but our friends tell us it is their international subject gateway of choice. While managed by NLA, PADI has a number of contributors and partners, and a recent agreement with the Digital Preservation Coalition in the United Kingdom ensures we will be working together to the benefit of both organizations' users.
Support from the Council on Library and Information Resources (CLIR), has allowed the NLA to pursue an experimental program of identifying and protecting the key resources listed in PADI through the Safekeeping Project. This project is based on a model of extremely distributed management of information resources, principally through self-archiving in compliance with a set of guidelines.
The NLA is committed to working with others in libraries, archives, universities, publishers, government agencies, and elsewhere, both in Australia and overseas. We seek to work collaboratively because our own small steps will take us only part of the way we need to go. We have found that collaboration works best where there is concrete action to be taken and clearly defined expectations on all sides.
It has long been recognized that some kind of certification is required to establish whether archiving arrangements can be trusted to provide adequate preservation guarantees (Task Force 1996; RLG 2002). The long history of cooperation between libraries in Australia may well lead us to look for cooperative ways of demonstrating our mutual accountability. It will be fascinating to watch the development of approaches to certification in other countries with different traditions of cooperation.
A National Model
In thinking about how national models for digital archiving may develop, it is helpful to return to the principles of responsibility mentioned earlier and the impact they have had in Australia.
- "Everyone doesn't have to do everything."
This principle has made it possible for partners to come into PANDORA at a modest level of involvement. It has also allowed some people who do not have an identifiable role to opt out of active archiving.
- "We don't have to do everything at once."
This message has enabled us to focus on collection building for the moment and to look for ways of improving how we manage collections later. It has also helped us accept the constraints and compromises along the way without falling into despair.
- "Responsibility can be time constrained."
This principle has been especially powerful in inviting people to play a role for a defined period without implying a long-term obligation. It also helps us bear in mind that all of our roles may be time constrained and that effective exit strategies and succession plans are essential.
These principles have been useful in helping us approach and develop the building of a national model for distributed digital archiving. However, we believe that they are only valid in the context of some other related principles:
- "We may not all have to do everything, but someone has to do something."
- "Someone must be willing to take a lead on almost all steps."
- "In the last resort, someone must be willing to take responsibility for everything, even if it is only responsibility for a final decision that some information will be lost."
So far, building this national collection has worked well in Australia's library sector. That may have something to do with the NLA's leadership and the strong spirit of cooperation engendered by success. Perhaps out of success in individual sectors, it will be possible to achieve success within other sectors and among sectors, so that we can build a truly national model for archiving and preserving digital information.
All URLs were valid as of July 10, 2002.
Cathro, Warwick, Colin Webb, and Julie Whiting. 2001. Archiving the Web: The PANDORA Archive at the National Library of Australia. Available at http://www.nla.gov.au/nla/staffpaper/2001/cathro3.html.
Consultative Committee for Space Data Systems. July 2001. Draft Recommendation for Space Data System Standards: Reference Model for an Open Archival Information System (OAIS). Available at
Fullerton, Jan. 1998. Developing National Collections of Electronic Publications: Issues to Be Considered and Recommendations for Future Collaborative Actions. Available at http://www.nla.gov.au/nla/staffpaper/int_issu.html.
Library of the University of California, Berkeley. 1996. Digital Library SunSITE Collection and Preservation Policy. Available at http://sunsite.berkeley.edu/Admin/collection.html.
National Library of Australia. 1997. Statement of Principles for the Preservation of and Long-Term Access to Australian Digital Objects. Available at http://www.nla.gov.au/preserve/digital/princ.html.
National Library of Australia. 2002. A Digital Preservation Policy for the National Library of Australia. Available at http://www.nla.gov.au/policy/digpres.html.
OCLC Online Computer Library Center. 2002. OCLC/RLG Preservation Metadata Working Group. Available at: http://oclc.org/research/pmwg/.
Research Libraries Group and Online Computer Library Center. 2002. Trusted Digital Repositories: Attributes and Responsibilities. Available at http://www.rlg.org/longterm/repositories.pdf.
Task Force on Archiving of Digital Information. 1996. Preserving Digital Information. Report of the Task Force on Archiving of Digital Information. Washington, D.C.: Commission on Preservation and Access, and Mountain View, Calif.: Research Libraries Group. Available at
Web sites noted in paper:
National Library of Australia Digital Services Project: http://nla.gov.au/dsp