Titia van der Werf
Understanding of the issues surrounding digital preservation is growing at a reassuring rate, at least among the nucleus of experts in this field. At the same time, confusion and a sense of being overwhelmed by the complexity of the issues often prevails in the wider community of archives and libraries. The national approaches that are now being started in the United Kingdom, such as the Digital Preservation Coalition initiative, as well as in countries with national digital strategies, such as the United States, are commendable; at the same time, they can distract institutions from just going ahead and acting. Moreover, some organizations, national archives as well as national libraries, seem to be stuck in the requirements-specification stage and find it difficult to move forward to implementation, perhaps out of fear of making mistakes.
In the Netherlands, we do not have a national strategy yet, but we have advanced quite a bit at the institutional level, especially in my library. The archive community, together with the Ministry of Home Affairs, has been setting up test beds for digital preservation. This paper focuses on two activities-how we are preparing to create a mass storage system as well as the work we are doing with IBM on long-term digital preservation issues. Like others, we have made mistakes, but we have also made substantial progress. This paper describes what we have done and the lessons learned en route.
Serious Business: Preparing a Mass Storage System
In contrast to most other countries, the Netherlands does not have a legal deposit regime. The national library of the Netherlands makes voluntary deposit agreements with the Dutch Publishers Association. Our agreements with publishers date from about 1974 for printed publications and from 1995 for electronic publications. The latter include offline as well as online publications. To date, we have collected monographs, CD-ROM titles, government reports, and dissertations. Among the serial collections are online journal articles from Elsevier Science and Kluwer Academic Publishers, and we have also received official government publications in digital form.
Altogether, our collection amounts to three terabytes of information. All the CD-ROM information is still in the offline medium; we have not yet transferred it to a storage system. We do not have enough storage capacity in our digital stacks for all digital deposit material. Because we anticipate a great deal of growth and the receipt of all the Elsevier titles in the coming year, getting a reliable, quality-controlled mass storage system in place is one of our priorities.
Different Countries, Common Goals
One of our most important activities has been leading the Networked European Deposit Library (NEDLIB) Project. As this project got under way, the eight participating national libraries entered into discussions about how to set up a digital deposit system. We spent a year talking about our differences. While time-consuming, these discussions were necessary because they gave us a common understanding of issues at stake. We slowly realized that we needed to identify our common missions, goals, and objectives. We asked “What is common to the digital publications that we receive, and what common solutions can we come up with for this problem?” This exercise laid the foundation for the consensus building that would occur later in the project.
We looked at the deposit process, that is, the workflow for electronic publications. First, a publication gets selected; then it comes in as a deposit and we capture, describe, and identify it. This process continues through the whole workflow, ending at user services. Next, we identified areas where we thought that this process might be different in the digital world than it is in the world of print material. For example, in the digital world, our library has no system in place that could tackle the new parts of the process required by digital publications. We have automated cataloging systems and acquisition systems, but even if we use these systems for digital publications as well, we still do not have a storage system or a digital preservation system.
We identified the missing information technology (IT) components. Figure 1 shows how we visualize the existing library systems that support the conventional steps of the workflow. For the steps in this workflow that are not supported by any current system, we realized that we would need to put a new system in place. By following the workflow steps, we could start identifying our requirements for this new system.
Fig. 1. The NEDLIB workflow for electronic publications
NEDLIB has now developed guidelines for national libraries that want to set up digital deposit collections and systems. The guidelines outline how to design the digital stacks as a separate module in the digital library infrastructure and how to implement the digital stacks as much as possible in conformity with the Open Archival Information System (OAIS) standard. We had been looking extensively at the OAIS standard in the NEDLIB project and were impressed that this model suited our requirements so well.
We also recognized that electronic publications coming in should be transferred to a highly controlled storage environment, and that after they are brought into the digital stacks, all objects should be treated alike. That led to another question: Should we implement separate systems for Web archiving, CD-ROM archiving, and archiving of online journals? We realized that if we did, the process could go on forever; we would have hundreds of systems sitting next to each other. That would not be manageable. Therefore, we decided that we wanted to be able to treat all electronic publications in the same manner, regardless of type.
When the NEDLIB project ended, we returned to our own countries and started implementing local digital stack systems in our libraries. We are all implementing these local deposit systems in different ways, but we hope that through NEDLIB we have gained a common understanding of what we are doing.
IBM’s Implementation Model: A Work in Progress
In the Netherlands we issued a request for information that asked the IT market whether there were products that could provide for system functions according to the NEDLIB/OAIS Model. As a result of the positive reactions from the IT sector, we started a tendering procedure.
IBM Netherlands, which had off-the-shelf products that could support quite a few of the processes that we had identified, was the successful candidate. IBM made it clear from the start that its products would not be able to provide any long-term digital preservation functionality, but it was willing to help us research the issues and look at the requirements of this subsystem for preservation.1
Figure 2 shows the IBM Implementation Model. It depicts the OAIS modules of Ingest, Archival Storage, Access, Data Management and Administration, as well as the NEDLIB-added modules of Preservation, Delivery and Capture, and Packaging and Delivery. The latter two modules are interfacing with existing library systems. Within these interfacing modules are gathered everything that has to do with locally defined and variable types of things. For example, at the input side are all the different file formats that publishers use. Because these formats are not generic and change over time, we regard them as external variables instead of internal archive standards. At the output side are different types of customers. The requested archive objects must be tailored to make them fit for use, both now and in the future. Library users’ groups will change over time, and users will become increasingly demanding as technology evolves. In summary, we put everything that is variable into these interfacing modules and everything that is generic into the “black box” that we call our deposit system.
Fig. 2. IBM Implementation Model
IBM has named its system implementation the Digital Archival Information System (DAIS). We have DAIS Version 1.0, whose scope includes a pre-Ingest module for manual loading of submission information packages (SIPs). With this module we can load CD-ROMs and e-books. There is an archival storage unit with backup, disaster recovery facilities, and a data management unit where technical metadata are recorded. Descriptive metadata and structural metadata are put into other existing library systems. The Access module for Dissemination Information Package (DIP) retrieval is complemented with a post-retrieval module for installation of electronic publications on the library workstations.
We are already planning for DAIS Version 2.0 as we anticipate additional functional requirements. We know we need full automatic batch loading of SIPs, especially for the e-journals that come in with a certain regularity and frequency. We do not want to do that all by hand, article by article. With DAIS, serials processing and archiving will be automated.
In Version 1.0, we are able to add new SIPs to the system. We know from experience, however, that some submissions need to be replaced or even deleted, even though it is a deposit. For that reason, we also need some replace and delete functionality in the DAIS-system.
IBM will upgrade the system to Content Manager Version 8.0, thereby adding new functionality as well. The British Library, which is involved in a similar effort with IBM-UK, has expressed its intention to build its deposit system on top of DAIS Version 1.0. Finally, the preservation subsystem still needs to be implemented in future versions of DAIS.
Defining Scope to Avoid Distraction
The IBM implementation has raised many issues outside its immediate scope. Just because we have a system does not mean that it will support the whole workflow. DAIS is only one piece of a large puzzle, and the pieces we still need include, for example, a Uniform Resource Name (URN)-based object identifier system.
We also need batch processing of the associated metadata we receive from the publishers, together with the content files. The metadata should be processed automatically and converted to our own XML DTD format, and they should be loaded into our metadata repository and indexed.
We need a digital mailroom where publishers can deposit their electronic publications. This should be a controlled area with password, virus checking, mirroring, and harvesting capabilities. We also need a preload area for both batch and manual loading. Other needs include user identification and authentication, authorization mechanisms, and collection browsing functionality.
Although we need these functions, we ultimately did not ask IBM to include them in their implementation system. We realized that adding new functionality along the way would jeopardize the project budget and schedule. Also we wanted a modular architecture, with well-defined functions and supporting technologies. We investigated whether we could implement some things ourselves, or whether there were products on the market that would be suitable. This scoping effort is very important to make sure that you take only the generic parts into the system and have a modular way of building your digital library infrastructure.
What are the digital stacks? Essentially, it is the IBM system and possibly other content management systems as well. We are thinking of implementing a separate system for our digitization collections because we are not going to put those collections in the IBM system.
Fig. 3. Digital Library Infrastructure
Deciding whether to add our digitized collections to the IBM system has been difficult. Our primary goal was the deposit collection. The requirements, therefore, were for managing highly complex and controlled technical metadata to ensure that later on we would be able to migrate, convert, or emulate formats. We had developed several requirements relating to the pre-ingest and post-retrieval processes that were much too heavy for the digitized collections. Digitization is all about providing quick access and putting it quickly on the Web. This is in contrast to requirements for deposit collections. We ultimately decided to create a separate system for the digitized collections.
A Data Model for Long-Term Access
The data model implemented in the IBM system is based on the OAIS model. In one archival information package (AIP), we envision being able to put in either the original version of the electronic publication or a converted version of the electronic publication. Also, we envision keeping the bit-images of installed publications in AIPs. For example, if a CD-ROM publication needs to be installed, we take a clean workstation configuration and install the CD-ROM on it. Then we take a bootable image of the installed CD-ROM and enter that as one image in the AIP. This process allows for access now and, we hope, in the future.
We also envision putting software applications (including platform emulations and virtual machines) in AIPs and operating systems, if need be, as disk images. Of course, hardware platforms cannot be put in an AIP.
In DAIS Version 1.0 we have in place as many long-term preservation hooks as we could think of. The data model, and especially the technical metadata, are important hooks. In terms of technical metadata, we are concentrating on what we really need to record about the technical dependencies between file formats, software applications, operating systems, and hardware platforms. We see this as a prerequisite to ensure access and readability now and in the future.
Fig. 4. Data model of the IBM implementation
A Reference Platform for Manageability
As Ken Thibodeau has said, when you start registering all this technical information and look at each file type for what system it can run on, you find that it can run on up to 100 different systems or configurations. Recording all that will take a good deal of time, and that is what he was visualizing when he created his graph with preservation methods and format types. He put preserving original technology at the “great variety” end of the continuum and contrasted it to preserving the more generic and persistent content of objects.
Recognizing this problem, we considered managing the technical metadata in terms of the concept of a “reference platform.” The reference platform concept tries to freeze the configuration of a workstation or PC for a whole generation of electronic publications-perhaps for five years. The configuration includes the hardware, the operating system (e.g., Microsoft Windows 2000), viewer applications (e.g., an Acrobat reader), and a Web browser, as is shown in Figure 5.
This frozen workstation would cater to a generation of publications: for example, all PDFs published between 1998 and 2002. Everything we receive in the library currently is in PDF format. We hope that with this frozen workstation, we will be able to manage the diversity in configurations that may appear during this period of time. We do not want to support all possible view paths, just the preferred view path. The reference workstation is the preferred view path for a collection of publications for a certain period of time. In this way, we can standardize the view paths we support in our system and make the handling of diverse configurations more manageable.
The technical metadata records the chain of software and hardware dependencies. We create technical metadata to be able to re-create the running environment for access now and in the future. This process will also help us monitor technological obsolescence. The reference platform is a means to make all this more manageable.
Fig. 5. The Concept of the Reference Platform
New Skills Call for New Jobs
With the workflow for electronic publications, we need new skills and staff. We need staff for the daily operation of the digital stacks, most likely not organized as a new department but as part of our existing processing department where the general cataloging takes place. We need technical catalogers who can handle many different file types and IT configuration management tools. We need technology watch officers who monitor for new formats and trends on the e-publishing market. Further, we need reference platform administrators, digital preservation researchers, and quality control managers.
Assistance from the Computer Scientists
The Koninklijke Bibliotheek has worked with Jeff Rothenberg of The RAND Corporation and IBM’s Raymond Lorie. We began working with Jeff Rothenberg because his emulation theory was new to us and we were interested in solutions. When we asked him to explain his solution, he presented his hypothesis as it stood in 1995 (Rothenberg 1995). As he talked with us and with representatives of other memory institutions, such as the Dutch National Archives, he developed his concept to maturity (Rothenberg 2000). He addressed the problem of needing to build different versions of emulators over time. Consequently, he developed the idea of the virtual machine on which you would be able to extend the life of an emulator.
With Raymond Lorie, we looked at all the PDF files in our digital deposit. He worked on a prototype to demonstrate that you could extract both the data and the logical structure of the document to re-create it at access time, even without the original PDF file. What you would need to keep is the bit-image of the document and a Universal Virtual Computer (UVC) program to interpret the logical data structure of this document, so that you would be able to scroll, search for words, and navigate through the document.
Lorie’s approach offers a generic way of handling PDFs that could be applicable to other types of data formats, such as JPEG (Lorie 2001). With this approach, you could discard the original PDF, although we have not chosen to do so. We agree with Ken Thibodeau when he says that as long as we can afford it, we should try and keep everything in its original state, even if we know that we will not be able to read it in 10 years’ time. Maybe in 150 years we will be able to read it; who knows?
Talking with computer specialists such as Jeff Rothenberg and Raymond Lorie helped us understand the need to distinguish between two types of electronic publications: executable publications and document-like publications. Executable publications are software programs; they include games and complex publications on CD-ROM. Document-like publications are the simpler data types, such as texts and images, which require only viewers to interpret them. Programs and document-like objects require different technology solutions. We have been investigating virtual machine technology and hardware emulation to solve the problem of hardware dependency, and we have been looking at data extraction for solving the software dependency problem.
Mistakes Bring Wisdom: Lessons Learned
What lessons have we learned in the course of the activities just described? First, the question of scope is important, because otherwise it is easy to get distracted. At one point we mentioned to IBM that our digitized collection needed to be put into this system. When they asked us about our requirements for that type of collection, we answered, “Fast access and fast ingest.” Those requirements, however, contradicted those we had previously specified for our deposit collection. This was confusing, and it took much precious time to clear up the confusion. We decided to forget about the digitized collection and return to our original goal. This shows that you just cannot try to tackle all the problems at once. A step-by-step approach is essential. This does not mean, however, that you can overlook the need for a comprehensive approach, because you always need to keep the big picture in mind.
Modular design is also important, especially the ability to have independent modules so that you use the right technology for the right module and do not create unnecessary dependencies. Each technology-for cataloging, indexing, or storage, for example-changes at its own pace. Technologies should not be too dependent on each other, because when you change one technology, you will have to upgrade the whole system.
It is important to think about whether you are going to choose IT market solutions or develop your own. We have reorganized our IT department in such a way that we no longer support IT development. Instead, we outsource everything pertaining to development or hire people with the needed development skills. Our policy is to rely as much as possible on IT market products rather than custom-made products, so that we can gain leverage from IT product support and development services.
Technology Problems Need Technology Solutions
Analysis of a list of digital preservation projects recently drawn up by the European Union-National Science Foundation Working Group shows that not all these projects are about long-term digital preservation. Many are about building controlled archives, which is really much more about storage management, rather than long-term preservation. The “Lots of Copies Keep Stuff Safe” (LOCKSS) approach, for example, has often been cited as the answer for digital preservation. But it is not the answer, because it is really nothing more than a very controlled (or maybe uncontrolled!) way of replicating. LOCKSS does not preserve anything in the long term. If a format is obsolescent now, it will still be obsolescent in the future.
There is an important difference between archiving and long-term access. Archiving is quite straightforward because we are doing it already: selecting, identifying, describing, storing, and managing. Archiving keeps the objects machine-readable and healthy, and it provides access now. The real challenge is long-term access-being able to render the object in a human-understandable form and being able to solve the software and hardware dependencies.
Digital preservation is a technology problem, and it needs a technology solution. Metadata are a means to help us solve the problem, but digital preservation is not a metadata issue. It is the same with access control issues. People talk about dark, dim, and bright archives, but that has to do with access control, not long-term digital preservation.
Spreading the Word for a Shared Problem
Digital preservation is not only the problem of memory institutions. We have new players and potential partners in our midst, such as Warner Brothers, a film business partner. The problem we are tackling is a shared problem across our new, e-based society. Businesses, public service and health organizations, schools, and research institutions have a stake in this issue, as do individuals. You have your personal files, including electronic income tax files and Web pages, that you want to keep for a longer period of time-the tax files because you are required to keep them for at least five years, the digital pictures and Web pages for your grandchildren.
We, as memory organizations, are responsible for raising awareness. Where other sectors in society are not yet fully aware of the need to develop a digital memory, we are the ones that can raise this awareness. The adoption of a Resolution on Preserving our Digital Heritage at the UNESCO general conference in October 2001 has been a very important awareness-raising step.
How can you measure the state of digital preservation? What do we have in place to be able to measure at what development stage we have arrived? It might be useful to look for similarities in another field, such as medicine. The first question is “Do we have patients?” Yes, we have an increasing number of digital archives and collections that are potentially endangered. Do we have doctors? Yes, we have increasing numbers of experts. What are the illnesses? Do we know the symptoms? We do know that there are increasing examples of digital obsolescence, of tapes that cannot be read, file types that are no longer supported by any software, but we do not know how many. Are there research programs? Yes. Research is under way worldwide, and we know that much more needs to be done. We are at the stage of drawing up research agendas. Do we have hospitals? Data recovery centers? I think there are one or two in the world.
Do we know the treatment against the illnesses? Are there medicines and cures? We have advisors, we have best practices, but not much more than that. Educational programs? Yes, in rising numbers. Do we have emergency kits? These are all questions that beg for answers, and only in raising these questions can we raise awareness of digital preservation issues across society and across the globe.
All URLs were valid as of July 10, 2002.
Lorie, Raymond A. 2001. Long Term Preservation of Digital Information. In: Proceedings of the first ACM/IEEE-CS Joint Conference on Digital Libraries, Roanoke, Virginia; p.346-352. New York: ACM-Press.
Nieuwenburg, Betty. 2001. Solving Long Term Access for Electronic Publications. D-Lib Magazine 7(11). Available at http://www.dlib.org/dlib/november01/11inbrief.html#NIEUWENBURG.
Rothenberg, Jeff. 1995. Ensuring the Longevity of Digital Documents. Scientific American, 272(1): 427 (international edition, pp. 249).
Rothenberg, Jeff. 2000. An Experiment in Using Emulation to Preserve Digital Publications. RAND-Europe. NEDLIB Report Series;1. Den Haag: Koninklijke Bibliotheek.