Utility of the Archival Paradigm in the Digital Environment • CLIR

Information is not a natural category whose history we can extrapolate. Instead, information is an element of certain professional ideologies . . . and cannot be understood except through the practices within which it is constructed by members of those professions in their work.

-Agre (1995)

The principles and practices discussed in the preceding section demonstrate how the archival community constructs information and why this construction needs to be understood and addressed in the digital environment. These principles and practices, independent of the archival construction of information, can also contribute to the management of digital information. Implementing the archival paradigm in the digital environment encompasses the following:

working with information creators to identify requirements for the long-term management of information;
identifying the roles and responsibilities of those who create, manage, provide access to, and preserve information;
ensuring the creation and preservation of reliable and authentic materials;
understanding that information can be dynamic in terms of form, accumulation, value attribution, and primary and secondary use;
recognizing and exploiting the organic nature of the creation and development of recorded knowledge;
identifying evidence in materials and addressing the evidential needs of materials and their users through archival appraisal, description, and preservation activities; and
using collective and hierarchical description to manage high volumes of nonbibliographic materials, often in multiple media.

The archival community is making significant contributions to research and development in the digital information environment by using integrity, metadata, knowledge management, risk management, and knowledge preservation. Each area is discussed below with reference to recent and ongoing projects in which the archival community has played a leading role in setting the agenda or integrating the archival perspective. Many of the projects discussed have in common a concern for evidence in information creation, storage, retrieval, and preservation; cross-community collaboration; strategies that use both technological processes and management procedures; development of best practices and standards; and evaluation.

Integrity of Information

Integrity requires a degree of openness and auditability as well as accessibility of information and records for public inspection, at least within the context of specific review processes. Integrity in an information distribution system facilitates and insures the ability to construct and maintain a history of intellectual dialog and to refer to that history over long periods of time.

-Lynch (1994)

Ensuring the integrity of information over time is a prominent concern in the digital environment because physical and intellectual integrity can easily be consciously or unconsciously compromised and variant versions can easily be created and distributed. This concern has two aspects-checking and certifying data integrity (associated with technical processes such as integrity checking, certification, digital watermarking, steganography, and user and authentication protocols) and identifying the intellectual qualities of information that make it authentic (associated with legal, cultural, and philosophical concepts such as trustworthiness and completeness).

Functional requirements are particularly well articulated in highly regulated communities such as the pharmaceutical and bioengineering industries. Less well explored is how to identify and preserve the intellectual integrity of information. The intellectual mechanisms by which we come to trust traditional forms of published information include a consideration of provenance, citation practices, peer review, editorial practices, and an assessment of the intellectual form of the information. In the digital environment, information may not conform to predictable forms or may not have been through traditional publication processes; a more complex understanding of information characteristics and management procedures is required for the intellectual integrity of information to be understood. Attempts are often made to implement digital versions of procedures traditionally used in record keeping and archival administration. Such attempts include establishing trusted servers or repositories that can serve as a witness or notary public; distributing information to multiple servers, thus making it harder to damage or eliminate all copies; developing certified digital archives as trusted third-party repositories; and identifying canonical versions of information resources (Commission on Preservation and Access and Research Libraries Group 1996, Lynch 1994).

Project Prism

Project Prism at Cornell University is concerned with issues of information integrity within digital libraries. It is a four-year collaborative project involving librarians, archivists, computer scientists, evaluation experts, and international testbed participants. The project was recently funded through the National Science Foundation’s Digital Library Initiative to investigate and develop policies and mechanisms for information integrity in digital libraries. The project will focus on five areas (Project Prism 1999):

preservation: long-term survivability of information in digital form;
reliability: predictable availability of information resources and services;
interoperability: open standards that allow the widest sharing of information among providers and users;
security: attention to the privacy rights of information users and the intellectual property rights of content creators; and
metadata: structured information that ensures information integrity in digital libraries.

International Project on Permanent Records in Electronic Systems (InterPARES)

The International Project on Permanent Records in Electronic Systems (InterPARES) is a three-year project using archival and diplomatics principles to examine the characteristics inherent in digital information objects created by electronic record-keeping technologies in order to establish their authenticity and how that authenticity might be maintained over time. The project is funded by several agencies, including the U.S. National Historical Records and Publications Commission and Canada’s Social Sciences and Humanities research Consortium. An interdisciplinary team of researchers drawn from archival science, preservation management, library and information science, computer science, and electrical engineering is working with an industry group (primarily the pharmaceutical and biocomputing industries) and major archival repositories, including the national archives of several countries.

The project builds on previous research conducted at the University of British Columbia that examined the preservation of the integrity of electronic records and theoretically defined the concepts of reliability and authenticity in relation to electronic records. It also identified the procedural requirements and responsibilities for ensuring the reliability of active records and the authenticity of preserved records. The philosophy underlying InterPARES is that the theories and methodologies necessary to ensure the long-term preservation of authentic electronic records must be centered on the nature and meaning of the records themselves. Despite the new media and formats of electronic records, from the perspective of archival science the integral components that identify and authenticate a record have not changed. By combining principles of diplomatics and archival principles, the project is developing a template that can be used to identify requirements for authenticity for different kinds of electronic records and systems that generate records. To use this template and to understand the extent to which electronic records resemble traditional records, the project is analyzing a variety of electronic information and record-keeping systems, including large-scale object-oriented databases, geographic information systems, dynamic Web resources, and digital music systems in many national legal and organizational contexts. These analyses will be translated into recommended systems-design requirements and authentication processes, record-keeping policies and procedures, and preservation strategies for different types of records (InterPARES Project 1999). Different preservation processes will also be evaluated to ascertain their ability to maintain the elements of different types of records identified as essential to preserving the records’ authenticity. Although this project is focused on the authenticity requirement of records rather than on more generic forms of information, its findings will likely be relevant to digital information or information systems that need to retain the integrity of physical and intellectual characteristics over time.

Metadata

I would contend that most objects of culture are . . . embedded within context and those contexts are embedded within other ones as well. So a characteristic of cultural objects is they’re increasingly context-dependent. And they’re increasingly embedded in meta-languages.

-Brian Eno (1999)

The term metadata has different meanings depending on the community using it. The library community frequently uses metadata to refer to cataloging and other forms of descriptive information, but it is also used to refer to information about the administration, preservation, use, and technical functionality of digital information resources (Gilliland-Swetland 1998).

With the increasing diversity of distributed and interactive digital information systems comes a need for a metadata infrastructure that can implement the functional requirements of each information community and promote interoperability. The challenge is not just to identify the areas where it is possible to map between different types of metadata. It is also necessary to identify the tensions between the rich and complex metadata sets that individual communities have developed and the need for simpler metadata sets that are easier for nonspecialists to use and systems designers to maintain. For information communities that work with cultural information there are several important elements in ensuring authenticity and facilitating the use of an information object. They include metadata such as contextual description, indications of relationships between collections of materials, annotations that have accrued around information objects, documentation of intellectual property rights, and documentation of processes that the information objects have undergone, such as reformatting and migration. Rich metadata sets that incorporate aspects such as these are essential if the object is to be used to its fullest potential. However, considerable demand exists for leaner metadata that will enable users to move between information systems that might contain different types of materials on the same subject. Some of the most interesting questions that arise from such considerations include the following:

How much of the metadata needs to exist in time and over time to support the evidential qualities of the information?
Where should the necessary metadata reside (within the digital information system, in paper form, or both)?
To what extent are metadata integral components of the information object? (Where does the information object end and the metadata begin?)
To what extent should information professionals be engaged in the design and creation of metadata for the systems that create information objects to ensure that those objects can be managed and preserved later in life?
How can metadata help to ensure that information objects are used optimally by diverse users?

Two examples that illustrate the contributions that archivists have made in the area of metadata are EAD and a suite of metadata projects that were recently conducted in Australia.

Encoded Archival Description (EAD)

Described earlier in this report, EAD is a new archival descriptive standard adopted in the United States and being developed as a potential international standard. A hierarchical, object-oriented way of describing the context and content of archival collections, EAD can be a flexible metadata infrastructure for integrating descriptions with actual digital and digitized archival materials within an archival information system. It can also be mapped into other metadata structures such as MARC. Perhaps EAD’s greatest potential lies in its ability to be manipulated for information retrieval and display without compromising how it documents the provenance, original order, and organic nature of archival collections. As a result, it moves beyond the static concept of the paper finding aid and can facilitate appropriate access for diverse users at the collection and item levels (Gilliland-Swetland 2000b, Pitti 1999).

A measure of the utility and sophistication of EAD is the interest it has created in other professional communities. The Online Archive of California (OAC), now part of the California Digital Library, is an example of a multi-institutional database containing encoded finding aids and digitized content drawn from archives and special collections of the University of California, California State University, and numerous other universities and repositories throughout the state. The size and scope of OAC have enabled it to develop best practices for encoding and model evaluation processes and to examine its own usability not only as a scholarly resource but also as a resource for K-12 education. (Gilliland-Swetland 2000a, Online Archive of California 1999). A constituent OAC project, Museums in the Online Archive of California (MOAC), which is being conducted by several museums across California, is applying EAD to the description of museum collections. This development has the potential not only to map between the descriptive practices of two professional communities but to integrate access to intellectually related two- and three-dimensional historical and cultural resources that have often been located in different institutions.

SPIRT Recordkeeping Metadata Standards Project

Over the past five years, several metadata projects conducted in Australia have built on the records continuum model by specifying, standardizing, and integrating into active electronic record-keeping systems the kinds of metadata necessary for effective record keeping and for ensuring the long-term management and archival use of essential evidence. These projects include the Victoria Electronic Records Strategy metadata set and the Australian Government Locator System. The most recent of these projects is the SPIRT (Strategic Partnership with Industry-Research and Training) Recordkeeping Metadata Standards Project for Managing and Accessing Information Resources in Networked Environments Over Time for Government, Commerce, Social and Cultural Purposes, directed by Monash University in association with the National Archives of Australia. This project builds on the work of previous projects and provides a framework for standardizing sets of interoperable record-keeping metadata that can be associated with records from creation through processes such as embedding, encapsulation, or linking to metadata stores. Metadata elements are classified by purpose and are being mapped against related generic and sector-specific metadata sets such as Dublin Core (Records Continuum Research Group 1999). In this way, archivists build a business case for including archival considerations in the workflow because of the need to manage risk and the role of records in supporting organizational decision making.

Knowledge Management

Like the term metadata, the term knowledge management is being widely used, although its meaning and how it differs from information management are less than clear. Knowledge management refers to the practices, skills, and technologies associated with creating, organizing, storing, presenting, retrieving, using, preserving, disposing of, and re-using information resources to help identify, capture, and produce knowledge. Knowledge management is often used to create entrepreneurial opportunities by identifying and exploiting an organization’s knowledge capital. Knowledge management activities can include data and metadata mining as well as digital asset management. In many respects, such activities are a logical extension of records management and archival activities such as those under way in Australia. The rationales for building and sustaining electronic records and other digital information resources are derived not only from abstract concepts of information and research needs but from administrative and legal necessity, the corporate bottom line, and institutional or repository enterprise.

Knowledge management systems are often hybrids of born-digital, digitized, and traditional media in the form of organizational records, nonrecord information, and digital products (such as publications or movies). Such systems include digital images and texts as well as sound, moving images, graphics, and animation. They also contain procedural and administrative information such as rights management for digital assets. Whereas digital libraries are built around assumptions about current and potential uses but with few hard data, digital asset management systems are created organically out of organizational activities and the need for agility sufficient to respond to emerging institutional priorities. This way of looking at information resources-regarding their content and metadata as assets with dynamic values and market demand-is a different mindset for many information professionals. It involves adopting a holistic rather than a piecemeal approach to information systems and shifting from a linear to an organic perspective.

The digital asset management approach has been extensively developed by the media industries, particularly publishing and entertainment, where both the product and the information and records associated with its production are primarily digital. In the entertainment industry, studios are hiring archivists with experience in electronic records management to build digital asset management or metadata management systems for the assets created during production. In some cases, a two-phase approach is adopted whereby digital production is handled in a production management system and its contents are created, described, and organized by the primary users. After production is completed, all associated materials are transferred to the asset management system, where the digital asset manager or digital archivist organizes and describes them for secondary use. Metadata are developed to track levels and types of use and allow maximum flexibility in retrieving and interrelating assets.

This approach has tremendous potential for supporting the vision, relevance, utility, and sustainability of digital library and archives resources. It incorporates the interests of the information creator and makes preservation management integral to creation and retention. It offers a new economic and use-based framework to help institutions prioritize selection of information content and decide what and how much metadata to create; which resources to keep online; and which assets to preserve, purge, or allow to decay gradually.

Risk Management

If archivists are to take their rightful place as regulators of an organization’s documentary requirements, they will have to reach beyond their own professional literature and understand the requirements for recordkeeping imposed by other professions and society in general. Furthermore, they will have to study methods of increasing the acceptance of their message and the impact and power of warrant.

-Duff (1998)

Evaluation practices of library and information retrieval systems have traditionally been based on four factors-effectiveness, benefits, cost-effectiveness, and cost benefits (Lancaster 1979). Research on electronic archival records has postulated another form of evaluation-risk management-borrowed from professions such as auditing, quality control, insurance, and law. Although this concept has not been applied directly to other information environments, it has implications for assessing risk in terms of ensuring the reliability and authenticity, appropriate elimination, and preservation of digital information.

Archivists seeking to develop blueprints for the management of electronic records have undertaken several important projects in recent years. This research showed that electronic records are likely to endure with their evidential value intact beyond their active life only if functional requirements for record-keeping systems design and policies and procedures for record keeping are addressed during the design and implementation of the system. This increases the likelihood that appropriate software and hardware standards will be used, making the records easier to preserve. Records will also be created in such a way that they can be identified, audited, rendered immutable on completion, physically or intellectually removed, and brought under archival control.

Missing from this approach is the motivation for organizations to invest the resources required to implement expensive archival requirements in their active record-keeping systems. With the digital asset management approach discussed previously, the motivation to preserve usable digital information comes from the organization itself and is intimately tied to enterprise management. The Australian metadata projects apply two other strategies. The first is demonstrating that well-designed record-keeping systems and metadata will enhance organizational decision making. The second is risk management: persuading the organization that the resources invested in electronic record keeping will reduce the organizational risk incurred by not complying with archival and record-keeping requirements. Organizations such as public bodies and regulated industries are generally aware of the penalties for noncompliance. Noncompliance by a public body could result in a costly lawsuit. Noncompliance by a regulated industry could result in not getting regulatory approval to market a new product. The cost of noncompliance with record-keeping requirements may be significantly higher than that of compliance. In other environments the risk analysis may be less straightforward because the risks may be less evident or the costs of noncompliance less tangible.

The risk management approach developed by the Recordkeeping Functional Requirements Project at the University of Pittsburgh between 1993 and 1996 greatly influenced subsequent electronic record-keeping research and development projects, including the Australian metadata projects. The Pittsburgh project was an inductive project based on case studies, expert advice, precedents, and professional standards (Cox 1994). There were four main products of the research:

functional requirements-a list of conditions that must be met to ensure that evidence of business activities is produced when needed;
a methodology for devising a warrant for record keeping derived from external authorities such as statutes, regulations, standards, and professional guidelines;
unambiguous production rules formally defining the conditions necessary to produce evidence so that software can be developed and the conditions tested; and
a metadata set for uniquely identifying and explaining terms for future access and for using and tracking records.

The contribution of the Pittsburgh project, beyond the development of the functional requirements and metadata set was the development of the concept of warrant and a methodology for creating a warrant relevant to the individual circumstances of an organization. Warrant relates to the requirements imposed on an organization by external authorities for creating and keeping reliable records. If organizations understand warrant regarding how they manage their electronic record-keeping systems, they can assess the degree of risk they might incur by not managing their systems appropriately (Duff 1998).

Knowledge Preservation

The digital world transforms traditional preservation concepts from protecting the physical integrity of the object to specifying the creation and maintenance of the object whose intellectual integrity is its primary characteristic.

-Conway (1996)

Preservation is arguably the single biggest challenge facing everyone who creates, maintains, or relies on digital information. Awareness of the immense scope of the potential preservation crisis has brought many groups together to experiment with new preservation strategies and technologies. Preserving knowledge is more complex than preserving only media or content. It is about preserving the intellectual integrity of information objects, including capturing information about the various contexts within which information is created, organized, and used; organic relationships with other information objects; and characteristics that provide meaning and evidential value. Preservation of knowledge also requires appreciating the continuing relationships between digital and nondigital information.

The archival mission of preserving evidence over time has resulted in demanding criteria for measuring the efficacy of the range of strategies now being discussed for digital preservation, including migration, emulation, bundling, and persistent object preservation. Projects using archival testbeds are under way in several countries with the aim of understanding the extent to which different strategies work with a range of materials and what limitations need to be addressed procedurally, through the development of new technological approaches, or both.

The Cedars Project

The Cedars Project is a United Kingdom collaboration of librarians, archivists, publishers, authors, and institutions (libraries, records offices, and universities). Working with digitized and born-digital materials, Cedars is using a two-track approach to evaluate different preservation strategies through demonstration projects at U.K. test sites; develop recommendations and guidelines; and develop practical, robust, and scaleable models for establishing distributed digital archives (Cedars Project 1999). Cedars is also examining other issues related to the management of digital information, including rights management and metadata.

The Digital Repository Project

The Digital Repository Project of the National Archives of the Netherlands is concerned with the authenticity, accessibility, and longevity of archival records created by Dutch government agencies. The project brings together two important concepts-the emulation technique devised by Jeff Rothenberg and the reference model for an open archival information system (OAIS) developed by the U.S. National Aeronautics and Space Administration, which is being adopted as an ISO standard. The emulation technique involves creating emulators for future computers to enable them to run the software on which archived material was created and maintained, thus recreating the functionality, look, and feel of the material (Rothenberg 1995 and 1999). The OAIS reference model is a high-level record-keeping model developed to assist in the archiving of high-volume information. It delineates the processes involved in the ingestion, storage, administrative and logistical maintenance, intellectual metadata management, and access and delivery of electronic records (Sawyer and Reich 1999).

The Digital Repository Project is most concerned with determining the functionality of the repository, scope of the metadata, standards to be applied, and differentiation of the intellectual and the physical and technical form of the records. As with the Cedars Project, a two-track approach is being taken. One track will build a small repository to preserve simple records in a stand-alone environment implemented by the National Archives. The other track will develop a testbed and experimental framework for examining preservation strategies such as migration, emulation, and XML on electronic records acquired by applying the OAIS reference model (Hofman 1999).

Persistent Object Preservation

Persistent object preservation is a highly generic technological approach that has been developed jointly by the U.S. National Archives and Records Administration and the San Diego Supercomputer Center. This project is addressing the need of the National Archives to find efficient and fast methods for acquiring and preserving, in context, millions of files that can be applied to many types of records and that comply with archival principles. The approach focuses on storing the information objects that make up a collection and identifying their metadata attributes and behaviors that can be used to recreate the collection.

Like the Digital Repository Project, persistent object preservation is built around the OAIS reference model. It supports archival processes from accessioning through preservation and use, and it recognizes the importance of collection-based management. Persistent object preservation also exploits inherent hierarchical structures within records, predictable record forms, and dependencies between them. It is designed to be consistent, comprehensive, and independent of infrastructure (Rajasekar et al. 1999, Thibodeau 1999).