Collections, Preservation, and the Changing Resource Base • CLIR

Anne R. Kenney

Introduction

Libraries in the first decade of the twenty-first century face enormous challenges, including challenges of identity and purpose. As traditional institutions of long standing, libraries manage legacy holdings of inestimable value. As purveyors of information, they are profoundly affected by the dizzying pace of technological change. Libraries’ constituents are more varied and more demanding than ever. Their detractors dismiss libraries as institutions that are no longer necessary in an age of networked information or, even worse, as potential enemies of the state in its fight against terrorism. And as the economy falters, libraries everywhere are on the chopping block. It is no great exaggeration to say that libraries are undergoing a crisis on par with any experienced in the past 100 years. Yet rumors of their demise are greatly exaggerated.

Recent studies characterize libraries as hybrid institutions, straddling the print world and the digital realm. Certainly more attention and resources are devoted to things digital: ARL reports that expenditures for electronic journals jumped 75% in the past 2 years and are up 900% since they were first reported in 1994/95 (Association of Research Libraries 2002). Yet, reliance on hardcopy books and journals remains strong; they represent more than 80 percent of materials expenditures, according to ARL. In addition, libraries of all types, but especially academic and research libraries, are expanding their collecting scope to include new media and formats, software, data sets, instructional materials, and samizdat Web resources. By and large, these resources complement, rather than substitute for, print resources (Friedlander 2002). As OCLC concluded in its report on five-year information format trends, “the universe of materials that a library must assess, manage and disseminate is not simply shifting to a new set or type of materials, but rather building into a much more complex universe of old and new, commodity and unique, published and unpublished, physical and virtual” (OCLC 2003).

In sum, libraries are expected to support the full gamut of information at the very time they are under pressure to cut costs and develop new services. These pressures have been made acute by the current financial crisis. Most states are facing serious budget deficits, investment income is down, and libraries everywhere are threatened. The American Library Association (ALA) is tracking the budget crisis through its “Campaign to Save America’s Libraries.” ALA reports that 32 states have suffered cutbacks in support to state, local, and academic libraries (American Library Association 2003). The figures are chilling:

The budget of the California State Library was cut 29 percent in FY 2003 and the library anticipates an additional 15 percent cut within the fiscal year, followed by an additional 30 percent cut in 2004.
The Colorado State Library has sustained a 50 percent cut in state revenue and expects an additional 10 percent cut within the current fiscal year.
In 2002, the Seattle Public Library instituted its first-ever two-week shutdown of the entire system to help meet a 5 percent budget cut.
Thirty-one of forty-eight respondents to an ARL survey of library directors in February 2003 anticipated significant budget cuts next year.
In 2003, the University of Michigan library met a $2 million budget cut by eliminating more than 30 positions (Library Journal Academic News Wire 2003).

Libraries, then, are under tremendous pressure to maintain the old, embrace the new, and do so with declining resources. Given this state, where does preservation fit into the picture? I see two major causes for concern. The first is economic vulnerability. Because preservation programs are relative latecomers in libraries, they may well suffer from the “last-hired, first-fired” syndrome. Recent reports confirm this may already be under way. A second concern is process uncertainty. In the 1980s and 1990s, preservation programs developed to address a serious threat (acid paper), and they could rely on a trusted methodology (microfilming). Today’s threat-digital obsolescence-is even more pressing, but libraries are plagued by a lack of clarity about how and when to do preservation in the digital realm.

Facing the Economic Challenge

The current budget crisis may provide the catalyst for libraries to rethink how they organize themselves to ensure the most effective and efficient way to deliver services. When reductions are small or one-time, the tendency is to absorb them, rather than to reconceptualize. Libraries in the next several years will need to consider dramatically different ways of doing business, including preservation. Four options present themselves here: reengineering, mainstreaming, collaboration, and automation.

Reengineering

Some areas, such as technical services, have reengineered their processes significantly. At Cornell, for instance, the central technical services unit has decreased its workforce by 20 percent in the past seven years, while reducing the backlog and the time from point of receipt to point of use. They have done so by replacing manual processing methods with technology-based methods, eliminating redundancies, streamlining workflows, minimizing handling, and making selective use of outsourcing. They have redefined “quality” as the appropriate balance between processing speed, cost, and fullness of bibliographic treatment. This effort has required considerable consultation with other divisions within the library system to ensure that the elimination of processes in technical services did not result in simply shifting the workload to other services.

Reengineering has been built on establishing priorities and accepting trade-offs in some areas. At the heart of this process are tough choices. Libraries have operated under the assumption that standards and best practices are the mainstay of operations. Quality cataloging in 1990 meant that each institution tweaked its records or would accept copy only from the Library of Congress. By 2000, the notion of acceptable copy had changed, and the need to address growing backlogs forced a shift in practice that includes not only conformance to bibliographic standards that are “good enough” but also to timely and cost-effective processing. Ross Atkinson calls the “demise of the completeness syndrome” one of the key management transformations occurring today (Atkinson 2003).

“Good-enough” practice is beginning to make head roads into preservation programs as well. This can be most clearly seen in binding practices. Next to personnel, binding is the largest expense associated with preservation. In fiscal year 19841985, Cornell University spent more than $184,000 on the conventional binding of periodicals and monographs. In 1985, John Dean became the first director of preservation. He introduced two alternatives to conventional binding: (1) the quarter buckram binding of periodicals, which more than halved the unit cost of binding and rendered the item more stable at the shelf and more flexible in use; and (2) the stiffening of paperbacks, which reduced the unit cost of binding a single book to less than $1. Initial resistance to these changes was based on aesthetics, but the savings were considerable. In 20002001, Cornell spent only $173,000 on binding, despite handling significantly more volumes than in previous years. That year, Cornell ranked eleventh in volume count among ARL libraries but forty-third in commercial binding expenditures. As funds dry up, institutions are turning to these alternative forms of binding as well as to shrink wrapping serials, off-site storage, or post-use binding of paperbacks sent directly to the shelves.

Mainstreaming

A second strategy for coping with budget reductions is to mainstream processes. A recent study on the state of preservation programs in academic libraries noted that various definitions of preservation practice prevail among library staff, some of whom would define it very narrowly (Kenney and Stam 2002). When preservation is viewed narrowly, it gets separated from mainstream functions, becomes identified as someone else’s domain, and can be considered a luxury when budget cuts must be made. This tendency is reinforced by the way libraries have measured preservation activity. Research libraries assess preservation capability through statistics such as whether the library has a preservation administrator or the number of staff in a preservation unit. Implicit in these measures is the assumption that preservation is distinct from other activities. This may lead institutions to feel inadequate if they do not have a separate preservation program or to assume that preservation is something they cannot afford. The extent to which preservation can be protected in this economic environment may well depend on the degree to which libraries can develop a more inclusive understanding of preservation-one that infuses the full range of library operations and encompasses all actions and policies designed to prolong the useful life of information. Assisting library staff to develop an appreciation for their roles in preservation can help the library meet its preservation objectives more effectively and economically.

Collaboration

Collaboration has been touted as a critical path for libraries. But the counter forces at work-competition, institutional ranking, self-interest, ownership, user resistance-make putting it into practice problematic. However, the twin pillars of digital access and a deepening economic crisis may force libraries to embrace collaboration more fully.

Libraries point to a long history of cooperation, most successfully in such areas as shared cataloging and interlibrary loan. More recently, they have joined forces to secure more favorable rates for electronic resources. In 20002001, ARL libraries spent nearly $15 million on e-resources through centrally funded consortia. This figure was dwarfed, however, by institutional expenditures on electronic resources, which topped $132 million (Association of Research Libraries 2002). Libraries have also established shared storage facilities, but too frequently these are characterized by separated spaces where each institution stores little-used, often duplicated, holdings. The Tri-College Library Consortium concluded that the partners could gain shelving space and maximize purchasing power by eliminating duplicated, low-use materials and building a single research collection. However, faculty members at the three institutions expressed serious reservations about relinquishing institutional collections to build a more integrated collection (Luther et al. 2003).

The movement to cooperate in shared collections and preservation responsibility is gaining ground. The Center for Research Libraries is spearheading an effort to investigate a network of regional print repositories and is collaborating with JSTOR to preserve the paper version of every journal in its stable (JSTOR 2001). Budgetary woes appear to be the catalyst for higher-level collaboration in California. The California Digital Library is building a shared storage facility for the University of California system to ensure the preservation of a print copy of record and enable campuses to eliminate paper subscriptions for journals available electronically. Long-term plans include cooperative collection development and access programs as well as preservation programs for print and electronic materials.

Cooperation as a preservation strategy may indeed be most promising near term in the area of print preservation, but it does require a rethinking of preservation principles. In the future, preservation will be decoupled from use, and the strategy of multiple redundancy will be replaced by single-copy archives. Born out of economic necessity and the convenience of network access, true collaboration will be dependent on the degree to which institutions are willing to relinquish ownership and share control over very long timeframes.

Automation

Three years ago, Bill Arms published a thought-provoking article in which he speculated on the degree to which automated processes can provide a satisfactory substitute for skilled librarians (Arms 2000). He correctly pointed out that the greatest expense in libraries is personnel-at Cornell University Library, for example, salaries and benefits represent 57 percent of the library budget. Arms argued that “brute force computing,” coupled with simple algorithms, can often outperform human intelligence and that the future of digital libraries will depend on making that switch. Although librarians responded negatively to this piece, we are indebted to Bill Arms for provoking such ideas and questioning current assumptions. The field of artificial intelligence is premised on the notion that computers can imitate human cognition; for example, IBM and others predict that the processing power of computers will equal the speed of the human brain within two decades. It may be difficult to pinpoint when a machine will be able to think, act, and emote in the same way as a human being does, but clearly libraries are turning to automation to reduce costs, increase productivity, and enhance decision making.

Libraries have successfully automated in a number of areas, but the impact on traditional preservation programs has been minimal. Digital preservation, however, will be possible on a grand scale only through the automation of archival processes. In recent years, various digital library projects have incorporated automated routines that, while still in the proof-of-concept stage, will become key to the development of sustainable digital preservation programs. Chief among these has been the use of Web harvesters, replication strategies, and automatic extraction of metadata.

Web harvesting is at the root of many archiving initiatives that focus on collecting publicly accessible Web resources. Several of these are fully automated, utilizing powerful Web crawlers to locate and download content. For others, such as Preserving and Accessing Networked Documentary Resources of Australia (PANDORA) and the Paradigma Project in Norway, ingest also includes some manual creation and clean-up of metadata and the establishment of content boundaries. The use of Web crawlers to automatically build synthetic collections on various subjects is an active line of research, which could have tremendous potential in establishing preservation priorities (Bergmark 2002).

Replication. It is perhaps ironic that while paper preservation may be moving away from multiple redundancy as a preservation strategy at the institutional level, replication is very much a piece of the puzzle in the digital world. Current research focuses on how much replication is necessary, the degree to which it promotes repurposing of content, and how automated the process can be made. Projects such as Lots of Copies Keep Stuff Safe (LOCKSS 2003) and the work at Stanford on data-trading networks are beginning to address these questions (Cooper and Garcia-Molina 2002).

Metadata extraction. Some digital library projects, such as the National Science Digital Library Project, are focusing on automated metadata extraction, and search engines are employing increasingly sophisticated algorithms to rank search results. But an assessment of trends over the past five years by OCLC revealed that while the use of metadata-including data that are automatically created through HTML editors-is on the rise, such data are not particularly deep or detailed. There is also a very slow take-up of formal metadata schemes, including the most basic, Dublin Core, which grew only marginally-from 0.5 percent of public Web site home pages in 1998 to 0.7 percent in 2002 (O’Neill, Lavoie, and Bennett 2003).

Automated metadata creation and extraction that may be critical for preservation purposes is even more elusive. For example, most of the 34 major elements identified by the OCLC/RLG Working Group on Preservation Metadata and derived from the Open Archival Information System (OAIS) Information Model would require human intervention to be captured fully (OCLC/RLG 2002). HTTP headers contain many fields that could be useful for preserving Web pages. In analyzing header field use from crawls involving more than seven million documents, Project PRISM observed that only three fields-date, content type, and server-were returned for virtually every page. These fields are useful for long-term as well as current management. But other desirable header fields for preservation purposes are not commonly used, such as use of the “frequency of content-length” and “last-modified” headers, which ranged from 35 percent to 85 percent in test sets (McGovern et al. 2003). Efforts to automate processes in other domains, most notably in network security, will be critical for digital preservation, but the focus of current efforts is typically on system performance and does not extend to long-term viability considerations.¹

Automation is necessary but insufficient, at least for now, in meeting the digital preservation challenge. One of the down sides of the focus on automated routines is the tendency to adopt a false sense of security that technology is the full answer. Consider the case of the Internet Archive. The Internet Archive has been sweeping the Web since 1996, saving whatever pages it can find. It currently holds about 100 terabytes (TB) of information and grows at a rate of 12 TB per month. The Internet Archive provides the best view of the early Web as well as a panoramic record of its rapid evolution. Nevertheless, it would be a mistake to conclude that the Internet Archive has solved the Web preservation problem.

The Internet Archive and similar efforts to preserve the Web by copying suffer from common weaknesses (Kenney et al. 2002) such as the following.

Snapshots may not capture important changes in content and structure.
Technological development did not always keep pace with the growth of the Internet. For instance, crawls in 1999 contain few images because the Internet Archive did not have enough bandwidth for text plus images. There were also months when there was no crawling at all while the crawler was being rewritten.
Technology development, including robot exclusions, password protection, Javascript, and server-side image maps, inhibits full capture.
A Web page may serve as the front end to a database, an image repository, or a library management system, and Web crawlers capture none of the material contained in these “deep” Web resources.
The volume of material is staggering. The high-speed crawlers used by the Internet Archive take months to traverse the entire Web; even more time would be needed to treat anomalies associated with downloading.
Automated approaches to collecting Web data tend to stop short of incorporating the means to manage the risks of content loss.
File copying by itself is insufficient: Repositories must commit to continued access through changing file formats, encoding standards, and software technologies.
The Internet Archives lacks authorization for its actions, and legal constraints limit the ability of crawlers to copy and preserve the Web.

Despite these drawbacks, there are those who believe that the Internet Archive does preserve the Web, as the recent “Sex Court” trademark trial illustrated. Playboy Enterprises brought suit against Mario Cavalluzzo’s pay-for-porn Web site, sexcourt.com, over use of the trade name. Playboy’s lawyers introduced evidence in court using the Internet Archive’s Wayback Machine that the earliest entry for Cavalluzzo’s sex court Web site was January 1999, four months after Playboy aired the first installment of its cable show of the same name. But attorneys for Cavalluzzo submitted evidence that his page was on the Internet by May 14, 1998. A chagrined Playboy settled out of court.

Addressing the Digital Preservation Challenge: More Than Just Technology

Despite increasing evidence about the fragility and ubiquity of digital content, cultural repositories have been slow to respond to the need to safeguard digital heritage materials. Survey after survey conducted over the past five years provides a bleak picture of institutional readiness and responsiveness. Why this lag in institutional take-up? In part the answer lies in the fact that most of the attention given to digital preservation has focused on technology. This emphasis has led to a reductionist view wherein technology is equated with solution, which in turn is deferred until some time in the future when the technology has matured. Even when the technology solution is purportedly at hand-D-Space, for example, has been characterized as a “sustainable solution for institutional digital asset services”-technology is not the sole solution, but is only part of it (Bass et al. 2002).

The focus on technology has mimicked computational methods that reduce things to an on or off status-either you have a solution or you do not. This either/or assessment gives little consideration to the effort required to reach the on stage, to a phased approach for reaching the on stage, or to differences between institutional settings. It is not surprising then that organizations are uncertain as to how to proceed. Postponing the development of digital preservation programs because one cannot create of whole cloth a comprehensive program will ensure that vital digital resources will be sacrificed in the interim. Lack of organizational readiness, not technology, is the greatest inhibitor to digital preservation programs (Kenney and McGovern 2003).

In an article on institutional repositories, Cliff Lynch voiced a fear that institutions would establish repositories without committing to them over the long term: “Stewardship is easy and inexpensive to claim; it is expensive and difficult to honor, and perhaps it will prove to be all too easy to later abdicate” (Lynch 2003).

Libraries in the first decade of the twenty-first century face tremendous responsibilities and opportunities. Preserving cultural heritage is more difficult when the path ahead is not clear. It is important, however, that libraries maintain their historic role as flame bearers from one generation to the next. They must find new ways to do so by taking risks and forging new partnerships, not only with other cultural repositories but also with creators, publishers, and ordinary folk. Recently, concerned individuals established AfterLife.org, a not-for-profit organization whose mission is to archive Web sites after their authors die. Their motives are pure, but America’s memory should not be measured by the lives of either creators or volunteers. That is what libraries and archives are for.

References

American Library Association. 2003. Latest Funding Cuts. Available at www.ala.org/Content/NavigationMenu/Our_Association/Offices/Public_Information/Promotions1/Latest_Funding_Cuts.htm.

Arms, William. 2000. Automated Digital Libraries: How Effectively Can Computers Be Used for the Skilled Tasks of Professional Librarianship? D-Lib Magazine 6(7/8). Available at http://www.dlib.org/dlib/july00/arms/07arms.html.

Association of Research Libraries. 2002. Highlights ARL Supplementary Statistics 2000-2001. Available at www.arl.org/stats/pubpdf/sup01.pdf.

Atkinson, Ross. 2003. Uses and Abuses of Cooperation in a Digital Age. Collection Management, in press.

Bergmark, Donna. 2002. Collection Synthesis. In Proceedings of the Second ACM/IEEE-CS Joint Conference on Digital Libraries, July 14-18, 2002, Portland, Oregon. New York: ACM Press

Cooper, Brian F., and Hector Garcia-Molina. 2002. Peer-to-Peer Data Trading to Preserve Information. Association of Computing Machinery. Available at http://www-db.stanford.edu/~cooperb/pubs/trading.pdf.

Bass, M. et al. 2002. DSpaceInternal Reference Specification Technology & Architecture, Version 2002-03-01 (paper version).

Friedlander, Amy. 2002. Dimensions and Use of the Scholarly Information Environment. Introduction to a Data Set Assembled by the Digital Library Federation and Outsell, Inc. Washington, D.C.: Digital Library Federation and Council on Library and Information Resources. Available at https://www.clir.org/pubs/abstract/pub110abst.html.

JSTOR. 2001. JSTOR and CRL Team Up on Journal Deposit Effort. JSTORNEWS 5(2).

Kenney, Anne R., and Nancy McGovern. 2003. The Five Organizational Stages of Digital Preservation. In Digital Libraries: A Vision for the 21st Century. A Festschrift for Wendy Pratt Lougee on the Occasion of Her Departure from the University of Michigan. In press.

Kenney, Anne R., Nancy Y. McGovern, Peter Botticelli, and Richard Entlich. 2002. Preservation Risk Management for Web Resources D-Lib Magazine 8(1). Available at http://www.dlib.org/dlib/january02/kenney/01kenney.html.

Kenney, Anne R., and Deirdre C. Stam. 2002. The State of Preservation Programs in American College and Research Libraries: Building a Common Understanding and Action Agenda. Washington, D.C.: Council on Library and Information Resources.

Library Journal Academic News Wire. 2003. Budget Blues and the University of Michigan: Library Loses More than 30 Jobs. April 1.

LOCKSS. 2003. Available at http://lockss.stanford.edu/.

Luther, Judy, Linda Bills, Amy McColl, Norm Medeiros, Amy Morrison, Eric Pumroy, and Peggy Seiden. 2003. Library Buildings and the Building of a Collaborative Research Collection at the Tri-College Library Consortium, Washington, D.C.: Council on Library and Information Resources.

Lynch, Clifford A. 2003. Institutional Repositories: Essential Infrastructure for Scholarship in the Digital Age. ARL Bimonthly Report 226 (February). Available at http://www.arl.org/newsltr/226/ir.html.

McGovern, Nancy Y., William R. Kehoe, Richard Entlich, and Peter Botticelli. 2003. Virtual Remote Control of Web Resources: A Risk Management Approach for Research Libraries. Paper presented at the Digital Library Federation Forum, May 14-16, 2003, New York, N.Y.

OCLC Library and Information Center. 2003. Five-Year Information Format Trends. Available at www.oclc.org/info/trends/.

OCLC/RLG. Working Group on Preservation Metadata. 2002 (June). Preservation Metadata and the OAIS Information Model: A Metadata Framework to Support the Preservation of Digital Objects. Available at www.oclc.org/research/pmwg/.

O’Neill, Edward T., Brian F. Lavoie, and Rick Bennett. 2003. Trends in the Evolution of the Public Web, 1998-2002. D-Lib Magazine 9(4). Available at http://www.dlib.org/dlib/april03/lavoie/04lavoie.html.

FOOTNOTE

¹ Examples include SiteSeer and SiteScope from Mercury Interactive (www.mercuryinteractive.com) and Honeypots from the Honeynet Project (http://project.honeynet.org/).