A survey “by metes and bounds” is a highly descriptive delineation of a plot of land that relies on natural landmarks, such as trees, bodies of water, and large stones, and often-crude measurements of distance and direction. This was accepted practice before more precise instruments and methods were developed—indeed, the original 13 U.S. states were laid out by metes and bounds. More accurate means of measuring were established to overcome the method’s serious shortcomings: streambeds move over time, witness trees are struck by lightning, compass needles do not point true north, and measuring chains and surveyor strides can be of slightly differing lengths. However, the metes and bounds system is still used when it is impossible or impractical to make more precise measurements.

In undertaking our survey of the e-journal archiving landscape, we found that precise measurements and controlled data collection were not always possible. The e-publishing terrain is changing at time-lapse photography speed. Definitions and terms are widely interpreted, and standards are not yet established. These factors, along with our need to rely heavily on self-reporting by the programs, mean that direct comparisons between them may not always be valid. Despite this, we describe in this report the current lay of the land for scholarly e-journal archiving.

This study focuses on the “who, what, when, where, why, and how” of significant archiving programs operated by not-for-profit organizations in the domain of peer-reviewed journal literature published in digital form. Not included are preservation efforts covering digitized versions of print journals, such as JSTOR; library-led digital conversion projects; self-archiving efforts by publishers; and initiatives still being planned.

In preparing this report, our team focused on the following:

  • soliciting library directors’ concerns and perceptions about e-journals;
  • compiling responses from e-journal archiving initiatives taken from written surveys and semistructured interviews; and
  • analyzing the issues and current state of practice in e-journal archiving, and forming recommendations for the future.

Library Directors’ Concerns

We began the study by developing a list of what library decision makers are likely to consider as they assess preservation strategies for e-archiving. The list was informed by our own research, discussions with colleagues, and comments made to staff members of the Center for Research Libraries (CRL) by member library directors.7

During March and April 2006, 15 North American library directors, representing a range of public and private institutions of various sizes as well as consortia, participated in telephone interviews designed to solicit their views on six key areas:

  1. Library motivation (Why should we be concerned about or invest in this?)
  2. Content coverage (Are current approaches covering the subject areas, titles, and journal components in which we are most interested?)
  3. Access (What will we gain access to? When and under what conditions?)
  4. Program viability (What evidence is there that these efforts are sufficiently well-governed and financed to last?)
  5. Library responsibilities and resource requirements (What will this cost our library in staff time, expertise, financial commitment? Would our support save the library money?)
  6. Technical approach (How do we judge whether the approach is rigorous enough to meet its archiving objectives?)

The interviews helped refine the issues to be covered in our survey. They also revealed some interesting opinions on the topic. Three common themes emerged in the interviews: the sense of urgency, resource commitment and competing priorities, and the need for collective response.

Sense of Urgency

These directors were all aware of digital preservation as a major concern, but they differed on whether it was a priority for support and action. Some felt the sense of urgency as a vague concern rather than as an immediate crisis, and several were willing to defer action until a crisis point is reached. Digital preservation is a “just-in-case scenario,” commented one director, “and this is very much a just-in-time operation.” Another noted, “Archiving is the last thing that gets taken care of because it’s the farthest thing out.” One director did assert that she would not want to gamble on what it would take to obtain access later if her institution did not invest now, likening that decision to not buying a book and waiting three years to see whether there was a demand for it. Several directors who have committed to supporting e-journal archiving do so because they have experienced loss. One acknowledged that her institution’s willingness to support digital archiving stemmed from the losses caused by a devastating flood: “Natural disasters make people focus.” Another director indicated that 9/11 raised his level of concern: “Prior to that, I had scoffed at the idea that the Internet would break down and I wouldn’t have access to my journals restored in 24 hours.”

One-third of the directors expressed more concern about the preservation of digital content other than e-journals. Virtually all expressed a lack of trust in publishers providing the solution, but many argued that publishers had to take on more responsibility. They pointed to efforts to include archiving clauses in licensing agreements. One questioned why she should have to pay additionally to support e-archiving initiatives: “We’ve pressured publishers to include archiving, and now we’re giving up on this?” Several pointed to the role that some publishers were already undertaking in collaborating with libraries to share preservation responsibility. One suggested that as the number of publishers decreases because of mergers and acquisitions, those remaining are making money and are not as apt to go under in the short term. Can an effective case be made, some asked, without there being an actual disaster? Another wondered about the future of licensed content in general for reasons other than digital preservation: “If you can’t get [e-journals] on the open public Internet, do they have much value anymore?” Several identified university records, Web sites, and digital content produced within institutions as more immediate concerns and were committing resources to their protection. “How do we sustain our role as the university archives in the digital age?” one asked.

Interviewees from some of the larger ARL libraries expressed the most concern about preserving e-journals. Although they argued that publishers had to bear some responsibility for e-journal archiving, they do not necessarily trust them to do this over time. One put it bluntly: “We definitely can’t wait this one out. I have a bias toward action and want to be involved. Until you explore it, you really don’t know what’s going on.” This concern was compounded by a sense of frustration over the options available. Understanding the issues is not the real problem, one noted: a lack of clarity about the solutions is. To date, few have committed real resources to address e-journal archiving, in part because they are unclear about what needs to be done. All directors interviewed acknowledged that a perfect solution is still many years away, and those who were willing to commit resources now stated their goal was to support a “good enough” solution that would be viable until the desired solution came along. One director characterized the decision of whether to commit resources as particularly acute for medium-size libraries. “The large ones will do it and worry about whether they should be doing this for others,” she argued, “and the smaller ones will say they don’t have the money. The ones in the middle with some resources and some sense of obligation are the fence sitters.” A director of an Oberlin Group library argued that leading liberal arts colleges would want to be involved as well.

Of the fifteen directors interviewed for this study, nine have committed or are prepared to commit resources to e-journal archiving, two are not, and four characterize themselves as fence sitters. The two who have decided to do nothing view their positions as managing risks and making hard decisions. Of the four who are undecided, one called himself a fence sitter only because he has not made up his mind about which initiative to support. Another characterized her institution as an “early follower, sitting on a fence by design, not because we wound up on one,” and a third concluded at the end of our discussion “I’m starting to think as we talk that sitting on the fence isn’t helping.” When asked what would provide additional incentives for getting off the fence, several pointed to peer pressure and reaching the “tipping point” of enough institutions participating. One said that he wanted to know where the major ARL libraries were going to put their money and why. One cited the importance of pressure from funding agencies such as The Andrew W. Mellon Foundation or their professional organizations. Another said that she would decide to do something in response to pressure from the administration or faculty members. Another indicated that having transparency in what is being done would be important, as was whether her institution would have a say in future directions. Several wanted to know about the circumstances and effort involved in committing to e-journal archiving, and how long they would have to wait before their institutions could restore access to their users following loss of normal access channels. Others wanted to know the costs involved, including staff effort, and what they would get from their commitment. They wanted to support those whom they could trust the most, whom they would have to pay the least, and who covered the material they care most about. Incentives to be an early subscriber were a big carrot. Knowing the penalties for waiting to join later was a potential big stick.

Resource Commitment and Competing Priorities

A recurring concern among the library directors interviewed was finding resources to commit to e-journal archiving programs. They pointed to competing priorities and the difficulty of identifying ongoing funds to support the effort.8 Many felt that while they might be able to provide resources for the next several years, support would eventually have to be found at the university or college level. Some were concerned that senior administrators would agree that the problem was real and that the library should address it, but that it would be difficult to get additional support. Digital archiving, one noted, is a new kind of expense, which is more difficult to argue for than increases to an existing expense. The directors requested sound bites to use with their provosts, presidents, and chancellors. (One mused that real horror stories would be better.) Several focused on the need to have faculty identify digital preservation as a major concern that directly affects them.

Almost all the directors rejected the argument that the savings in moving to electronic-only could cover the archiving costs. For most of them, that shift has already occurred as a result of lean budget years and dramatic increases in serials subscriptions, and the savings have already been reallocated to other purposes. “We couldn’t wait for the safety net to cancel,” said one. A director from the East Coast noted that many competing demands from new initiatives require ongoing financial support.

The greatest competition, however, lies in providing ongoing access to electronic resources. When a choice has to be made between the two, “broad and deep access at this point trumps more restricted access but a reliable archive,” concluded one director. “I’d rather buy more titles now than pay for something I might never use,” said another. Several directors from state institutions worried about justifying the use of state funds to purchase something “intangible” and questioned whether e-journal archiving could substitute for risk management measures locally. Others expressed more concern about guaranteeing perpetual access to e-journals than archiving them. One pointed out that his main worry was ensuring future access to content “below the trigger threshold” that would not be addressed by e-journal archiving. Another director questioned whether it was counter to his responsibilities to try to “preserve all e-journals when I can’t even get access to many of them because I can’t afford it.” Another commented, “It all comes down to money: present money versus future money.” One even suggested that it would almost seem like throwing money away: “You don’t have anything to show for it, and I’m not even sure that the solution would survive when you do need it.”

Need for Collective Response

All the directors interviewed rejected the notion of creating their own institutional solution. A major finding of the seven e-journal archiving projects supported by The Andrew W. Mellon Foundation in 2001 was the difficulty of developing an institution-specific solution. At the end of that project, the Mellon Foundation decided to provide startup funds for both Portico and the LOCKSS Alliance (Bowen 2005). Several directors called for the creation of a national cooperative venture, saying, “We want to throw our lot in with other libraries.” Some wanted to tie e-journal archiving to their consortial buying and licensing efforts. Others felt that publishers had to be at the table as well, noting that libraries are too prone to seek internal solutions. One mused that libraries can now do with e-journal archiving what they have wanted to do for 40 years with shared print repositories, and that the two could not be handled in isolation.

Although agreeing that a collective response is needed, several directors worried about having too many options. “I have heard others say we need lots of strategies to keep stuff safe,” said one, “but I’m not sure that’s true.” Another worried about ending up with two or three competing models that would be difficult to sustain. He suggested not investing in any of the options until they get together to build “something we can all get behind.” Keeping track of what is archived by whom raised the specter of major management overhead. One director mused that this might represent a new business for Serials Solutions. All agreed that while it was still early, it would be “nice if the market sorted itself out fast.”

Another concern of the directors was the long-term viability of any e-journal archiving initiative. Several wanted reassurance that their investment would be secure for at least 10 to 20 years. Others argued that it was unrealistic to expect assurances up front, noting that all the options are still experimental and that there is no right solution. Several suggested that it was important for institutions to support different options because it is not clear “which model will win out.” The right answer, one stated, “is that more people must participate in order to uncover the problems and workable solutions.” One director argued that instead of focusing on the existing options, libraries should collectively define what the solution should look like.

Cornell Survey of 12 E-Journal Archiving Initiatives

The directors’ concerns helped shape a questionnaire that our team used to survey e-journal archiving programs. The survey covered six areas: organizational issues, stakeholders and designated communities, content, access and triggers, technology, and resources. The form went through several iterations in response to reviewer feedback and was pilot-tested with one digital archiving entity before being finalized. A version of the final survey form is located in Appendix 1. Project staff sent surveys to 12 e-journal archiving programs in March and held hour-long interviews with key principals (and subsequent follow-up) between April and June 2006.

Several criteria guided the selection of electronic journal archiving initiatives to include in this study. First, each initiative had to have an explicit commitment to digital archiving for scholarly peer-reviewed electronic journals. Second, it had to maintain formal relationships with publishers that include the right to ingest and manage a significant number of journal titles over time. Third, work addressing long-term accessibility had to be under way. Fourth, the efforts had to be by not-for-profit organizations independent of the publishers. Finally, the work had to be of current or potential benefit to academic libraries that have a preservation mandate.

The following 12 e-journal archiving programs met these criteria. Appendix 2 includes longer descriptions of these programs.

Canada Institute for Scientific and Technical Information (CISTI Csi)
The National Research Council of Canada (NRC), Canada’s governmental organization for research and development, was mandated by the National Research Council Act (August 1989) to establish, operate, and maintain a national science library. In that capacity, the NRC hosts CISTI to provide universal, seamless, and permanent access to information for Canadian research and innovation in all areas of science, engineering, and medicine for Canadians, the NRC, and researchers worldwide. To achieve its mission as Canada’s national science library, CISTI has established a three-year program called Canada’s scientific infostructure (Csi) and is partnering with Library and Archives Canada (LAC) to ensure business continuity. This program is creating a national information infrastructure in collaboration with partners to provide long-term access to digital content loaded at CISTI and to support research and educational activities. In 2003, CISTI began loading e-journal content from three publishers and now has loaded close to 5 million articles. Additional content from other publishers in the sciences is planned.

LOCKSS Alliance and CLOCKSS
The Lots of Copies Keep Stuff Safe (LOCKSS) program, based at Stanford University, launched the beta version of its open-source software between 2000 and 2002. LOCKSS intended the software to allow libraries to collect, store, preserve, and provide access to their own, local copies of authorized content. Some 100 participating institutions in more than 20 countries use the LOCKSS software to capture content. About 25 publishers of commercial and open-access content (including large aggregators) participate in the LOCKSS program. In 2005, the LOCKSS Alliance was launched as a membership organization built on the LOCKSS software. The purpose of the alliance is to develop a governance structure and to address sustainability issues. The Controlled LOCKSS (CLOCKSS) initiative, added to the LOCKSS program in 2006, brings together six libraries and twelve publishers to establish a dark archive for e-journals.

Koninklijke Bibliotheek e-Depot (KB e-Depot)
As the national deposit library for the Netherlands, the Koninklijke Bibliotheek (KB) is responsible for preserving and providing long-term access to Dutch electronic publications. To meet that responsibility, the KB started planning for e-journal archiving in 1993 and began to implement an archiving system between 1998 and 2000. It was initially intended as a system in which Dutch publishers would voluntarily deposit their publications for archiving. The KB’s current goal is to include journals from the 20 to 25 largest publishing companies, which produce almost 90% of the world’s electronic STM literature. The KB e-Depot currently offers digital archiving services for eight major publishers.

Kooperativer Aufbau eines Langzeitarchivs Digitaler Informationen (kopal/DDB)
Funded by the German Federal Ministry of Education and Research, kopal/DDB is a cooperative project begun in July 2004. A main impetus for kopal has been the need for the national library of Germany, Die Deutsche Bibliothek (DDB), to manage the legal deposit of electronic publications. DDB had been experimenting with electronic journals since 2000; in 2006, Germany enacted legal deposit legislation for electronic publications, making the implementation of a system a priority. Through voluntary agreements with publishers, DDB has acquired a variety of electronic content, including e-journal titles from Springer, Wiley-VCH, and Thieme. Under legal deposit, DDB will start acquiring and adding to kopal all electronic journals published in Germany. In the future, kopal/DDB intends to offer other institutions data archiving services.

Los Alamos National Laboratory Research Library (LANL-RL)
Los Alamos National Laboratory is one of three U.S. national laboratories operated under the National Nuclear Security Administration of the U.S. Department of Energy. LANL-RL has been locally loading licensed backfiles from several commercial and society publishers since 1995. Focusing on titles in the physical sciences, the library maintains content from 10 publishers primarily for the use of the LANL-RL staff, but it also serves a group of external clients who pay for access (LANL charges on a cost-recovery basis). LANL-RL has done substantial research and development work on repository and digital object architecture for long-term maintenance of electronic journal contents. A major focus of the research and development work has been the creation of the aDORe repository.

National Library of Australia PANDORA (NLA PANDORA)
The NLA selects e-journals from its Australian Journals Online database for preservation in PANDORA (Preserving and Accessing Networked Documentary Resources of Australia), which was established in 1996. E-journals is one of six categories of online publications included in PANDORA, which lists 1,983 journals published in Australia. Of these, 150 are commercial titles. The NLA released the first version of the PANDORA Digital Archiving System (PANDAS) in 2001.

OCLC Electronic Collections Online (OCLC ECO)
OCLC launched ECO in June 1997 to support the efforts of libraries and consortia to acquire, circulate, and manage large collections of electronic academic and professional journals. It provides Web access through the OCLC FirstSearch interface to a growing collection of more than 5,000 titles in a wide range of subject areas from more than 40 publishers of academic and professional journals. Libraries, after paying an access fee to OCLC, can select the journals to which they would like to have electronic access. OCLC has negotiated with publishers to secure for subscribers perpetual rights to journal content. In addition, OCLC has reserved the right to migrate journal backfiles to new data formats as they become available.

OhioLINK Electronic Journal Center (OhioLINK EJC)
The Ohio Library and Information Network is a consortium of Ohio’s college and university libraries, comprising 85 institutions of higher education and the State Library of Ohio. OhioLINK’s electronic services include a multipublisher Electronic Journal Center (EJC), launched in 1998, which contains more than 6,900 scholarly journal titles from nearly 40 publishers across a wide range of disciplines. OhioLINK has declared its intention to maintain the EJC content as a permanent archive and has acquired perpetual archival rights in its licenses from all but one publisher.

Ontario Scholars Portal
Launched in 2001, the Ontario Scholars Portal serves the 20 university libraries in the Ontario Council of University Libraries (OCUL). The portal includes more than 6,900 e-journals from 13 publishers and metadata for the content of an additional 3 publishers. The primary purpose of the portal is access, but the consortium has made an explicit commitment to the long-term preservation of the e-journal content it loads locally. The initiative began with grant funding but as of 2006 became self-funded through tiered membership fees.

Portico
Publicly launched in 2006, Portico is a third-party electronic archiving service for e-journals, and serves as a permanent dark archive. E-journal availability (other than for verification purposes) is governed by specific “trigger events” resulting from substantial disruption to access from the publishers themselves. A membership organization, Portico is open to all libraries and scholarly publishers, which support the effort through annual contributions. As of July 1, 2006, 13 publishers and 100 libraries participated in Portico.

PubMed Central
Launched in February 2000, PubMed Central is NIH’s free digital archive of biomedical and life sciences journal literature, run by the National Center for Biotechnology Information of the National Library of Medicine (NLM). PubMed Central encompasses about 250 titles from more than 50 publishers. It prefers that the complete contents for participating titles be submitted, but it will accept at minimum the primary research content, and it allows publishers to delay deposit by a year or more after initial publication. PubMed Central retains perpetual rights to archive all submitted materials and has committed to maintaining the long-term integrity and accuracy of the archive’s contents.

General Characteristics

Three organizational types are represented among the twelve programs, as presented in Figure 1. The largest category includes government-supported efforts, with five of the six sponsored by a national library (CISTI Csi, KB e-Depot, kopal/DDB, NLA PANDORA, PubMed Central). LANL-RL receives funding from the U.S. Department of Energy and the U.S. Department of Defense. Two (OhioLINK EJC and the Ontario Scholars Portal) represent consortia that aggregate content primarily for access but have assumed archiving responsibility. Four (CLOCKSS, LOCKSS Alliance, OCLC ECO, and Portico) are member or subscriber initiatives, with all except ECO launched specifically to address digital archiving issues.

figure1

Fig. 1. Types of organizations included in survey

These programs are of recent origin. The oldest (LANL-RL) began in 1995, and four were launched within the past two years. Seven of the programs provide ongoing access to content and five limit access to current subscribers or members. Two (PubMed Central and NLA PANDORA) are open to all, but access to some material may not occur immediately following publication (this waiting period creates a “moving wall” for access). Five provide current access only for auditing purposes and for checking the integrity and security of systems and content; otherwise, access will be given after a trigger event occurs. A trigger event may occur, for example, when a publication ceases to be available online because of publisher failure or lack of support, a major disaster, or technological obsolescence.

Table 1 compares major attributes for the group, including year of inception, organizational type, access mechanisms, and designated users (i.e., those who receive access whenever it is provided).

table1

Table 1. Major attributes of programs surveyed

Note: For the purposes of this report, the abbreviations listed in the left-hand column above will be used for all figures and tables. CLOCKKS was not considered as a separate entity from LOCKSS during the initial round of survey and interview and, therefore, will not be listed separately in many tables.

Assessing E-Journal Archiving Programs

Our team compiled and analyzed the survey responses in May and June 2006, freezing the addition of new information on July 1. A set of indicators for assessing the e-journal archiving programs was derived, in part, from two statements. The first is the Minimum Criteria for an Archival Repository of Digital Scholarly Journals, issued in May 2000 by the DLF. The second is the minimal set of services for an archiving program represented in the “Urgent Action” statement noted above.

As a result of this work, we identified seven indicators of a program’s viability. In meeting its obligations to archive e-journals, the repository should

  1. have both an explicit mission and the necessary mandate to perform long-term e-journal archiving;
  2. negotiate all rights and responsibilities necessary to fulfill its obligations over long periods;
  3. be explicit about which scholarly publications it is archiving and for whom;
  4. offer a minimal set of well-defined archiving services;
  5. make preserved information available to libraries under certain conditions;
  6. be organizationally viable; and
  7. work as part of a network.
figure2

Fig. 2. Measuring e-journal archiving programs against seven indicators

Figure 2 shows our estimate of the current state of program viability for the 12 e-journal archives under review based on the seven indicators. These programs have secured their mandates, defined access conditions, and are making good progress toward obtaining necessary rights and organizational viability, but room for improvement is apparent in three key areas: content coverage, meeting minimal services, and establishing a network of interdependency.

A discussion of the seven indicators follows.

Indicator 1: Mission and Mandate

The repository should have both an explicit mission and the necessary mandate to perform long-term e-journal archiving.

All 12 programs confirmed that their missions explicitly committed them to long-term e-journal archiving, and each has negotiated with publishers to secure the archival rights to manage journal content. Many publishers are willing to participate in these programs in part to protect their digital assets and in response to increasing demand from their principal customers. For example, the five largest STM publishers—Blackwell, Elsevier, Springer, Taylor & Francis, and Wiley—are all engaged in more than one of the e-journal archiving efforts reviewed in this report. Their participation, however, is voluntary, and at least one other publisher refused to grant OhioLINK EJC archival rights as part of its license agreement. E-journal archiving efforts could be strengthened considerably if publishers were required by legislative mandate or as a precondition in license arrangements to deposit their content in suitable e-journal archives.

The Role of Legal Deposit in E-Journal Archiving

More and more nations are requiring the deposit of electronic publications, including electronic journals, in their national libraries. Both the British Library and Library and Archives Canada, for example, are designing electronic-deposit repositories, and Germany recently passed a law that mandates the deposit of German publications, a move that will strengthen kopal/DDB’s program.9 Other nations are expected to follow suit.

While legal deposit is often implemented as a requirement for copyright protection, in practice it can also become an important component of a digital preservation program. Legal deposit laws provide the designated deposit libraries with both an explicit mission and a mandate to preserve a nation’s publications. Once a journal has been deposited, the repository library is responsible for its preservation.

One question is whether legal deposit requirements will obviate the need to establish other e-journal archiving programs. We suggest that it will not, for at least four reasons. First, and most important, while most of the laws are intended to ensure that the journals will be preserved, there is less clarity as to how one can gain access to those journals. In almost all cases, one can visit the national library and consult an electronic publication onsite. It is unlikely, however, that the national libraries will be able to provide online access to remote users in the event of changes in subscription models, changed market environments, or possibly even publisher failure. The recently revised “Statement on the Development and Establishment of Voluntary Deposit Schemes for Electronic Publications,” endorsed by both the Committee of the Federation of European Publishers (FEP) and the Conference of European National Librarians (CENL) and intended to serve as a model for national deposit initiatives, makes no mention of access beyond the confines of the national legal deposit library, leaving such issues to separate contractual arrangements with the publishers (CENL/FEP 2005). None of the national deposit programs we surveyed currently has the capability to serve as a distributor of otherwise unavailable archived journals.

Second, because legal deposit requirements are so new, the ability of the national libraries to preserve content is largely untested. Spurred by the requirements of legal mandates to acquire and preserve digital information, the national libraries have made tremendous strides in developing digital preservation programs. Many advances in our understanding of digital preservation have come through the work of the KB, the NLA, and other pioneering national libraries and archives working in this area. None of these libraries, however, would claim that it has developed the perfect, or only, solution to digital preservation. At this early stage in our knowledge, it is important to have competing digital preservation solutions that can, over time, help us develop a consensus as to what constitutes best practice.

Third, while the movement for national digital deposit legislation seems to be spreading, major gaps remain. In many cases, such as in the Netherlands, the deposit program is a voluntary agreement between the library and the publishers. Publishers are encouraged, but not required, to deposit electronic material. In other cases, most notably the United States, there is neither mandatory legal deposit for electronic publications nor clear evidence that the Copyright Office could demand the deposit of electronic publications (Besek 2003). At a minimum, the United States will need to adopt strong mandatory digital deposit legislation if legal deposit is ever to replace library-initiated preservation.

Finally, and somewhat paradoxically, the concept of national publications is becoming problematic, especially when dealing with electronic journals. Elsevier, for example, may be headquartered in the Netherlands, but does that make all its publications Dutch and subject to any future deposit laws in the Netherlands—even when those journals may have a primarily U.S.-based editorial board and may be delivered from servers based in a third country?

Although legal deposit may not be the silver-bullet solution to archiving e-journals, it is clearly an important component of the preservation matrix. If nothing else, a legal requirement that would force publishers to deposit e-journals in several national deposit systems (because of the international nature of publishing) would create pressure for standard submission formats and manifests for e-journal content. In addition, once material is preserved, it may be possible to revisit the trigger events that allow access to the content and even to permit remote access in narrow circumstances. The national libraries are also well positioned to develop technical expertise related to digital preservation and to share that expertise. For these reasons, we hope that efforts to develop more e-journal deposit laws will continue. It would be particularly beneficial if the U.S. Copyright Office started requiring deposit of electronic journals for copyright protection and the Library of Congress (LC) assumed responsibility for the preservation of those journals.

The Role of Open-Access Research Repositories in E-Journal Archiving

A development closely related to mandatory legal copyright deposit is the mandatory deposit of funded research into an open-access research repository, such as PubMed Central or arXiv. To date, participation in such repositories has been voluntary, and the results have been mixed. NIH, for example, estimates that only 4% of eligible research is making its way into the PubMed Central online digital archive as a result of the voluntary provisions of NIH’s Policy on Enhancing Public Access to Archival Publications Resulting from NIH-Funded Research, implemented in May 2005 (DHHS 2006). Indeed, member publishers of the DC Principles Coalition fiercely contested the idea of a “mandated central government-run repository” (AAP, AMPA, DCPC 2004).

Several initiatives now under way could alter the voluntary nature of most agreements. In the United Kingdom, the Wellcome Trust and the Medical Research Council have ordered that the final copies of all research they fund be deposited in the UK PubMed Central, and the Biotechnology and Biological Sciences Research Council has mandated that publications from research it funds after October 1, 2006, will be deposited “in an appropriate e-print repository” (BBSRC 2006). Research Councils UK (RCUK) has encouraged the other U.K. research councils to consider deposit of funded research in an open-access repository.10 In the United States, a recent NIH appropriations bill was modified in committee to mandate the deposit of copies of all NIH-funded research in an open-access repository within 12 months of publication (Russo 2006). In addition, Senators John Cornyn (R–TX) and Joe Lieberman (D–CT) have introduced the Federal Research Public Access Act of 2006 (FRPAA), which would require that research funded by the largest federal research agencies and published in peer-reviewed journals be deposited and made openly accessible in digital repositories within six months of publication. Publishers oppose this proposed legislation.11

Given that more and more funded research is going to find its way into open-access repositories, an obvious question is whether libraries can rely on those repositories to preserve that information. There are at least two reasons why we would not recommend relying solely on open-access repositories for an archiving solution at this time.

First, while much research that appears in journals is funded by major U.S. or U.K. funding sources, many articles are not so funded. Consequently, much information will remain outside open-access repositories for the foreseeable future. Open-access article repositories are unlikely to function as substitutes for electronic journals.

Second, open-access repositories are not necessarily digital preservation solutions, although sometimes their names suggest otherwise. For example, one of the oldest open-access repositories, arXiv, suggests by its name that it is involved with preservation, yet there is nothing in the repository software that will ensure the preservation of deposited digital objects. Similarly, the protocol that links many preprint servers was named the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), suggesting that its activities are related to the Open Archival Information System (OAIS) framework. In reality, OAI and OAIS have nothing to do with each other (Hirtle 2001). Open “archives” are primarily concerned with providing open access to current information and not with long-term preservation of the contents.

In its draft position statement on access to research outputs, issued June 28, 2005, RCUK noted the distinction:

RCUK recognises the distinction between (a) making published material quickly and easily available, free of charge to users at the point of use (which is the main purpose of open access repositories), and (b) long-term preservation and curation, which need not necessarily be in such repositories. . . . [I]t should not be presumed that every e-print repository through which published material is made available in the short or medium term should also take upon itself the responsibility for long-term preservation.

RCUK’s proposed solution was not to assume that the open-access repositories would perform preservation, but instead to work with the British Library and its partners to ensure the preservation of research publications and related data in digital formats.

Similarly, the Cornyn/Lieberman bill does not assume that institutional or subject-based repositories will be able to preserve research articles. Instead, it requires that their long-term preservation be done either in a “stable digital repository maintained by a Federal agency” or in a third-party repository that meets agency requirements for “free public access, interoperability, and long-term preservation.”

In sum, the existing open-access research repositories (other than PubMed Central) are unlikely to qualify at this time as stable digital repositories. Libraries should therefore not presume that the scholarly record has been preserved just because it has been deposited in such a repository. At the same time, initiatives such as those from the RCUK and in FRPAA could be important to the development of digital preservation because they would force agencies either to develop digital preservation solutions themselves or define the requirements for third-party solutions.

Recommendations

  1. More effort needs to go into extending the legal mandate for preserving e-journals through legal deposit of electronic publications around the world, to formalize preservation responsibility at the national level.
  2. As part of their license negotiations, libraries and consortia should strongly urge publishers to enter into e-journal archiving relationships with bona fide programs.
  3. Publishers should be overt about their digital archiving efforts and their relationships with various digital archiving programs. The five largest STM publishers are all engaged in more than one of the e-journal archiving efforts reviewed in this report, but only one (Elsevier) presents its digital archiving program on its Web site. Several others have announced their archiving policies in newsletters or press releases—which may still be included on their Web sites as part of a publicity archive—but it can be difficult to locate this information.12
  4. Programs with responsibility to provide current access and archiving should publicize their digital archiving responsibilities both to publishers and to the research library community. Our discussions with library directors revealed that several of them were unaware of PubMed Central’s archiving responsibility or that it could serve as part of their preservation safety net.
  5. As the “Urgent Action” statement stipulates, research libraries should not sign licenses for access to electronic journals unless there are provisions for the effective archiving of those journals. The archiving program should offer at least the minimal level of services defined in the “Urgent Action” statement. In addition, the programs should be open to audit, and, when certification of trusted digital repositories is available, they should be certified. Unless e-journal content is preserved in such a repository, research libraries should not license access.

Indicator 2: Rights and Responsibilities

Rights and responsibilities associated with preserving e-journals should be clearly enumerated and remain viable over long periods.

Closely related to mission and mandate is the need for clarity of a repository’s rights and responsibilities vis-à-vis publishers, distributors, and content creators. Although a publisher may grant archiving rights to a repository, the circumstances surrounding the exercise of these rights may not be uniform or clearly enumerated—or even fully understood when the contract is written. Including input from research libraries and publishers in the governance or operation of the repository would be a useful way to monitor policies as circumstances change (Table 2).

Table 2. Responses to question: “Do publishers have any voice in the governance/operation of your e-journal archiving program?” (P = publishers; L = libraries)

The following three questions should be carefully considered in laying the foundation for digital archiving responsibility:

First, do the contracts consider all intellectual property rights held by publishers, creators, and technology companies that pertain to the content, and do they convey to the repository the right to perform necessary archiving functions to prolong the life of the content? Such rights can include basic permission to copy or reformat material, or both. They extend to bypassing copy and access restrictions, expiration, and other embedded technological controls. If not granted explicit permission, the repository may be unable to provide ongoing access through copying, migration, or reproduction.

Second, does the publisher or its successor reserve the right to remove or alter content from the archival institution under certain circumstances? If so, the archived content could be placed at risk. When asked whether agreements with publishers allow the repository to continue to archive content if the publisher is sold or merges with another company, seven programs answered “yes,” one answered “no,” and two were unsure. PubMed Central reported an instance when a publisher acquired one of the journals previously included and decided not to participate further, so new content has not been added. The content already in the repository remained. OhioLINK EJC’s publisher agreements make no mention of exceptions caused by future changes in ownership. Could their rights under these conditions be only indirectly protected? The KB e-Depot and kopal/DDB recommend that publishers continue to ensure compliance with archiving agreements in the event of mergers, buyouts, or discontinuation of publishing operations, but these recommendations are not legally binding. Elsevier reserves the right to remove content from the KB e-Depot if there is a breach of contract; the LANL-RL indicated that material received could be kept indefinitely, “as long as previously agreed-upon usage restrictions are adhered to.” CISTI Csi will seek to obtain a new agreement in the case of a merger or title transfer to a new publisher.13

Finally, are agreements with publishers regarding archival rights of limited duration? If so, the circumstances governing preservation responsibilities may be subject to change. Four of the twelve repositories reported that their contracts are of fixed, limited duration. They are reviewed regularly, at which time they may be renewed but also canceled. The remaining contracts are of indefinite duration or automatically renewable; all have cancellation options.

Recommendations

  1. Once ingested into the digital archive repository, e-journal content should become the repository’s property and not subject to removal or modification by a publisher or its successor.
  2. In case of alleged breach of contract, there should be a process for dispute mediation to protect the longevity and integrity of the e-journal content.
  3. Contracts need to be reviewed periodically, because changes in publishers, acquisitions, mergers, content creation and dissemination, and technology can affect archiving rights and responsibilities. Continuity of preservation responsibility is essential.
  4. A study should be conducted to identify all necessary rights and responsibilities to ensure adequate protection for digital archiving actions, so that these rights are accurately reflected in contracts and widely publicized.
  5. Research libraries and consortia should pressure publishers to convey all necessary rights and responsibilities for digital archiving to e-journal archiving programs (i.e., the same rights should be conveyed in all archiving arrangements).

Indicator 3: Content Coverage

The repository should be explicit about which scholarly publications it is archiving and for whom.

Although this indicator seems to be straightforward, it is surprisingly difficult to identify what publications are being preserved and by whom. Six of the programs make public their list of publishers (OhioLINK EJC, PubMed Central, CLOCKSS, OCLC ECO, LOCKSS Alliance, Portico), three do so indirectly (KB e-Depot, CISTI Csi, Ontario Scholars Portal), and three do not (LANL-RL, NLA PANDORA, kopal/DDB). Even when the publishers are known, one should not assume that all journals owned by that publisher are included in the archiving programs. For instance, PubMed Central reported the largest number of publishers represented in its holdings, but the smallest number of titles of the 12 programs surveyed.

Locating a list of specific titles included is even more difficult. When asked whether they made an up-to-date, definitive list of titles available to the public, five responded “yes” (NLA PANDORA intersperses the list of journal titles with other content, with no ability to sort on e-journals only; the LOCKSS Alliance is building its list alphabetically by journal title). Five said “no,” (the KB e-Depot and kopal/DDB indicated that they will archive all publications published in their respective countries). The remaining two programs plan to make such a list available. Further, even when the publications are listed, it is difficult to determine what date spans are included (only four repositories list this information) and how complete the contents of the publication are. For instance, the LANL-RL purchased backfiles of the Royal Chemistry Society journals from their inception to 2004, but is not receiving current content for local loading and archiving and does not intend to purchase it. Table 3 shows the availability of title lists and date spans by e-journal archiving repository. Maintaining content currency is a moving target; all repositories indicated they expect to add new titles and, indeed, during the course of our investigation new titles and publishers were being added frequently.

Table 3. Responses to question “Do you make information about journal titles and date spans included in your program available to the public?” ( • = yes; P = plan to within six months)

The pace of consolidation within scholarly publishing also creates dilemmas for those attempting to chronicle the state of the industry at any one time. Ownership of publishing houses, imprints, and individual titles is in constant flux, making it difficult to accurately associate large lists of titles with the correct publisher. In recent years, large companies with no name recognition as publishers have swallowed up a number of venerable publishing houses. Should these titles continue to be listed under the familiar, original publisher or by the new owner? Particularly complex are cases wherein a publisher has sold a portion of its titles or entire imprints but held on to others.

When evaluating data from e-journal archiving initiatives, it is sometimes impossible to tell whether lists of participating publishers or the names of publishers associated with particular titles reflect current status or are based on legacy metadata. For example, some initiatives still list Academic Press as a separate entity, while others have incorporated its titles under the current owner, Elsevier. When an initiative lists titles from Kluwer, is it referring to Kluwer Academic Publishers, which was purchased by Springer from Wolters Kluwer in 2004, or to Kluwer Health, which is still part of the original firm and includes labels such as Adis International and Lippincott, Williams & Wilkins? If complete title listings are available, it may be possible (though onerous) to make such a distinction, but lists are not always available.

Thus, the publisher listings presented here should be viewed as nothing more than a fuzzy snapshot of circumstances on July 1, 2006. The kind of precision that would allow us to determine the archived status of specific titles and publishers is not possible given the market’s volatility and ambiguity in the current data.

Adding to the confusion about which titles and publishers are included in archiving initiatives is the fact that not all the “publishers” listed are truly publishers. Some are really aggregators—essentially republishers that provide electronic publication, marketing, and dissemination services for (usually) small scholarly societies that produce only one or a few titles and therefore benefit from aggregation to achieve visibility, critical mass, and state-of-the-art electronic publishing services.

Two prominent aggregators that turned up many times in our surveys are BioOne and Project MUSE. BioOne is a nonprofit aggregator that disseminates noncommercial titles in the biological, ecological, and environmental sciences. Most of the original publishers contracting with BioOne are scholarly societies and associations. As of July 1, 2006, BioOne handled 84 titles from 66 publishers. Even though none of the e-journal archiving initiatives we surveyed listed the American Association of Stratigraphic Palynologists as a publisher, its lone journal, Palynology, is included in LOCKSS Alliance, OhioLINK EJC, and Portico, by virtue of its contract with BioOne.

Project MUSE fills a similar niche for small publishers in the humanities, arts, and social sciences. Incorporating more than 300 journals from 62 publishers, predominantly university presses, as of July 1, 2006, Project MUSE provides a portal and search facility that brings together many related titles. But MUSE also boasts that it provides a “stable archive.” The overview on its Web site states the following:

It is a MUSE policy that once content goes online, it stays online. As the back issues of journals increase annually, they remain electronically archived and accessible. We also have a permanent archiving and preservation strategy, including participation in LOCKSS, maintenance of several off-site mirror servers, and deposition of MUSE content into third-party archives.

MUSE participates in LOCKSS Alliance, OhioLINK EJC, and OCLC ECO. So, despite the absence of the George Washington University Institute for Ethnographic Research on the publisher listings of any of the e-journal archiving initiatives included here, its journal, Anthropological Quarterly, is being archived.

Other aggregators that are participating in at least one of the archives include HighWire Press (which hosts nearly 1,000 titles from large and small publishers and is affiliated with LOCKSS Alliance), the LOCKSS Humanities Project, the History Cooperative, and ScholarOne, Inc.

With all these caveats in mind, the number of titles included in these 12 programs is impressive, exceeding 34,000, as shown in Figure 3.

Fig. 3. Approximate number of titles included in e-journal archiving programs

Because there is no definitive list of titles covered in all these programs, the degree of overlap in content coverage is unknown. We were able to identify 220 publishers mentioned as participating in one or more of the e-journal archiving programs under review. We omitted PANDORA because the NLA preserves only Australian publications and does not maintain e-journal publisher data separately. Figure 4 provides the total publisher count for each e-journal archiving program. Appendix 3 lists the publishers in each archiving program.

Fig. 4. Number of publishers included in the 12 e-journal archiving programs surveyed

The number of unique publishers in this pool is 128 (58% of the total). Of those, 91 (71%) are participating in only 1 program; 20 (16%) are involved in 2 programs. The major publishers are well represented in multiple arrangements. As Figure 5 reveals, 17 of them (13%) are involved in 3 or more programs and 6 of them (5%) are involved in 7 or more programs. Appendix 4 identifies the publishers included in more than one e-journal archiving arrangement.

Although there may not be complete overlap in content in each program, it appears that there is much redundancy for the major publishers of STM e-journals, especially those in English, many of which have their own archiving programs. Other disciplines, smaller publishers (especially independent Web publications of a dynamic nature), and most material published in non-Roman alphabets are less represented in general and particularly in multiple arrangements. They are also less likely to have developed a full-fledged archiving program in-house.

Fig. 5. Publisher overlap

It is unclear what the trend toward amalgamation of smaller presses into larger entities will mean for digital archiving, but it might prove beneficial. Recognizing the extent of at-risk e-journals in the humanities, LOCKSS launched its Humanities Project in 2004. Selectors at a dozen research libraries are participating in the project to identify significant content in the humanities for preservation, and programmers at those institutions are developing the plug-ins needed to capture the content, once the relevant publishers sign on.14

In addition to being transparent about the list of journals included and the date spans covered for each journal, archiving programs should be explicit about the content captured at the journal level (see next section). Content captured can vary by publisher as well as by journal. Given the differing archiving approaches used, it is likely that the extent of content captured for a particular journal held by more than one archive will vary among archives.

Recommendations

  1. E-journal archive repositories need to be more overt about the publishers, titles, date spans, and content included in their programs. This information should be easily accessible from their respective Web sites.
  2. A registry of archived scholarly publications should be developed that indicates which programs preserve them, following such models as the Registry of Open Access Repositories (ROAR), which lists 667 open-access e-print archives around the world, and ROARMAP, which tracks the growth of institutional self-archiving policies.
  3. Research libraries should lobby smaller online publishers to participate in archiving programs and encourage e-journal programs to include the underrepresented presses; ideally, e-journal programs would cooperate to ensure that they share the responsibility to include these journals. (Only the LOCKSS Alliance allows a library to choose which publications to include.)

Indicator 4: Minimal Services

E-Journal archiving programs should be assessed on the basis of their ability to offer a minimal set of well-defined services.

This indicator is among the most elusive to assess because there is no universally agreed-on set of requirements for digital preservation, no mechanism to qualify (or disqualify) archiving services, and no organized community pressure to require it, although promising work is under way.

In 2003, RLG and NARA established the RLG-NARA Digital Repository Certification Task Force to develop the criteria and means for verifying that digital repositories are able to meet evolving digital preservation requirements effectively. The task force built on the earlier work of the OAIS working groups, especially the Archival Workshop on Ingest, Identification, and Certification Standards. In September 2005, RLG issued the task force’s draft Audit Checklist for Certifying Digital Repositories for public comment. The checklist provides a four-part self-assessment tool for evaluating the digital preservation readiness of digital repositories. A revised version of the checklist is planned for release by the end of 2006.

To further the digital preservation community’s certification efforts, The Andrew W. Mellon Foundation awarded a grant to fund the Certification of Digital Archives project at CRL. This project used the draft RLG audit checklist as a starting point for conducting test audits for four archival programs: Portico, LOCKSS Alliance, the Inter-University Consortium for Political and Social Research, and the KB e-Depot. The results of these test audits are informing the revision of the checklist. The project’s final report, also scheduled for release by the end of 2006, will include recommendations for future developments in the audit and certification of digital repositories.

The Digital Curation Centre in the United Kingdom is conducting test audits of three digital repositories. It has a particular interest in and focus on the nature and characteristics of evidence to be provided by an organization during an audit to demonstrate compliance with the specified metrics. An interesting aspect of its approach is the value and use of evidence provided by observation and testimonials (Ross and McHugh 2005, 2006).

Germany is developing a two-track program for certification. DINI (Deutsche Initiative für Netzwerkinformation), a German coalition of libraries, computing centers, media centers, and scientists, encourages institutions to adopt good repository management practices without being overly prescriptive—steps that would lead to soft certification. The aim of soft certification is to motivate institutions to improve interoperability and gain a basic level of recognition and visibility for their repositories. The nestor project (Network of Expertise In Long-term STOrage of Digital Resources) is investigating the standards and methodologies for the evaluation and certification of trusted digital repositories and embodies rigorous adherence to requirements, leading to hard certification. The principles embraced by the nestor team include appropriate documentation, operational transparency, and adequate strategies to achieve the stated mission. DINI focuses on document and publication repositories at universities for scientific and scholarly communication and had issued 19 certifications as of July 2006. Nestor’s scope goes beyond the realm of higher education and also targets repositories in national and state libraries and archives, museums, and data centers. Nestor is finalizing its certification criteria and has not yet issued any certificates (Dobratz and Schoger 2005; Dobratz, Schoger, and Strathmann 2006).15

It is not now possible for digital archiving programs to be certified, but when asked whether they would seek to become certified once such a process is in place, five of the e-journal archiving programs indicated they would, one indicated it would not, and five were uncertain or unaware of the certification effort. Table 4 reports their responses.

Table 4. Responses to question: “Will you seek to become a certified repository?” ( • = yes)

In the absence of a certification process, adherence to digital preservation standards is a potential gauge to the technical viability of a program. Some existing digital preservation standards and best practices provide pieces of the puzzle.16 We asked the surveyed repositories whether they were adhering to or planning to follow some of the key standards in the next six months. Table 5 lists these standards and best practices and provides the repositories’ responses. Of interest is that only 5 of 11 programs report adherence to OAIS, an International Standards Organization standard that is gaining strong purchase in the digital preservation community. NLA PANDORA sees compliance to standards as a long-term goal and aligns with them as much as possible.

Table 5. Responses to question: “Do you follow any of the following standards and best community practices for archiving?” ( • = yes; P = plan to within six months)

Despite the lack of a means to certify the operation of digital repositories, enough conceptual work has been done to identify minimal expectations of best practices for a less rigorous standard—that of a well-managed collection. Measures such as an effective ingest process with minimal (even manual) quality control, acquiring or generating minimal metadata for digital objects in collections, maintaining secure storage with some level of redundancy, establishing protocols for monitoring and responding to changes in file format and media standards, and creating basic policies and procedural documentation—all acknowledge and address fundamental threats to digital document longevity.

There is widespread agreement about the nature of those threats—information technology (IT) infrastructure failure (hardware, media, software, and networking), built environment failures (plumbing, electricity, and heating, ventilation, and air conditioning), natural disaster, technological obsolescence, human-induced data loss (whether accidental or intentional, internal or external in origin), and various forms of organizational collapse (financial, legal, managerial, societal). There is far less uniformity of thought about the best means to confront each threat, or even which approaches should be considered effective to provide minimal protection.

Not surprisingly, therefore, the programs we surveyed, despite claiming a similar mandate, have chosen a variety of ways to carry it out. The diversity of approaches is healthy and useful, since only time and experience will tell us which techniques are effective. It is critical, however, that existing programs honestly and accurately document their successes and failures. The need for a risk-free mechanism to report negative results was noted in a previous CLIR report, which recommended “establishing a ‘problems anonymous’ database that allows institutions to share experiences and concerns without fear of reprisal or embarrassment” (Kenney and Stam 2002). The recommendation to establish such a system arose again in a more recent paper, which suggested the National Aeronautics and Space Administration’s Aviation Safety Reporting System as a possible model (Rosenthal et al. 2005b). We heartily endorse these recommendations and believe that the community should place high priority on creating such a reporting system soon. The only way we will learn about the efficacy (or lack thereof) of various approaches is by having truthful reporting of experiences.

Short List of Minimal Services

As a starting point for documenting the digital preservation services being executed by the programs under review, we chose to assess them by five technical requirements laid out in the “Urgent Call to Action” statement, plus an additional requirement that we believe qualifies for the “short list” of minimal services:

  • receive files that constitute a journal publication in a standard form, either from a participating library or directly from the publisher;
  • store the files in nonproprietary formats that could be easily transferred and used should the participating library decide to change its archives of record;
  • use a standard means of verifying the integrity of ingoing and outgoing files, and provide continuing integrity checks for files stored internally;
  • limit the processing of received files to contain costs, but provide enough processing so that the archives could locate and adequately render files for participating libraries in the event of loss;
  • guard against loss from physical threats through redundant storage and other well-documented security measures; and
  • offer an open, transparent means of auditing these practices.

Our discussion of these services presumes that programs should address not only what the services consist of but also how they intend to implement them.

Receive files that constitute a journal publication in a standard form, either from a participating library or directly from the publisher. This ingest-focused requirement encompasses at least two major elements. The first deals with the standard form that received files take. Before delving into specific standards, it is necessary to distinguish two basic approaches that e-journal archiving programs can use to receive the files that constitute a journal publication from the publisher. The most common approach is often referred to as “source-file archiving.” In it, the archival agency receives from the publisher the files that constitute the electronic journal. These could be the standard generalized markup (SGML) files used to produce the printed volumes or the word processing or extensible markup language (XML) files used by the publisher to produce both printed and online products, such as portable document format (PDF) files. Graphic files and supporting material can also be included. In some cases, the files sent to an archival agency can be more complete than what is actually published. For example, a high-resolution image could be preserved even though a lower-resolution image is used on an online access site. PubMed Central and Portico are focused on preserving the source files received from the publishers.

A second approach is to receive the files that constitute the journal as published electronically. We call this approach “rendition archiving,” since it focuses on preserving the journal in the form made available to the public. PDF files are the most common format for displaying journals as published, although some programs also receive the HTML and image files that are used to display a journal to readers. All the programs we surveyed welcome the submission of rendition files, and some, such as OCLC ECO, NLA PANDORA, and the LOCKSS Alliance, are based entirely on preserving and delivering the content as published. The LOCKSS Alliance and NLA PANDORA are special cases of rendition archiving. Rather than relying on rendition files provided by the publisher, they harvest (with the permission of the publishers) files from the publishers’ Web sites.

Each of these approaches has advantages and disadvantages. With source archiving, the most complete version of the e-journal content is preserved. Furthermore, as is discussed in detail below, source-file content is often either delivered in or converted to a few normalized formats, on the assumption that it will be easier to ensure the long-term accessibility of standardized and normalized files. One disadvantage to source archiving is that it requires a large up-front investment, with no assurance that the archive will ever actually be needed. In addition, the presentation of the e-journal content will almost certainly differ from that of the publisher; the “look and feel” of the journal will be lost.

Rendition archiving can maintain the look and feel of the journal, but it may be harder to preserve the content. No one knows, for example, what an effective migration strategy for PDF documents might be. In addition, it may be difficult to preserve the functionality of a dynamic e-journal if harvesting screen “scrapes” of static hypertext markup language (HTML) pages is the preferred ingest solution. On the plus side, the initial costs associated with preserving rendition files are likely to be lower (and, in the case of the harvesting projects, much lower). Migration, normalization, and other preservation activities need take place only when actually needed.

At this point, it is impossible to say which of these two approaches is the better solution to archiving. Those programs that solicit both source files and rendition copies of e-journal content (PubMed Central, Portico, KB e-Depot, kopal/DDB) probably are the safest archiving solution—but at a potentially greater cost.

Since text structure is the aspect of journal publishing that has been subject to the greatest standardization effort, source files are the type most commonly produced in a standard form. Several SGML and XML DTDs (document type definitions) have been devised specifically to support publishing of scholarly journal articles. One of the most popular is the NLM/NCBI (National Library of Medicine/National Center for Biotechnology Information) Journal Archiving and Interchange DTD. The full Journal Archiving and Interchange DTD Suite also includes modules that describe the graphical content of journal articles and certain nonarticle text, including letters, editorials, and book and product reviews. Acceptance of the Journal Archiving and Interchange DTD received a major boost in April 2006 when LC and the British Library announced support for the migration of electronic journal content to the NLM DTD standard, “where practicable” (Library of Congress 2006).17 Four of the programs we surveyed currently use the NLM DTD.

Use of XML and SGML with DTDs designed for journal articles and other components has implications for “standard form” of structure and interchange capability at the lowest levels. The definition of a character in the XML specification is based on the Unicode set. We queried the programs about the Unicode compatibility of their systems and found that at least some components of legacy systems (ScienceServer sites in particular) lacked it. With many publishers now supplying both journal content and metadata in XML, this has caused problems, particularly with the display of bibliographic data for some access-driven programs. We heard complaints that publishers had made the switch to Unicode compliance without giving the archive enough time to adjust its ingest procedures, resulting in incompatibilities. Two archives (PubMed Central and Portico) mentioned that despite being fully Unicode compliant, they could not support non-English metadata because of limitations in their ability to perform quality control and, in PubMed Central’s case, because the search-and-retrieval system is based on English-language indexing and text matching.

Given that many of the programs profiled here are research driven, it is not surprising that they are trying to break new ground in repository development. Consequently, some of the “standard forms” used in the programs are unique to them. In LANL-RL’s new aDORe repository, digital objects are represented using MPEG-21 DID (digital item declaration) and stored in an XML tape, while kopal/DDB has developed a Universal Object Format (Steinke 2006) for archiving and exchange of digital objects. Unfortunately, nothing yet qualifies as “universal” when it comes to digital objects. (As a cynic once said, “The nice thing about standards is that there are so many to choose from.”) Until digital repository design matures and stabilizes, exchange of complex digital objects (i.e., archival information packages, or AIPs) among repositories will be less than transparent. However, proposals are emerging for facilitating the exchange of complex digital objects between repositories and archives.18 Experimentation with a variety of approaches is appropriate at this stage of archive development. We also recommend that e-journal archives using different standards begin examining interoperability issues for digital objects and metadata, with an eye on maximizing compatibility.

There is as yet no standard form for source files. Although many programs prefer, and some require, files to be delivered as PDFs, no specific version of PDF is required. No program requires that PDFs adhere to ISO 19005-1 (PDF/A-1), and we are not aware of any major publishers that offer their files in that format.

Asked about the existence of file-format requirements (or preferences) for ingest, eight programs said they have such requirements, and half of them provided us with technical documentation describing them. Four do not (LOCKSS Alliance, Ontario Scholars Portal, NLA PANDORA, Portico). LOCKSS Alliance and NLA PANDORA harvest files from the Web and take whatever content can be delivered through Web protocols.

The second major element of this minimal service is the receipt of “files that constitute a journal publication.” Identifying the entirety of a journal publication in print is a straightforward matter, but the components of e-journals are more varied both in form and content and are far less tightly bound together. The lack of an established standard for what constitutes the essential parts of an e-journal was made abundantly clear by the nonuniform responses to our questions about which journal content types and features each archiving program includes (see Table 6).

Table 6. Journal content types and features

All said they include research articles and errata, but beyond that there was no consistency. Athough most said they maintain “whatever the publisher sends,” many do not include advertisements (which are often generated on-the-fly in a user-dependent manner) and certain other non-editorial content. Some do not capture supplemental materials, and even fewer are able to capture external features associated with publisher Web sites, such as discussion forums and other interactive content. Although it encourages the deposit of all journal components, PubMed Central, for example, requires only that research articles be provided; the presence of other kinds of content may vary among publishers, and even among titles.

The programs are aware that different publishers send different kinds and numbers of files for each title, but they seem less aware of what those components are. Survey comments made it clear that some responses to this question were guesses. Particularly for the access-driven programs, the focus is primarily research articles. Several respondents said that although they keep everything they receive, they are not necessarily able to provide access to all components.

There is likewise considerable variability within programs, because publishers have different definitions of what constitutes a complete e-journal. With no means to standardize journal components, and given that publishers are generally unable to provide manifests of how many files of what type the archive is supposed to be receiving, uncertainty at the receiving end is inevitable. Several programs noted that the lack of publisher manifests was a big problem. There is less ambiguity with programs that harvest content from publisher Web sites (NLA PANDORA and LOCKSS Alliance). Since the content is coming directly from the publisher’s officially disseminated version, the only potential for missing components is if the harvesting itself is incomplete.

Users read and access the content of e-journals very differently than they do print journals (Olsen 1994). As more scholarly publishers eliminate print versions of their titles, it is possible that certain once-common features, such as advertisements or conference announcements, will be dropped or disseminated by different means (e.g., blogs or RSS feeds). The scholarly publishing landscape is not stable enough to prescribe what components (at minimum) constitute a journal publication in electronic form. But publishers need to do a better job of specifying exactly what they call a complete issue, and archiving programs need to pay more attention to exactly what they are receiving.

Store the files in nonproprietary formats that could be easily transferred and used should the participating library decide to change its archives of record. Use of nonproprietary formats has long been recognized as a strategy to fight obsolescence and improve the portability of digital objects. Depending on the ingest and archive approach of a particular program, the role of nonproprietary formats may be to

  • take everything and store it in the supplied format (e.g., OhioLINK EJC, Ontario Scholars Portal, LOCKSS Alliance);
  • take everything (or nearly so), preserve the original, but normalize it on ingest (e.g., Portico); or
  • require use of a particular format or formats for deposit (e.g., PubMed Central, KB e-Depot, OCLC ECO).

The choice of preferred formats varies. Some require a form of XML (PubMed Central) or one that can be converted to XML (Portico), for articles, metadata, or both. Others accept PDF as the primary deposit format (OCLC ECO, KB e-Depot, OhioLINK EJC, CISTI Csi) or as an optional secondary format (PubMed Central). PDF is widely regarded as so open a specification that it is deemed nonproprietary. The lack of any credible competitor has made PDF seem a safe choice for long-term archiving, as evidenced by the work on PDF/A-1 and now PDF/A-2. However, the PDF specification is owned by Adobe, and recent events have slightly clouded the picture around it. Microsoft has announced the development of a competing product called XPS (XML paper specification), an XML-based document format with many similarities to PDF. In June 2006, Microsoft reported that Adobe had threatened a lawsuit if plans to incorporate the ability to save as PDF into Office 2007 were carried out. Adobe denied making such a threat and said that its primary concern was that Microsoft would produce PDFs that strayed from its specification. Regardless of whom one believes, the bottom line is that no file format, no matter how open or popular, can be deemed permanently “safe.”

The survey addressed the ability of programs to archive a variety of text, still image, and multimedia (sound and moving image) file formats (Tables 7–9). The gamut ranged from format-agnostic initiatives such as LOCKSS Alliance, which archives any format a publisher can make available through Web protocols, to prescriptive operations, such as PubMed Central, which requires submitted content to be in either XML or SGML. Just because a program says it accepts a format in its archive does not mean that it has the ability to provide access to it. For example, programs using an older version of ScienceServer software (three programs, at the time of our survey) are largely limited to displaying PDF, Tagged Image File Format (TIFF), and some XML files.

Table 7. Text formats and page description languages accepted (P = plan to accept within six months)

Table 8. Still-image formats accepted

Table 9. Other formats accepted

Effective transfer of archives content between programs requires more than simply using nonproprietary file formats. XML comes in many different flavors, with an external specification (the DTD) determining how the content should be interpreted. Metadata are moving toward standardization of both content and format, but metadata standards still vary widely among e-journal archives. Thus, even if we achieved universal adoption of nonproprietary file formats, easy transfer will be possible only with greater standardization of externalities and the containers that surround the basic digital objects.

Use a standard means of verifying the integrity of ingoing and outgoing files, and provide continuing integrity checks for files stored internally. This specification presumes that there is a standard means of determining and maintaining integrity, but our survey suggests that this area is ill-defined. Procedures for integrity testing differ greatly across the programs. Completeness testing can be automated or manual, and no two programs do it exactly the same way. Some test at the volume level, some at the issue level, and some at the article and article-component level. Some use byte counts while others use markup callouts. Only LOCKSS/CLOCKSS appears to have a system that incorporates a publisher’s manifest for each transaction. Integrity testing at ingest is similarly nonstandard. Some programs use checksum comparisons or network transfer protocols that employ checksums (e.g., ftp). Others rely on random sampling with visual inspection or validation. LOCKSS boxes can do comparisons with both publisher sites and other LOCKSS boxes containing the same content.

Table 9. Other formats accepted

Effective transfer of archives content between programs requires more than simply using nonproprietary file formats. XML comes in many different flavors, with an external specification (the DTD) determining how the content should be interpreted. Metadata are moving toward standardization of both content and format, but metadata standards still vary widely among e-journal archives. Thus, even if we achieved universal adoption of nonproprietary file formats, easy transfer will be possible only with greater standardization of externalities and the containers that surround the basic digital objects.

Use a standard means of verifying the integrity of ingoing and outgoing files, and provide continuing integrity checks for files stored internally. This specification presumes that there is a standard means of determining and maintaining integrity, but our survey suggests that this area is ill-defined. Procedures for integrity testing differ greatly across the programs. Completeness testing can be automated or manual, and no two programs do it exactly the same way. Some test at the volume level, some at the issue level, and some at the article and article-component level. Some use byte counts while others use markup callouts. Only LOCKSS/CLOCKSS appears to have a system that incorporates a publisher’s manifest for each transaction. Integrity testing at ingest is similarly nonstandard. Some programs use checksum comparisons or network transfer protocols that employ checksums (e.g., ftp). Others rely on random sampling with visual inspection or validation. LOCKSS boxes can do comparisons with both publisher sites and other LOCKSS boxes containing the same content.

table10

Table 10. Responses to question: “Do you conduct validation/testing?” ( • = yes; N/S= not sure; P= plan to within six months)

Even though there are considerable differences in conducting completeness and integrity tests at ingest, ongoing integrity testing reveals the greatest divisions among the programs (see Table 10). Some lack any means for doing ongoing integrity testing. Several programs do periodic integrity checks using checksums. Although some access-driven programs conduct automated integrity checks, a prevailing view of those programs is that daily use by the constituency is the most effective way to uncover problems with individual files. At the same time, operators of access-driven programs are skeptical that a dark archive can be properly maintained and ready for active use at any time simply by testing static properties of the content. They argue that usage patterns are ever-evolving and are themselves an essential part of curation. PubMed Central articulated this view most clearly:

PMC operates on the philosophy that the best way to ensure the integrity of archived content is to use it directly, actively and continuously. Effective use of the content by humans and by automated processes proves the integrity and continued usability of the content. Therefore, the archive is made freely available to all users, encouraging repeated use—by between 50,000 and 90,000 different users each day and an estimated 1.5 million or more users a month. HTML views of articles are generated dynamically, directly from the archival XML copy, thus proving its integrity.

Changing usage modalities reveal incremental problems in the data and allow them to be addressed before becoming massive and insurmountable. The bottom line is that there is a continuously ongoing process of archive curation.

Writing from a LOCKSS perspective, Rosenthal et al. (2005b) counter that relying on access alone as a means of integrity testing is inadequate because most items in an e-journal repository are infrequently used. The reliability of this approach is further called into question by the fact that one of the access-driven programs had a known problem (involving Unicode compatibility) that caused some bibliographic data to display as gibberish and yet logged no complaints from users. To obtain the greatest benefit from use testing, access systems should be designed to encourage and facilitate the reporting of integrity problems by users (Marty and Twidale 2000). Preservation-driven programs, however, can face resistance from publishers who can oppose regular use-based testing that does not derive from a trigger event (Honey 2005). Ultimately, both access-driven and preservation-driven programs need a combination of routine automated checks and regular review by a variety of users to maximize the benefits of integrity testing.

Limit the processing of received files to contain costs, but provide enough processing so that the archives could locate and adequately render files for participating libraries in the event of loss. Data are not yet widely available on the relative cost of file processing within digital repositories and the impact of various procedures on long-term renderability of files. Consequently, it is impossible to identify which programs have found the best balance between cost savings through minimizing file processing, and sufficient investment in metadata creation, integrity testing, and techniques to fight obsolescence. We can, however, look at examples of different approaches to limiting file processing and speculate about their impact on efficiency of operations. Three approaches stand out:

  • automating manual processes,
  • offloading tasks to parties outside the archive, and
  • making architectural decisions (e.g., about repository design, normalization, digital preservation strategy).

In operating and maintaining an e-journal archive, there are several steps with the potential to require large amounts of file processing. These include integrity and completeness validation at ingest, metadata creation at ingest, ongoing integrity testing, and responding to file-format obsolescence. The following paragraphs look at each of these activities in relation to the efficiency strategies mentioned above.

Integrity testing and completeness validation at ingest. These procedures are still conducted manually at many of the archives, even by programs with otherwise high levels of automation. Maintaining quality control at the point of ingest is sufficiently complex and important to warrant the time and expense of manual labor. If the completeness and integrity of content are not established at this point, the archive’s ability to “locate and adequately render files for participating libraries” is substantially compromised. Tools for automating validation, such as JHOVE, are becoming available, and some archives are using them; Portico and the KB e-Depot both report using JHOVE in their workflows. However, there are limits to what automated validation can do, and a file deemed by JHOVE to be valid and well formed is not necessarily error-free.

Survey comments indicated that archives want more help from publishers in facilitating ingest. Archives would like publishers to provide a detailed manifest of the contents of each issue so that they have something against which to gauge completeness. The LOCKSS Alliance and CLOCKSS use an automated procedure to validate that everything the publisher made available has been collected. But that automated process would not be possible without the cooperation of the publisher (which creates a manifest page) and without the design of an architecture that supports this kind of testing as well as recovery from an error situation. So, LOCKSS/CLOCKSS combines all three approaches for maximizing the efficiency of completeness testing at ingest.

Metadata creation. Many see metadata creation as the most onerous step in digital repository management. There is a temptation to generate a lot of metadata (a tendency not discouraged by the size of the PREMIS data dictionary), on the presumption that “more is better” when it comes to managing digital files. However, there are significant costs in creating metadata, as well as ongoing costs for its maintenance and preservation. Some argue forcefully that hand-generated format and bibliographic metadata do not add enough value to merit the effort they require, relative to automated capture of the same class of data (Rosenthal et al. 2005b). LOCKSS uses completely automated metadata collection and believes that what it gets is good enough (although it notes that others disagree) and that the savings from forgoing a more aggressive metadata-creation policy is better used in preserving additional content.

Automation is clearly an option for increasing the efficiency of metadata creation. Tools such as DROID, JHOVE, and the National Library of New Zealand Metadata Extraction Tool can aid in file-format identification as well as in extraction of deeper technical characteristics. Thus far, automated characterization is limited to a few popular file formats, but for most collections, that is probably adequate to deal with a distribution model in which 80% of the files are represented by a few common formats. Considerably more testing and experience with these tools are needed to improve their efficiency, learn their limitations, and develop best-practice guidelines for their deployment.

Since truly reliable automated means for extracting bibliographic and other forms of nontechnical metadata have yet to be perfected, such information should ideally be provided by the data submitter. If the publisher can be convinced to provide metadata in a standard format, so much the better.

Ongoing integrity testing. Several aspects of ongoing integrity testing, especially fixity verification, are routinely automated. KB e-Depot, Portico, kopal/DDB, and NLA PANDORA reported using checksums. The LOCKSS architecture uses a more robust system in which checksums are regularly generated and compared with newly generated checksums on peer LOCKSS boxes with the same content. If a discrepancy arises, a voting system is used to determine which box has the corrupted file and it is then replaced with a deemed “good” copy. The entire process is automated (Maniatis et al. 2003).

Some programs (OhioLINK EJC, Ontario Scholars Portal, CISTI Csi) have, in effect, offloaded the task of ongoing integrity testing to their users. Such an approach reduces costs by eliminating the programming and processing needed to implement and carry out automated checks, but it may leave large portions of a repository’s content vulnerable to undetected corruption or loss. This is the case because standard usage patterns suggest that most articles will be infrequently accessed and because users tend to be unreliable at reporting data integrity problems unless empowered to do so (Marty 2005). Thus, opting to maintain data integrity by relying primarily on user feedback rather than other techniques may not be a good trade-off between cost savings and maintenance of long-term renderability.

Responding to file-format obsolescence. The role of repository architecture in streamlining operations comes to the fore in the design of procedures to respond to file format obsolescence. The options include the following:

  • offloading some normalization responsibilities to the publisher (PubMed Central, KB e-Depot, OCLC ECO, OhioLINK EJC);
  • normalization on ingest (Portico, PubMed Central, Ontario Scholars Portal);
  • migration on-the-fly/just-in-time migration (LOCKSS Alliance, LANL-RL);
  • batch migration/just-in-case migration (OhioLINK EJC, PubMed Central, OCLC ECO); and
  • emulation (KB e-Depot, kopal/DDB, and NLA PANDORA).

The differences are even finer than these options suggest. For example, both PubMed Central and OhioLINK EJC request publisher normalization before ingest, but their strategies are very different. PubMed Central asks for partial normalization (publisher files delivered as XML or SGML based on an accepted journal publishing DTD), which it then fully normalizes to the NLM DTD. OhioLINK EJC, because its access software can handle only a limited range of file formats, requests that publishers normalize to one of those formats (typically PDF or XML) so that it can display the files to users. It does no internal normalization but assumes it will eventually have to do a batch migration of its currently used formats to more-modern formats. Thus, in the short term, PubMed Central has to process any file not already using the NLM DTD; later, it will have to batch-migrate its entire collection each time there is a significant change in the NLM DTD. OhioLINK EJC has essentially no up-front overhead for file-format management, but will eventually face multiple batch-migration operations when its prenormalized formats are no longer supported.

Strategies that envision doing on-the-fly migration also differ in implementation details. LOCKSS anticipates maintaining a suite of converters that will be called as needed, depending on whether an HTTP query indicates that the browser can handle the existing file format or not (Rosenthal et al. 2005a). LANL-RL, on the other hand, uses changes in the metadata envelope to indicate how a file should be decoded. Which technique will be judged more efficient and effective remains to be seen, since neither has had sufficient use in operational repositories to prove itself.

There are prospects for automating portions of the process of coping with file format obsolescence. XENA (XML Electronic Normalizing of Archives), a tool from the National Archives of Australia that facilitates normalization to XML-based formats, is now in its third postproduction release. None of the programs surveyed use XENA, which is not surprising since it is geared toward normalizing office-type documents rather than e-journal articles. However, one could imagine its utility for normalizing image files or supplemental data files that accompany some journal articles.

Another potential means for automation is the preservation-planning component of PRONOM 5b from the U.K. National Archives, slated for release in December 2006. According to the description, “The system will . . . focus on the development of migration pathways for the automatic conversion of electronic records to new formats as required for preservation or presentation purposes” (PRONOM 2006).

Three programs (KB e-Depot, kopal/DDB, and NLA PANDORA) said they would use emulation as a means of coping with file-format obsolescence, though not to the exclusion of other techniques. A pair of studies published in RLG DigiNews deals directly with the competing interests represented by this minimal service: long-term usability versus cost of maintenance. Hedstrom and Lampe (2001) compared migration and emulation in terms of renderability; Oltmans and Kol (2005) compared them in terms of cost, providing some insight into the potential trade-offs between the two approaches.

Hedstrom and Lampe measured user satisfaction in response to both a migrated and an emulated form of a computer game. They found no statistical difference between users’ perceptions of how well each approach preserved the game’s look and feel. However, the authors concluded

Further research on the effectiveness of emulation and migration needs to account for the quality of the emulator, the impact of specific approaches to migration on document attributes and behaviors, and on numerous aspects of the original computing environment that may affect authenticity and user experience.

Studies making similar comparisons between migrated and emulated components of scholarly e-journal articles, as well as user response to the repositories employing the different strategies, should help sort this out.

The Oltmans and Kol study, conducted as part of the KB e-Depot’s research-and-development efforts, compared the projected costs of maintaining renderability of a large collection of digital objects over 50 years through either migration or emulation. The authors’ model presumes higher up-front costs for emulation (mostly for emulator development), but cost savings from eliminating the need to periodically migrate every file soon thereafter tilt the advantage significantly toward emulation. At the end of 50 years, depending on the archive’s size and other parameters, the authors predict that migration will be up to twice as expensive as emulation.

Regardless of the conclusions of these early studies, considerably more time and experience with large collections is needed before the relative merits of the different approaches to file-format obsolescence can be determined with any certainty. Most of the programs have only done small-scale testing or proof-of-concept exercises, particularly with regard to migration and emulation. Table 11 summarizes the programs’ responses about the archiving strategies they use now or will adopt, when necessary.

Table 11. Responses to question: “What type of archiving strategies do you use or plan to use?”

Whether we will learn which of these strategies best balances production efficiencies with protection of users’ interests in the integrity of stored files depends heavily on how open the repositories are willing to be about their operations. Some archives are ingesting files that they currently have no means to render or disseminate or have no plan to migrate to more-manageable formats. Careful scrutiny and diligent reporting will be needed to ensure that such files are not forgotten or marginalized.

Guard against loss from physical threats through redundant storage and other well-documented security measures. Potential loss from physical threats is easily the best-understood and most widely appreciated aspect of digital preservation. Since the advent of digital-storage technology, IT professionals and casual computer users alike have maintained backup copies as a bulwark against the ephemeral nature of digital information and its vulnerability to a raft of destructive forces.

Redundancy provides an important hedge against immediate, large-scale data loss. In practice, redundancy can take many forms. Although local backups provide a convenient second source in cases of media or hardware failure, they are of limited value in cases of natural disaster, infrastructure failure, or any other widespread destruction. Awareness of the need for off-site storage (at a sufficient distance to preclude loss of primary and secondary copies in the same disaster) has noticeably increased in the aftermath of recent natural disasters (hurricanes, tsunamis, earthquakes) and political upheaval (Entlich 2005). An additional level of redundant security is the use of mirror sites, which not only hold an off-site copy of primary data (sometimes updated in real time) but also replicate the entire IT infrastructure so that they can substitute for the primary site should it become unavailable. Mirror sites are particularly important for those programs providing current access, since restoration of data from backup copies can be extremely time-consuming. Ontario Scholars Portal reported that it would take months to restore its terabyte-size primary online data store from backup tapes.

We asked each program about its use of local backups, off-site storage, and mirror sites, and about the total number of redundant copies of the journal data maintained (Table 12). Other than the LOCKSS Alliance, all programs currently maintain or shortly plan to implement both local backups and off-site storage. The preferred mechanism for backing up LOCKSS boxes is the LOCKSS system itself. LOCKSS boxes are designed to be “self-healing” and to detect and correct corruption on the basis of comparisons with and downloads from other LOCKSS boxes carrying the same content. However, for very large collections, rebuilding an entire LOCKSS box in that manner could be time-consuming and incur substantial network traffic charges. Nevertheless, even though it might be faster and cheaper in some cases to restore a LOCKSS box from a local, offline backup, most installations have opted to forgo their use. In fact, LOCKSS content licenses lack authorization to make such backups, so their legality, at least under U.S. copyright law, is unclear. An alternative for institutions with very large storage caches would be to establish a second complete LOCKSS box within the same network domain.

Table 12. Responses to questions: “Do you use any of the following redundancy procedures?” and “How many copies of your content do you maintain?” ( • = yes; P = plan to within six months)

Two initiatives—OCLC ECO and CISTI Csi—have established mirror sites. Portico, the KB e-Depot, and PubMed Central all have them in the planning stages. PubMed Central is in different stages of negotiation to establish mirrors in at least five countries; U.K. PubMed Central is expected to be the first to go live, possibly as early as January 2007 (UKPMC 2006). The concept of a mirror site has a different meaning in the context of LOCKSS; in a sense, all the content is mirrored, because every LOCKSS box has the complete LOCKSS software. Although no two LOCKSS boxes necessarily carry exactly the same content, any particular content should be available on a minimum number of other boxes.

There are not only different techniques for carrying out redundancy but also varying degrees of practice for each technique, as evidenced by differences in the number of redundant copies each program maintains. However, it is the operational details behind the numbers that determine the degree of protection provided. For example, a program that keeps five copies of only its data files, all on the same kind of media and in the same location, is more vulnerable to loss than is a program that maintains a single mirror site with both applications software and data that are in a geographically distinct location, on a different power grid, in a different network, and operated by different personnel. LOCKSS proponents claim that one strength of its architecture is that distinct systems personnel operate every site, increasing the protection of the content against loss by human error or deliberate attack from a determined insider. In fact, they assert that “unified system administration should be an unacceptable feature of digital preservation” (Rosenthal 2005b). We agree.

Different levels of redundancy may be appropriate for different types of archiving programs. Preservation-driven programs have less need for real-time mirroring, because they do not provide current access and typically do not promise immediate access to their subscribers or members in the case of a trigger event. Furthermore, the publisher can usually resupply content that has been processed, but not yet backed up. However, over time, it can be expected that publisher failures, expiration of copyright, and other kinds of trigger events will eventually turn preservation-driven programs into content providers, thereby changing the nature of their responsibilities and, presumably, their redundancy planning.

Redundancy should be seen for what it is—a stopgap measure designed to restore data integrity or operations following a loss of primary systems. It is always preferable to prevent data loss in the first place. The need to rely on redundant storage, which can mean considerable expense and downtime, can be reduced through disaster planning. We asked each program whether it had established written procedures and protocols for dealing with three major classes of physical threats: malicious attacks, natural disasters, and infrastructure failure. As shown in Table 13, most programs have policies to address all three.

Table 13. Responses to question: “Do you have written procedures and protocols to minimize vulnerability to various threats?”

A written plan shows that a program takes its data-security obligations seriously. To be effective, disaster plans have to be comprehensive, detailed, widely disseminated to relevant personnel, and regularly tested and updated. Programs could enhance members’ and subscribers’ confidence in their preparedness for disasters by making disaster-planning documents public.19 Public versions of these documents should be edited to exclude information that might compromise security, such as the precise location of off-site storage facilities, the identity of security personnel, and details about the operation of antihacking and anti-intrusion systems.

Offer an open, transparent means of auditing practices. This requirement addresses two questions: are practices audited and is the audit process open and transparent? At this early stage, there appears to be little agreement about the appropriate means and level of openness and transparency needed to gain the trust of potential participants. Our survey included a question about the conduct of technical audits. Seven programs indicated that they conduct technical audits (OhioLINK EJC, LANL-RL, LOCKSS, NLA PANDORA, Portico, OCLC ECO, CISTI Csi), two do not (Ontario Scholars Portal, kopal/DDB), and one (KB e-Depot) plans to conduct a technical audit within the next six months.

We also asked about the existence of written documentation covering many aspects of the programs’ e-journal archiving functions. There is as yet no standard expectation for a minimal set of documentation, and as Table 14 indicates, no one type of document that all programs have created. In most cases, only some of the documentation is publicly available.

Table 14. Responses to question: “Do you have the following written documentation that explicitly refers to e-journal archiving?” ( • = yes; P = plan to within six months)

We believe that to earn the trust of the user community, archives must have written policies in all major areas of operations that are available for public review. Table 14 does not even address public availability, but it does point to an absence of written documentation in several critical areas, particularly quality control, disaster planning and recovery, and preservation planning.

During the thaw in relationships between the Soviet Union and the United States that took place in the 1980s, a number of Russian terms became well known to English speakers in the United States. These included perestroika (economic restructuring) and glasnost (openness), which referred to policy changes within the Soviet Union. On the U.S. side, the cautious response from then President Reagan often took the form of “Doveryay, no proveryay,” usually translated as “Trust, but verify.” That expression is especially appropriate for tentative relationships, where there is insufficient history and experience for trust to be automatic and unequivocal. Relationships between libraries and commercial publishers, in particular, have been strained, if not adversarial, for many years. Consequently, even with trusted nonprofit entities, including national libraries and university libraries playing a major role in facilitating e-journal archiving, there is much that libraries want to scrutinize and evaluate before they can feel comfortable investing in a particular solution. Especially in these early stages, programs and initiatives should be prepared to demonstrate an extraordinary level of openness and transparency if they expect to gain the trust and support of the user community.

Recommendations

  1. Publishers, research libraries, and archiving entities must all be involved in defining requirements and the processes associated with certification. Although it is important to consider what future requirements will be, it is equally important to do things now and to document what works and what does not.
  2. Digital repositories should be overt about their ability to meet minimal requirements for well-managed collections and, ultimately, for certification. As the “Urgent Action” statement noted, “Certifying agencies might recognize qualified preservation archives that provide these services with a publicly visible symbol of compliance.” Figure 6 shows examples of such symbols that are already in use: the NLA PANDORA’s use of Safekept for materials on digital preservation that are preserved by Preserving Access to Digital Information (PADI), the National Archives of Australia’s e-permanence program, and the server-certification program in Germany sponsored by DINI.

Fig. 6. Examples of logos symbolizing compliance

  1. Research libraries should probe e-journal archiving programs for details on their ability to meet base-level requirements for responsible stewardship of journal content.
  2. An anonymous reporting service should be established so that e-journal archiving programs and others in the community can share negative experiences with digital preservation procedures and tools without embarrassment or loss of credibility.
  3. To achieve maximal feedback on the state of an archive’s content, e-journal archiving programs should use a combination of automated integrity testing and active usage. Systems providing current access should be designed to encourage and facilitate reporting of data quality problems. Publishers should relax usage restrictions on dark archives to boost confidence that the content is “user ready” at all times.20
  4. Programs should practice openness and transparency by making policy statements, model contracts, and technical procedure documentation publicly available.
  5. E-journal archiving programs should begin examining interoperability issues for digital objects and metadata with an eye on maximinzing the ability to exchange data among them.
  6. E-journal archiving programs should implement redundancy policies that maximize the survivability of data against the wide variety of potential threats. System administration responsibilities should be decentralized to reduce vulnerability to loss from a determined insider.

Indicator 5: Access Rights

A repository should negotiate with publishers to ensure that the digital archiving program has the right, and is expected, to make preserved information available to libraries under certain conditions.

The sine qua non of an effective e-journal digital archiving program is the ability to provide effective access to journals over time. If e-journals cannot be made available, there is little reason to preserve them. The conditions under which e-journal archiving programs can make preserved information available, and to whom, are two of the most important defining characteristics of the programs.

“Current Access” versus “Archiving”

One of the major distinctions in the surveyed initiatives is between those that provide immediate access to content, and promise to do so on a continuing basis, and programs whose primary responsibility is to ensure future availability of material, but which do not address current demand.

Tying digital preservation directly to current user access has pros and cons. On the plus side, it keeps preservation in the forefront. If a reader cannot currently access journals, either because of format changes or renderability problems, the provider will need to address the issue in relatively short order. Of the 12 initiatives we surveyed, 5 (CISTI Csi, OCLC ECO, LANL-RL, OhioLINK EJC, and the Ontario Scholars Portal) are focused primarily on making electronic journals available immediately to their authorized communities.

Two initiatives—PubMed Central and NLA PANDORA—offer online access to commercial publications after the expiration of a moving wall, normally six months to three years from date of publication.21 In theory, one could substitute free access through PubMed Central or NLA PANDORA for a subscription, but in practice for most titles behind the moving wall, archival access is a supplement to, rather than a replacement for, current access from other sources.

The drawback to programs that tie digital preservation to current user access is that they may be more motivated to perform functions supporting current, rather than future, access needs. One program providing immediate access commented on its use of standards and community practice: “As an access-oriented system, we struggled here. What we use is based on the current system for access. We would choose to use one or more [of these standards] if we were just archiving, or we may use them as we evolve to a new access system.” Because proper preservation management embodies enough different and specialized responsibilities, the DLF Minimum Criteria for an Archival Repository of Digital Scholarly Journals document recommends against combining access and preservation in one system. Criterion six states that the limited-access services an archival repository provides “should not replace the normal operating services through which digital scholarly publications are typically made accessible to end users” (DLF 2000). Similarly, the authors of the “Urgent Call” statement suggested that digital archiving may best be viewed as a “kind of insurance” and not a form of access. They split archiving into two issues: mitigating risk of permanent loss and avoiding access disruptions for a protracted period.

The determination of whether a current e-journal access and delivery system can also effectively serve as an archival repository will ultimately rest upon a careful examination of all the program viability factors outlined in this report. Unlike the authors of the DLF Minimum Criteria, we do not reject out of hand the possibility that a program with a primary focus on current access could also serve as an archival repository.

“Dark Archive” versus “Light Archive”

A repository that preserves material for future use but does not provide current access is often referred to as a dark archive (Pearce-Moses 2005). In theory it might be possible to have a true dark archive that stores, maintains, and manages a sequence of bits without necessarily knowing what those bits contained. In reality, however, even the darkest of archives must permit some access by repository staff. The level of public access to the system can further distinguish dark archives. Some dark archives stress that they are dark because the system itself has no public interface and allows no public access. Only the person who deposits data into the dark archive can get it out, and it is the depositor’s responsibility to provide access to the data. Other dark archives have public interfaces but allow no public access until a trigger event occurs. That trigger event could be negotiated with the content contributor (i.e., immediate onsite access to the files) or it could be related to an external event (such as the unavailability of the content owner’s own Web site). People often refer to these archives as “dim,” even “light,” archives.

Librarians by and large have not been thrilled with the idea of pure dark archives. There are at least three reasons for this antipathy. The first is that for librarians, preservation and access have always intimately been linked. As Brian Lavoie and Lorcan Dempsey noted in their 2004 article, “Thirteen Ways of Looking at . . . Digital Preservation”:

The notion of “dark archives,” supporting little or no access to archived materials, has met with scant enthusiasm in the library community. This suggests that digital repositories will function not just as guarantors of the long-term viability of materials in their custody, but also as access gateways. Fulfilling this dual mission requires that preservation processes operate seamlessly alongside access services.

Don Waters made this same point in his paper “Good Archives Make Good Scholars: Reflections on Recent Steps Toward the Archiving of Digital Information”:

Access is the key. Over and over again, we have found that one special privilege that would likely induce investment in digital archiving would be for the archive to bundle specific and limited forms of access with its larger and primary responsibility for preservation (Waters 2002).

The second objection to dark archives concerns the funding mechanisms. As Sadie Honey (2005) noted:

. . . the dark archive approach appears least likely to address long-term preservation needs. . . . The dark archive approach is weak in terms of equitable sharing of costs and long-term sustainability and does not score well against any of the criteria. The biggest obstacle for the dark archive approach is funding—who pays for it and how.

The third objection librarians have to dark archives is technical. It is far from certain that digital files stored in a system that is not accessible to the public can be safely managed. Don Waters, in the essay cited above, notes that, “User access in some form is needed in any case for an archive to certify that its content is viable.” Harvard and others assert that they can safely audit and test a digital repository even when it is not open to public use, but this contention has not been proved. Cornell’s experience with offline storage of digital masters has not been good and, in one case, a heroic rescue of digital files was necessary.

What librarians really want, in short, is at least a dim archive—though the level of dimness can vary. Fortunately, all the primarily preservation-oriented programs in our survey require staff access to content, with many assuming some level of public access. PubMed Central and NLA PANDORA, as noted above, are current publishers for some content and make other content available after a set period of time. The KB e-Depot and the kopal/DDB allow immediate onsite access to preserved content, with the possibility that online access can occur after certain trigger events. LOCKSS prefers that the publisher provide access to the reader, but when the publisher’s copy is not available, the LOCKSS cached copy can be used for current access. To date, members of the LOCKSS Alliance have not experienced much need to initiate local access from their LOCKSS boxes. Recently, however, when the journal Communication Theory moved from Oxford University Press to Blackwell Publishing, some LOCKSS Alliance libraries that do not subscribe through Blackwell began to provide local backfile access to their Oxford University Press content. As each institution’s LOCKSS box serves only its own readers, the inexpensive machines used are more than adequate for a single institution’s access load. Only Portico and CLOCKSS eschew some level of current access beyond audit, and both of them can become delivery mechanisms of choice under certain conditions. Portico plans to use the JSTOR access system to provide access in response to triggers or to secure perpetual access rights, if participating publishers choose to designate Portico as a provider of post-cancellation access. In addition, select librarians at participating libraries are granted password-controlled access for verification purposes.

Trigger Events

In a world of dim archives, the three key questions are who can have access to preserved content, how they can have access, and when they can have access. The conditions that can lead to a change in access to preserved content are usually called trigger events (Flecker 2001). A trigger event would occur when something goes wrong and a library could file a claim. We identified six trigger events that could change access conditions:

  • a publisher ceases operation;
  • a publisher no longer offers back issues;
  • copyright in the journal expires;
  • a journal ceases publication;
  • the publisher or distributor experiences catastrophic system failure; or
  • the publisher or distributor experiences temporary system failure.

Trigger events and the authorized community. We surveyed the archiving initiatives to see how a trigger event might change access for their authorized community. The results are presented in Table 15.

Table 15. Trigger events that spark changes in access for the authorized community

The programs that provide current access to content (OhioLINK EJC, LANL-RL, Ontario Scholars Portal, OCLC ECO, and CISTI Csi) would continue to provide such access even after a trigger event. As one of the providers noted, “Our partner model does not involve the idea of a ‘trigger event.’ Our repository is always available.” Similarly, the moving-wall agreements that PubMed Central and NLA PANDORA have with publishers control access, regardless of trigger events. If either has received permission to make material available immediately or after a fixed period of time, that permission continues, regardless of the status of the publisher or the journal. LANL-RL is developing agreements with several scholarly societies, most notably the American Physical Society, to become a fallback provider if the primary servers fail completely.

Trigger events are more important for the other five repositories and can potentially alter the type and amount of access that each can provide. For example, if a publisher ceases operations, no longer offers access to back issues, ceases publication, or has a catastrophic failure of its delivery mechanism, LOCKSS and Portico would be able to make content available to authorized users. With LOCKSS, local access to the material preserved on a local LOCKSS box would be instantaneous, whereas with Portico it could take from 90 to 120 days to provide authorized user access to preserved material.22

In addition to the trigger events listed above, LOCKSS can provide access in the event of a temporary disruption in the publisher’s distribution mechanism. Portico can in some cases provide ongoing access to subscribed content even after a library has terminated its license with the publisher. In these cases, the publisher will have decided that Portico, and not the publisher, will meet any perpetual access obligations of the original license.

Reactions to expiration of copyright as a trigger event were quite interesting. In theory, once copyright in a journal expires, the repository should be able to make it freely available to anyone. In practice, few repositories seem to have considered this possibility during their negotiations with publishers. If the negotiated agreements with the publishers limit access to a subset of users during the copyright term of the material, those restrictions would often still apply, even after the copyright has expired. As one interviewee somewhat sheepishly admitted, “Given the increasingly long duration of copyright terms, it is difficult to remember that copyright will eventually expire.” Some of the initiatives (for example, PubMed Central, KB e-Depot, and kopal/DDB) are eager to make open-access material available to the world. Other initiatives appear to be concerned about the costs of giving nonmembers or nonsubscribers access to preserved open content. The benefit to society of providing ready access to public domain or otherwise open content can be great (Hamma 2005), and those programs providing current access to users should be urged to open access to the most material that the law, license agreements, and business plans allow.

Trigger events beyond the authorized community. The “Urgent Action” statement argued that access in response to a trigger event should be limited to designated member or subscriber communities. For those outside this group, access should come at a premium: “Potential participants who might choose initially to withhold support would pay their full fair share, should they eventually need access to preserved materials.”

We therefore asked the e-journal archiving programs that restrict current or future access to a designated community whether, if one of the trigger events occurred, the repository would be able to provide access to those beyond their designated member or subscriber communities. Take, for example, an Elsevier journal that was no longer available electronically through the publisher. Would a library that subscribed to that journal and was not part of one of the archiving initiatives be able to turn to one of the e-journal archives to retain electronic access to the journal? And what about libraries that do not even have a current subscription? Would they ever be able to gain access to the preserved content?

Two of the initiatives—PubMed Central and NLA PANDORA—already make their content available to all after a publisher-specified waiting period. Of the remaining initiatives, only CLOCKSS said that it would be able to provide access to nonmembers in the event of a trigger event. A presumed trigger event would initiate collaboration among publishers, librarians, and representing societies to determine whether the trigger event had actually taken place and what the appropriate response should be: e.g., whether materials would be made generally available to all and whether such access would be for a limited or an indefinite period. Assuming general public access was authorized, the process of moving material from CLOCKSS’s restricted storage environment into a public-access system would begin, and material would be available within six months.

The KB e-Depot, in principle, could also serve as a general delivery system for content in the event of a catastrophic collapse of the publisher’s system, but some additional negotiations with publishers might be required, and the ramp-up time for the development of an online access system would likely be high, with no assurance that funding to develop such a system would be available. As yet, kopal/DDB has not negotiated the right to make material generally available after a catastrophic failure, though again this might be possible with the publishers’ agreement and an appropriate ramp-up time.

Of the remaining seven initiatives, none opposed providing nonmembers access to preserved content at some time in the future, but all stressed that there would be myriad conditions and costs associated with doing so. As the respondent from the Ontario Scholars Portal noted, “Providing access outside the defined membership would be a problem financially and possibly ethically.”

The reasons for the hesitation varied. In some cases, repositories did not know whether they would have the technical and financial resources necessary to make a general open portal to the preserved content. In other cases, agreements with publishers do not cover such contingencies. In all cases, it was presumed that a nonmember would have to become a member to access the preserved content—presumably at a higher fee than if it had participated from the start. A library, for example, could join the LOCKSS Alliance, establish a LOCKSS box in the library, and then secure access to all content it had previously licensed or was freely available under a Creative Commons license. Alternatively, a library could join OCLC ECO or Portico to gain access to content to which it had once subscribed. The terms of the library’s subscription and the archiving initiative’s agreement with the publisher may limit what can be made available.

In short, it does not appear that there is a ready mechanism that can provide broad public access to currently access-restricted content should a triggering event occur. Subscribers to one of the current access services that also promise enduring access should be unaffected by any trigger event, assuming that the services can effectively preserve content. Participants in the LOCKSS Alliance and Portico should be able to “call in their insurance policy” and get ready access from these providers. The intention of CLOCKSS is to make its preserved content freely available to everyone in the event of a trigger event. The e-Depot at the KB and DDB’s implementation of kopal would also like to provide worldwide, online access to content in the event of a publisher’s failure, but for now the only certainty is that they will be able to continue to provide onsite access. Providers such as OCLC ECO and Portico may be willing to sign up new members when the need arises, but the costs are unclear.

The bottom line is this: unless electronic journals are available through the open-access portions of different repositories, the only certain method of access to preserved content for someone from outside a designated community is to fly to Amsterdam or Frankfurt to work with the preserved content onsite. The initiatives we examined have secured the necessary permissions to make material available to their designated community (e.g., subscribers, participants, onsite users). Few options, however, are available to users from outside the designated communities.

Recommendations

  1. The only way a library can ensure that it will have continued access to subscribed (non-open access) content is through membership or participation in at least one of the e-journal archiving initiatives described in this report. This information should be conveyed to key library stakeholders to help them decide whether to support an e-journal archiving program at the local level.
  2. National preservation projects should be encouraged to negotiate for broad access rights to copyrighted content in the event of a trigger event. Increased access may lead to increased preservation.
  3. The preservation capabilities of any initiative whose primary purpose is the delivery of current journal literature should be carefully assessed. Access and preservation are not automatically at odds but focus on the former could be to the detriment of the latter.
  4. All preservation initiatives should give more thought to the possibility that some of the content they store may eventually rise into the public domain and should negotiate all agreements with publishers accordingly.

Indicator 6: Organizational Viability

Repositories must be organizationally viable.

A digital preservation program exists within an organizational context and as such must fit the needs, priorities, and resources of the relevant stakeholders (e.g., publishers, the repository itself, members/subscribers/underwriters, users, and beneficiaries). Trusted Digital Repositories: Attributes and Responsibilities, produced by RLG and OCLC in 2002, defines the organizational context for a digital preservation program. Three attributes in particular relate to the viability of any e-journal archiving effort: administrative responsibility, organizational viability, and financial sustainability.

Administrative responsibility includes a commitment to implement community-agreed-upon standards and best practices, collect and share data measurements with depositors, regularly validate or certify processes and procedures, and maintain transparency and accountability in all actions. Organizational viability is reflected in a commitment to long-term retention and management in mission statements, legal status, business-practice transparency, staffing, the development and review of policies and procedures, testing, and contingency/escrow arrangements. Financial sustainability can be reflected in good business practices, business plans, annual reviews, standard accounting procedures, and short- and long-term financial-planning cycles.

What evidence exists that e-journal archiving programs are administratively responsible, organizationally viable, and financially sustainable? Our survey included questions on a range of issues, from organizational commitment, to documentation and standards adherence, to succession planning, to resources and cost models. The various programs’ responses suggest that all have the potential for long-term viability. Each has an explicit mission committing it to long-term e-journal archiving and the legal right to do so. All have formal arrangements with publishers that spell out archiving and access requirements and show evidence of continued growth in publications covered. All are embedded in an organizational structure, and all except the government-supported programs have or plan to have a governance board that includes input from key stakeholders—libraries and publishers. Most make use of external advisers or are planning to do so within the next six months. All maintain Web sites and other publicity materials; many have contributed to the profession through participation in conferences, standards bodies, or digital preservation efforts, or through publication.

But these programs are of recent vintage and have limited track records in terms of digital preservation responsibility and practical experience. Except for the National Library of Australia, those with a primary preservation focus are less than four years old; three have become operational since last year. Most are still building their digital preservation programs, and this is reflected in the fact that policies and practices are not as well documented as they might be. Well-defined service requirements are not fully met by all the repositories, and there appears to be little agreement regarding the appropriate means and level of openness and transparency needed to gain the trust of potential participants. Few have considered succession planning; none reported having a formal arrangement in place. That only half of them indicated a commitment to seek certification could also be a red flag for an institution that is relying on them for its preservation needs.

As shown in Table 16, only half of the programs reported that they have business and financial auditing processes in place or planned. However, the detailed comments accompanying these responses indicate that very few seem to conform to the standard set by the securities industry for a formal, externally conducted, and publicly released audit. Financial reports and publisher agreements, almost without exception, are not publicly available.

Table 16. Responses to question: “Do you have the following audit processes in place?” ( • = yes; P = plan to within six months)

Economic issues related to digital preservation have been scrutinized in recent years, but the absence of any standard mechanism for accounting for all of the associated costs of e-journal archive management, and the early developmental stage of most of the programs, make meaningful comparisons of operating costs impossible—even if the programs surveyed had shared detailed budget documents with us. Perhaps the CRL report forthcoming by the end of 2006 will shed more light in this area.

We did look at two potential indicators of financial sustainability: sources of funding and stakeholder buy-in.

Sources of Funding

Programs with a government mandate may have an edge in terms of ongoing commitment and funding appropriations, although an exclusive dependence on government largesse could be detrimental in lean economic times. The KB, for example, has reallocated funding within its own budget to support e-Depot and since 2003, it has received an additional €1.1 million annually from the Ministry of Education, Culture, and Science for system maintenance and operations staff. In 2005, the ministry provided an additional €900,000 to be used exclusively in digital preservation research (Oltmans and van Wijngaarden 2006). Funding for PubMed Central is based on appropriations from the federal government for the NIH. In 2004, NLM’s annual operating cost for PubMed Central was $2.3 million.23 The Bundesministerium für Bildung und Forschung funded the three-year development of kopal/DDB with over €4 million in August 2004. To support the implementation of electronic legal deposit in Germany this year, kopal/DDB is getting a funding increase of about €2 million. Los Alamos National Laboratory receives appropriations from the U.S. Department of Energy, the U.S. Department of Defense, and elsewhere. The library receives funding from the institutional overhead in those appropriations or from grants and work for others that is done at the laboratory. The library charges external customers for access on a cost-recovery basis.

Programs with a primary mission to provide access may also be at a financial advantage, because the costs of archiving are tied directly to current use and subscriptions. Between 2001 and 2005, the Ontario Scholars Portal was supported by a grant and provincial matching funds as part of the Canadian National Site Licensing Program. The portal is now self-funded through a membership pricing model that adjusts for the varying size of consortium members and factors in usage, and includes tiered membership fees. Members have made a financial commitment through 2009–2010. OCLC ECO has been an online service provider for nearly 30 years and has the power of OCLC behind it. For OhioLINK EJC, all technical infrastructure costs, as well as about 20% of content-acquisition costs, are centrally funded though legislative appropriations. The remaining funding for content comes from member libraries, based on an institution’s rate of expenditure on journals from publishers represented in EJC, including both print and electronic subscriptions. Most Ohio higher education institutions participate. Fluctuations in state appropriations, however, have resulted in discontinuation of some titles. EJC’s contracts stipulate a nonpunitive approach to obtaining missing content if it resubscribes to a canceled title.

The three programs that are not funded by the government and are primarily intended for preservation may be the most vulnerable. All three have started within the past year or so; each has benefited from generous startup support from well-respected sources. The Andrew W. Mellon Foundation has supported both Portico and LOCKSS, and LC supports both Portico and CLOCKSS. In addition, LOCKSS received funding from the National Science Foundation, Sun Microsystems, and Stanford University libraries, and in-kind support from Sun, Intel Research Berkeley, HP Labs, and the computer science departments of Stanford and Harvard. Portico received heavy initial support from Ithaka and JSTOR, in addition to Mellon and LC.

Stakeholder Buy-in

Long-term sustainability for these efforts will depend on their ability to secure ongoing support from a number of quarters. The LOCKSS Alliance is an open-membership organization that began in 2005 to introduce governance for the program and to address sustainability issues. Its goal is self-sufficiency through membership fees, which are based on an institution’s Carnegie Classification.24 There is a 5% discount for consortia and library systems. Because some of the participating publishers make available for preservation only current content to current subscribers, the earlier a library joins the LOCKSS Alliance, the more complete its coverage is. Portico looks to a diversified revenue portfolio to fund ongoing operations, with major support coming from publishers and libraries. Publishers are asked to make annual contributions, which are tiered and vary according to the size of their annual revenue from journal subscriptions and advertising in addition to providing electronic journal source files. Libraries are asked to support the lion’s share of expenses. Those that join pay an annual archive support payment, which is tiered according to a library’s self-reported total library materials expenditure. Library systems and consortia are offered modest discounts. Published rates are available on the Portico Web site. To encourage early adoption, libraries that join in 2006 and 2007 will be designated “Portico Archive Founders.” Those joining in 2006 receive a 25% savings in their payments for the next five years; those joining in 2007 will receive a 10% discount for the next five years.

CLOCKSS is in an initial two-year phase, and it is difficult to judge what will happen next. In the minds of many library directors, the e-journal–preservation issue comes down to two choices: LOCKSS Alliance or Portico. The long-term viability of these programs will be determined largely by how successful they are in signing up e-journal publishers as well as library members. The LOCKSS Alliance reported arrangements with more publishers than Portico, but Portico lists more titles covered. As of July 1, 2006, 13 publishers had committed more than 3,500 journals to Portico; 25 publishers had committed 1,500 titles to the LOCKSS Alliance.25 Both continue to add new publishers and content.

More than 90 libraries worldwide joined the LOCKSS Alliance (157 institutions maintain LOCKSS boxes) in the first year it recruited members. In June 2006, the Alliance got a major boost when OCLC announced it had joined (OCLC 2006). According to the survey response from LOCKSS Alliance Director Vicky Reich, the LOCKSS Alliance “has reached an impressive level of sustainability.” Eileen Fenton, Portico’s executive director, reported that as of July 1, 2006, 100 libraries had committed to supporting the archive. “Steadily growing participation from U.S. academic libraries and significant international expressions of interest suggest a broad base is building in support of Portico’s efforts,” she noted.

Both the LOCKSS Alliance and Portico have their supporters—and their detractors. Those who prefer to invest in an archiving solution by writing checks see Portico as the better choice and the annual fees a “bargain,” especially given the early incentives and consortial discounts. The JSTOR imprimatur brings with it a sense of confidence in the approach. Some Portico supporters are also concerned by the technical requirements and staff time at the local level to participate in LOCKSS. Last February, the California Digital Library (CDL) estimated the impact of the Portico service on its systemwide e-journal preservation activities. They compared the journals then covered in Portico with CDL’s 2005 journal packages, including nonprofit and for-profit publishers. The number of Tier 1 journals licensed was 4,593 for all 10 University of California (UC) campuses (9 campuses if the content is nonmedical and UC San Francisco is excluded). CDL negotiates the license, and all UC users have access to this material. It may be funded, in whole or in part, by CDL. CDL discovered that 45% of the journals were covered by Portico, representing 57% of the funds spent by CDL to license the journals.26

Those who favor the LOCKSS approach see it as the low-cost, technically proved, and organized way to go about archiving. “Any time someone asks us to write a check, we disappear,” commented one director. They conceded that participating in the LOCKSS Alliance did require resources beyond the membership fee, but that the hardware and staff costs were negligible.27 Others commented on the value of participating in collection development activities—choosing which publications to archive. They also valued the access to documentation, prerelease software, training, and involvement in planning efforts. Some expressed concern about the up-front efforts required by Portico to normalize data from the publishers, being one step removed from publishers by the participation of a third party, and the need to buy in before a full set of publishers was covered.

A few directors wondered whether the profession could financially sustain both the LOCKSS Alliance and Portico. Others valued the opportunity to participate in more than one program. As of July 1, 2006, 32 institutions had joined or were participating in both LOCKSS and Portico. Several members of OhioLINK EJC and the Ontario Scholars Portal are also participating in LOCKSS. Close to 300 institutions in the United States and Canada are covered by one or more e-journal archiving programs—a good beginning, but representing only a fraction of all higher education institutions in the country.

Cornell University Library is participating in both Portico and the LOCKSS Alliance. Approximately 2,200 titles licensed by Cornell are covered in Portico (about 63% of Portico’s total). As a LOCKSS Alliance member, Cornell’s coverage includes 188 journals, 66 of which are also represented in Portico. Beyond the Alliance itself, Cornell subscribes to 618 titles from publishers in the LOCKSS program. Of these, 442 are also being archived through Portico.28 It was surprisingly hard to determine the number of scholarly e-journals Cornell maintains that are not covered by these two options.29 The cost to Cornell of participating in both Portico and the LOCKSS Alliance in 2006 is about $24,000, of which membership in the LOCKSS Alliance is $10,800 and participation in Portico is $13,125 (after the 25% early adopter discount). The LOCKSS box is running on a five-year-old Dell machine whose memory was upgraded twice, for a total of $125. The programmer responsible for managing the box estimates it took less than a day to set up the system and that he spends about 15 minutes a month to keep it running. With a three-year effort to move to electronic-only subscriptions in the sciences, social sciences, and the humanities, where possible, Cornell considers this money well spent, averaging approximately $10 per title and a little over one-tenth of 1% of total library materials expenditures. The money to support the memberships is coming from an account previously used for preservation microfilming.

Recommendations

  1. Academic libraries should assess how much of their licensed content is protected in one of the e-journal archiving programs as a measure of the value of participation.
  2. Academic libraries should share information with each other about what they are doing in terms of e-journal archiving, including their internal assessment process for decision making.
  3. Mainstreaming commitment in terms of requisite resources and organizational support is essential. Participation in more than one program can ensure that different approaches and strategies are tried and assessed.
  4. Academic libraries should press e-journal archiving programs for particulars on their business plans but not expect them to offer absolute guarantees of economic viability. Support should be viewed as an investment in developing viable models and an interim means for protecting vulnerable content.

Indicator 7: Network

Repositories will work as part of a network.

The DLF Minimum Criteria lay out the advantages to creating a network: establishing a “satisfactory” degree of redundancy of their holdings; developing common finding aids, access mechanisms and registry services; and potentially reducing costs. In response to an evaluation by outside experts last year, the KB agreed that e-Depot should become part of a “larger international programme for preserving scientific literature.” Yet what evidence exists that repositories are working toward this goal? Certainly they are holding common, often redundant, content and have common problems.

We asked the group whether they had any relationships with other archiving organizations in a number of categories. Table 17 summarizes their responses. Good collaboration is occurring in exchanging ideas and strategies (75%), sharing software (75%), and sharing planning documents (58%). The LANL-RL has shared its customized version of access software with both OhioLINK EJC and the Ontario Scholars Portal, and kopal/DDB and KB e-Depot are collaborating on the further implementation of IBM’s DIAS software. Kopal is part of nestor, the alliance for Germany’s digital memory; Portico and JSTOR have an agreement to use JSTORS’s content-delivery infrastructure. The LOCKSS Alliance and CLOCKSS are using the same software. CISTI Csi and the Ontario Scholars Portal are having informal conversations on ways to collaborate. CISTI Csi has implemented business continuity facilities with Library and Archives Canada. OCLC ECO plans to work with OCLC’s digital archives program in the future. And, as noted earlier, LC and the British Library intend to support the migration of electronic journal content to the NLM DTD standard.

Table 17. Responses to question: “Do you have any relationships with other archiving organizations involving the following activities?” ( • = yes; P = plan to within six months)

Coordinating content selection and providing secondary archiving responsibilities is an under-represented form of collaboration. Only two repositories indicated that they coordinate content selection, but both are doing it in the context of their own consortial arrangements rather than with the other digital archiving programs. Very few respondents have or are even thinking about succession plans or dependencies, as indicated by Tables 18 and 19, and only Portico has the contractual rights to pass on content and rights to another nonprofit organization. What may be more disturbing is that some may not even see the need to consider this option. One respondent wrote, “As a national library, we do not envisage that we would not continue.” Another responded, “As a legal deposit repository, the need for succession is unlikely (if not unthinkable).” Although several respondents expressed a willingness to consider serving as a successor archive if another archive failed, in reality little formal commitment has occurred.

Table 18. Responses to question: “Do you have a succession plan in the event you are not able to continue your program?” ( • = yes)

Table 19. Responses to question: “Do you or would you be willing/able to serve as a successor archive if another archive failed?” ( • = yes)

Recommendations

  1. Agree on the need for common rights to protect digital content and facilitate collaboration.
  2. Investigate models for collaborative digital preservation action, such as Data-PASS (Data Preservation Alliance for the Social Sciences), a broad-based partnership of leading data repositories in the United States, to ensure the preservation of materials within and beyond current repository holdings. Supported by an award from LC through its National Digital Information Infrastructure and Preservation Program, Data-PASS is working in such areas as selection, appraisal, acquisition, and metadata and has developed the concept of partner-to-partner protocols for conveying content if an archive fails.
  3. Fund a meeting of these programs’ principals to identify areas of collaboration.
 

Getting and Keeping Informed

At a time when there is a great deal of activity related to e-journal archiving, there is unfortunately no comprehensive clearinghouse or gateway to all the relevant developments. The sources listed here cover at least a portion of the landscape.

Bibliographies

Discussion Forums

Blogs

What’s New and News Listings

Online Journals and Newsletters

Web Sites

 

Promising E-Journal Archiving Programs Not Included in This Report

The 12 programs discussed in this report were selected on the basis of criteria presented earlier. One of those criteria was that the program had to already be archiving content. In the course of our research, we encountered references to additional programs that are still being planned or tested, or that have not yet devised a preservation strategy. Some of these programs are noteworthy because they will be archiving content that is not included in any of the 12 programs reviewed in this report, particularly e-journals using non-Roman alphabets. National libraries, through their legal deposit frameworks, are coordinating almost all this activity.

>British Library (BL)
Subsequent to the passage of new legal deposit legislation in 2003, the British Library had been working with the Joint Committee on Legal Deposit to establish guidelines and procedures for deposit of materials not authorized for legal deposit in prior legislation (The British Library n.d.). To facilitate this work, three subcommittees were formed, including one to address issues relating to deposit of e-journals. The e-journals subcommittee has formed a working group that is conducting a pilot deposit project at the BL with more than 20 commercial, university, society, and small presses participating, representing more than 200 titles (Joint Committee on Legal Deposit 2004). The working group’s first report, issued in June 2005, emphasizes technical issues, especially file formats and metadata (Inger 2005).

Det Kongelige Bibliotek (The Royal Library, Denmark)
Legal deposit legislation in Denmark that went into effect July 1, 2005, includes a new section that covers “materials made public via electronic communication network.” It permits harvesting of public content on Danish Internet domains, as well as of materials intended for a Danish audience but made public on non-Danish Internet domains. A repository with preservation and access functions is being designed with the Royal Library’s partner, the Statsbiblioteket (State and University Library), and the two locations will provide reciprocal backup capability. Danish law allows online access to content provided under legal deposit only for material that is not commercially available and, even then, only to meet strictly defined research needs. Most e-journals will be available only onsite at the Royal Library.

Library and Archives Canada (LAC)
The bulk of scholarly journal publishing in Canada is from university presses, trade associations, and individual academic departments. The National Research Council Research Press is the largest publisher of electronic journals in Canada, with 15 titles. Other e-journal publishers of note are the University of Toronto Press and the Canadian Medical Association (McDonald and Shearer 2006). The most recent change to Canada’s legal deposit laws, passed in 2004, includes a mandate for deposit of electronic publications that goes into effect in January 2007. According to its 2005–2006 Report on Plans and Priorities (Frulla n.d.), LAC is planning to develop a system to “facilitate the acquisition, management, preservation and accessibility” of Canadian digital content, in concert with the new legal deposit requirements.

National Diet Library (Japan)
Though amended in 2000 to include CD-ROM and other packaged digital publications, Japan’s legal deposit legislation still does not cover online publications. Research preparatory to further amendments governing online publications has been under way at the National Diet Library, and revised legislation is expected soon. As part of its Digital Library Medium-Term Plan for 2004 (Mutoh 2005), NDL is conducting a digital library initiative that includes among its objectives the construction of a digital repository, Web archiving, and digital deposit for e-journals. Since 2002, NDL has been pursuing an experiment called the “Web Archiving Project” (WARP), to preserve Japanese Web sites, including digital editions of periodicals on the Internet and born digital periodicals (NDL n.d.). By 2004, WARP had made available 1,496 e-journals harvested from the Japanese Web, although it is unknown how many of these are scholarly (Mutoh 2005). Mechanisms for long-term preservation are being discussed.

National Library of China (NLC)
The NLC is developing a digital repository that includes both access and long-term preservation as part of its mission. NLC recognizes the importance of e-journals and is working on a strategy for their preservation, with an emphasis on STM titles (Zhang, Zhang, and Wan 2005). The current NLC digital collection includes e-journals in Chinese and in Western languages. In May 2005, NLC launched a portal to its digital collections, including 16,000 periodicals in Chinese and other languages. Because of copyright restrictions, the portal is available only within the NLC building. It is not clear how many of the 16,000 periodicals are scholarly titles. Preservation activities are still in the planning stages.

Others
A recent report by the International Federation of Library Associations and Institutions describes the digital preservation activities and plans of 15 national libraries (Verheul 2006). Besides those mentioned above, several others are working on repositories that are expected to incorporate e-journals and will merit attention over the next few years.


FOOTNOTES

6 See, for example, “Minimum Criteria for an Archival Repository of Digital Scholarly Journals,” Digital Library Federation, May 15, 2000, http://www.diglib.org/preserve/criteria.htm. In 2001, The Mellon Foundation funded seven institutions to research archiving options. The results of these studies pointed to the need for collective action.

7 “Digital Repositories: Some Concerns and Interests Voiced in the CRL Directors’ Conversation,” January 21–22, 2006 [at ALA midwinter] as distributed on the CRL Member Directors’ listserv, February 3, 2006, by CRL President Bernard F. Reilly. See also Digital Archives and Repositories Update, FOCUS 25(2). Available at http://www.crl.edu/PDF/pdfFocus/Winter2005-06.pdf.

8 Small and medium-size libraries expressed this concern in a 2003 study on the state of preservation programs by Kenney and Stam (2002).

9 See the “Gesetz über die Deutsche Nationalbibliothek (DBNG),” signed into law June 22, 2006, and available at http://www.d-nb.de/wir/pdf/dnbg.pdf.

10 See “RCUK Position on Issue of Improved Access to Research Outputs” Web page at http://www.rcuk.ac.uk/access/.

11 See Van Orsdel and Born 2006; see also letter to Senators Cornyn, Lieberman, and Collins from signatories of the Washington D.C. Principles for Free Access to Science, June 7, 2006, available at http://www.dcprinciples.org/LiebermanLetter.pdf. The D.C. Principles, released on March 16, 2004 (see http://www.dcprinciples.org/), lay out seven principles constituting “commitment to innovative and independent publishing practices and to promoting the wide dissemination of information in our journals” by dozens of nonprofit scholarly journal publishers that oppose government-mandated public release of scholarly research articles. One of the seven principles is, “We will continue to work to develop long-term preservation solutions for online journals to ensure the ongoing availability of the scientific literature.” As of August 1, 2006, only about half of the 75 scholarly society publishers who have signed the D.C. Principles had committed to one of the twelve e-journal archiving programs profiled in this report. Most are users of HighWire Press, which is in the process of including all its titles in LOCKSS.

12 A study of publishers’ archiving policies conducted in 2002 produced similarly disappointing results, indicating little progress in this area in the past four years. See Hughes 2002. Elsevier’s home page offers a link to a set of resources for librarians that includes Elsevier’s archiving policy: http://www.elsevier.com/wps/find/librariansinfo.librarians/libr_policies#sdarchiving. Publishers that have issued press releases announcing their participation in archiving programs have advertised only those most closely associated with archiving (Portico, LOCKSS, CLOCKSS, and KB e-Depot). If the others are noted (e.g., OhioLINK EJC and Ontario Scholars Portal), the announcements say nothing about archiving but focus on their roles in providing access. Other publisher sites checked were Oxford University Press, Kluwer, Sage, and Cambridge University Press. A few e-journal publishers and providers have provided prominent references to their archiving efforts, including Project MUSE, which has a link to Archiving and Preservation available at http://muse.jhu.edu/about/index.html, and the journals home page for the American Institute of Physics (http://journals.aip.org), which has a direct link to its archives and use policy at http://www.aip.org/journals/archive/arch&use.html.

13 An interesting glimpse at the perspective of publishers of journals for small scholarly societies regarding perpetual access responsibilities during title transfers appears in a publication of a British publisher’s association. “If an unequivocal contractual commitment to provide ‘perpetual’ access was made by the transferring publisher, then strictly speaking it should bear the cost of whatever solution is adopted (be careful of this when drawing up your own journal licenses for journals you do not own!).” See ALPSP 2002.

14http://www.lockss.org/lockss/Related_Projects.

15 A list of institutions that have received DINI certificates is available at http://www.dini.de/dini/zertifikat/zertifiziert.php.

16 Relevant standards include OAIS (Open Archival Information System), Reference Model, ISO 14721:2002; PREMIS (PREservation Metadata: Implementation Strategies); METS (Metadata Encoding and Transmission Standard); NISO MIX (NISO Metadata for Images in XML), NISO z39.87; MPEG-21; PDF/A-1 (Portable Document Format/Archival), ISO 19005-1:2005(E); OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting); Journal Archiving and Interchange DTD (Document Type Definition); and Journal Publishing DTD.

17 Even in the case of those programs that are using the NLM DTD, none requires the publisher to submit its material in that form. PubMed Central requires participating publishers to submit research articles in SGML or XML, based on an established journal article DTD. Although it does impose certain minimum coding requirements, it does not insist on use of the NLM DTD. More and more publishers are moving to XML-based production systems, and consider the XML version (not PDF or HTML) to be the official version. Nevertheless, there is still a considerable lack of publisher consistency regarding the “standard form” for journal articles.

18 See, for example, Bell and Lewis 2006, which examines interchange of electronic theses between a DSpace- and a Fedora-based repository; and Bekaert and Van de Sompel 2006.

19 Some do so now, e.g., OhioLINK; see http://www.ohiolink.edu/ostaff/it/docs/DisasterPlan.doc.

20 Ken Orr proposes six data-quality “rules” of potential relevance to maintainers of and contributors to dark e-journal archives. Among these are (1) unused data cannot remain correct for very long; (2) data quality will, ultimately, be no better than its most stringent use; (3) data-quality problems tend to become worse as the system ages; and (4) laws of data quality apply equally to data and metadata (Orr 1998).

21 kopal/DDB hopes to negotiate with some publishers moving wall access to preserved content as well, but it cannot currently offer that service.

22 The other archiving initiatives (CLOCKSS, KB e-Depot, and kopal/DDB) would prefer to make content available to everyone after a trigger event, rather than manage authentication systems that control access to a select group of authorized users. These programs are discussed below.

23 E-mail message from Ed Sequeira to Rich Entlich, April 14, 2006. “The last time we tallied the cost of PMC, in October 2004, we came up with an annual operating cost of $2.3 million.”

24 See http://www.lockss.org/locksswiki/files/a/ad/AllianceInvoice.pdf. For a description of the Carnegie Classification system, see http://www.carnegiefoundation.org/classifications/. Equivalent measures are used for non–U.S. libraries.

25 More publishers and titles are represented as being included in programs employing LOCKSS boxes, and the publishers’ title listings on the Web site seem to be a work in progress. See http://www.lockss.org/lockss/Publishers_and_Titles.

26 E-mail, Patricia Cruse, Director, Digital Preservation Program, California Digital Library, to Anne R. Kenney, July 11, 2006.

27 Libraries buying new hardware to support the LOCKSS box can be expected to spend approximately $1,000. Total staff costs, including technical support and collection development, average several hours per month.

28 Information supplied by William Kara, e-resources and serials librarian, to Ellie Buckley, July 14, 2006.

29 Cornell has about 42,000 unique bibliographic IDs for e-journals, so a little over 5% of the e-journal content Cornell makes available is covered in LOCKSS and Portico.