APPENDIX 2 • CLIR

Profiles of the 12 E-Journal Archiving Initiatives

All data in the following summaries were current as of July 1, 2006

Canada Institute for Scientific and Technical Information

The National Research Council of Canada (NRC), Canada’s governmental organization for research and development, hosts the Canada Institute for Scientific and Technical Information (CISTI), a major source for information in all areas of science, technology, engineering, and medicine. CISTI became the National Science Library in 1957.

CISTI has a key role as leader and catalyst in building universal, seamless, and permanent access to information for Canadian research and innovation. To help achieve this vision for Canada, CISTI has established a three-year program called Canada’s scientific infostructure (Csi). This program will create a national information infrastructure and opportunities for collaborations with partners to support research and educational activities.

Using a leading-edge architectural approach, CISTI has built a reliable technology platform with expandable storage capacity that ensures long-term access to digital content loaded at CISTI. CISTI is partnering with Library and Archives Canada (LAC) to ensure business continuity for the infrastructure. With the infrastructure in place, CISTI has loaded close to 5 million articles from publishers NRC Research Press, Springer, and Elsevier. New content from the Institute of Physics, Oxford University Press, the American Society for Microbiology, Mary Ann Liebert, and Emerald will be added to increase the depth and breadth of the repository.

As part of the Csi program, CISTI is negotiating with publishers for rights to make content accessible to customers and partners. To ensure that access is as seamless as possible, CISTI is implementing SFX to support bibliographic linking and is investigating best options to support authentication and authorization in a digital environment. CISTI is also conducting research in the areas of text and data mining and text analyses for future implementation.

LOCKSS Alliance

The Lots of Copies Keep Stuff Safe (LOCKSS) program began in 1999 as a research project based at Stanford University Library. LOCKSS launched the beta version of its open-source software to 50 libraries between 2000 and 2002. LOCKSS developed its software to allow libraries to collect, store, preserve, and provide access to their own, local copies of authorized content they purchase. The LOCKSS Web site¹ lists about 100 participating institutions in more than 20 countries that are using the LOCKSS appliance to capture content. About 25 publishers of commercial and open access are participating in LOCKSS not counting the individual publishers represented by aggregators such as HighWire Press and Project MUSE, and LOCKSS’s own Humanities Project.²

In 2005, LOCKSS launched the LOCKSS Alliance as a membership organization that is built on the LOCKSS software to introduce governance for the program and to address sustainability issues. The LOCKSS Alliance is an open membership organization. Members have equal rights and responsibilities, though membership fees are based on an institution’s Carnegie Classification. LOCKSS Alliance membership benefits include participation in collection-development activities (including publisher briefings); early access to LOCKSS documents, documentation, and prerelease software; access to implementation collection and technology workshops; involvement in community planning efforts; and access to the LOCKSS program staff.

The LOCKSS Alliance assures its members of access to participating publisher content, if the member has licensed or purchased that content. Libraries manage their LOCKSS boxes to include all the licensed content to which they wish to ensure long-term access. Libraries can also negotiate with publishers that are not participating in LOCKSS. Participating publishers may choose to prevent the collection of new content, but they cannot withdraw content that was previously ingested.

The LOCKSS appliance, an open-source software application, is the core of the LOCKSS program and the foundation for the LOCKSS Alliance. The appliance uses Web harvesting to capture content from participating publisher websites. To participate in LOCKSS, a publisher grants access to libraries to collect, preserve, and provide access to the content and grants access to the LOCKSS software to crawl, collect, and preserve the content by adding a Web page called a LOCKSS publisher manifest. The LOCKSS appliance has rules for monitoring, mediating, and repairing on the basis of the results of this continuous polling of the content.

CLOCKSS

The CLOCKSS (Controlled LOCKSS) initiative is a 2006 addition to the LOCKSS program that brings together 6 libraries (Edinburgh University, Indiana University, New York Public Library, Rice University, Stanford University, and University of Virginia) and 12 publishers and learned societies (American Chemical Society, American Medical Association, American Physiological Society, Blackwell Publishing, Elsevier, Institute of Physics, Nature Publishing Group, Oxford University Press, Sage Publications, Springer, Taylor & Francis, John Wiley & Sons, Inc.) to establish a large-scale, dark archive for e-journals. The libraries participating in CLOCKSS are also participants in the LOCKSS Alliance. Each library will host two servers, creating a network of 12 dark repositories.

CLOCKSS is a limited-membership organization that is holding assets on behalf of the broader community. CLOCKSS systems will harvest content by Web crawling and ingest source files provided by publishers. Access to CLOCKSS content will be made available to the community following an access trigger event. The CLOCKSS system will automatically detect the cessation of online access from the publisher and, if the content remains unavailable for six months, the governing board (made up of libraries and publishers) will work collaboratively to determine whether content will be made available to the community for a limited or indefinite time. “It’s like a barn raising,” Gordon Tibbitts, president of Blackwell Publishing’s American division, said of CLOCKSS. “We all know we have to have the barn, so we’re calling everyone together to build it” (Kiernan 2006).

During the two-year developmental phase, the CLOCKSS initiative will also test the responsiveness of this distributed test bed of content to various potential disasters and share the results of these tests to contribute to the development of global strategies for preservation.

Koninklijke Bibliotheek e-Depot

As the national deposit library for the Netherlands, the Koninklijke Bibliotheek (KB) has the responsibility for preserving and providing long-term access to Dutch electronic publications. At first, the KB focused on Dutch publishers, but more recently it has come to recognize that multinational publishers produce academic literature, and, as a consequence, there is often no longer a national library that is the natural repository for the content the publishers produce. The KB, therefore, has assumed the responsibility to acquire and preserve, in conjunction with other repositories, the published scientific output of the world, regardless of where it was formally published.

To meet that responsibility, the KB began planning for e-journal archiving in 1993, started experimenting with e-journal archiving systems in 1995, and conducted research and implementation of an e-journal archiving system as part of the NEDLIB project from 1998 to 2000. The current e-Depot was delivered in 2002 and is now fully operational: a fully automated system, dedicated to long-term storage and large-scale archiving. The e-Depot system has been made part of the general budget of the KB. In addition, since at least 2003, the KB has been receiving earmarked funds for the operation of the e-Depot system as well as monies for research and development in long-term preservation. Currently, those funds amount to €2 million a year.

The growth of content in e-Depot has been dramatic. As of March 2006, the e-Depot contained more than 6 million digital objects in about 6 terabytes of storage space. More than 3,500 e-journal titles are represented in the repository. Among the prominent publishers that have signed archiving agreements with the KB are

Elsevier (1996, 2002)
BioMed Central (2003)
Kluwer Academic Publishers (now part of Springer) (2003)
Blackwell Publishing (2004)
Taylor & Francis (2004)
Oxford University Press (2004)
Sage Publications (2005)
Brill Academic Publishers (2005)
Springer (2005)

The KB’s goal is to include in the e-Depot the journals from the 20 to 25 largest publishing companies, which produce almost 90% of the world’s electronic STM literature.

Because there is no legal deposit requirement in the Netherlands, the deposit of material into e-Depot is managed through negotiations between the KB and individual publishers. At a minimum, the KB stipulates that there must be on-site access to all authorized library users. The archiving agreement with BioMed Central allows the KB to provide free remote access to more than 100 open-access journals. For non-open-access journals, the agreement with publishers stipulates that in the event that a publisher cannot deliver content for a long period of time, the KB could deliver the journals on an interim basis to subscribers. If a publisher should decide to stop providing electronic access, the KB could, if it so chooses, provide access to the world. Thus, while the e-Depot system is not primarily an access system, in an emergency the e-Depot could in theory provide access to users around the world—assuming sufficient funds to do so were available.

After receipt, ingest, and storage of electronic files from the publishers, the KB follows two technical approaches to long-term digital preservation. The first is migration: the KB plans to transform digital objects to keep them readable. The KB is also interested in emulation and has several projects under way to see whether it can be used both to lower the cost of preservation and to preserve the look and feel of the original object. The KB continues to work with IBM, the vendor for the e-Depot system, as well as partners from around the world, to create the technical tools required for digital preservation.

Perhaps the most important component of the KB’s approach to digital preservation, however, has been the articulation of the need for what it has called the “Safe Places Network.” The Safe Places Network will consist of a limited number of places that make a substantial investment in the equipment, skills, and expertise necessary to manage digital archiving programs. Sharing the risks inherent in a digital archiving system with a limited number of committed partners, it is hoped, will reduce the cost of digital preservation.

kopal/ Die Deutsche Bibliothek

The Kooperativer Aufbau eines Langzeitarchivs digitaler Informationen (kopal), is a cooperative project funded by the German Federal Ministry of Education and Research. It began in July 2004. Its goal is to develop an innovative technical solution to the problem of long-term accessibility of digital documents. Project partners Die Deutsche Bibliothek (DDB—the National Library of Germany) and the Lower Saxon State and University Library (SUB Göttingen) are storing a variety of digital materials in a repository based on DIAS, the Digital Information and Archiving System, developed by IBM and the National Library of the Netherlands, the Koninklijke Bibliotheek, in The Hague. The Gesellschaft für wissenschaftliche Datenvaraberitung mbH Göttingen (GWDG) is in charge of the archive’s technical operation, with software support provided by IBM Deutschland GmbH.

One of the driving forces behind kopal has been the need of DDB for a system for managing the legal deposit of electronic publications. DDB had been experimenting with electronic journals since 2000; in 2006, legal deposit legislation for electronic publications was enacted in Germany, making the implementation of a system a priority. Fortunately, as part of the initiation of electronic legal deposit, DDB is getting a funding raise of about €2 million to implement it.

As part of its preliminary investigations, DDB had, through voluntary agreements with publishers, acquired a variety of electronic content, including 455 e-journal titles from Springer and many other e-journals from Wiley-VCH and Thieme. Under legal deposit, DDB will start acquiring and adding to kopal all electronic journals published in Germany.

DDB requires that publishers send to it compressed archive files that contain the journal contents plus some rudimentary metadata. At present, the intention is to maintain the readability of the archived file; when necessary, the content will be migrated into new formats. DDB has used emulation for some preservation activities and will continue to do so.

Voluntary agreements with publishers in the past have allowed for public access to the e-journals in the event of publisher failure. This “access of last resort” may also be possible with journals received via legal deposit. As yet, kopal has not built public-access systems, and so it is likely that there would be a significant delay between the collapse of a publisher’s delivery system and remote access to content in kopal. Nevertheless, kopal/DDB is likely to serve as an important guarantor of the long-term availability of e-journals published in Germany.

Los Alamos National Laboratory Research Library

Los Alamos National Laboratory (LANL) is one of three U.S. national laboratories (the other two being Sandia and Lawrence Livermore) operated under the National Nuclear Security Administration of the U.S. Department of Energy. The Research Library at Los Alamos National Laboratory (LANL-RL) has been locally loading licensed backfiles from several commercial and society publishers since 1995. Focusing on titles in the physical sciences, the library maintains the content primarily for the use of LANL staff, but it also serves a group of external cost-recovery clients. These include five U.S. Department of Energy laboratories, nine members of the U.S. Air Force Library Consortium, Sandia National Laboratories, Santa Fe Institute, and five universities located in the western United States. LANL-RL’s locally loaded e-journals are also available to members of the public who are on-site at the library during its regular hours. The titles come from the following publishers:

American Chemical Society
American Institute of Physics
American Physical Society
Elsevier
Institution of Electrical Engineers
Institution of Electrical and Electronics Engineers
Institute of Physics
John Wiley & Sons, Inc.
Royal Society of Chemistry (backfiles through 2004 only)
Springer

Through its digital library initiative, the Library Without Walls, LANL-RL has done substantial research and development work on repository and digital object architecture for long-term maintenance of electronic journal contents. In November 2004, LANL-RL received a $750,000 grant from the U.S. Library of Congress’s National Digital Information Infrastructure and Preservation Program “to support research and development of tools that will help address complex problems related to collecting, storing and accessing digital materials.”

A major focus of the research-and-development (R&D) work at LANL-RL has been the aDORe repository. aDORe uses a modular architecture, and is based on the following standards (Bekaert, Liu, and Van de Sompel 2005):

MPEG-21 DID (Digital Item Declaration) to represent digital objects
MPEG-21 DII (Digital Item Identification) to identify digital objects
XMLtapes and Internet Archive ARC files to store digital objects and constituent data streams
OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) to harvest resources
The OpenURL Framework to convey context-sensitive dissemination requests
Info URI to facilitate the referencing of information assets under the URI allocation

LANL-RL is moving its main e-journal repository from ScienceServer to aDORe and expects to complete the transfer by the first quarter of 2007. Until then, it has to live with some of the limitations of ScienceServer, including the inability to display certain formats and partial lack of Unicode compliance. The new architecture will be considerably more flexible and was built with long-term preservation of digital objects in mind. In particular, it provides an application-neutral, XML-based means to store a wide variety of file formats while maintaining a record of the infrastructure and tools needed to decode the files through evolving digital environments.

Despite the emphasis on preservation in its R&D work, LANL-RL does not offer e-journal archiving services to its external cost-recovery clients. The fees paid by clients cover only the cost of current access and do not provide for subsequent access, even to backfiles, in the event of termination. However, even beyond its digital repository development contributions, LANL-RL’s e-journal preservation efforts have important implications, both for the LANL community and for the scholarly community at-large.

First, LANL-RL has insured through contractual negotiation that all acquired e-journal content can be perpetually archived. Second, it has extended its R&D work into the area of trustworthy and high-integrity transfer of e-journal content from publishers. Since 2003, LANL-RL has been working with the American Physical Society (APS) on a multiphase project that may lead to the establishment of a fully synchronized dark-mirror site for all APS publications wherein LANL-RL would become the worldwide source for APS content in the event of catastrophic failure of APS’s primary servers. LANL is in various stages of negotiation with other publishers to offer similar mirror and fallback services.

LANL receives appropriations from the U.S. Departments of Energy and of Defense, among other sources. The Research Library receives funding out of the institutional overhead in those appropriations. Researchers receiving grants are taxed for institutional support, and a portion of those funds go to support of the RL. Therefore, part of the RL’s funding comes indirectly from appropriations, though there is no explicit budget line for RL operations, let alone for e-journal archiving or other specific tasks.

This creates a certain amount of uncertainty regarding ongoing commitments to e-journal archiving. LANL-RL’s primary concern is that the scholarly journal literature needed by its staff continue to be available via an affordable and trustworthy mechanism. If another source that provided sufficient functionality emerged, it could decide to contract for the services instead. On the other hand, LANL-RL was one of the earliest local loaders of e-journals, and as a result of ongoing R&D, has continued to offer LANL staff functionality not available elsewhere.

Another potential source of uncertainty is that LANL is undergoing a major restructuring that could affect priorities and funding. LANL is currently managed by the University of California (UC) under contract to the U.S. Department of Energy, but over the next year, operation of the laboratory will shift to a limited liability corporation called Los Alamos National Security that includes UC along with Bechtel National, Inc., BWX Technologies, Inc., and the Washington Group International, Inc. How the shift in management will affect the RL’s operation is not yet known.

National Library of Australia PANDORA

The National Library of Australia (NLA) established PANDORA in 1996. PANDORA is an acronym for Preserving and Accessing Networked Documentary Resources of Australia. PANDORA serves “all Australians, present and future, and anyone with a research interest in Australia.” In addition to the NLA, the PANDORA program includes nine national- and state-collecting agencies across Australia that partner to populate and maintain PANDORA. The NLA covers the infrastructure, and support costs for PANDORA through appropriations.

PANDORA contains six priority categories of online publications, including Commonwealth and Australian Capital Territory government publications, publications of tertiary education institutions, conference proceedings, e-journals, titles referred by indexing and abstracting agencies, and topical Web sites. There are 1,983 journals represented in PANDORA, although not all are scholarly or peer reviewed. The PANDORA Web site groups the content into a broad range of subjects covering academic, cultural, social, political, and technical topics. Apart from approximately 150 commercial titles, PANDORA contains publicly accessible content. The commercial content of PANDORA is typically restricted for one to three years.

The first version of the PANDORA Archiving System (PANDAS) was released in 2001. The members of PANDORA use PANDAS to gather content, which is stored on NLA servers using proprietary storage software called DOSS. The NLA developed the PANDAS software to support these workflows: identifying, selecting, and registering candidate titles; seeking and recording permission to archive titles; setting harvest regimes appropriate to the content; gathering (harvesting) files; undertaking quality assurance checking; initiating archiving processes; and organizing access, display, and discovery routes to, and metadata for, the archived resources. The PANDAS software manages administrative metadata about titles that have been selected for archiving, rejected, or are being monitored pending a decision; manages access restrictions; schedules and initiates the harvesting of titles; manages the quality checking and assurance process; prepares and organizes harvested content for public display through title entry pages and title and subject listings; and provides operational reports. The PANDAS software that the NLA developed to gather content will be made available as open-source software soon.

OCLC Electronic Collections Online

OCLC launched Electronic Collections Online (ECO) in June 1997 to support the efforts of libraries and consortia to acquire, circulate, and manage large collections of electronic academic and professional journals. It provides Web access via the OCLC FirstSearch interface to a growing collection of more than 5,000 titles in a wide range of subject areas, from more than 40 publishers of academic and professional journals. Libraries, after paying an access fee to OCLC, can select the journals to which they would like to have electronic access.

An important component of the ECO offering is its promise of long-term accessibility to subscribed content. OCLC’s agreement with publishers ensures that it can continue to provide libraries with access to any content to which the libraries may have subscribed as long as the library continues to pay the access fee. Even if a user discontinues an ECO access account, OCLC will maintain the user’s subscription profile for five years, and if a user renews an access account before five years have passed, the user can regain access to all the journals covered by the previous subscription.

Although ECO has not established the “minimal set of well-defined services” that would make it a “qualified preservation archives” (Waters 2005), it has undertaken a number of steps that increase the likelihood that it will be able to provide continued access to the content it offers. For example, OCLC maintains a copy of all journal content and the associated abstract and index data in an off-site storage facility. It has also secured the right to migrate journal backfiles to new data formats as current formats such as PDF, which form the vast bulk of ECO content, become outmoded. (OCLC has not as yet, however, had to migrate any file formats.) ECO is not part of OCLC’s Digital Archive service and has no immediate plans to take advantage of OCLC’s “real-world solutions for the challenges of archiving and preservation in the virtual world.”

In the event of publisher failure or some other trigger event that would prevent a publisher from delivering content to subscribers, it is possible that subscribers might be able to shift their subscriptions to ECO in order to secure access. This would have to be worked out in negotiations with the publishers. Should OCLC decide to stop offering the ECO service, it can provide to libraries on tape or CD/DVD copies of any content to which the library had subscribed. It would then be the library’s responsibility to mount that material and make it available.

OhioLINK Electronic Journal Center

The Ohio Library and Information Network (OhioLINK) is a consortium of Ohio’s college and university libraries, comprising 85 institutions of higher education and the State Library of Ohio. OhioLINK’s electronic services include a multipublisher Electronic Journal Center (EJC), launched in 1998, which contains more than 6,900 scholarly journal titles from close to 40 publishers across a wide range of disciplines. Although several OhioLINK resources are available to all Ohio residents (with some open to all on the Internet), the content of EJC is available only to students, faculty, and staff members at OhioLINK-affiliated institutions. At this time, OhioLINK has neither the resources nor the legal right to make the contents of EJC available outside of the state of Ohio.

EJC is an optional service of OhioLINK, though the vast majority of Ohio higher education institutions have chosen to participate. The cost of joining EJC is determined by the institution’s current spending on journals from the publishers who are represented in EJC, including print and electronic subscriptions. Most institutions wind up getting electronic access to far more titles than they previously were subscribing to for a similar outlay of funds. The access mechanism is shifted from a campus-based one through publishers and aggregators to one based on EJC.

EJC accepts most content as it is supplied by the publisher, but is limited in the formats that can be displayed by its main repository software, ScienceServer. The current version of ScienceServer can display only PDF, TIFF, and some types of XML. EJC intends shortly either to upgrade to a new version of ScienceServer or move to different repository software. Goals for the new software include expansion of the range of file formats that can be displayed and resolving existing display limitations caused by the lack of Unicode compliance in the old ScienceServer.

OhioLINK has declared its intention to maintain the EJC content as a permanent archive and has acquired perpetual archival rights in its licenses from all publishers but one (the American Chemical Society). Furthermore, in May 2006 the OhioLINK Governing Board approved a series of recommendations that included a commitment to seek the addition of a clause to all EJC contracts that would extend liberal self-archiving and access rights to all personnel of Ohio higher education institutions.

EJC relies on regular and heavy use by subscribers to help maintain the integrity of its archive and reveal problems. Though it anticipates having to perform file migrations in the future, it has not done any yet. It does not normalize incoming files. Instead, EJC relies on publishers to supply files in one of the standard formats that ScienceServer is capable of displaying. Content received from publishers in other formats is retained, but will not be displayable until the next-generation repository software is in place.

All technical infrastructure costs, as well as about 20% of content-acquisition costs, are centrally funded though legislative appropriations. The remaining funding for content comes from member libraries. Fluctuations in state appropriations have resulted in discontinuation of some titles. EJC’s contracts stipulate a nonpunitive approach to obtaining missing content if EJC resubscribes to a canceled title.

EJC has been extremely popular and continues to experience growth in usage. OhioLINK would like to expand EJC to include publishers such as Sage, Taylor & Francis, Cell Press, the Institute of Electrical and Electronics Engineers, GeoScienceWorld, and titles from a number of scholarly societies. Some of these acquisitions would fill gaps in disciplines such as nursing and the biosciences that OhioLINK officials feel are currently underserved. If funding can be found, OhioLINK also wants to purchase backfiles for many titles as a means to increase access and save member libraries money by reducing the need to store print copies at multiple sites.

Plans include development of a Digital Resource Commons (DRC),³ with which OhioLINK hopes to accomplish with a shared repository environment what EJC and other OhioLINK components have done with shared content. Instead of member institutions investing the resources to create and manage their own repositories, DRC would provide a centrally managed repository (based on Fedora) with locally controlled infrastructure for ingest, and a sophisticated, multilevel access rights management system. According to OhioLINK, DRC “ingests, preserves, presents, and mediates administration of the educational and research materials of participating institutions.” Capabilities envisioned include an institutional repository for research portfolios such as preprints, postprints, and working papers, electronic thesis and dissertation management, and Web-mediated peer-reviewed electronic journals with open access, self-archiving, and publishing.

Ontario Scholars Portal

Launched in 2001, the Ontario Scholars Portal (OSP) serves all 20 university libraries in the Ontario Council of University Libraries (OCUL) consortium.⁴ The Portal includes more than 6,900 e-journals from 13 publishers and metadata for the content of an additional 3 publishers. The publishers currently represented include Elsevier, John Wiley & Sons, Inc., Springer, Kluwer Law International, Blackwell, Oxford University Press, Cambridge University Press, American Psychological Association, Emerald, Berkeley Electronic Press, Sage, Institute of Electrical and Electronics Engineers, and the Royal Society of Chemistry.

The Portal uses a combination of “push and pull” to gather content: publishers provide source files, and the Portal harvests content from publisher Web sites. The Portal stores all the content from publishers, but the current system cannot render all the formats that have been stored, e.g., video files and numeric data. Most of the content is in PDF or XML format.

The primary purpose of the Portal is access, but the consortium has made an explicit commitment to the long-term preservation of the e-journal content that it loads locally. The Portal provides online access to the content that consortium members have licensed or purchased. Members of the consortium are required to pay membership fees and are represented on the executive board of the Portal. Preservation is included in the e-journal service to members.

Between 2001 and 2005, OSP was supported by a grant and provincial matching funds as part of the Canadian National Site Licensing Program.⁵Ongoing support for OSP relies upon a membership cost model that adjusts for the varying size of consortium members and usage factors and that includes tiered membership fees.

Portico

Portico is one of the newest of the archiving programs, having just gone “live” in 2006 (although planning began in 2004, and the preservation obligation was assumed in 2005). The mission of Portico is to “preserve scholarly literature published in electronic form and to ensure that these materials remain accessible to future scholars, researchers, and students.” Specifically designed as a third-party electronic-preservation service, Portico serves as a permanent dark archives. E-journal availability (other than for verification purposes) is governed by trigger events resulting from substantial disruption to access via the publishers themselves.

The program’s archival approach begins with the receipt of source files, which comprise the intellectual content of electronic scholarly journals directly from the publishers, and features transformation or normalization of these diverse files to a standard archival format that can be managed over time through the preservation strategy of migration.

Portico boasts a strong pedigree, with startup funding provided by The Andrew W. Mellon Foundation, Ithaka, JSTOR, and the Library of Congress. A membership organization, it is open to all libraries and scholarly publishers, both of which are asked to support the effort through annual contributions. Thirteen publishers are participating in Portico:

American Anthropological Association
American Mathematical Society
Annual Reviews
Berkeley Electronic Press
BioOne
Elsevier

Cell Press
The Lancet

John Wiley & Sons
Oxford University Press
Sage Publications, Inc.
Society for Industrial and Applied Mathematics (SIAM)
Symposium Journals (Oxford UK)
United Kingdom Serials Group
University of Chicago Press

Recently announced library fees, ranging from $1,500 to $24,000 per year, are based on the total library materials expenditures for an individual institution. To encourage early adopters, libraries that subscribe to this service in 2006 and 2007 will be designated “Portico Archive Founders” and will receive substantial savings on their annual archive support payment for five years. Library systems and consortia that facilitate support for the archive among their member institutions will be offered modest savings in their annual payments. According to Eileen Fenton, executive director, Portico is aiming to attract additional libraries from across the Carnegie Classification of Institutions of Higher Education.⁶

PubMed Central

PubMed Central (PMC) is a free, publicly accessible digital archive of English language biomedical and life sciences journal literature, run by the National Center for Biotechnology Information (NCBI) of the U.S. National Library of Medicine (NLM). Launched in February 2000 with content from the Proceedings of the National Academy of Sciences and Molecular Biology of the Cell, PMC has grown to include hundreds of thousands of articles from about 250 titles and 50 publishers.

Like the similarly named PubMed, PMC is an integral component of NCBI’s Entrez life sciences search engine. While PubMed contains citations, abstracts, and links to full-text articles, PMC consists of full-text research articles and other content from peer-reviewed life sciences journals. The two services are separate and not entirely complementary. PubMed points to numerous articles that are not in PMC, while some content in PMC (mostly nonarticle journal content) is not indexed in PubMed.

PMC’s mandate to preserve the journal literature of biomedicine comes from the Congressional act that created NLM, which authorizes it to “acquire, organize, disseminate and preserve books, periodicals, . . . and other library materials pertinent to medicine.” At the moment, NLM cannot compel researchers to deposit their publications in PMC, but authors of life science research sponsored by U.S. National Institutes of Health are requested to voluntarily deposit final manuscripts of articles into PMC within a year of publication.

That situation may change, however. Legislation entitled the Federal Research Public Access Act of 2006 (introduced in the U.S. Senate on May 2, 2006) would require that U.S. government agencies with annual extramural research expenditures of more than $100 million make journal articles based on research funded by that agency publicly available via the Internet within six months. If the bill is passed, agencies in the U.S. Department of Health and Human Services, e.g., NIH and the Centers for Disease Control and Prevention, would presumably use PMC, since the law requires that manuscripts be preserved in a digital archive that supports free public access, interoperability, and long-term preservation.

Other content comes into PubMed Central by a variety of mechanisms. Some open-access journal publications (most notably the entire set of BioMed Central journals) use PMC as their archiving solution. Some commercial publishers that do not otherwise have agreements with PMC allow authors to designate their articles as open access and to deposit these articles in PMC. Finally, a growing number of publishers have reached contractual agreements with PMC to deposit all their journal contents with PMC.

To participate in PMC, a publication must be covered by a major abstracting/indexing service, or have three editorial board members with current grants from major nonprofit funding agencies. Publishers are required to supply source files (via FTP or on CD/DVD or tape) in either SGML or XML, conforming to the NLM Journal Archiving XML DTD or another full-text article DTD that is widely used in the life sciences. The original high-resolution digital image files must be provided for all figures. PMC prefers (but does not require) that publishers also include a PDF version of their articles in the archive. Publishers are encouraged to deposit the entire contents of their journals for archiving, but must at minimum provide all research articles. For display purposes, PMC performs an on-the-fly conversion of stored XML to HTML.

PMC has a flexible deposit policy designed to accommodate the desire of many publishers to delay appearance of journal content in PMC for a period of time following publication. Although publishers are encouraged to make content available via PMC as soon as possible after publication, they may request a delay of up to one year for research articles, and up to three years for other content, such as letters and reviews.

NLM is committed to long-term stewardship of the content in PMC. All contracts must include a clause granting PMC perpetual archiving rights for any deposited material. Two operational policies dominate PMC’s approach to content longevity. One is an emphasis on standardized XML, which is portable, maintains document structure, and lends itself to intelligent processing without sacrificing human readability. NLM is continuing its work on the Journal Archiving and Interchange DTD from which the Journal Publishing DTD was derived and for which the Library of Congress and the British Library recently announced support. The other is free, open access to all content, which, in concert with automated processes, helps ensure the integrity of archived content through direct, active, and continuous use.

NLM is also committed to expanding PMC. New publishers and titles are being added regularly, and NLM has embarked on a program of back-issue digitization for the titles that are routinely depositing current content in PMC.

PMC is not identified specifically as a line item in the NIH or NLM budgets. In October 2004, a review of personnel, contract, and system (hardware/software) costs noted an annual cost of $2.3 million. This included most operating costs for staff, contract work, equipment, and software other than the cost of digitization of journal back issues.

FOOTNOTES

¹http://lockss.stanford.edu/about/users.htm.

²http://lockss.stanford.edu/about/titles.htm.

³ About the Digital Resource Commons, http://drc-dev.ohiolink.edu/.

⁴http://www.ocul.on.ca/.

⁵http://library.queensu.ca/libdocs/news/2001apr09.htm.

⁶http://www.carnegiefoundation.org/classifications/.