Libraries face several issues as a consequence of undertaking digital conversion. Some of these issues are constraints that may limit the potential of digitization to enhance research and teaching. They must be identified and explored to assess accurately the costs and benefits of digitization. Other issues may not be constraining as such, but may put new pressures on libraries to expend further resources. They, too, must be scrutinized and managed if digital programs are to be developed cost-effectively and with the greatest possible benefit to the collections and their users.
3.1. Treatment and Disposition of Source Materials
Selecting for scanning must include an assessment of the source’s physical condition and readiness for the camera. For those items that are rare, unique, fragile, or otherwise of artifactual value, preparation for scanning usually demands the attention of conservators, if only to confirm that the item will not be harmed by the camera.
Some type of prescanning treatment may also be required, from strengthening or mending paper to removing environmental soil. In the case of certain media, such as medieval manuscripts or daguerreotypes, the institution may need to call upon the services of a consultant or vendor with special expertise. This involves more time and resources (e.g., for writing contracts and checking items that return from a vendor) and may divert resources normally dedicated to other conservation work.
The disposition of scanned source materials that are not unique or rare is a challenging subject that most libraries are just beginning to confront. When digitization becomes an acceptable, if not preferred, alternative to microfilm for preservation reformatting and those items that can be networked are, what criteria will libraries use to decide what to keep on campus, what to send to remote storage, and what to discard? For materials that are rare or unique that question should not arise. But what about back journals that will be available from a database such as JSTOR, or American imprints that the University of Michigan and Cornell have scanned and made available without restrictions on their MOA site? The library community never reached a consensus about this issue for microfilming. The historical inattention to originals on the assumption that some institution would retain an original has led to avoidable material losses and to a serious public relations problem.
Recent trends in scholarship have forced libraries to reexamine old assumptions about the value of original serials and monographs-items that are not traditionally prized for their artifactual value (Council on Library and Information Resources 2001). It is important that research libraries find cost-effective means of preserving originals and making them readily accessible. This does not mean that all libraries should keep multiple copies of items that researchers prefer to use electronically. But 20 years from now, when many scholars may well prefer accessing these materials electronically over retrieving them from library shelves, will the library community have developed a collective strategy for preserving a defined number of originals for access purposes and reducing the redundancy of print collections? How will researchers who wish to have access to originals be able to find out where they are and how they can be viewed? Current plans to register digitized items in a nationally coordinated database include consideration of noting the location of a hard copy available for access, and the fate of originals is beginning to get the attention that researchers demand (Digital Library Federation 2001).
For certain very fragile media, such as lacquer sound discs or nitrate film, the original or master should seldom or never be used for access purposes. Service of those collections should always be done on reformatted (access) media. However, that is an expensive proposition for any library, and there is great resistance to push the costs of preservation transfer onto the user. Regrettably, a great number of recorded sound and moving image resources continue to be played back using the original. Until digitization is an affordable option for access to these media, their preservation will remain at very high risk.
Digitizing either general or special collections presents challenges regarding size: How many items from any given collection will be sufficient to create added value? “Critical mass” is one selection criterion that shows up in nearly all the written guidelines for selection and is commonly noted in conversation. The magic of critical mass, in theory, is that if you get enough related items up in a commonly searchable database, then you have created a collection that is richer in its digital instantiation than in its analog original. This is premised on the notion that the technology has a transformative power-that it can not only re-create a collection online but also give it new functionality, allow for new purposes, and ultimately create new audiences that make it available for novel queries. It does this by, for example, turning static pages of text or numbers into a database. Monographs are no longer limited by the linear layout of the bound volume or microfilm reader. By making texts searchable, librarians can create new resources from old ones and transform items that have had little or no use into something that receives hundreds or thousands of hits.
But how much is enough? A critical mass is enough to allow meaningful queries through curious juxtapositions and comparisons of phenomena, be it the occurrence of the word “chemise” in a run of Victorian novels or the U.S. Census returns from 1900. A large and comprehensive collection is valuable because it provides a context for interpretation. But in the digital realm, critical mass means something quite new and as yet ill-defined. The most salient example of this phenomenon is the MOA database at the University of Michigan, which contains thousands of nineteenth-century imprints at risk of embrittlement or already embrittled. Staff report that although the books themselves were seldom called from the stacks, the MOA database is heavily used, and not exclusively by students and teachers of Michigan. Members of the University of Michigan community use MOA most heavily, but among its largest users is the Oxford University Press, which mines the database for etymological and lexical research. Is this database heavily used because it is easily searched and the books were not? Because one can get access to it from any computer in any time zone, while the books were available to only a small number of credentialed users? Were the books of no research value when they languished in remote storage? And what is their research value now? It is hard to isolate a single factor that is decisive. Attempts to create equally valuable critical masses must address not only content but also searchability, ease of use, ranking on various search engines, and so forth.
In the case of general collections, or imprints, one must select enough that they, taken together, create a coherent corpus. In one case, time periods may set the parameters; in another, it might be genre or subject. In many ways, what makes a digitized collection of print materials singularly useful is the ability to search across titles and within subjects. The more items in the collection, the more serendipitous the searching. But as JSTOR has shown, incremental increase in the number of titles in the corpus is possible because they already exist within a meaningful context and are available through a nimble search and retrieval protocol.
Critical mass could more accurately be thought of as “contextual mass,” that is, a variable quantity of materials that provides a context for evaluation and interpretation. In the analog realm, searching within a so-called critical mass has always been very labor-intensive. It has taken great human effort and patience to identify the relationships in and among items in a collection, and it has been possible only within collections that are physically located together. But once those items are online, in a form that is word-searchable, one has a mass that is accessible to machine searching, not the more arduous human researching. Theoretically, when many related collections exist together on the Web, they create a significantly more meaningful source than these same collections would if not linked electronically. In reality, achieving such a contextual mass across institutional boundaries will remain elusive as long as the collections are not interoperable.
For archival-type collections, which are not necessarily text based and are usually under looser bibliographical control than are published works, the amount of material needed to get a critical mass can defy the imagination, or at least challenge the budget. If a collection is too large to digitize-for example, a photo morgue or institutional records-staff may choose to digitize a portion that represents the strengths of the collection. But what portion? How much is enough? These are subjective decisions, and they are answered differently by different libraries. In the public libraries, with no faculty to provide advice, the decisions have been made by the curatorial staff. The Library of Congress has called in scholarly consultants and educational experts from time to time to aid in selection decisions, but the curatorial staff always makes the actual decisions. The NYPL relies on a curatorial staff that is expert in a number of fields. Like most cultural heritage institutions, it has long corporate experience in selecting for exhibitions. Curatorial staff in academic special collections libraries often have the opportunity to work with faculty or visiting researchers who collaborate in shaping a digital collection and even add descriptive and interpretive text to accompany items.
But many curators see digitizing anything less than a complete collection as “cherry picking,” which results in a collection that does not support the research mission of the institution. Others are less severe and cheerfully admit that for most researchers, a little bit is better than nothing at all, and very few researchers mine any single collection to the depth that we are talking about. Those who do, they assert, would end up seeing the collection on site at some point in any event. These judgments are generalized from anecdotal experiences and are not based on objective data. When asked, for example, about how research techniques in special collections may be affected by digitization, some librarians said that research will be pursued by radically different strategies inside of a decade. Others think that research strategies for special collections materials will not change, even with the technology. The important thing, in their view, is not to get the resources online but to make tools for searching what is available in libraries readily accessible on the Web-tools such as finding aids. The NYPL has secured money to do long-term studies of users of digitized special collections to gather information about use and to test assumptions. More needs to be done. A significant portion of grant-funded digitization, especially that supported by federal and state funds, should include some meaningful form of user analysis.
The California Digital Library (CDL) has inaugurated a project, called California Cultures, designed to make accessible a “critical mass of source materials to support research and teaching.” Much of this documentation will reflect the social life, culture, and commerce of ethnic groups in California” (CDL 2000). The collection will comprise about 18,000 images. The California Digital Library sees collaboration as a key element in scalability. Because of funding and governance issues, the CDL believes that it must foster a sense of ownership and responsibility for these collections among creators statewide, locality by locality.
The role of scholars in selecting a defined set of contextually meaningful sources often works well for published items in certain disciplines. Agriculture and mathematics are examples where scholars have been able to come up with a list of so-called core literature that is amenable to comprehensive digitization. By way of contrast, curators may do a better job in selecting from special collections of unpublished materials-musical manuscripts, photo archives, personal papers-than scholars do. These are materials with which only curatorial staff members are sufficiently familiar to make selection decisions. While there are exceptions to this rule, the sheer quantity of materials from which to select often makes the involvement of scholars in all decisions impractical and hence unscalable.
Scholars tend to have a different concept of the critical mass than do librarians. Projects such as the Blake Archive and the Digital Scriptorium are built with the achievement of a critical mass for teaching and research in mind. Whereas a collection-driven, text-based program such as MOA can convert massive amounts of text and make it searchable, it can also put up materials without an interpretive framework. Other projects, such as APIS and, to a large degree, American Memory, invest time and money in creating interpretive frameworks and item-level descriptions that never existed when the items were analog, confined to the reading room, and served by knowledgeable staff. In many ways, this type of access is really a new form of publishing, not library service as it has been
3.3. Intellectual Control and Data Management
The scarcity of cataloging or description that can be quickly and cheaply converted into metadata is often a decisive factor in excluding a collection from digitization. Given that creating metadata is more expensive than is the actual scanning, it is necessary to take advantage of existing metadata-that is, cataloging. Often, money to digitize comes with promises by library directors that they will put up several thousand-even million-images. This is a daunting pledge. To mount five million images in five years, as LC pledged to do, has meant giving priority to large collections that already have extensive bibliographical controls. The NYPL is likewise giving selection preference to special collections that already have some form of cataloging that can be converted into metadata to meet production goals. In this way, expedience can theoretically be happily married to previous institutional investments. These libraries have put enormous resources into creating descriptions, exhibitions, finding aids, and published catalogs of prized institutional holdings. One can assume that a collection that has been exhibited or made the subject of a published illustrated catalog has demonstrated research and cultural value.
Some collections that are supported by endowments can make the transition to digital access more easily than can others, because funds may be available for this within the terms of the gift. The Wallach Division of Art, Prints, and Photographs, for example, at the NYPL will be put online as the Digital Wallach Gallery. There are a number of grant applications that not only build the cost of metadata creation into the digitization project but also appear to be driven in part by a long-standing desire on the part of a library finally to get certain special collections under bibliographical control.
It can be quite difficult, however, to harmonize descriptive practices that were prevalent 40 years ago with what is required today. The expansive bibliographical essays that once were standard for describing special collections need quite a bit of editing to make them into useful metadata. It is not simply a question of standards, which have always been problematic in special collections. The problem is that people research and read differently on the Web than they do when sitting with an illustrated catalog or finding aid at a reading desk. Descriptive practices need to be reconceptualized for presentation online. This reconceptualization process is several years off, since we have as yet no basis for understanding how people use special collections online.
For monographs and serials, genres for which the MARC record was originally devised and which is a standard well understood, retooling catalog records need not be complicated or expensive. For materials that are published but not primarily text based, such as photographs, posters, recorded speech, or musical interpretations, the MARC record has noted limitations and those tend to be accentuated in the online environment. Unpublished materials share this dichotomy of descriptive practice between textual and nontextual. For institutions that have chosen to put their special collections online, tough decisions must be made about how much information can be created in the most cost-effective way. In some cases, rekeying or OCR can be used to produce a searchable text in lieu of creating subject access. For handwritten documents, non-Roman scripts, and audio and visual resources, searching remains a problem.
The context for interpretation needs to be far more explicit in the online environment than in the analog realm. It is interesting to think about why that is so. Are librarians creating too much descriptive material for online presentation of those collections that have successfully been served in reading rooms with no such level of description, or is it too little? Are librarians assuming that the level of sophistication or patience of the online user is far lower than that of the onsite researcher? It is commonly assumed that an online patron will not use a source, no matter how valuable, if it is accompanied by minimal-level description. This assumption may be well founded in principle, and it is certainly true that the deeper and more structured the description, the likelier it is that the item will be found through the various searching protocols most in use. But by removing research collections from the context in which they have traditionally been used-the reading room-one also removes the reference staff who can guide the patron through the maze of retrieval and advise about related sources. Materials that the public has had little experience using are now readily available online. If patrons lack certain research skills, the resources will remain inaccessible to them.
The ease of finding digitized items on library Web sites varies. A few sites are constructed in a way that makes finding digitized collections almost impossible for people who do not already know they exist. Other sites have integrated the surrogates into the online catalog and on OCLC or RLIN or both. Those DLF members whose primary purpose in digitization is to increase access to special collections and rare items have expressed willingness to expose the metadata for these collections to a harvester using a technical framework established for the Open Archives Initiative.
3.4. Coordinated Collection Development
The idea of coordinated collection development of digital objects is a powerful one. It motivated the Berkeley, Michigan, and Cornell libraries to work together to mount several collections from their own holdings that could be termed part of one, the Making of America. Given the resources that must be dedicated to creating digital collections and the resources it takes to build the infrastructure that allows access to them, it would seem that the only way to build truly scalable collections is through some cooperative effort.
But for all the talk of building federated collections that will aggregate into a digital library with depth and breadth-that is, critical mass-the principle of “states’ rights” remains the standard. Each institution decides on its own what to digitize, and usually does so with little or no consultation with other libraries. There are funding sources that require collaboration in some circumstances-the Library of Congress’s Ameritech grant is an example-but the extent of collaboration usually has to do with using the same standards for scanning and, sometimes, description. Selection is not truly collaborative; it could more properly be characterized as “harmonized thematically.” Institutions usually make decisions based on their particular needs rather than on community consensus about priorities.
How do we know that we are not duplicating efforts in digitization, even when the content comes from special collections? Two sheet music projects, the Levy Sheet Music Collection at Johns Hopkins University and the Historical American Sheet Music Collections at Duke University, may or may not have significant overlap, for example. There may be sound reasons to scan each collection in full, even if there is overlap. But we are not able to make that kind of decision at present without a comprehensive database about such projects. Some will argue that duplication is good and that dreams of standardization are premature. Nonetheless, while duplication may have some benefits at these early stages of research and development, we are unable to take advantage of these benefits unless we know what others are doing. A registry that would make core pieces of information available, not only about content but also about technical specifications, disposition of originals, and cataloging preferences, is a critical component of infrastructure that would allow each institution to make informed decisions on a series of matters (Digital Library Federation 2001).
Some library staff worry that institutional concerns, such as fund raising, public relations, and special projects, divert too many resources from more academically defensible projects or from the core mission of the library. When asked to take on projects conceived to advance the home institution’s mission, rather than the research and teaching mission of the library per se, most library administrators seem willing to accept this “good citizen” role, and some use it to the library’s distinct advantage. Even a vanity project, if managed properly, will bring money into the library for digitization and provide the kind of training and hands-on experience that is necessary to develop digital library infrastructure and expertise. The key to building on such a project is to be sure that all the library’s costs, not only scanning but also creating metadata, migrating files, and so forth, are covered. Such projects, done willingly and well, usually enhance the status of the library within the community and seldom do long-term harm. The issue bodes ill only when libraries deliberately seek funding for things that are not core to their mission or when staff and management are diverted to support low-priority projects. Looked at from this perspective, outreach can properly be considered part of the library’s mission.
Academic libraries such as those at Virginia and Cornell, which are funded by a mix of private and public monies, are liable to face pressures to serve not only research and teaching needs but also state and regional interests. These need not be mutually exclusive, and even Harvard University has demonstrated its good citizenship by contributing items of interest to Cantabridgeans on its publicly accessible Web site. The key lies in achieving a balance and, if possible, a synergy between the two.
For public institutions, digital programs offer a new and unique way to serve collections to taxpayers. For example, online distribution is the only way LC can provide access to its holdings in all congressional districts. For the primary funders and governors, that is, members of the U.S. Congress, who have built and sustained the library on behalf of their constituents, this rationale is compelling. Similarly, while the NYPL is not fully supported by public funds, its choices are nonetheless strongly influenced by the priorities of state and local governments.
Academic libraries with dual funding streams-private and public-are most vulnerable when the state in which they are located expresses some expectation that the university library will mount materials from its collections that are aimed at the kindergarten through twelfth-grade user group. There is much talk of how access to primary source materials held in research institutions can transform K12 education. This hypothesis needs to be tested. The fact that much grant funding is tied to K12 interests has made it necessary for universities to try to shape research-level materials into a K12 mold to secure funding, or at least to mask a research collection as one that is also suited for younger audiences.
There is no doubt that public institutions are seen as holding a promise to improve the quality of civic life if they provide greater access to their holdings. The fact that the NYPL and LC have been so successful in securing funds from private citizens is a clear indication of the public esteem that these institutions enjoy and the desire to “get the treasures out.” This level of philanthropy would be unthinkable in any other country. While these libraries may be accused of pandering to donors on occasion or of not paying enough attention to the academic community by digitizing materials that are not in demand first and foremost by scholars, the fact is that they are public libraries. Unlike the libraries in state universities, they are not designed to serve exclusively, or even primarily, the scholarly community. This obligation to serve the public, however, does not skew selection for digitization as drastically as some assert. Donors may express an interest in a particular type of material, but in the end, they choose from a set of candidate collections that have been proposed by curatorial divisions and vetted by preservation and digital library staff for technical fitness. In terms both of process and result, these candidate collections differ little from their private academic counterparts.
In both public and private libraries, some curators who are active in special collection development advocate for digitization because they see it as a way to induce further donations. For them, the promise of access is a useful collection-development tool because digital access advertises what the library collects and demonstrates a commitment to access.
Much less has been written about how to plan for the access to and preservation of digitally reformatted collections over time than about how to select materials for digitization. This is partly because we know nothing certain about maintaining digital assets over the long haul. We have learned a great deal as a result of failed or deeply flawed efforts-those of the “We’ll never do that again!” variety told of some projects to reformat information onto CDs, for example-but such lessons tend to be only informally communicated. Exceptions include the University of Michigan, where one library that has a clear view of what role digitization plays-that of collection management and preservation-has developed and published preservation policies that support those goals. The CDL is also an exception, perhaps because, as a central repository, establishing standards and best practices to which its contributors must adhere is paramount to building confidence as well as collections. Harvard University has published information about its plans for a digital repository, and Cornell has adopted policies for “perpetual care” of digital assets. The Library of Congress has also put online much about its planned audiovisual repository in Culpeper, and has announced a plan to develop a national digital preservation infrastructure (LC 2001). General information about the preservation of digital files can be found on the Web sites of the Cedars Project, CLIR, and Preserving Access to Digital Information (PADI).
Nearly every library declares its intention to preserve the digital surrogates that it creates. The Library of Congress has also pledged to preserve those surrogates created by other libraries under the auspices of the National Digital Library Program (Arms 2000). In reality, however, many libraries have created digital surrogates for access purposes and have no strategic interest in maintaining them as carefully as they would have if they had created those files to serve as replacements. Nonetheless, at this point libraries are uncomfortable admitting that they have a limited commitment to many of their surrogates, should push come to shove. Those who are creating surrogates for access purposes alone still declare an interest in maintaining those surrogates as long as they can because the original investment in the creation of digital files has created something of enormous value to their patrons. Moreover, the cost of having to re-create those surrogates and the physical stress it might impose on the source materials argue for maintaining those files as long as possible.
The mechanism for long-term management of digital surrogates is theoretically no different from that for management of born-digital assets. While refreshment and migration of digital collections have occurred in many libraries, the protocols and policies for preservation are clearly still under development. Many libraries have been sensitized to the fact that loss can be simple and catastrophic, beginning with the wrong choice of (proprietary) hardware, software, or medium on which to encode information and ending with negligent management of metadata. The Y2K threat that libraries faced in 1999 has led to systemic improvements in many cases. Not only did institutions become aware of how deleterious it is to allow different software to proliferate but they also developed disaster-preparedness plans. Many received funds for infrastructure upgrades that might have been awarded much later, or not at all, were it not for the sense of urgency that the coming of Y2K provided.
Libraries are anticipating the day when they must develop strategies for handling digital objects that faculty are creating without the involvement ofthe library. These are the often-elaborate constructions done by individual scholars or groups of collaborators that the library hears about only after the critical choices of hardware, software, and metadata have been made, often by people wholly unaware of the problems of long-term access to digital media. An increasing number of library managers have expressed concern about the digitized materials created by faculty that are “more than a Web site” yet less, often far less, than what the library would choose to accession and preserve. While libraries acknowledge that this is a growing problem, none has been forced to do much about it yet, and thoughts about how to deal with faculty projects are just now evolving. Predictably, those that are collection-driven in approach are working to build a system for selecting what the library wishes to accession to its permanent collections. Cornell is developing criteria that individuals must work to if they expect that the library will provide “perpetual care” (Cornell 2001). CDL already has such guidelines in place. Michigan has a well-articulated preservation policy, one that is detailed enough to support the university’s vision of digital reformatting as a reliable long-term solution to the brittle book problem.
3.7. Support of Users
There is little understanding of how research library patrons use what has been created for them. Most libraries recognize that the collections they now offer online require a different type of user support than that which they have traditionally given visitors to their reading rooms. In many cases, user support has been developed for “digital collections” or “digital resources,” terms that almost invariably denote born-digital (licensed) materials. The Library of Congress, which targets a K12 audience, has three reference librarians for its National Digital Library Program Learning Center. As a rule, however, libraries have not been reallocating staff to deal specifically with digitized collections. Hit rates and analyses of Web transactions have yielded a great amount of quantitative data about access to digital surrogates, and those data have been mined for a number of internal purposes, from “demonstrating” how popular sites are to making general statements about how users are dialing in and from where. Qualitative analysis is harder to derive from these raw data, and there have been few in-depth analyses of how patrons are reacting to the added functionality and convenience of materials now online. Libraries have been keeping careful track of gate counts, for example, but when these counts go up or go down, what conclusion can one draw about the effect of online resources on use of on-site resources?
One exception to this rule is the journal archiving service, JSTOR, which rigorously tracks the use of its resources. JSTOR analyzes its users’ behavior because it needs to recover costs and hence must stay closely attuned to demand, within the constraints of copyright issues. Looking ahead, close analysis of how researchers use specific online resources, especially how they do or do not contribute to the productivity of faculty and students, will be a prime interest of libraries. Further work must begin to complement this work with analysis of free Web sources mounted by libraries.
Most libraries report having classes and other instructional options available for students and faculty. Some librarians report that instruction is not really necessary for undergraduates, who are quite used to looking first online, but that general orientations to library collections are needed more than ever.
Good Web site design, that is, creating sites that are easily found, easily navigated, and readily comprehensible, is an often overlooked aspect of access. Within the first-generation libraries, there is an astonishing variety in the quality of Web site design. Even for fairly sophisticated users, finding a library’s digital collections can involve going through a half-dozen screens. Having a professional design team that keeps a site up-to-date and constantly reviews it for improvements does cost money, but considering that the Web site is the front door to the collections, it would seem penny wise and pound foolish to ignore design and marketing.