2. Identification, Evaluation, and Selection • CLIR

2.1. Policies, Guidelines, and Best Practices

A great deal has been written on the subject of selection for digitization and on the management of conversion projects. Much of this literature is published on the Web and has become de facto “best practice,” to the extent that many institutions applying for digitization grants use it to plan their projects and develop selection criteria. In addition to these guidelines, there are a number of reports about selection for digitization that range from project management handbooks and technical guides to imaging, to broad, nontechnical articles aimed at those outside the library community who fund such programs.²

Very few libraries have developed their own formal written policies for conversion criteria. Those that do have such documents tend to refer to them as “guidelines.” These documents tend to focus on technical aspects of selection and, even more, on project planning. When asked why they do not have a policy, most institutions reply that it is too early to formulate policies, that they have not gotten around to formulating them, or that the institution does not have written collection-development policies for other materials so it is unlikely to write them for digitized collection development.

These documents almost always focus on the planning of digital projects or of various elements of a larger program, rather than on the rationale for digitization. The University of Michigan, for example, has a written policy that clearly aims to fit digitization into the context of traditional collection development. It states that “core questions” underlying digitization should be familiar to any research library collection specialist (University of Michigan 1999). These questions are as follows:

Is the content original and of substantial intellectual quality?
Is it useful in the short and/or long term for research and instruction?
Does it match campus programmatic priorities and library collecting interests?
Is the cost in line with the anticipated value?
Does the format match the research styles of anticipated users?
Does it advance the development of a meaningful organic collection?

These are fundamental collection-development criteria that assert the importance of the research value of source materials over technical considerations; however, they are quite general. The rest of Michigan’s policy focuses not on how to select items for conversion but on how anticipated use of the digital surrogates should affect decisions about technical aspects of the conversion, markup, and presentation online.

Harvard’s selection criteria offer far more detailed considerations than do those of Michigan (Hazen, Horrell, and Merrill-Oldham 1998). In common with the Michigan criteria, the Harvard criteria focus largely on questions that come after the larger, “why-bother-to-digitize-this-rather-than-that” issues have already been answered. Creation of digital surrogates for preservation purposes is cited as one legitimate reason for selection, as are a number of considerations aimed not at preservation but solely at increasing access. (Sometimes digitization does both at once, as in the case of rare books or manuscripts.)

The Harvard guidelines have been useful to many beyond Harvard who are engaged in planning conversion projects because they present a matrix of decisions that face selectors and are available on the Web (Brancolini 2000). The authors begin with the issue of copyright-whether or not the library has the right to reformat items and distribute them in limited or unlimited forms. They then ask a series of questions derived from essentially two points of departure:

Source material: Does it have sufficient intellectual value to warrant the costs? Can it withstand the scanning process? Would digitization be likely to increase its use? Would the potential to link to other digitized sources create a deeper intellectual resource? Would the materials be easier to use?

Audience: Who is the potential audience? How are they likely to use the surrogates? What metadata should be created to enhance use?

The answers to these and similar questions should guide nearly all the technical questions related to scanning technique, navigational tools and networking potential, preservation strategy, and user support.

The primary nontechnical criterion-research value-is a subjective one and relies on many contingencies for interpretation. What does it mean to say that something has intrinsic research value? Do research libraries collect any items that do not have such value? Should we give priority to items that have research value today or to those that may have it tomorrow? What relationship does current demand have to intrinsic value? Because the answers to these questions are subjective, the only things excluded under these selection criteria are items that are difficult to scan (for example, oversized maps) or things that are very boring or out of intellectual fashion. Interestingly, foreign language materials are nearly always excluded from consideration, even if they are of high research value, because of the limitations of optical character recognition (OCR) software and because they often have a limited number of users. There are digital projects that have converted valuable historical sources, from Egyptian papyri to medieval manuscripts, into image files (such as the Advanced Papyrological Information System [APIS] and the Digital Scriptorium). In general, however, the conversion of non-English language sources into searchable text continues to be rare.

This high-level criterion of research value is also an intrinsic part of traditional collection-development policies. The difference is that in most libraries, the acquisition of monographs, to take an example, fits into a longstanding activity that has been well defined by prior practice. This practice governs the acquisition of new materials, that is, those that the library does not already hold. (The issue of how many copies is secondary to the decision to acquire the title.) Selection of an item for digitization is reselection, and the criteria for its digitization, or repurposing, will be different from those for its acquisition. The meaning of research value will also differ, because the methods of research used for digital materials differ from those used for analog, and the types of materials that are mined-and how-are also fundamentally different. Several large digitization programs today are grounded in the belief that it is the nature of research itself that is “repurposed” by this technology, and it is often surprising to see which source material yields the greatest return when digitized.

As one librarian said, the guidelines addressing selection that are used routinely, whether official or not, are by and large “project oriented.” It would be a mistake to confuse what libraries are doing now with what libraries should and would do if “we understood what higher purpose digitization serves.” While guidelines for technical matters such as image capture and legal rights management are extremely useful and should be codified, formal collection-development policies are still a long way off.

2.2. Rationales for Digitization

Libraries usually identify two reasons for digitization: to preserve analog collections, and to extend the reach of those collections. Most individual projects and full-scale programs serve a mix of both purposes. As librarians have learned from tackling the brittle book problem through deacidification and reformatting, it is difficult and often pointless to separate preservation and access. When a library is seeking outside funding for digital conversion (apparently still the primary source of funding for many libraries), it tends to cite as many possible benefits from conversion as possible. For this reason, preservation and access are usually mentioned in the same breath. Nonetheless, because it has been generally conceded that digital conversion is not as reliable for preservation purposes as is microfilm reformatting, it is worthwhile to consider what institutions are doing in terms of preservation per se.

2.2.1. Preservation

2.2.1.1. Surrogates

The use of scans made of rare, fragile, and unique materials-from prints and photographs to recorded sound and moving image-is universally acclaimed as an effective tool of preventive preservation. For materials that cannot withstand frequent handling or, because of their value or content, pose security risks, digitization has proved to be a boon.

2.2.1.2. Replacements

For paper-based items, librarians generally agree that digital scans are the preferred type of preservation surrogates. They are widely embraced by scholars and are preferred over microfilm. However, most librarians also assert that scanning in lieu of filming does not serve preservation purposes, because the expectation that we can migrate those scans into the future is simply not as great as is our conviction that we can manage preservation microfilm over decades. There is a general hope that the problem of digital longevity will soon be resolved. In anticipation of that day, most libraries are creating “preservation-quality” digital masters-scans that are rich enough to use for several different purposes and are created to obviate the need to rescan the original. These masters are sometimes created together with preservation master microfilm.

Only one institution, the University of Michigan, has a policy to scan brittle books and use the scans as replacements rather than as surrogates. The university has created a policy for the selection and treatment of these books, and it explicitly talks of digital replacements as a crucial strategy for collection management (University of Michigan 1999). This policy is based on the premise that books printed on acid paper have a limited life span and that, for those items with insignificant artifactual value, the library is not only rescuing the imperiled information but also making it more accessible by scanning in lieu of filming. (The preservation staff continues to microfilm items identified by selectors for filming as well as to deacidify volumes that are at risk but not yet embrittled.) The focus of Michigan’s digital program is the printed record, not special collections such as rare books and photographs, and digitization has been made a key collection-management tool for these holdings. Cornell also has incorporated digitization into collection management (that is, it is not for access alone), although its efforts are not as systematic as are those of Michigan. At Cornell there is a preference for digital replacements of brittle materials with backup to computer output microfilm (COM) or replacement hard copies made from digital scans. The Library of Congress has also begun implementing preservation strategies based on digitization in its project to digitize the nineteenth-century journal, Garden and Forest. Most libraries, though, elide the issue of digitally replacing brittle materials because they scan chiefly items from special collections.

For audiovisual materials, digital replacements appear to be inevitable, although standards for archival-quality re-recording have yet to be established. Because the recording media used for sound and moving image demand regular, frequent, and ultimately destructive reformatting, migrating onto digital media for preservation, as well as access, is acknowledged to be the only course to pursue for long-term maintenance. The University of California at Los Angeles (UCLA) and LC, both deeply engaged in audiovisual preservation, intend to digitize analog materials to provide long-term access. This does not mean that these institutions will dispose of the original analog source materials, only that the preservation strategy for these items will not be based on routine use of that analog source material.

2.2.2. Access

In nearly all research libraries, digitization is viewed as service of collections in another guise-one that provides enhanced functionality, convenience, some measure of preservation, aggregation of collections that are physically dispersed, and greatly expanded reach. Among all the strands of digitization activities at major research institutions, there are essentially three models of collection development based on access: one that serves as outreach to various communities; one that is designed to build collections; and one that is driven by a specific need, such as demand by a user or preservation surrogates, or that is part of a larger effort to develop core infrastructure. All libraries engage in the first kind of access to one degree or another. Significant strategic differences are evident, however, in their approaches to the choice between mounting large bodies of materials in the expectation of use versus collaborating with identified users to facilitate their data creation.

2.2.2.1. Access for Outreach and Community Goals

There will continue to be times when academic libraries create digital surrogates of their analog holdings for reasons that are important to the home institution yet not directly related to teaching and research. Libraries will continue to be parts of larger communities that look to them for purposes that transcend the educational mission of the library per se. As custodians of invaluable institutional intellectual and cultural assets, libraries will always play crucial roles in fund raising, cultivating alumni allegiance, and public relations.

Occasions for selective digitization projects include exhibitions, anniversaries (when archives or annual reports often get into the queue), a funding appeal (digitization as a condition of donation), and efforts to build institutional identity. Careful consideration needs to be given to what goes online for whatever purpose because, once a collection is online, it becomes part of the institutional identity. Image building is a critical and often undervalued part of ensuring the survival of the library and its host institution. As custodians of the intellectual and cultural treasures of a university, libraries have an obligation to share that public good to the advantage of the institution.

2.2.2.2. Building Collections

The most common approach to digitization hinges on a collection-driven selection process in which a library decides to scan a set of materials identified by staff as having great research potential online. The terms “collection driven” and “use driven” are familiar from the preservation microfilming projects of the 1980s and 1990s. At that time, brittle books were conceptualized as a “national collection” that was held by a group of libraries, not just one. Each library could help preserve the national collection by filming a set of its holdings that were particularly strong, thereby avoiding duplication of effort by registering its filming activities and enhancing access to the endangered materials through the loan or duplication of microfilm. Librarians would select books not on the basis of their documented use (by checking circulation statistics or going with items that cross the circulation desk), but on the basis of whether they constituted a coherent set of monographs (and occasionally serials) arranged by subject matter or date of publication, or both.

This method, so well-known to preservation experts and subject specialists from the many grant-supported microfilming projects of the past two decades, has been transferred to the digital realm with one interesting twist: It has been extended primarily not to the general collections-monographs and serials-that are the heart of microfilming projects, but to special collections-materials that are rare or archival in nature or that are in nonprint format. This means that a library will scan items that exist as a defined collection, either by format (incunabula, daguerreotypes) or by genre of literature (antislavery pamphlets, travel literature gathered as a discrete group and held as such in the rare book department, photographs of the Reconstruction-era South given to the library by a donor and known under the donor’s name, Sanborn fire insurance maps, and so forth).

Within each group, libraries may attempt to be as comprehensive as possible in putting items from the collection online to simulate the comprehensive or coherent nature of the source collection. Examples of such collection-driven digital collections are the Making of America (a subject and time period), Saganet (a set of special collection items held by certain repositories that relate to Icelandic sagas), the Sam Nunn Papers project at Emory University, the Hoagy Carmichael site at Indiana University, and the Scenery Collection at the University of Minnesota.

In a survey of selection strategies at 25 research libraries, Paula de Stefano found that “the most popular approach to selecting collections for digital conversion is a subject-and-date parameter approach applied, by and large, to special collections, with little regard for use, faculty recommendations, scholarly input, editorial boards, or curriculum” (de Stefano 2001, 67). A recent analysis of 99 research libraries and their special collections done by the Association of Research Libraries (ARL) revealed that virtually all have been digitizing some of their special collections. The list of digitized collections submitted to ARL during this survey reveals just how eclectic these scanning projects are, a result fully consonant with de Stefano’s findings about selection policies (Panitch 2001, 99, 116-123). This approach has often slyly been referred to as the “field of dreams” method of collection development (“build it and they will come”), implying a certain naïve hopefulness on the part of the selectors but also hinting at the elements of surprise and serendipity we see in the digital realm.

2.2.2.3. Meeting Use and Infrastructure Needs

Some libraries have decided that they will digitize collections only in response to explicit user-driven needs. As a state-supported institution, UVA has developed access projects that serve state and regional needs. These projects are based primarily on UVA’s special collections holdings and are similar in scope and purpose to the type described above. The University of Virginia also has several digital conversion initiatives that are explicitly user driven, and these programs exist both in the library and elsewhere on campus. At the Institute for Advanced Technology in the Humanities (IATH), an academic center located in the library but administratively separate, scholars develop deep and deeply interpreted and edited digital objects that are, by any other name, publications. Examples include projects on the writers William Blake, Dante Gabriel Rossetti, and Mark Twain, as well as the Valley of the Shadow Civil War site. Within the library is the Electronic Text Center, where staff members choose to encode humanities texts that they put up without the interpretive apparatus of the IATH objects. These are analogous to traditional library materials that are made available for others to interpret; the difference is that encoded text is far more complicated a creature than is the OCR text that other libraries are creating. The Electronic Text Center is not so much responding to faculty or student demand as it is being driven by a technology. Exploring the potential of various encoding schemes is part of its agenda.

Under its Digital Library Initiative and with funds provided by the university, Harvard University libraries are concentrating on building an infrastructure to support born-digital materials first and foremost, rather than on building collections of digital surrogates of existing collections. Where the libraries have converted items, the criteria for selection have to do with user needs, not general collection building. And while the holdings of the more than 100 repositories in the university certainly comprise a rich collection of cultural heritage, Harvard will attempt to serve the Harvard community, not the larger community (Flecker 2000). “While in many instances the digital conversion of retrospective materials already in the University’s collections can increase accessibility and add functionality and value to existing scholarly resources, it is strategically much more important that the library begin to deal with the increasing flood of materials created and delivered solely in digital format.” Although $5 million of the $12 million allocated by the university is for content development, so far the majority of content development comprises conversion-for-access purposes. Slated for review are the collections that have been mounted so far. “One specific issue being discussed is the randomness of the areas covered by the content projects. Since these depend upon the initiative of individuals, it is no surprise that the inventory of projects undertaken is spotty, and that there are notable gaps . . . . It is also possible that specific projects will be commissioned to address strategic topics” (Flecker 2000). However, the gaps Flecker refers to are not content per se-specific subjects that would complement one another-but content that demands different types of digital format-for example, encoded text, video, or sound recordings. This is a technical criterion, of course, independent of collection development, and is fully concordant with the purposes that Flecker identifies the initiative is to serve.

At New York University (NYU), the focus is on the user as part of a plan that allocates relatively modest resources for digitization. New York University presents collections of cultural significance through online exhibitions and other modes of Web outreach rather than engaging in full collection conversion. This library has decided to concentrate on working with faculty and graduate students to develop digital objects designed to enhance teaching and research through its Studio for Digital Projects and Research and its Faculty Technology Center. Like Harvard, NYU is giving priority to the development of an infrastructure to deal with born-digital materials and, in an institution with extensive programs and collections in the arts and performance studies, on multimedia archives converted to digital form for presentation. New York University plans to give grants to faculty members to develop teaching and research tools. However, the library staff is now putting much of its effort toward preparing for the time, seen to be imminent, when the demands of born-digital materials will obviate any initiative to create large collections of digital surrogates.

The Cornell libraries have tried both collection- and user-driven approaches to selection. In several instances, staff members have begun with expressed interests of faculty, say for teaching, and have developed digital collections based on those interests. In each case, however, library staffs have expanded their brief and have augmented faculty choices with related materials. A faculty member’s interests are usually fairly circumscribed, and librarians select a good deal of additional materials on a topic, such as Renaissance art, to add depth to a selection. As a result, a selection of materials becomes a collection and has a wider scope of content. Research librarians are used to thinking of collections as being useful to the extent that they offer comprehensiveness or depth. Scholars, on the other hand, take comprehensiveness for granted and concentrate on making choices and discriminations among collection items in order to build a case for an interpretation. These two views of collections are complementary; however, when it comes to selection for digitization, they create the most difficult choices facing libraries in digitization programs. Selection is an “either/or” proposition. It seldom tolerates “both/and” solutions. Those historians who are working on Gutenberg-e projects sponsored by the American Historical Association are beginning to encounter the limitations that librarians live with every day. When faced with the opportunity not only to write their text for electronic distribution but also to present their sources through digital surrogates, the historians find themselves facing dilemmas familiar to digital collection builders everywhere. How much of the source material is enough to represent the base from which an argument was built? How can one select materials that give a sense of the scope of the original from which the scholar made his or her choices? And why is digitization of even a few core files so expensive?

Many of the scholar-designed projects may be coherent digital objects in themselves, but they would fail the librarian’s test of comprehensiveness as a collection. Indeed, one could say that the value added by the scholar lies precisely in its selectivity. Some of those projects, most notably the Valley of the Shadow Civil War site, attempt to bring together materials that complement and enrich each other but do not try to comprehend the great universe of materials that could be considered complementary. These new digital collections are somewhat analogous to published anthologies of primary sources, carefully selected by an individual or an editorial team to serve heuristic purposes or to provide supporting evidence for an interpretation. Other projects driven by scholar selection, such as the “Fantastic” collection of witchcraft and French Revolutionary source materials at Cornell or the Women Writers’ Resources Project at Emory University, do not claim to be comprehensive, but serve as pointers to the collection by presenting a representative sampling of it. Yet others, such as the Blake Archive or the APIS serve primarily to collocate items to form a new virtual collection that then serves as a new paradigm of critical edition.

General Collections. To date, very few libraries have digitized significant series of books and periodicals, whereas, as the ARL special collections survey shows, a great many libraries are digitizing their special collections. Several reasons for this selection strategy are commonly given, and others can be inferred.

There has been a preference to digitize visual resources over textual sources, in part because they work so well online and in part because visual resources do not require the additional expense of OCR or text encoding that add value to textual materials. (Creating metadata for visual resources that are not well indexed, however, often ends up being more expensive.) Printed sources do not require additional features, of course, but simple page images of non-rare texts do not provide the enhanced access that most researchers want from digital text. Nearly all selection criteria call for a specific additional functionality, such as browsing and searching, from text conversions.

A number of commercial interests are working with publishers or libraries to provide digitized versions of texts that have a potential market; one example is Early English Books Online (EEBO). For the core retrospective scholarly literature that is in high demand, there are commercial and nonprofit providers, such as Questia and JSTOR, ready to run the copyright gauntlet that libraries are ill equipped to handle efficiently. ArtSTOR, a production-scale digitization program initiated in the summer of 2001 and modeled on JSTOR, will develop a database of art and architecture images based on analysis of curricular needs for higher education. It is the copyright issue associated with these sources that has largely precluded individual institutions from digitizing their slide libraries and other visual resources and contributing them to a database that would build contextual mass.

In addition to these considerations, there is a sense that a library can help build institutional identity by digitizing materials that are unpublished or not commonly held. This can be important in encouraging alumni loyalty or in recruiting students. This assumption-that special collections build institutional identity and general collections do not-is actually challenged by the success of the Making of America (MOA) projects at Michigan and Cornell and by the texts encoded at UVA. These institutions have achieved considerable renown for their collections of monographs and periodicals. Yet it is reasonable to assume that such massive digitization programs are not easily replicated by institutions with smaller digital infrastructure. For those institutions, special collections allow a smaller-scale approach to developing a Web presence.

But, as the MOA projects highlight, two of the challenges faced by libraries mounting print publications are how much is too little and how much is perhaps too much. The sense that textual items need to exist in a significant or critical mass online stems in part from the fact that books and magazines do not have quite the same cultural frisson as do Jefferson holographs or Brady daguerreotypes. Few libraries are in the position to mount the kind of large-scale digitization projects that can result in a critical mass of text online, and for reasons to be discussed below, they do not enter into the kind of cooperative arrangement that is at the heart of MOA.

Special Collections. In 1995, when several academic libraries were working together to mount the text-based Americana in the Making of America project, LC inaugurated a digitization program, American Memory, based on its Americana special collections. The program was ambitious (they targeted five million images in five years) and has been influential largely because of the extensive and easily adapted documentation that the library has mounted on its Web site and the well-publicized redistribution grants that it gave under its LC/Ameritech funding. The requirements for those grants were based on Library of Congress experience, and they have significantly influenced the requirements of other funding agencies, including the Institute for Museum and Library Services (IMLS). The only other library that has similar collecting policies and a similar governance and funding structure is the NYPL, and the digital program it plans to implement over the next few years bears remarkable resemblance to that of LC: both have an ambitious time frame, focus on special collections, and intend to make access to the general public as high a priority as service to scholars. They share, in other words, the same strategic view of digitization-one that is well in line with the realities of their roles as public institutions and with their audiences, collection strengths, and governance structures.

Many libraries that are not similar to LC or NYPL have also used this strategy. In the early stages of its digitization program, Indiana University reports that it used the LC/Ameritech Competition proposal outline to assess the merits of collections for digitization. This led to a canvass of the university’s libraries for “their most significant collections, preferably ones in the public domain or with Indiana University-held copyrights.” Then, with (special collection) candidates in hand, the library examined them for what it identified as the basic criteria: “the copyright status of the collection; its size; its popularity; its use; its physical condition; the formats included in the collection . . . and the existence of electronic finding aids” (Brancolini 2000).

Libraries that depend on outside funding-the great majority of libraries digitizing collections-often assert that it is easier to raise funds if they propose to digitize special collections because they are more interesting and have greater appeal to the funding agencies. This hypothesis is untested-although MOA, which comprises general collections, has received major grant funds, so perhaps the hypothesis has indeed been tested and proven invalid. Nevertheless, this notion of the funding bodies’ predilection for special collections continues to be persuasive.

While academic libraries have many reasons for deciding to digitize special collections, the rationales of the two public institutions merit special consideration, in large part because they are so different from those of academic libraries. (Because of these differences, however, taking them as a model should be done with eyes wide open). The NYPL and LC base their selection decisions on their understanding that they are not libraries within a specific academic community, with faculty and students to set priorities. Rather, they serve a broad and often faceless community-the public. Their goal is to make available things that both scholars and a broader audience will find interesting. They also endeavor to make their collections accessible to those with modem connections and low bandwidth, often limiting factors for the delivery of cartographic and audio materials, among others. Because their primary audience is not academic, they have no curricular or educational demands to meet. They can focus exclusively on their mission as cultural institutions. Moreover, as libraries that have rich cultural heritage collections held in the public trust, they feel obligated to make those unique, rare, or fragile materials that do not circulate available to patrons who are unable to come to their reading rooms. Their strategic goal is cultural enrichment of the public.

None of the research libraries with comparable deep collections claims cultural enrichment of the public as an explicit goal. And yet, as de Stefano points out, there are academic libraries that are mounting special collections of broad public appeal that is not matched to curricular needs (de Stefano 2001, 67). She cautions that, “It is only a matter of time until the question emerges as to how long the parent institutions will be satisfied with supporting the costly conversion of their library’s materials to improve access for narrowly defined audiences that may not even be their primary local constituents.” This strategy may be supportable as long as the overhead of serving a secondary audience beyond the campus is low. But anxiety about this issue was repeatedly expressed in the conversations held during research for this paper.

FOOTNOTE

²See Research Libraries Group 1996; Digital Library Federation and Research Libraries Group 2000; Sitts 2000; Smith 1999; Gertz 2000; de Stefano 2001; Kenney and Rieger, 2000; Library of Congress National Digital Library Program 1997.