1.1 One Culture
I have, of course, intimate friends among both scientists and writers. It was simply through living among these groups, and much more, I think, through moving regularly from one to the other and back again that I got occupied with the problem of what … I christened to myself as the “two cultures.”
C. P. Snow thus differentiated two distinct intellectual communities-what we would call today humanists and scientists-that had lost the ability to communicate across their disciplinary boundaries and, for all the similarities in intellect, background, and social standing, lived and worked in worlds that could not be bridged. Encounters between these two societies were often hostile and dismissive.
Interestingly, one of the topics that Snow chose to highlight in his description of the divergent worldviews of the “two cultures” was the Industrial Revolution. Snow claimed the revolution was largely a product of science and engineering, and also claimed that writers and humanists had largely ignored it. The crux of this observation is an assumption that science embraces technology and the humanities does not, or does so more slowly and reluctantly.2
Such an assumption is no longer valid, as this report shows. The Digging into Data projects are successful collaborations, and through their success give evidence of shared intellectual values, rigorous methodological approaches, and common ground across scientific and humanistic disciplines. Researchers from these disciplines rely deeply upon one another for insight and discovery when confronted with very large-scale, complex challenges.
Nevertheless, these researchers still work in environments that, at least implicitly, admit the residual truth to Snow’s argument. Their academic departments, scholarly and professional societies, colleges, and universities are set apart, clustered, and structured according to the traditional “two-culture” perspective Snow describes. The grant programs and funding agencies that have traditionally supported their work are similarly focused. The Digging into Data program has challenged this bifurcation by insisting on collaboration across disciplines and by funding the projects through an amalgam of sources that cross these borders.
While it is not the intent of this report to dredge again the merits and failings of Snow’s now famous declamations, the projects it describes suggest a very different academic landscape supporting one culture in the pursuit of knowledge. The eight teams of researchers have built collaborations that are neither contrived nor strained. In assessing the project teams’ work, we have come to understand that the one culture of e-research-encompassing what have been called the e-sciences as well as the digital humanities-involves not a choice between the scientific and humanistic visions of the world, but an imperative that people and organizations fully embrace both. In these projects, highly organized teamwork, such as might characterize a scientific laboratory, is as significant as more free-form contemplation. It is in working together and apart3 that we will see digital scholarship flourish.
1.2 Background: “What Do You Do with a Million Books?”
In 2006, Tufts University professor Gregory Crane posed the “million book” question in an article4 exploring the potential for doing large-scale investigations of text corpora. Crane identified several problems facing computationally intensive research with texts, including insufficient funding for digital text repositories, variable quality and granularity of repository content, inaccuracies arising from errors made in optical character recognition (OCR) and metadata generation, research plans that are too narrowly defined to appeal to a broad audience, and access restrictions imposed by pay walls and copyright laws.
The launch of the Digging into Data Challenge in 2009 was in part a response to the “million book” prospectus. Very large data sets are susceptible to scholarly inquiry but are dependent on computational tools and equipment for execution and analysis. What are the intellectual benefits, and what are the risks? How does this new research align within the traditional context of scholarship and how might it be distinct? Mass-digitization projects such as Google Books, which had by then prompted widespread excitement and speculation about its use for research,5 and HathiTrust, a not-for-profit library-based alternative, made the challenge more intuitively feasible. “Reading” large text corpora by machine-encompassing an amount of information exponentially greater than would be possible for any individual to take in and process in a lifetime-was then, as now, a subject at once intriguing, daunting, and unsettling.
Under the leadership of Brett Bobley, chief information officer and director of the Office of Digital Humanities, National Endowment for the Humanities (NEH-ODH), the Digging into Data Challenge was framed broadly, to encompass any type of digital or digitized content used by researchers in the social sciences and humanities. In discussions before the program’s launch, leading researchers and other funders had stressed that establishing reliable methodologies for analyzing large quantities of non-text digital content-including audio, image, and audiovisual data, was as important as learning to machine-read large bodies of texts. These advisers proposed that the Challenge be supported by a group of funders rather than adopted as the responsibility of a single agency. The United States National Science Foundation (NSF), the Joint Information Systems Committee in the United Kingdom (JISC), and the Canadian Social Sciences and Humanities Research Council (SSHRC) joined the NEH in preparations to coordinate grant calendars, guidelines, and a review process for the new program. By requiring international collaboration, the four agencies hoped to fund projects that would have high visibility and broad appeal; by actively recruiting the managers of significant data repositories to signal support for the program through making their holdings accessible, the agencies hoped to encourage openness. Eight proposals were funded for the first round; they are the focus of this report.
|November 2007||The National Endowment for the Humanities Office of Digital Humanities (NEH-ODH) begins exploring the idea of a new funding program focused upon computationally intensive humanities research|
|May 8, 2008||NEH convenes the “Million Book Challenge” planning meeting with scholars and other funding agencies|
|January 16, 2009||Four cooperating funders announce first Digging into Data Challenge6|
|September 10–11, 2009||Joint review panels determine first award recipients|
|December 3, 2009||First awards announced7|
|March 16, 2011||Eight cooperating funders announce second Digging into Data Challenge8|
|June 9–10, 2011||Digging into Data Challenge conference held9|
|December, 2011||Second round of awards announced10|
Table 1. Digging into Data Chronology
1.3 The Context of this Study
At its inception, this study posed two fundamental questions to the eight research teams:
- Why do you as a scholar need a computer to do your work?; and
- What kinds of new research can be done when computer algorithms are applied to large data corpora?
The questions imply a distinction between “new” computer-based and “traditional” non-computer-based research in the humanities and social sciences. Early on, that distinction became problematic. While natural and perhaps necessary to pose the old and new in opposition to one another to better understand the changing landscape of scholarship and the transformative potential of new technologies, there was never clear separation between past and present, traditional and digital, or other bounded concepts that very quickly felt artificial and unhelpful. Many of the researchers interviewed for this study assiduously avoided making such distinctions.
The framing questions thus quickly and unintentionally exposed an important aspect of collaborative, computationally intensive research initiatives. The eight projects that are the subject of this report reflect more complex, iterative interactions between human- and machine-mediated methods than are implied by our second question. Rather than being a combination of fixed, clearly defined entities-the researcher’s question, the algorithm, and the corpus-the projects are structures built with continually moving parts. Certain research questions require major investments of human labor in amending corpora; others require intense testing and reworking of algorithms to adapt to new and varied data. It is the combination of algorithmic analysis and human curation of data that helps humanists and social scientists refine their existing questions and articulate new ones. Furthermore, many of these projects show collaborators making significant advances in the field of computer science as well as within the relevant subject domain. Conducting research “at scale,” especially across the unstructured and heterogeneous data upon which humanists depend, can inspire new and more nuanced applications of computer tools, which can in turn lead to new questions.
1.4 The Eight Projects
The web-based version of this report includes individual case studies that describe key findings as well as some of the challenges each project team encountered. This printed report describes the cases in aggregate, extrapolating the commonly shared, characteristics. Table 2 notes the represented disciplines, numbers of researchers, data types, methodologies, and tools used for each project.
Table 2. Digging into Data Projects
2 Snow, C. P. The Two Cultures (Cambridge: Cambridge University Press, 1998), p. 2. Based upon a talk given by Snow at Cambridge University on May 7, 1959, first published in the same year by Cambridge University Press. Cited in Patricia Waugh, Review of The Two Cultures Controversy: Science, Literature and Cultural Politics in Postwar Britain, by Guy Ortolano (Cambridge: Cambridge University Press, 2009).
3 A 2009 CLIR report titled Working Together or Apart: Promoting the Next Generation of Digital Scholarship, which was the outcome of a symposium planned by former CLIR Director of Programs Amy Friedlander, provided the foundation for the study upon which this report is based. The insights of its authors, who write from specific disciplinary perspectives, resonate well with the findings here.
4 Crane, Gregory. “What Do You Do With a Million Books?” D-Lib Magazine 12.3 (March 2006). Available at http://www.dlib.org/dlib/march06/crane/03crane.html.
5 This speculation was in addition to the class action lawsuit filed by the Author’s Guild and the Association of American Publishers, still in litigation after a proposed settlement agreement was rejected by the New York Southern District Court in March 2011.
20 See Robertson, Bruce. “Optical Character Recognition of 19th-Century Polytonic Greek Texts: Results of a Preliminary Survey.” Perseus Digital Library (Jan. 19, 2012). Available at http://www.perseus.tufts.edu/publications/dve/RobertsonGreekOCR/