How many lifetimes? This question often arose when the authors of this report pondered the extraordinary scale and complexity of research conducted in the Digging into Data Challenge program. Analyzing and extrapolating patterns of meaning from tens of thousands of audio files; nearly 200,000 trial transcripts; millions of spoken words, recorded over many years; and hundreds of thousands of primary and secondary texts in ancient languages would, if undertaken using printed resources and analog materials, have required the lifetimes and generations of scholars. Because the resources in question were digital, the time of analysis and discovery was compressed into months, not decades. By choosing to work with very large quantities of digital data and to use the assistance of machines, the Digging into Data Challenge investigators have demarcated a new era-one with the promise of revelatory explorations of our cultural heritage that will lead us to new insights and knowledge, and to a more nuanced and expansive understanding of the human condition.
As articulated in section one, the Digging into Data projects are built on collaborations that are neither contrived nor strained. These collaborations include humanists, social scientists, computer scientists, and other specialists working together toward shared goals that also meet their individual research aspirations. Rather than working in silos bounded by disciplinary methods, participants in this project have created a single culture of e-research that encompasses what have been called the e-sciences as well as the digital humanities: not a choice between the scientific and humanistic visions of the world, but a coherent amalgam of people and organizations embracing both.
Within this one culture are many important differences and distinctions (think of a magnifying lens adjusting to expose increasing levels of granularity). Regardless of their disciplinary significance, at the lowest level all data in a digital environment are zeros and ones, a flattening of information that, while necessary for its storage within a computer’s architecture, is not particularly meaningful to humans. At an intermediate level, the human user can appreciate the diversity of digital resources. Data for the humanities and social sciences comprise many media and formats; among the types examined by the Digging into Data investigators are digital images of American quilts, fifteenth-century manuscripts, and seventeenth-century maps; conversations recorded in kitchens; news broadcasts; court transcripts; digitized music; and thousands upon thousands of digitized texts in many languages. Text, speech, music, image, and linguistic data offer rich opportunities for close, careful examination as well as rapid, large-scale computational analysis.
Research at these scales, speeds, and levels of complexity encourages new methodological approaches and intellectual strategies. As recently as 20 years ago, social science researchers typically used analog resources and some computational analysis of data collected in a laboratory or in the field, while humanists worked predominantly with library and archival materials. The Digging into Data Challenge presents us with a new paradigm: a digital ecology of data, algorithms, metadata, analytical and visualization tools, and new forms of scholarly expression that result from this research. The implications of these projects and their digital milieu for the economics and management of higher education, as well as for the practices of research, teaching, and learning, are profound, not only for researchers engaged in computationally intensive work but also for college and university administrations, scholarly societies, funding agencies, research libraries, academic publishers, and students.
This report results from a study of eight international projects that have uncovered previously unimagined correlations between social and historical phenomena through computational analysis of large, complex data sets. The following recommendations are based on this study; they are urgent, pointed, and even disruptive. To address them, we must recognize the impediments of tradition that hinder the contemporary university’s ability to adapt to, support, or sustain this emerging research over time. Traditional organizations and funding patterns reflect a much more strictly delineated intellectual landscape. It is time to question which among these boundaries remain useful, which should be more porous, and which no longer serve a useful purpose.
1. Expand our concept of research
To realize the benefits of data-intensive social sciences and humanities, institutions and scholarly societies must expand their notions of what kinds of activities constitute research and reconsider how these activities are supported, assessed, and rewarded. Computationally intensive research projects rely upon four diverse kinds of expertise, each described in detail in section two of this report: domain (or subject) expertise, analytical expertise, data expertise, and project management expertise. The active engagement of each of these kinds of experts in the research enterprise is essential. A re-evaluation of hiring practices, job requirements, and tenets of promotion is requisite.
2. Expand our concept of research data and accept the challenges that digital research data present
The digital raw materials upon which today’s humanists and social scientists rely are every bit as heterogeneous, complex, and massive as “big data” in the sciences.1 Not only do humanists and social scientists work with big data, their research also produces large data corpora. In fact, some scholars engaged in computationally intensive research see the new data they create as their most significant research outcome. The academy risks losing valuable data unless someone takes steps to care for them in an intelligent manner; to test them with an appropriate degree of skepticism; and, where needed, to correct, enhance, and integrate them with other data in ways that make them meaningful, reliable, and useful to others.
3. Embrace interdisciplinarity
The scholars participating in the first eight Digging into Data projects are active members of multiple academic communities that cross traditionally bounded fields. Their need to work across disciplines mirrors a larger need for organizational flexibility and possible restructuring of institutions of higher learning to promote successful working partnerships between differently trained scholars and academic professionals. Interdisciplinary collaboration benefits not only researchers but also students. Today’s colleges and universities must equip students with skills appropriate for a rapidly changing and diverse workforce: the intellectual flexibility that an interdisciplinary perspective cultivates is an excellent foundation for developing these skills.
4. Take a more inclusive approach to collaboration
As the subjects of this report attest, humanists and social scientists engaged in computationally intensive work benefit intellectually and professionally from sustained collaborations with others outside their academic departments and institutions. Library, information technology (IT), and other academic staff; graduate and postdoctoral fellows; undergraduates; and even citizen scholars have roles to play in such research projects. These roles need to be articulated and supported. Section three of this report explores this challenge and other challenges arising from collaborative, multidisciplinary research.
5. Address major gaps in training
The complexity of digital research requires an ongoing commitment to professional development in order to maintain expertise in rapidly accruing resources and tools. Faculty, staff, and students need strong, reliable training programs that correlate sound methodological strategies with appropriate new technologies.
6. Adopt models for sharing credit among collaborators
Institutions of higher learning can more forcefully encourage engagement across disciplinary, institutional, and professional divides by noting and appropriately rewarding their faculty, staff, and students for making substantial contributions to collaborative efforts. Few large-scale digital projects can succeed if individual researchers remain solely responsible for them. If collaborative credit sharing enhances, rather than detracts from, the assessment of an individual’s work, more scholars will be willing to work collaboratively, and, ultimately, both the quality and the long-term impact of digital projects in the humanities and social sciences will grow.
7. Adopt models for sharing resources among institutions
The level of investment required to support computationally intensive research is large and growing. It makes no sense to replicate resources, skills, and services at all colleges and universities. Instead, institutions have an opportunity to establish explicit, long-term agreements to work with one another for mutual benefit. There will be serious challenges to overcome-including maintaining appropriate controls over network security, data privacy, and intellectual property-but these challenges must be met to sustain digital research efficiently and affordably.
8. Re-envision scholarly publication
Institutions, scholarly societies, libraries, and funding agencies are all positioned to expand the range of available publication outlets for scholars. Many meaningful outcomes of computationally intensive research, such as data-rich visualizations, cannot be distilled into conference presentations, journal articles, or monographs. Taking advantage of current web technologies, leaders in the academic sector can create new models for publication that incorporate rigorous review processes while at the same time inviting diverse data-rich and multimedia contributions to the academic record.
9. Make greater, sustained institutional investments in human infrastructure and cyberinfrastructure
Computationally intensive research demands a sustainable, redundant network for the preservation of information, as well as trained research professionals to manage this network intelligently. The network’s infrastructure should facilitate sophisticated knowledge management and extraction for both anticipated and unanticipated future research. Gateways into that infrastructure will need continual refinement. With investments in innovation and the refinement of user tools, researchers will be able to engage a broader public in their work. Maintaining a digital infrastructure in which collaborative research can flourish will require major commitments from individuals, institutions, governments, and other funders of higher education. It is time for each of these stakeholders to make these commitments.
Recommendations by Stakeholder Group
• Look for opportunities to develop expertise in areas beyond a single discipline, including other related disciplines, data management, data analysis, and project management.
• Create opportunities for students to develop these kinds of expertise.
• Be willing to collaborate both within and outside your discipline, particularly in cases where researchers in other disciplines use similar methodologies.
• Be willing to collaborate both within and outside your institution.
• Be willing to share credit for collaborative work and to recognize others’ collaborative efforts.
• Cite digital resources, including tools and data, that you use just as consistently as you cite published articles, conference papers, or monographs.
• Contribute to new forms of digital publication, as authors, editors, and as peer reviewers.
• Commit to investing in the long-term management and preservation of data.
• Create opportunities for humanities and social science faculty, adjunct faculty, staff, and students to develop skills in the management, analysis, and interpretation of these data.
• Offer incentives for engagement in collaborative research initiatives.
• Develop models for the assessment of collaborative work.
• Develop partnerships with institutions with complementary strengths.
• Adopt clear policies for sharing hardware, software, and data resources among on- and off-campus researchers that maximize openness yet protect privacy and intellectual property.
For scholarly societies:
• Cultivate and critically assess new research methodologies with potential benefits for your discipline.
• Promote the value of computationally intensive research methodologies within your discipline to researchers outside the discipline and to the wider public.
• Create opportunities for members to develop skills in data management and analysis.
• Encourage cross-disciplinary engagement among members as well as non-members with relevant expertise.
• Build alliances with other societies with similar needs and interests.
• Support new models for scholarly communication and peer review.
• Commit to supporting the long-term preservation of key digital resources in your discipline.
For academic publishers:
• Seek to publish content that crosses disciplinary boundaries and embraces newer computationally intensive methodologies.
• Encourage the submission of work by multiple authors and ensure that publications give credit to all contributors to this work.
• Seek ways to incorporate digital data and multimedia into online publications, and adopt models for assessing such work.
• Commit to supporting the long-term preservation of and access to your publications.
• Deepen partnerships with academic institutions and scholarly societies in the service of preservation and access.
• Where possible, increase transparency in your business practice so that other academic stakeholders understand the true costs of publication and how these costs are changing over time.
For research libraries:
• Recruit and develop staff prepared to engage as active partners in computationally intensive research initiatives, particularly by offering expertise in data management, data analysis, or the management of collaborative projects.
• Recruit and develop staff capable of contributing to the peer review of new forms of online scholarship.
• Offer consultation services to researchers that help them manage, maintain, and, if warranted, transfer responsibility for valuable research data to library repositories.
• Offer consultation services to researchers that help them identify appropriate publication venues for non-traditional forms of scholarship.
• Encourage cross-disciplinary engagement among researchers and students at your library, such as through public programs or workshops related to data-intensive research tools.
• Establish partnerships with other institutions to promote the long-term preservation of and access to scholarly publications and the digital data upon which they rely.
For funding agencies:
• Acknowledge the high costs of curating reliable large-scale digital data sets for the humanities and social sciences and create incentives for researchers, institutions, and scholarly societies to accept responsibility for these costs.
• Support robust, thoughtful approaches to computationally intensive research in the humanities and social sciences that incorporate disciplinary rigor as well as sound data management, analytical, and project management practices.
• Support training and professional development opportunities related to computationally intensive research for students, staff, and faculty.
• Support new models for academic publication and peer review.
• Encourage cross-disciplinary and multi-institutional research initiatives that take advantage of academic professionals’ and institutions’ complementary strengths.
1 John Coleman, Mark Liberman, Greg Kochanski, and colleagues make compelling comparisons between the sizes of major data corpora in the sciences and humanities on page 3 of their white paper Mining Years and Years of Speech. See http://www.phon.ox.ac.uk/files/pdfs/MiningaYearofSpeechWhitePaper.pdf.