It became clear in our work that humanists, who are often exceptional experts in their fields, often have a difficult time describing how they go about their work and analyses. Having humanists work in teams and with computer scientists required them to explain and detail their processes. The work we have done has made us more sensitive to this issue and opens up many new areas of research-How can we develop better collaborative models that help humanists explicate their processes? Can we build tools to help capture the way humanists work? How can we enhance digital archives to facilitate the ways humanists work with objects?
–Dean Rehberger, Digging into Image Data to Answer
2.1 Structural Commonalities and Notable Differences
A broad range of topics and methodologies are represented in the eight inaugural Digging into Data initiatives. Nevertheless, when considered as exercises in research practice, the initiatives reveal some shared characteristics. All projects:
- Engage with data corpora that are much larger than what might be read, seen, heard, or experienced by a single individual. These data range from highly structured, uniform, and topically specific to completely unstructured and heterogeneous corpora.
- Apply some form of computational analysis-whether described as a tool, an application, or merely an algorithm-to these corpora. These tools, applications, and algorithms vary from the highly specific to the more general, from the most experimental to the mature. Some are widely accessible, and others require the expertise of computer specialists.
- Require continual refinements to tools and data, which in turn requires collaboration and coordination of multiple project participants with varied backgrounds and skills.
- Conducted a research process that incorporates most or all of seven stages:
a. hypothesis and/or question formation;
b. selection of a corpus or corpora;
c. exploration of a corpus or corpora;
d. querying and correcting, modifying, or amending the data as needed;
e. pulling together subsets of data relevant to a given question;
f. making observations about those data; and
g. drawing conclusions from and/or interpreting those data.
The case studies show that computationally intensive research mirrors other kinds of inquiry, although they suggest that a dependency on digital tools and resources requires more explicit documentation and communication about methodology than has typically been the case in the humanities and qualitative social sciences. Because digital research methodologies are still maturing, it is important to consider carefully the rationale for the investigators’ choices of analytical tools and the evidence produced-that is, to reflect upon the significance of a specific tool applied to a specific corpus. To borrow the words of one project team, who worked together on Using Zotero and TAPOR on the Old Bailey Proceedings: Data Mining with Criminal Intent (DMCI),31 “The methodology is part of the message.”
Social scientists are generally comfortable foregrounding explanations of methodology in discussions of their research; humanists, by contrast, tend to foreground the argument or interpretation resulting from scholarly investigation rather than the research methods. Asserting the value of one’s approach to research as a model for others is a more comfortable position for social scientists than for humanists. Humanists often see greater understanding of the subject matter with which they are concerned as their primary contribution to their fields, or at least a more important contribution than the preparatory work necessary to describe new findings and support new claims.
Crossing disciplinary boundaries often increases the impact of computationally intensive scholarship by exposing it to greater numbers of researchers, students, and the public. At the same time, it complicates project management: traditions, concepts, and research vocabularies must be adapted to accommodate other points of view. When the common ground for a collaboration is methodological (the “how”) rather than driven by a shared desire for a particular discovery or outcome (the “why”), partners must be prepared to work in ways that do not neatly fit the models they have been trained to emulate. This results in products for which partners cannot take sole credit, some of which defy traditional kinds of peer review. The level of stress this transformation may create for the researcher varies by discipline, by institution, and by individual, but acceptance of this change is obligatory.
These projects point to new avenues for investigation more often than they provide conclusive answers to their original framing questions. This is not surprising, given that topics as complex as patterns of human creativity, authorship, and the continuity of culture over time often elude conclusive explanation. But many practitioners of computer-assisted investigation contend that in time, with enough attention to the curation of valid data, the formation of suitably complex and replicable methods of analysis, and the framing of increasingly precise questions, it may be possible to combine computer-based analysis of large data corpora with the creativity and critical power of the human researcher to promote a greater understanding of our society and culture than has ever been possible. The prospects of new discovery at such a scale seem achievable only through continued collaboration across disciplines.
2.3 The Spectrum of Data and Its Consequences
The quality, quantity, and utility of data is unquestionably the most complex determining aspect of these projects. Within an umbra of many shared characteristics, important differences surfaced as an effect not only of different disciplinary traditions but also of the choice of collaborators deemed most suitable for the media, scale, and organization of the targeted data sets, the proportion of manual to automated work, the need for continual adaptation of analytical tools, and the likelihood of achieving major outcomes in a brief (15-month) grant period. In other words, it is not just the specificity of the question or the maturity of a tool that determines what computationally intensive research might achieve, but also the state of the raw material from which it is produced.
In discussions and subsequent exchanges at the Digging into Data program’s culminating meeting in June 2011, University of Portsmouth Professor Richard Healey, co-principal investigator of Railroads and the Making of Modern America, suggested that the framing of the original Challenge oversimplified the kinds of work that computationally intensive research encompasses. He writes, “I think there may have been something of an implicit original assumption behind the initiative, at a broad level, that since there were multiple millions of digital text/image/data files ‘out there’ …all the focus would be on the use of data mining and other algorithms to tease out new signals from the noise.” Rather than a “one-size-fits-all” model for data-intensive humanities and social sciences, Healey proposes many different levels of data-related operations. He describes these levels as a “data hierarchy” and characterizes them as follows:
Level 0: Data so riddled with error that they should come with a serious intellectual health warning … ! (We have much more of this than most people seem willing to admit. … ).
Level 1: Raw data sets … corrected for obvious errors.
Level 2: Value-added data sets (i.e., those that have been standardised/coded, etc., in a consistent fashion according to some recognised scheme or procedure, which may require significant domain expertise/ training and the exercise of judgement. …).
Level 3: Integrated data resources … these will contain value-added data sets but the important additional aspect of these resources is that explicit linkages have been made between multiple related data sets (or have been coded/tagged in such a way that the linkages can be made by software. … ).
Level 4: “Digging Enabler” or “Digging Key” data/classificatory resources … these require extensive domain expertise and use/analysis of multiple sources/relevant literature to create. They facilitate extensive additional types of digging activity to be undertaken on substantive projects beyond those of the investigators who created them, i.e., they become “authority files” for the wider research community. Gazetteers, structured occupational coding systems, data cross-classifiers, etc., fit into this category. … There are important questions also about how such resources acquire authority status (e.g., through quality of referencing back to original sources, through collaborative work by leading research groups in the field, by peer review, by crowd sourcing from citizen scholars).32
These distinctions make clear that to realize the benefits of data-intensive social sciences and humanities, institutions and scholarly societies must expand their notions of what kinds of activities constitute research, and must reconsider how these different activities are supported, assessed, and rewarded.
Investigators stressed frequently that the research they pursued would not be possible without extensive collaboration with partners who contributed many kinds of expertise working in what Peter Ainsworth (Digging into Image Data) called a “transformative, symbiotic partnership.” Collaborators’ expertise and training overlapped more in some cases (such as Mining a Year of Speech, Data Mining with Criminal Intent) than in others (such as Digging into the Enlightenment, Digging into Image Data). When the teams included experts with complementary, rather than overlapping, strengths, the coordination and management of the project, including communications among the partners and dividing responsibility for shared resources, was especially vital, as were significant investments of time in planning for and framing the project.
Four generic kinds of expertise were represented among partners in each project: domain expertise, data management expertise, analytical expertise, and project management expertise. Participants in all the projects shared an appreciation for each of these kinds of skills. While not always represented in the same proportions, each of these areas was represented in the eight projects by one or more individuals. These categories of expertise seemed important counter-balances to one another, as if they were four supporting legs of a table (Figure 1).
Fig. 1: Expertise represented among project partners
Although the investigators agreed that the four categories were equally important, some observed that the contributions of researchers with more than one of these kinds of expertise were most critical to project success. Dan Edelstein, who worked on Digging into the Enlightenment, put it this way: “What made our project possible was that we had these hybrid people with more than one leg of the ‘table’. Those people are very hard to find. They don’t do well naturally in a university setting.” Students, short-term project staff, and junior faculty all played crucial roles, often in a “hybrid” capacity.
2.4.1 Domain Expertise
Domain expertise incorporates theoretical as well as a factual understanding of the humanities or social science research traditions relevant to the project. It was usually represented in the projects at the principal investigator level, an indicator of its critical importance. Beyond this, outside contributors also played important roles; for example, in the Data Mining with Criminal Intent project, a number of outside experts tested and evaluated the project’s tools and methodology. The theoretical component of Structural Analysis of Large Amounts of Music Information is an ontology contributed by experts at the universities of Oxford and Southampton, and was fundamental to shaping the direction of that project. Domain expertise requires familiarity with the kinds of data to be examined, the ways in which disciplinary specialists have interpreted them in the past, and the ability to identify key knowledge gaps and questions toward which computationally intensive methodologies can be applied. Other relevant skills include an understanding of the provenance and materiality of digitized evidence and the imagination to make connections between research concerns and the concerns and practices of related disciplines. Familiarity with the relevant disciplinary literature, its conventions of citation and publication, and an ability to teach others to appreciate the importance of each are included in this skill set. Critically, these experts must be comfortable teaching others from diverse educational backgrounds, including students at all levels; computer scientists, programmers, and developers; and members of the general public.
Domain experts have:
• a deep theoretical and factual knowledge of relevant field(s)
• familiarity with types of data to be examined, their provenance, and their significance to the relevant field(s)
• the ability to identify knowledge gaps
• familiarity with disciplinary literature and conventions
• the ability to teach others from different backgrounds to appreciate all of the above
2.4.2 Data Expertise
As a team we noticed an interesting interaction where we had to accept each other’s approaches. This was particularly important in that those in the Old Bailey who had come in with an appreciation for their structured data had to come to understand how the Old Bailey could be seen as a mass of unstructured data for text mining. The text miners in the group in turn had to look more closely at what could be done with structured data. This was a fruitful exchange.
-Data Mining with Criminal Intent team
Data expertise is defined by an understanding of how data have been collected and curated, the relationships between material objects and digital representations of those objects, relevant data models and conventions for description, and storage systems and how they affect the way in which data are accessed and preserved. Understanding of information-seeking behaviors across diverse disciplines and an ability to predict future or alternate uses for data consumed or produced by the project are also relevant. Devising ways to manage the hand-correction of erroneous data efficiently is another important contribution of data experts, since such tasks can consume a major share of labor on a digital project when such correction is necessary.
A data expert must have sufficient technical knowledge of storage systems to help others comprehend how they might affect compatibility or interoperability with other systems and standards. She or he must also understand new forms of publication that can integrate data resources with narrative and interpretation. These experts make important contributions in teaching and advising other participants to adopt research practices that maximize readiness of relevant data for publication and reuse.
Data expertise was often represented in the projects by the managers or curators of the corpus or corpora to be investigated. The British Library partners who manage the British National Corpus that is part of the basis for Mining a Year of Speech, the University of Oxford–based creators of the Electronic Enlightenment for Digging into the Enlightenment, the Tufts University managers of the Perseus Digital Library for Towards Dynamic Variorum Editions, and the multiple partners who share responsibility for creating the Old Bailey Online (Data Mining with Criminal Intent) and the Quilt Index (Digging into Image Data) are examples. The level of engagement of these partners in the day-to-day operations of each project varied according to how structured or accessible their data were initially and how closely project activities aligned with the priorities of the institution responsible for maintaining the corpus.
Data experts have:
• an understanding of how data have been collected and curated and of relationships between material objects and digital representations of those objects (if applicable)
• familiarity with data models and/or conventions of description
• an understanding of how relevant data are accessed and stored
• the ability to facilitate data sharing and manual error correction, both during and after the project
• the ability to predict future or alternate uses for data
• an understanding of new forms of publication that can incorporate data
2.4.3 Analytical Expertise
We realized that we needed to be better about opening up the black boxes of algorithms. For many humanists, they remain a mystery in which one feeds things in one end and an “answer” comes out the other end. But algorithms are more like recipes, and it is important to have humanists be part of every stage of the process. We need to determine the ingredients (features) that will be used in the process. We need to make it clear that the actual “cooking” process of the algorithm can be changed or tweaked depending on the input and output. And finally, the output is not an answer but another kind of “text” or “visualization” that needs to be interpreted or analyzed. Algorithmic literacy means not only learning how to interpret results but to understand the whole “cooking” process of algorithm development.
-Dean Rehberger, Digging into Image Data to Answer
Technologists, scholar-technologists, information scientists, and computer scientists contributed analytical expertise. While the role of the analytical expert was important in all of the projects, it was paramount in those relying on high-performance computing infrastructures and in those developing cutting-edge methodologies (such as visual analytics for Digging into the Enlightenment, computer-aided structural analysis of music for the SALAMI project, and adaptive image analysis for Digging into Image Data). Analytical expertise is not limited to specialized programming and computation, and for data-intensive work often requires a much broader understanding of research methodologies than is common for programmers or developers. Gregory Crane, one of the investigators who led Towards Dynamic Variorum Editions, emphasized the importance of this distinction: “We often do not get access to people working at a sufficient level of expertise [in computational analysis] to get real work done.”
Analytical expertise includes understanding the strengths and weaknesses of an array of research tools relevant to a project. These may include generic statistical, visualization, geographic information, and optical character recognition tools as well as the numerous specialized algorithms used by the Challenge investigators. Analytical experts select the most appropriate tools and are able to customize and improve them for specific research tasks. These experts can test the efficacy of an analysis, validate results, and teach less-experienced partners to read and interpret visualizations, charts, and statistics. Measuring the performance of new methods against traditionally collected “ground truth” data in order to validate those methods was a key component of the SALAMI project and Harvesting Speech Datasets for Linguistic Research on the Web.
Analytical experts have the ability to:
• understand the strengths and weaknesses of individual research tools
• select and customize appropriate tools to support research goals
• predict problems that might arise with using the selected tool to perform project tasks
• predict and detect error rates in data and data analysis algorithms and to choose statistical methods that account for these errors when appropriate
• teach others to interpret results of analysis
2.4.4 Project Management Expertise
This project has offered me the unique opportunity, as both a junior faculty member and a female in digital humanities, to simultaneously develop leadership and research skills. The project PIs have given me and other junior faculty the opportunity to integrate ourselves fully into the project not just as a source of labor (be it intellectual or task-based) but also as a participant in shaping the future of our research. They have mentored us throughout the project, created specific pathways to publications and presentations, and allowed us equal ownership of the project.
-Jennifer Guiliano, Digging into Image Data to Answer
Without project management expertise, none of the inaugural Digging into Data projects could have succeeded. The inherently experimental nature of these projects made coordinating parallel work streams complicated. Project managers had to track the achievements of their collaborating partners on an almost daily basis, especially in cases where large numbers of people were involved. Thorough and consistent project documentation, so necessary for the products of such initiatives to be useful to other scholars, is an additional component that requires a skilled manager’s coordination. For the projects funded through the Challenge, there were the additional burdens of reporting about the same project to several funding agencies. The work involved in compiling such reports is significant.
Project management responsibilities were often distributed to several members of the team, most commonly to principal investigators. Occasionally one of the collaborating institutions assumed leadership in this area. For one of the larger initiatives, Digging into the Enlightenment, the team chose to assign major project coordination tasks to the in-house academic technology specialist at the Stanford Humanities Center; having access to a professional with expertise in grant management and coordination was invaluable. For Digging into Image Data, another large collaboration, the experience and support of staff at the Institute for Computing in Humanities, Arts, and Social Science (I-CHASS) at the University of Illinois’ National Center for Supercomputing Applications (NCSA) were fundamental. Here, where sharing hardware, software, and data among distant collaborators was critical to project success, the partners crafted a formal memorandum of understanding for the project that eliminated confusion about participants’ individual roles and responsibilities, freeing partners to focus on their work.33 The agreement also addressed legal and ethical issues, including setting standards for citation and credit sharing among participants in postproject presentations and publications as well as for respecting intellectual property restrictions. In addition, the agreement prescribed methods for communication and documentation for the project and a policy for licensing of any software deliverables.
Project managers have:
• an ability to frame project parameters
• an ability to set appropriate goals and deadlines and to coordinate parallel work streams if necessary
• an ability to select the most appropriate communication and documentation strategies for the project
• a mastery of collaborative research tools
• a strong desire to work toward outcomes that benefit all team members
32 E-mail from Richard Healey to Christa Williford, June 11, 2011.
33 Simeone, Michael, Jennifer Guiliano, Rob Kooper, and Peter Bajcsy. “Digging into Data Using New Collaborative Infrastructures Supporting Humanities-Based Computer Science Research.” First Monday 16.5 (May 2, 2011). Available at http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3372/2950. Section 3.2 of this article, titled “Legal and Ethical Aspects of Scholarly Collaborations,” is especially salient here.