Using Zotero and TAPOR on the Old Bailey Proceedings: Data Mining with Criminal Intent (DMCI)
[Note: Additional details about this project are contained in the main body of this report (PDF).]
Using Zotero and TAPoR on the Old Bailey Proceedings: Data Mining with Criminal Intent serves as a classic example of expanding the impact of prior investment in constructing a reliable, highly structured digital repository. The Old Bailey Online was first launched in 2003 and has been in development for over a decade. It now includes complete trial records, including transcripts, of 197,000 trials held at the Central Criminal Court in London between 1674 and 1913. In size and scope, the Old Bailey corpus is broadly useful for answering a wide range of questions relating to social and cultural history; worldwide, Old Bailey Online users number in the hundreds of thousands.3
The scholars involved in Data Mining with Criminal Intent (DMCI) represent seven different institutions. Their joint goal was to expand the utility of the Old Bailey corpus by integrating it with online research tools for collecting and exploring textual data. These tools were Zotero, a personal research environment that helps users harvest and manage texts, and Voyeur Tools (now called Voyant Tools), a suite of visualization applications for text analysis, part of TAPoR. Zotero is based at the Roy Rosenzweig Center for History and New Media, George Mason University, and Voyant Tools have been developed by Canadian scholars Geoffrey Rockwell (University of Alberta) and Stéfan Sinclair (McGill University). Like the Old Bailey Online, both Zotero and Voyant Tools were already relatively mature initiatives at the start of the joint effort funded through Digging Into Data.
Before DMCI, scholars could search the Proceedings in Old Bailey Online by keyword or by predetermined categories such as ‘offence type,’ ‘verdict,’ or ‘punishment.” By bringing groups of trial texts into Zotero and/or Voyant Tools, scholars could be free from the “restricted research pathways” built into the Old Bailey Online architecture, opening up possibilities of exploring and analyzing the texts in alternate ways. In order to achieve the integration of the three projects, however, each of the teams had to make significant additions or changes to the software upon which their individual projects depended. They (1) developed a new application programming interface (API) for Old Bailey Online (OBAPI) so that other applications could call up, aggregate, and (if desired) export trial data, (2) created a “translator” for Zotero so that individual trial records could be described and managed along with other texts, and (3) simplified and extended the functionality of the user interface for Voyant Tools to make the exploration, visualization, and analysis of those texts more intuitive and powerful for scholars. Once the newly developed pieces of each project were in place, the team demonstrated the value of the integration using example searches.
The DMCI project cohered around a cluster of historical questions about crime in London society, as well as a broader scope of methodological questions about the application of data mining technologies to textual corpora. The UK, US, and Canadian teams each also had their own goals for strengthening and expanding the impact of their individual initiatives. This posed some challenges, since each of the three teams had to continue to serve the needs of their existing user bases while collaborating on making their work interoperable. Despite these challenges, team members observed that the collaboration was successful in both anticipated and unanticipated ways. It increased the range of expertise to which the participants had regular access, strengthened trust among the partners, and kept individual morale high in the process of meeting project goals. Teams in each country, and individuals within each of these teams, each followed independent work plans. Since the goals of the project required adherence to a common timeline, however, they held joint weekly conference calls during which all three groups shared progress reports. Participants also met in person on several occasions throughout the project, scheduling meetings around conferences and workshops of mutual interest.
One of these workshops involved exploring the application of machine learning techniques used to calculate degrees of similarity between records or texts-familiar to most users of search engines and e-commerce websites as links labeled “more like this.” The teams immediately saw the potential of implementing machine learning on the Old Bailey site: as a researcher explores the Old Bailey trial texts, selecting relevant records, machine learning technologies would result in increasingly accurate predictors of materials of interest to that researcher, all along suggesting trials that he or she might not otherwise have found if searching by keyword, category, or date. Excited by the potential of this kind of machine learning to facilitate serendipitous discovery (characteristic of so much scholarly inquiry), the Canadian team is now preparing to implement these technologies within Voyant Tools.
We realized that to pursue intellectual agendas such as the differing crime patterns of women and men we needed multiple points of entry rather than a single massive visualization. We thus followed several tracks at the same time, including data warehousing, mathematical models, and small-to-large visualizations.–DMCI team (white paper)
Throughout their work together and in their joint white paper, the DMCI group stresses the importance of communicating the value and utility of the research methods they facilitate and employ to “ordinary working historians,” as well as the importance of optimizing the usability of research tools to broaden their impact. To this end, they recruited an advisory team of a dozen professional historians with expertise in eighteenth and nineteenth century London and the history of crime to test the use of Old Bailey texts with Zotero and Voyant Tools. While the collection and management of trial records within Zotero came naturally to the “ordinary” user group, textual analysis and visualization was not so intuitive. Reactions from their focus group prompted the creation of a new, simplified user interface for Voyant Tools and documentation and tutorials about the project specifically tailored to the needs of historians.
What has become self-evident to me, is that “big data,” and even “pretty big data” inevitably creates a different and generically distinct form of historical analysis, and fundamentally changes the character of the historical agenda that is otherwise in place.–Tim Hitchcock
At the conclusion of the grant term, participants began pursuing publication opportunities within their own individual areas of historical expertise that demonstrate how “textual analysis of the infinite archive” can complement and transform existing practices and suggest new interpretations. In their white paper, the group explains how extracting records containing the term “poison” from the Old Bailey Archive with Zotero and importing these records into Voyant Tools helped project participant and developer Fred Gibbs identify frequent co-occurrences of “poison” with “drank” and “coffee,” while verifying that co-occurrences with “ate” or “eaten” are relatively rare, suggesting to him that coffee was the poisoner’s medium of choice in eighteenth and nineteenth century London. Cohen used Criminal Intent methods to identify and extract trial transcripts with references to bigamy and demonstrated a significant decline in the severity of punishments for female bigamists in the latter part of the nineteenth century. Tim Hitchcock and Bill Turkel took the extracted Old Bailey corpus into the sophisticated computation and visualization tool Mathematica. After numerous iterations of visualization and interpretation, they arrived at the graph below, which plots trial transcript length across the two hundred years of legal history represented in the corpus, color-coding the transcripts by offence. The results show a significant increase in short trial reports for both serious offences such as forms of “killing”, and for all trials. Through this process, Turkel and Hitchcock were able to identify an important phenomenon in the history of the criminal trial and to evidence the rise of ‘plea bargaining’ as a common characteristic of the British criminal justice system from the second quarter of the nineteenth century. In the process they were able bring the results of datamining to bear on a current historical debate on the timing and character of the “modern ‘adversarial trial.'”
Data Mining With Criminal Intent is an ambitious initiative that required high investments of time and talent during a relatively short period. Participants also reported higher than anticipated expenses that had to be absorbed by the institutional budgets of the main partners, and that the computational demands of the project stretched the limits of the resources investigators had at hand: this was a pattern that we saw in every Digging Into Data project. Even with the advantage of a reliable, highly structured, and relatively uniform data set, the integration of textual analysis and visualization methodologies into historians’ search and discovery experience has not been a straightforward process, nor is it complete. Software development is ongoing and will continue to be informed by active historical research; by their own testimony, participants’ scholarship has also been richly informed by the development process. The pattern of collaboration between partner developers and scholars, including many partners who play both roles, is one that re-appears consistently in the other Digging into Data projects.
3Statistics kept for The Old Bailey Online suggest 23 million visits to the site since 2003.
- Dan Cohen (George Mason University, USA) served as Principal Investigator for the NEH-funded portion of the project and managed the workflow and partnership at GMU.
- Fred Gibbs (George Mason University, USA) wrote the Zotero plugin that extracts trial transcripts from The Proceedings of the Old Bailey Online, organized them, and sent their text to mining services. He also conducted research using the project’s tools.
- Tim Hitchock(University of Hertfordshire, UK) served as Principal Investigator for the JISC-funded portion of the project as well as liaison between the Old Bailey team and other project partners, ensuring that data was available in the right form. He also worked with Turkel on detailed textual analysis, and on organizing the stakeholders’ engagement with the project.
- Geoffrey Rockwell (University of Alberta, Canada) served as co-Principal Investigator for the SSHRC-funded portion of the project as well as worked with Sander and John Simpson to implement the data warehouse model for data from The Proceedings of the Old Bailey Online in preparation for the Old Bailey Application Programming Interface (OBAPI).
- Jörg Sander(University of Alberta, Canada) worked with Rockwell and John Simpson to select and then implement the data warehouse model for data from The Proceedings of the Old Bailey Online in preparation for the Old Bailey Application Programming Interface (OBAPI).
- Robert Shoemaker(University of Sheffield, UK) managed the implementation of the Old Bailey Application Programming Interface (OBAPI) at Sheffield.
- Stéfan Sinclair(McGill University, Canada, previously McMaster University) served as co-Principal Investigator for the SSHRC-funded portion of the project as well as designed a new, simplified skin (a combination of tools) to optimize Voyeur/Voyant Tools’ visual ease-of-use.
- Sean Takats(George Mason University, USA) worked with Cohen and Gibbs on the incorporation of the plugin that extracts trial transcripts from The Proceedings of the Old Bailey Online and imports them into the Zotero research management tool.
- William Turkel(University of Western Ontario, Canada) imported project data into Mathematica to create visualizations for the project.
Other contributors and stakeholders
- Cyril Briquet (McMaster University, Canada)
- Hugh Couchman (SHARCNET, Canada)
- Clive Emlsey (Open University, UK)
- Margaret Hunt (Amherst College, USA)
- Jamie McLaughlin (University of Sheffield, UK)
- Michael Pidd (University of Sheffield, UK)
- Milena Radzikowska (Mount Royal University, Canada)
- Kevin Sienna (Trent University, Canada)
- John Simpson (University of Alberta, Canada)
- Kirsten C. Uszkalo (Independent Scholar)
Papers and publications
Other writings and media
- Hitchcock, Tim. Academic History Writing and the Headache of Big Data. Historyonics (blog). 30 January 2012.
Presentations, lecture notes and slides
- Cohen, Dan. The Future of History
- Tim Hitchcock, Textmining the Old Bailey Proceedings, Digital History Seminar, Institute of Historical Research, 13 June 2011.
- Tim Hitchcock, Using Zotero and TAPOR on the Old Bailey Proceedings:Data Mining with Criminal Intent, American Historical Association Annual Meeting, 5 January 2012.