Towards Dynamic Variorum Editions (DVE)
The final Digging into Data initiative that focuses primarily on textual data is Towards Dynamic Variorum Editions (DVE), a collaboration of scholars from Tufts University, Mount Allison University, and Imperial College, London. This project also represents a continuation of decades of prior work that includes digitizing and organizing primary and secondary resources pertinent to Classical Studies and making them available to researchers. The thirteen participants named in the project include scholars, computer scientists, scholar technologists, and one specialist librarian. Since all participants had a long history of working together on numerous grant projects, collaboration at an international level was a well-established practice for this group.
Much of the inspiration for the DVE project comes from partners’ experiences building the Perseus Digital Library, which is based at Tufts and is now one of the oldest and most widely used online research environments. The creation of Perseus was motivated by very basic, longstanding concerns for a variety of disciplines concerned with the history and culture of the Classical world, including documenting changes in language use and meaning throughout human history. Until recently, for most scholars answering such questions required lifetimes of effort working with centuries of editions, lexica, and commentaries, and the product of these scholars’ work has been new works built upon these same publication models.
In the past few decades, optical character recognition, text mining, and machine learning technologies have matured to a point at which researchers might discover word usages that would be impossible to find by other means, holding the potential to revolutionize our understanding of classical languages and, in turn, the many other languages they have influenced over time. The aspiration of the scholars working together on the DVE project is audacious: to create an online “edition of editions” that would make freely available and in one environment every edition and scholarly work ever published pertaining to Greek, Latin, or Arabic classical texts.
The deliberate choice to call the project Towards Dynamic Variorum Editions is telling. As in the examples of the Criminal Intent and Mapping the Republic of Letters projects, much work remains to be done in adapting already available technologies so that they are suitable to the purposes of classicists. Among the project’s technical concerns is the customization of current optical character recognition algorithms for classical languages, in particular for Greek. Secondly, the DVE partners have developed an infrastructure capable of handling the volume of text data now available that fits into their scope: rather than describing this corpus as “a million books,” they have chosen to reframe their corpus as “a billion words,” and are currently explicating the scale and implications of this corpus for classical scholarship in various publications. Finally, many hundreds of works, many of which are inaccessible to all but the most determined researchers, still remain to be digitized. These include the vast majority of works that survive in Greek and Latin, since the challenging nature of scholarship in an “analog” world has precluded most scholars from venturing outside the established canon. The task the partners have set themselves is to create an environment in which the rich array of translations and editions of canonical sources can be queried and intellectually accessed, while continuing to build upon this canon with less well studied works made accessible at the lexical level through machine-assisted techniques. Fulfilling both goals is critical to the team, since texts in classical languages span many more centuries than have been sufficiently explored. Texts that have drifted into obscurity often hold critical clues to shifts in meaning and use over time. Without these clues, scholars are liable to misinterpret what they read.
Unlike the Criminal Intent and Mapping the Republic of Letters projects, the DVE project depends upon data that is largely unstructured. These include page images of 28,000 works identified as being primarily in Latin selected from a corpus of 1.2 million books freely available through the Internet Archive (collected by the DVE team’s collaborators at the University of Massachusetts Amherst), 1000 scanned books from Google’s collection to which optical character recognition has been applied, and 8 million words of Greek and Latin that has been carefully marked up in XML for the Perseus Digital Library. The heterogeneity of the content available presents its own challenges, not the least of which is the flawed quality of much of the metadata. Using language detection software, the team found 4000 works in the Internet Archive collection which had been incorrectly identified as Latin, and a similar number of Latin works within the corpus of 1.2 million that had been misidentified as being in another language. The date information provided for most of the Internet Archive works was also unhelpful to the team, since the date of publication for the scanned volumes was most often drastically different from the original dates of the works contained within the editions. To correct these errors, the partners employed a team of students to hand correct metadata for 9000 of the Latin volumes in their corpus. The resulting 385 million word corpus of dated Latin was by far the largest amount of text the collaborators had worked with to date. Rather than the first 500 years of Classical Latin represented in the contents of the Perseus project, their new corpus contained the writings of two millennia. To conduct semantic analysis of the Latin corpus, the team used a morphological analyzer for Latin and a 2.9 million word collection of Latin words and their English translations. To these tools, Bamman, Babeau, and Crane (2010) have added a methodology for aligning multiple editions of the same text so that the scholarly markup of one edition may be mapped onto another.
Building and aligning a similar corpus of Greek texts is more difficult, because of the limitations of current optical character recognition technology for the Greek language. While current systems were reasonable for small quantities Greek-only texts, the team was interested in performing OCR on large-scale corpora. Furthermore, a large number of works of interest to Greek scholars include quotations in Greek embedded in works in other languages, so a methodology had to be developed to recognize Greek in these contexts as well. To this end, building upon work Federico Boschetti had done while a resident fellow at Perseus, Mount Allison University professor Bruce Robertson led a team of students in customizing open source OCR engines for Greek, introducing facilities to detect page layouts and to identify and classify Greek characters in multiple font families. Their customizations have improved the accuracy of OCR for pre-20th century Greek texts significantly. They have also begun work on a new graphical user interface (GUI) to enable easy correction of OCR.
In a similar way to the other text-based projects, working with data at a large scale has required access to infrastructure much larger and more powerful than the average desktop computer provides. The task of the Imperial College London team was to create and implement such an infrastructure, one that would not only process the mass quantities of text data and page images available today, but also work with the even larger quantities of texts relevant to Classical studies that have not yet been scanned, as well as those not yet written. This infrastructure had to privilege speed of character analysis and retrieval of comprehensive results across a large corpus over the precision and accuracy only possible using human-mediated OCR. The DVE project supported the system’s design, which is now undergoing testing. Preliminary results show a processing speed for Greek of approximately two minutes per page. DVE scholars believe that the distributed system, which takes advantage of parallel processing and cloud computing, can serve as a model for managing intensive processing of other types of research data in the future.
While it represents just one step in a much larger effort, the Dynamic Variorum Editions project has much wider implications for the future of humanities research and education which are laid out very clearly in the project white paper. By breaking away from the boundaries of the traditional classical canon, its scholars have begun to expose the deeper ties between Classical Studies, itself an interdisciplinary field, and the humanities writ large, encompassing all studies of human history and culture since the Classical period. At the same time, they have exposed the limitations of the current scholarly labor force to cope with the scale of interrelated, interlinked knowledge now at our disposal online. The deep relationships between Greco-Roman and the early Islamic world, for example, have gone largely unexplored due to the restrictions on physical and intellectual access. As technologies loosen these restrictions by making more accurate text mining, semantic analysis, and machine translation possible, scholars, students, and ordinary citizens can begin to make meaningful contributions to our knowledge and understanding of the past.
- Alison Babeu (Tufts University, US) is the Digital Librarian for the Perseus Digital Library and contributed both data and subject expertise to the project.
- David Bamman (Tufts University, US) is a computational linguist who contributed both technical and subject expertise to the project.
- Federico Boschetti worked with Robertson on customizing optical character recognition (OCR) engines for ancient Greek source texts.
- Lisa Cerrato (Tufts University, US) is the Managing Editor of the Perseus Project and contributed both data and subject expertise.
- Gregory Crane (Tufts University, US) served as Principal Investigator for the NEH-funded portion of the project.
- John Darlington (Imperial College London, UK) served as Principal Investigator for the JISC-funded portion of the project.
- Brian Fuchs (Imperial College London, UK) designed and implemented a scalable computer infrastructure for processing large datasets of page images from books.
- David Mimno (University of Massachusetts Amherst, US) is a computer scientist who contributed both technical and analytical expertise to the project.
- Bruce Robertson (Mount Allison University, Canada) served as Principal Investigator for the SSHRC-funded portion of the project as well as worked with Boschetti and a team of undergraduate students on producing classifiers suitable for the OCR of ancient Greek source texts.
- Rashmi Singhal (Tufts University, US) is lead programmer for the Perseus Project and contributed technical expertise.
- David Smith (University of Massachusetts Amherst, US) is a computer scientist who contributed both technical and analytical expertise to the project.
Bamman, David and David Smith. 2012. “Extracting Two Thousand Years of Latin from a Million Book Library.” In Journal of Computing and Cultural Heritage, 5 (1), 2012. http://doi.acm.org/10.1145/2160165.2160167. Open access preprint available at: http://www.perseus.tufts.edu/publications/01-jocch-bamman.pdf
Bamman, David and Gregory Crane. 2011. “Measuring Historical Word Sense Variation.” In Proceedings of the 11th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2011), pp 1-10. http://dx.doi.org/10.1145/1998076.1998078. Open access preprint available at: http://hdl.handle.net/10427/75561
Bamman, David, Alison Babeu, and Gregory Crane. “Transferring Structural Markup Across Translations Using Multilingual Alignment and Projection.” In Proceedings of the Tenth ACM/IEEE-CS Joint Conference on Digital Libraries, Gold Coast, Australia, June 21-25, pp. 11-20. http://dx.doi.org/10.1145/1816123.1816126. Open access preprint available at: http://hdl.handle.net/10427/70398
Crane, Gregory, Bridget Almas, Alison Babeu, Lisa Cerrato, Matthew Harrington, David Bamman and Harry Diakoff. 2012 (To appear). “Student Researchers, Citizen Scholars and the Trillion Word Library.” Paper accepted to JCDL 2012. Available from Tufts Digital Library, Digital Collections and Archives, Medford, MA. Open access preprint available at: http://hdl.handle.net/10427/75559
Other writings and media
Robertson, Bruce. “Optical Character Recognition of 19th Century Polytonic Greek Texts: Results of A Preliminary Survey.” 19 January 2012.