Two of the inaugural Digging into Data projects were based in the field of computational linguistics, a discipline with well established methodologies honed over decades by theorists and practitioners from multiple academic and corporate research environments. However, gathering the audio data to support research in this domain has, until recently, been performed more often in laboratory settings than in real life. Audio recordings collected in the laboratory suffer from obvious limitations: the artificiality of the laboratory setting, the framing of data collection around the specific interests of the researcher, and the “observer effect” being just a few. Mining a Year of Speech is a project that seeks to move beyond the laboratory to investigate very large-scale corpora of English recorded “in the wild.” These corpora include the recently digitized British National Corpus (BNC) from the British Library Sound Archive, and diverse recorded speech collections at the Linguistic Data Consortium (LDC) at the University of Pennsylvania. For the purposes of this project, the team dealt only with recorded language collections for which there were pre-existing transcriptions and for which there was a reasonable amount of descriptive information and would present relatively few legal difficulties for publication.
The 10 million-word spoken part of the BNC includes 4.2 million words BNC of everyday speech recorded by volunteers throughout the United Kingdom during the 1990s. The LDC’s collections include political speeches, news broadcasts, Supreme Court oral arguments, anonymous telephone and face-to-face conversations, audio book recordings, and interviews. Together the corpora analyzed for this project total over 5000 hours of speech, far more than would be possible for any one researcher to listen to in a lifetime. In data terms, the aggregated corpora for this project compare with some of the largest “Big Science” research data sets in use today.
Relative size of “Big Science” and “Big Humanities” corpora, by John Coleman.4
Rather than framing their project around a single question, the eight researchers working on Mining a Year of Speech focused their attention on establishing a methodology for making their corpora searchable by the scholarly community and wider public, in other words, a creating a “speech search engine.” When they reach this goal, their corpora will be able to support investigations of a wide variety of subjects across many disciplines. Naturally, the potential of using such a resource to test theories of word use, pitch, stress, and dialectical differences would be vast, and the team provides several interesting examples of these in their white paper, but a “speech search engine” could have impacts on other disciplines that would be equally wide-ranging. Students of English as a second language could retrieve multiple use cases and pronunciations for the vocabulary they are studying; scholars in media studies could identify and examine changes in the way certain words and phrases are stressed across multiple news outlets over time; given sufficient data about the gender, geographic origin, and socioeconomic backgrounds of speakers featured in large-scale audio corpora, psychologists or anthropologists could collect and study examples of recorded speech to explore theories of human behavior.
Graphical user interface for aligning transcripts to audio files used for the Mining a Year of Speech project.
The team’s central challenge is this: while speech transcriptions can already be easily searched, searchers cannot retrieve the snippets of audio data that match these transcriptions until those transcriptions are precisely time-coded to the appropriate places in the recordings. Manual time-coding is a time-consuming endeavor. For corpora at the scale of the Mining project, such a task would be impossible. Fortunately, recent developments in speech technology offer a solution: forced alignment. Roughly, forced alignment requires the automatic conversion of textual transcriptions into their phonetic equivalents, then using a phonetic dictionary to match the phonetic transcriptions to the recorded speech.
Performing forced alignment of mass quantities of transcribed speech has required the Mining team to deal with a number of complicating factors, including incomplete transcriptions, the existence of transcriptions for which the corresponding audio has been lost, varied dialects, varied qualities of audio data, the presence of background noise, moments when multiple speakers speak simultaneously, the use of “non-standard” words that cannot be found in a phonetic dictionary, and the presence of “disfluencies” such as “um” or “uh.” Each of these factors affects the accuracy of forced alignment. As they describe in their white paper for their ongoing project, the Mining team have already managed to address disfluencies and dialect differences sufficiently to allow reasonably accurate alignments. Producing phonetic transcriptions of non-standard words for the diverse corpora included in the project was also a greater challenge than anticipated, requiring the generation of multiple word and phoneme pairings and training the alignment software to select the most likely one. Once they addressed these issues, it was possible to align transcripts and audio for hundreds of hours of news broadcasts, two-thirds of the U.S. Supreme Court arguments and the BNC, two hundred fifty hours of two-party telephone conversations, and more.
Other issues have proved more obstreperous. For example, adapting the alignment software to cope with missing portions of audio was more important for this project than researchers initially anticipated, so the team has had to divert some attention to finding ways to identify portions of transcripts that have no corresponding audio. Another challenging problem arises from the need to preserve the anonymity of the people whose speech is included in the BNC: many of the instances in which speakers disclose personally identifiable information are marked in the transcripts, but it will be necessary for those portions of the audio files to be cut or altered (“bleeped out”) before the corpus is made available in searchable form. Still another challenge that continues to occupy the researchers is how to devise a way to measure the accuracy of alignment across large corpora, so that future users of the audio search engine can tailor their search methods to retrieve the audio they need as precisely and comprehensively as possible.
Work on Mining a Year of Speech continues,5 and it is clear that a refinement to the forced alignment algorithms used by this team will be required before the thornier issues affecting its accuracy are resolved. Some of the error issues will never be resolved; instead, as the team argues compellingly in their white paper, scholars will need to devise statistically valid ways to account for it as they use large-scale data resources for their work. Nevertheless, the team has already demonstrated that automatic alignment of large-scale transcribed audio is possible, and that soon web-based searching of automatically aligned corpora will be a reality. With more and more digital audio and video becoming freely available over the web, not to mention the vast quantities of analog data held in libraries throughout the world awaiting digitization, this project will make it possible for future scholars to use this information in ways we can only begin to imagine.
4John Coleman comments, “I was astonished that the quantities of data involved in audio collections worldwide are on the same order of magnitude as ‘big science’ projects from the Human Genome project through the Hubble Space Telescope data right up to the scale of data collected over several years by the Large Hadron Collider…this insight will be useful for libraries, archives and humanities scholars in making the case for financial support for research needs that are in some respects comparable to ‘big science’ projects.”
5There was a nine-month delay between the times the participating partners in this project received their funding from their respective agencies. This meant that the research timeline extended beyond other projects funded through the Digging into Data program, and beyond the time allowed for composition of this report.
- Lou Bernard (University of Oxford, UK) is Assistant Director of Oxford Computing Services and has long been responsible for the distribution and maintenance of the British National Corpus, the dataset at the heart of the JISC-funded portion of the project.
- Christopher Cieri (University of Pennsylvania, US) is the Executive Director of the Linguistic Data Consortium at the University of Pennsylvania and contributed administratively and substantively to the project. He is an expert in corpus-based phonetics.
- John Coleman (University of Oxford, UK) is Professor of Phonetics and served as Principal Investigator for the JISC-funded portion of the project, based at Oxford’s Phonetics Laboratory, which he directs.
- Sergio Grau (University of Oxford, UK) is Research Fellow at University of Oxford and performed most of the analysis on the British National Corpus data for the project.
- Gregory Kochanski (University of Oxford, UK) is Senior Research Fellow at Oxford’s Phonetics Laboratory and contributed subject and analytical expertise to the project.
- Mark Liberman (University of Pennsylvania, US) served as Principal Investigator for the NSF-funded portion of the project, based at the Linguistics Data Consortium.
- Ladan Ravary (University of Oxford, UK) is Research Fellow at Oxford’s Phonetics Laboratory and an expert in the engineering of speech recognition and alignment technologies.
- Jonathan Robinson (British Library, UK) is Lead Content Specialist in Sociolinguistics and Education at the Social Sciences Collections and Research department of the British Library and contributed technical, managerial, and subject expertise to the project.
- Joanne Sweeney (British Library, UK) is Content Specialist in the Social Sciences Collections and Research department of the British Library and contributed technical expertise and support to the project.
- Jiahong Yuan (University of Pennsylvania, US) is Assistant Professor of Linguistics and the developer of the Penn Phonetics Lab Forced Aligner, which is a tool that was adapted and used extensively for the project.
Lectures and talks
“Mining a Year of Speech,” at “New Tools and Methods for Very-Large-Scale Phonetics Research“, at the University of Pennsylvania, a workshop organized by our collaborators Jiahong Yuan and Mark Liberman. http://www.phon.ox.ac.uk/jcoleman/MiningVLSP.pdf
Baghai-Ravary, Ladan, Sergio Grau and Greg Kochanski. “Detecting gross alignment errors in the Spoken British National Corpus” (poster). 10/19/2010.
Coleman, John. “Large-scale computational research in Arts and Humanities, using mostly unwritten (audio/visual) media” (invited lecture), at the Universities UK-sponsored “Future of Research” conference. See slides.
The British National Corpus Spoken Audio Sampler continues to be actively developed and extended; in 2012 the team will be releasing a further substantial portion of the Spoken BNC. Even more extensive sections of the American English corpora used in Mining a Year of Speech have been published by the Linguistic Data Consortium.
Tools and documentation
A key tool vital to all the work of this project is the Penn Phonetics Lab Forced Aligner. This was developed prior to this project.