Close this search box.
Close this search box.

Harvesting Speech Datasets for Linguistic Research on the Web

<<Previous Case StudyNext Case Study>>

Jump toProject ParticipantsProject Outcomes

The second audio-centric project among the Digging into Data pioneer initiatives is another methodologically focused linguistics project, Harvesting Speech Datasets for Linguistic Research on the Web, led by researchers at Cornell and McGill universities. Rather than devising a method for conducting analysis of specific transcribed speech corpora, however, the Harvesting partners focused on demonstrating a way to explore linguistic theories at “web scale,”” effectively opening up all web-based audio corpora for scholarly investigation. To handle what is for all practical purposes a limitless (and ever-changing) resource, the Harvesting team uses the power of computing to collect and organize data for testing targeted, specific theories.

The Harvesting team’s expertise is primarily in the area of prosody, or the combination of rhythm, stress, and intonation speakers use when speaking the same word sequences in different contexts. To use an example from their project proposal, a theory of prosody called “contrastive prominence” would explain the difference between the different stress patterns for the phrase “than I did” in the following two contexts:

a.     She did more than I did.
b.    I wish I had done more than I did.

To study these kinds of linguistic theories, researchers need many different examples (or “tokens”) of the same short word sequence. Traditionally methods involve producing these tokens in a laboratory, but audio data available on the web offers researchers an alternative to retrieve them from “the wild.”

Just as with the team from Mining a Year of Speech, the partners on the Harvesting project have necessarily chosen to work with audio recordings for which there are pre-existing transcriptions. However, rather than relying on human-made transcriptions, Harvesting researchers relied on transcriptions produced through automatic speech recognition (ASR). Since they are reliant on still-maturing technologies, automatically generated transcriptions contain frequent errors, but the accuracy of ASR transcriptions “is often better than 50% at the level of short, common word sequences.” For this reason, automatic transcriptions are suitable for the kinds of research that most interest the Harvesting partners, when the goal is not to obtain an exhaustive listing of every possible use of a given phrase but simply to assemble enough tokens with which to test key theories about prosody.

The partners’ first goal was to develop and demonstrate a reliable technique for harvesting and analyzing audio snippets from the web. As the team explains in detail in the white paper, their harvesting technique is a multi-step process that is a combination of computationally intensive and manual work. First, the researchers use command-line programs first to create a list of potential “hits” for a particular target word or phrase in a web-based audio repository, including portions of the transcript that both precede and follow those hits. Using a second script, they retrieve the relevant audio files. The two scripts that identify and retrieve audio data must be tailored to work with each online resource from which data is harvested.  The researchers use a third script to cut segments from the harvested audio that surround the target word or phrase. Using an alignment method that is essentially an earlier version of the forced aligner employed in Mining a Year of Speech, the team then matches the retrieved transcriptions to their corresponding sounds in the audio snippets. Finally, they use two different sets of machine learning techniques to analyze and classify each snippet into two pre-defined categories according to its specific prosodic features.6

The partners divided the responsibility for each step in the process according to their own areas of expertise: Mats Rooth at Cornell handled the programming required for the initial harvest of target files; Michael Wagner at McGill handled the alignment of the transcription and audio data; and Jonathan Howell, a recent Cornell graduate who worked under Rooth and who now holds a postdoctoral fellowship at McGill, handled the machine classification of the retrieved, aligned audio files. This division of labor allowed each researcher to work somewhat independently, although they communicated frequently throughout the project. In a group interview with CLIR representatives, the team observed that having one member with experience working at both partner institutions was a key advantage.

The group’s second goal was to compare the implications of web-harvested data for specific linguistic theories to the implications derived from data produced in a traditional manner, in a laboratory. The McGill-based partners assumed responsibility for producing lab data that corresponded to the harvested data sets. The team was pleased to find that the lab data supported most of the same conclusions as the web harvested data. The team took this as an affirmation of the validity of traditional data collection methods. Said Michael Wagner, “This is a little bit surprising, and a nice result because the main acoustic cues are working both in the lab and in nature. Our lab data are almost as good [as data produced ‘in the wild.’]”

To meet the final goal for their project, the teams will harvest datasets from multiple audio repositories corresponding to a variety different theories of prosody, such as the aforementioned theory of contrastive prominence (“than I did” vs. “than I did”). At the time of writing, work on producing these datasets is still underway. It deserves mention that these research outcomes will be equally (if not more) important for facilitating future research as the team’s publications and conference presentations will be. Also among their most significant contributions are the series of computer programs they have created, which they have made freely available to other researchers on their personal and institutional websites.

One major challenge to their research, completely outside the researchers’ control, deserves mention: the web-based audio repositories with which the team has chosen to work were not always reliably accessible, with at least one key repository vanishing from the web without warning during the course of their work. For this reason, researchers working with “web scale” data must be prepared to adapt to the ever-changing nature of that data. Nevertheless, as ASR technologies continue to improve, including automatic “word spotting” techniques that can perform searches of auto data without recourse to transcriptions, it is easy to imagine the method demonstrated in this project being employed at greater and greater scales. Both separately and together, the harvesting as well as the machine analysis techniques now under development will have immediate applications in related academic disciplines such as speech recognition and artificial intelligence, as well as more long-term value in the humanities and social sciences broadly construed. In time, these methods will make the ever-expanding masses of audio and audiovisual data accessible to the general public in ways that were previously impossible.

6These techniques are much more thoroughly explained in the project white paper, to be released after the conclusion of this project’s term in July 2012.

<<Previous Case StudyNext Case Study>>

Project Participants

  • Mats Rooth (Cornell University, US) served as Principal Investigator for the NSF-funded portion of the project. A computational linguist, he was responsible for working with graduate and undergraduate students at Cornell to design and implement the harvesting methodology used for the project.
  • Michael Wagner (McGill University, Canada), a linguist, served as Principal Investigator for the SSHRC-funded portion of the project and was responsible for leading the analysis of data harvested during the course of the project, which included the comparison of results of computational statistical analysis with analysis using traditional formal-linguistics methodologies.
  • Jonathan Anthony Howell (McGill University, Canada), a postdoctoral fellow who specializes in statistical and machine learning methodologies for phonetic analysis. His doctoral dissertation project formed the basis for the collaboration funded through the Digging into Data program.

Project Outcomes

Related Publications

Gorman, Kyle, Jonathan Howell and Michael Wagner (forthcoming). “Prosodylab-Aligner: A tool for forced alignment of laboratory speech.” Proceedings of Acoustics Week, the annual conference of the Canadian Acoustical Association  (draft here).

Howell, Jonathan and Mats Rooth. 2009. “Web Harvest of Minimal Intonational Pairs.” Open access pre-print

Howell, Jonathan, Mats Rooth, and Michael Wagner. 2011. Acoustic classification of focus in a web corpus of comparatives. Presented at New Tools and Methods for Very-Large-Scale Phonetics Research, University of Pennsylvania, January 28-31. Open access pre-print available on Howell’s website: [Penn_presentation.pdf

Related Posters and Presentations

Howell, Jonathan and Mats Rooth. “A corpus search methodology for focus realization” (poster). 157th Meeting of the Acoustical Society of America. Abstract appears in J. Acoust. Soc. Am. Volume 125, Issue 4, pp. 2573-2573.

Prosody Datasets

<<Previous Case StudyNext Case Study>>

Skip to content