By Charles Henry
The term big data is popular today, as our world accumulates unprecedented amounts of information—approximately 2.5 quintillian bytes per day (2.5×10 18). Fields of study that contribute to this enormous burgeoning of data include astronomy, genomics, climatology, and medicine. A favorite example that helps conceptualize the scale of big data production is the Large Hadron Collider (LHC). It has 150 million sensors that detect and capture the information generated by particle collisions; as many as 40 million collisions can occur per second in LHC experiments. The amount of data thus generated is impossible to analyze, so physicists have been able to sift and filter out 99.999% of the information generated, with the resultant .0001% analyzed. Even at this reduced, tiny fraction of capture, each large experiment generates enough data to fill 100,000 DVDs.
The ocean of data we find ourselves in is often defined as a unique aspect of contemporary life, with important implications for us regarding our judgment, intuitions, and biases: how we interpret the world and how we make decisions.CLIR is buildingnew programsto address the challenge of data and ways to best curate, preserve, and reuse data in support of research and teaching. Several books and articles underscore the increasing complexity and depth of this challenge.
Ian Ayers, in Super Crunchers, makes an urgent case for all of us to become statistically literate in order to succeed in this new world. The subtitle of his work succinctly summarizes his main point: Why Thinking-by-numbers is the New Way to Be Smart. Similarly, in Big Data, authors Viktor Mayer-Schonberger and Kenneth Cukier contend that our traditional reliance on gut feelings, instinct, and past experience is often insufficient and can even be misleading. Our expertise, based on these traditional faculties, can be upended and supplanted by statistical analyses of big data.
Big Data’s authors cite the film Moneyballas a classic example. In a memorable scene, older baseball scouts review possible candidates for the team based on the candidates’ facial appeal (and in one instance the lack of appeal of a girlfriend), the sound of the bat as it hits a ball, physical stature, and the like. These opinions and judgments are summarily swept aside with a rigorous and objective analysis of pertinent statistics of the players in question, a mathematical analysis that produces a championship team. Other interesting examples in Big Data that underscore the shift from the visceral to machine logic include Amazon’s reliance on algorithms to rate books, which replaced the previous use of rankings garnered from living, breathing readers; the use of The-numbers.com to not only follow the box office success of movies but also to predict profitability before a movie is made; and Zynga‘s ability to modify online games as they are played, based on data analysis, and to modify games according to the ability and style of individual players. The older process of building a game, then gathering the written/spoken responses of the players to assist in designing new versions, has been replaced by analyzing data that is generated directly by the players’ actions. The “middle voice,” as it were, has been eliminated.
Another deeply interesting book on data is Paul Edwards’ A Vast Machine, which traces the history of data gathering of meteorological information in the 1800s to the global capture of weather data through an immense array of satellites, sensors, balloons, radar, air stations, weather ships, and planes around the world in the 20th century. With the development of supercomputers in the later 20th century and into the 21st, this globally captured data was modeled, using powerful simulation and analysis tools, to produce scenarios that gain us insight into broader climate patterns and climate change.
In an earlier period of meteorological forecasting, data and human interpretive skills were in synch: “Analyzing the charts required heuristic principles and experience-based intuition as well as algorithms and calculation” (257). Edwards’ main point is that no one, no team or small army of human analysts, can cogently and effectively visualize the amount of meteorological data that has been gathered, and is being gathered, into predictive, trusted climate models. Only machines can do this.
And in so doing they build for us “a stable, reliable knowledge of climate change” (398). This modeling of data is essentially a knowledge production process. Those who deny climate change often insist on a scientific methodology that is anchored in data and experiment, not on models, which they deem as theoretical and thereby unrealistic and not trustworthy. The dichotomy is false: the models are all we have, the only means of accurately visualizing and predicting what lies ahead. Human cognition and rigorous heuristic skills are nonetheless integral to our knowledge of climate: the formal interpretation of these models is conducted by a small army of scientists from around the world, the Intergovernmental Panel on Climate Change (IPCC). The most recent report (2013) states with very high confidence that the world is warming more quickly than previously predicted, and that there is no doubt that much of the cause is anthropogenic.
Data and its modeling do not supplant human judgment, but they certainly instigate (or should) a rethinking and re-evaluation of how we understand and respond to our world. Our actions are in this way tied directly to a vast preponderance of data, so vast that only super powerful machines can begin to make sense of it. But it is just a beginning.