The Predictive Power of Big Data

The Predictive Power of Big Data

NGram
If you wrote out all five zettabytes of data produced by humans produce each year, it would reach the galactic core of the Milky Way. Google

The following is an excerpt from Uncharted: Big Data as a Lens on Human Culture, by Erez Aiden and Jean-Baptiste Michel.

In computer science, the unit used to measure information is the bit, short for “binary digit.” You can think about a single bit as the answer to a yes-or-no question, where 1 is yes and 0 is no. Eight bits is called a byte.

Right now, the average person’s data footprint—the annual amount of data produced worldwide, per capita—is just a little short of one terabyte. That’s equivalent to about eight trillion yes-or-no questions. As a collective, that means humanity produces five zettabytes of data every year: 40,000,000,000,000,000,000,000 (forty sextillion) bits.

Such large numbers are hard to fathom, so let’s try to make things a bit more concrete. If you wrote out the information contained in one megabyte by hand, the resulting line of 1s and 0s would be more than five times as tall as Mount Everest. If you wrote out one gigabyte by hand, it would circumnavigate the globe at the equator. If you wrote out one terabyte by hand, it would extend to Saturn and back twenty-five times. If you wrote out one petabyte by hand, you could make a round trip to the Voyager 1 probe, the most distant man-made object in the universe. If you wrote out one exabyte by hand, you would reach the star Alpha Centauri. If you wrote out all five zettabytes that humans produce each year by hand, you would reach the galactic core of the Milky Way. If instead of sending e-mails and streaming movies, you used your five zettabytes as an ancient shepherd might have—to count sheep—you could easily count a flock that filled the entire universe, leaving no empty space at all.

This is why people call these sorts of records big data. And today’s big data is just the tip of the iceberg. The total data footprint of Homo sapiens is doubling every two years, as data storage technology improves, bandwidth increases, and our lives gradually migrate onto the Internet. Big data just gets bigger and bigger and bigger.

THE DIGITAL LENS

Arguably the most crucial difference between the cultural records of today and those of years gone by is that today’s big data exists in digital form. Like an optic lens, which makes it possible to reliably transform and manipulate light, digital media make it possible to reliably transform and manipulate information. Given enough digital records and enough computing power, a new vantage point on human culture becomes possible, one that has the potential to make awe-inspiring contributions to how we understand the world and our place in it.

Consider the following question: Which would help you more if your quest was to learn about contemporary human society— unfettered access to a leading university’s department of sociology, packed with experts on how societies function, or unfettered access to Facebook, a company whose goal is to help mediate human social relationships online?

On the one hand, the members of the sociology faculty benefit from brilliant insights culled from many lifetimes dedicated to learning and study. On the other hand, Facebook is part of the day-to-day social lives of a billion people. It knows where they live and work, where they play and with whom, what they like, when they get sick, and what they talk about with their friends. So the answer to our question may very well be Facebook. And if it isn’t — yet —then what about a world twenty years down the line, when Facebook or some other site like it stores ten thousand times as much information, about every single person on the planet?

These kinds of ruminations are starting to cause scientists and even scholars of the humanities to do something unfamiliar: to step out of the ivory tower and strike up collaborations with major companies. Despite their radical differences in outlook and inspiration, these strange bedfellows are conducting the types of studies that their predecessors could hardly have imagined, using datasets whose sheer magnitude has no precedent in the history of human scholarship.

Jon Levin, an economist at Stanford, teamed up with eBay to examine how prices are established in real-world markets. Levin exploited the fact that eBay vendors often perform miniature experiments in order to decide what to charge for their goods. By studying hundreds of thousands of such pricing experiments at once, Levin and his co-workers shed a great deal of light on the theory of prices, a well-developed but largely theoretical subfield of economics. Levin showed that the existing literature was often right—but that it sometimes made significant errors. His work was extremely influential. It even helped him win a John Bates Clark Medal—the highest award given to an economist under forty and one that often presages the Nobel Prize.

A research group led by UC San Diego’s James Fowler partnered with Facebook to perform an experiment on sixty-one million Facebook members. The experiment showed that a person was much more likely to register to vote after being informed that a close friend had registered. The closer the friend, the greater the influence. Aside from its fascinating results, this experiment— which was featured on the cover of the prestigious scientific journal Nature—ended up increasing voter turnout in 2010 by more than three hundred thousand people. That’s enough votes to swing an election.

Albert-László Barabási, a physicist at Northeastern, worked with several large phone companies to track the movements of millions of people by analyzing the digital trail left behind by their cell phones. The result was a novel mathematical analysis of ordinary human movement, executed at the scale of whole cities. Barabási and his team got so good at analyzing movement histories that, occasionally, they could even predict where someone was going to go next.

Inside Google, a team led by software engineer Jeremy Ginsberg observed that people are much more likely to search for influenza symptoms, complications, and remedies during an epidemic.

They made use of this rather unsurprising fact to do something deeply important: to create a system that looks at what people in a particular region are Googling, in real time, and identifies emerging flu epidemics. Their early warning system was able to identify new epidemics much faster than the U.S. Centers for Disease Control could, despite the fact that the CDC maintains a vast and costly infrastructure for exactly this purpose.

Raj Chetty, an economist at Harvard, reached out to the Internal Revenue Service. He persuaded the IRS to share information about millions of students who had gone to school in a particular urban district. He and his collaborators then combined this information with a second database, from the school district itself, which recorded classroom assignments. Thus, Chetty’s team knew which students had studied with which teachers. Putting it all together, the team was able to execute a breathtaking series of studies on the long-term impact of having a good teacher, as well as a range of other policy interventions. They found that a good teacher can have a discernible influence on students’ likelihood of going to college, on their income for many years after graduation, and even on their likelihood of ending up in a good neighborhood later in life. The team then used its findings to help improve measures of teacher effectiveness. In 2013, Chetty, too, won the John Bates Clark Medal.

And over at the incendiary FiveThirtyEight blog, a former baseball analyst named Nate Silver has been exploring whether a big data approach might be used to predict the winners of national elections. Silver collected data from a vast number of presidential polls, drawn from Gallup, Rasmussen, RAND, Mellman, CNN, and many others. Using this data, he correctly predicted that Obama would win the 2008 election, and accurately forecast the winner of the Electoral College in forty-nine states and the District of Columbia. The only state he got wrong was Indiana. That doesn’t leave much room for improvement, but the next time around, improve he did. On the morning of Election Day 2012, Silver announced that Obama had a 90.9 percent chance of beating Romney, and correctly predicted the winner of the District of Columbia and of every single state—Indiana, too.

The list goes on and on. Using big data, the researchers of today are doing experiments that their forebears could not have dreamed of.

Join the Discussion