Science's Reproducibility Problem: 100 Psych Studies Were Tested and Only Half Held Up

A team of researchers looking at published scientific studies found that over half of the studies could not be replicated to create the same result as that found in the original reports. The new findings raise questions about the research and reporting methods used and accountability for published studies. Carlos Barria/Reuters

Most rational humans hold some faith in science. We trust that scientists, through their hard-won expertise, are well equipped to conduct studies that provide proof of why things are the way they are, help solve problems and explain mysteries. Much of that trust is built on an implicit belief that research findings are concrete truths—in other words, that if given the same set of parameters, they would be easy to reproduce. This type of replication is essential to science because it validates key discoveries and helps scientists make progress in their fields of research.

But it turns out many study findings are nearly impossible to emulate.

"The challenges of reproducibility are pervasive across all scientific disciplines," says Brian Nosek, a psychology professor at the University of Virginia and coordinator of a recent study that sought to replicate the findings of 100 already published studies. The enormous project, which involved 270 researchers on five continents, is part of an effort from the Center for Open Science, a nonprofit technology company directed by Nosek that aims to increase transparency and reproducibility in scientific research.

For the project, researchers who had not worked on the original studies selected papers to test out that were published in three prominent journals: Psychological Science, the Journal of Personality and Social Psychology and the Journal of Experimental Psychology: Learning, Memory, and Cognition. The first is a premier outlet for all psychological research; the others are leading journals for social psychology and cognitive psychology.

The researchers discovered they could replicate less than half of the original findings, which raises the question of how the original researchers arrived at their conclusions (and formed their hypotheses) in the first place. The researchers also examined statistical significance, or whether the results were caused by random chance. Scientists do this by calculating the p-value, which is complicated math to test a statistical hypothesis. Before a study is performed, researchers select a threshold value, or the significance level of the test (usually 5 percent or 1 percent).

If the p-value ends up being equal or smaller than the threshold value, the findings are statistically significant. With the original studies, 97 percent had statistically significant results, but when the researchers tried to reproduce them, just 36 percent had statistically significant results.

It's difficult to tell exactly why so many study findings appear to be irreproducible, Nosek says. "One possibility was that I was wrong; we had a false positive. The second possibility is the replication team had a false negative. The third possibility is both are correct, but we did not recognize in doing the replication that there is an important methodological difference. It's very difficult to distinguish between these possibilities."

Nosek suggests false positives and negatives occur when researchers feel compelled to tell a powerful narrative, which can make a career. Some might argue that flibanserin, the drug approved to treat low libido in women, passed scrutiny by the Food and Drug Administration because researchers weaved a memorable story from tenuous science. Because of this, many researchers want to produce work that is memorable, with a happy or surprise ending ("positive or novel results") and a neat and purposeful narrative arc.

Nosek says he struggles with this in his own work. All scientists need to reference studies that have led them to form their own hypothesis to test, and it can be tempting to not make reference to a paper from another researcher if it makes one's own paper better and more persuasive.

"What's best for researchers isn't necessarily what's best for science," says Nosek, who noted that scientists' careers depend upon publication in reputable and high-profile journals. "The incentives are pushing us toward clean, beautiful stories. That is a wonderful thing when it's achieved," but it's not exactly easy when you are working through difficult problems, perhaps for the first time in history.

The topics and methodologies of the 100 studies in question varied widely. One study, for example, examined the hypothesis that people are more willing to share their opinions with a group if they perceive that the opinions of others closely match their own. The researchers tested this hypothesis during the 2004 presidential election campaign by tallying the number of political bumper stickers with liberal or conservative opinions in counties that are well known to vote primarily for candidates affiliated with either the Democratic or Republican party. The results of that study—it's true!—turned out to be reproducible.

Another published study tested out by the researchers explored whether men have a harder time than women distinguishing between sexual cues versus mere friendliness while interacting with a person of the opposite sex. That study used photos of women with facial expressions that depicted certain emotions or feelings, such as friendly, sexually interested, sad or rejected. The study concluded that, indeed, men weren't as well equipped to distinguish these subtle differences. However, researchers who attempted to replicate the study didn't reach the same conclusion.

"An explanation for this lack of replication may be due to cultural differences of participants based in the USA versus the UK," the researchers explain in their paper that summed up their attempt at replication. "Alternatively, the different time periods during which the data were collected may have driven the differential findings, due to the images appearing dated in the current study."

The peer review process at scientific journals allows for some outside scrutiny. But the problem with the peer review process, says Nosek, is that it occurs too late in the game. "When I get new data sets and I analyze them, I have many different choices for how to analyze them," he says. "The peer reviewer only gets to look at the results when they're known."

Journals are well aware of the problem. Last year, Psychological Science began to open-source study materials and data for other researchers. The Center for Open Science is working on a number of ways to increase transparency and reproducibility in scientific research. One example is looking for ways to change the peer review process so it happens earlier in the game—a review of the study's design and methodologies before the researchers actually begin the work.

The problem can also be frustrating to the general public, which wants to rely on the latest research to make decisions about health. For just about every study that makes the claim that something can harm you, there's another that says that same thing will help you live longer. But this isn't necessarily a bad thing. "It can mean the research process has improved," says Nosek, adding that it creates more precise conclusions. For example: "Wine may be good for you in some circumstances but not in other circumstances."