A Blog by Ed Yong

Welcome To The Era of Big Replication

Psychologists have been sailing through some pretty troubled waters of late. They’ve faced several cases of fraud, high-profile failures to repeat the results of classic experiments, and debates about commonly used methods that are recipes for sexy but misleading results. The critics, many of whom are psychologists themselves, say that these lines of evidence point towards a “replicability crisis”, where an unknown proportion of the field’s results simply aren’t true.

To address these concerns, a team of scientists from 36 different labs joined together, like some sort of verification Voltron, to replicate 13 experiments from past psychological studies. They chose experiments that were simple and quick to do, and merged them into a single online package that volunteers could finish in just 15 minutes.

They then delivered their experimental smorgasbord to 6,344 people from 36 different groups and 12 different countries.

This is Big Replication—scientific self-correction on a massive scale.

I’ve written about this “Many Labs Replication Project” over at Nature News, so head over there for more details and viewpoints from the psychological community. The project was also coordinated by Richard Klein and Kate Ratliff from the University of Florida, Michelangelo Vianello from the University of Padua, and Brian Nosek from the Center for Open Science.

Here are the main results.

First, 10 of the 13 effects replicated. That’s certainly encouraging after months of battering.

One of the 13 was on the fence—the “imagined contact” effect, where imagining contact with people from other ethnic groups reduces prejudice towards them. It’s hard to say whether this is real or not.

And two of the 13 effects outrightly did not replicate. Both are recent studies involved social priming, the field in which subtle and subconscious cues supposedly influence our later behaviour. In one, exposure to a US flag increases conservatism among Americans; in the other, exposure to money increases endorsement of the current social system.

For Nosek, personally, the results are a mixed bag. Two of his own effects were in the mix and they both checked out. Many classics in the field are robust. This is all good. But a lot of Nosek’s own work involves social priming, and the fact that this sub-field regularly (but not always) stumbles in the replication gauntlet is troubling to him. “This been difficult for me personally because it’s an area that’s important for my research,” he says. “But I choose the red pill. That’s what doing science is.”

But he and others I spoke to also urge caution. This is neither a “te absolvo” for the field, nor a final damnation of social priming. The team chose the 13 effects arbitrarily, to represent a range of different psychological studies from different eras. It doesn’t mean that 10 out of every 13 effects will replicate, nor that 2 out of every 2 social priming ones will flunk. It’s not systematic. (Nosek, incidentally, is also leading a systematic check of reproducibility in psychology, in which more than 150 scientists are repeating every study published in four journals in 2008.  The man is front and centre in this debate.)

To focus too much on the results would miss the point. The critical thing about the Many Labs Project is its approach.

Replications are really important, and there aren’t enough of them in psychology. But single, one-off replications can add more heat than light. If you can’t replicate an earlier study, the knee-jerk reaction is to say that the original was flawed. Alternatively, you could be incompetent. Or you could have changed the original experiment in important ways. Or your new study may be too small. Or you might have studied a completely different group of people. The authors of the original study can always hit back with these objections, and they would not be wrong to.

So, some scientists run meta-analyses—big mega-studies where they look at the results of past experiments and tease out the overall picture. If one replicate attempt is inconclusive, what do all of them say together? But meta-analyses also have flaws. If they don’t publish their failed replications (and until recently, it was really hard to), the meta-analysis will be badly skewed. And if everyone used slightly different methods, the results will still be inconclusive.

The Many Labs project has none of these problems because, as Daniel Simons told me, it is a planned meta-analysis. They’re did many checks all at once, and nothing was hidden away if it didn’t “work”. They consulted with the original authors where possible. They ran the exact same experiment on all of their different samples. They tested a far larger group of people than any of the original experiments (and replication attempts generally need to be bigger than the original studies that they’re checking). And they pre-registered their methods: everything was agreed before a single volunteer was recruited, leaving no room for sneaky data-massaging.

The result is a definitive assessment of the 13 effects. The priming ones didn’t check out. At the other extreme, Nobel laureate Daniel Kahneman comes out of this very well. His classic anchoring effect, in which the first piece of information we get can bias our later decisions, turns out to be much stronger than he estimated in his original experiments.

The Many Labs sample was also diverse, which tells us whether the effects being scrutinised are delicate flowers that only blossom in certain situations, or robust blooms that grow everywhere. This is important, because some psychologists like Joe Cesario from Michigan State University have argued that effects like social priming ought to vary in different contexts, or across different individuals.

I contacted Cesario, and he clarified: “At no point did I make the claim that all effects, or even all priming effects, will vary by laboratory, region, etc. The point was to appreciate the possibility that some priming effects might vary by underappreciated context variables… Absent cross-lab replication, priming researchers cannot make extreme claims about the widespread nature of priming effects.”

In the Many Labs Project, none of the 13 effects varied according to the nationality of the participants, or whether they did the experiments online or in a lab. Kahneman’s work checked out everywhere, and the priming studies failed everywhere. Cesario adds, “The ManyLabs project correctly tells us that [the two effects that did not replicate] aren’t really effects that we as a discipline should care about because they have no generalizability beyond that unique situation.”

It is very telling that everyone I spoke to praised the initiative, including the authors whose work did not replicate. There was none of the acrimony that has stained past debates. When something is done this well, it’s pretty churlish to not accept the results.

This is a harbinger of things to come.

Simons is coordinating a similar multi-lab replication attempt of Jonathan Schooler’s verbal overshadowing effect, in which verbally describing something like a face impairs our recognition for that thing. The effect has been famously tricky to repeat, and Simons says “Our goal is to measure the actual effect size as accurately as possible by conducting the same study in many laboratories.” The results will be published in the journal Perspectives in Psychological Science next spring. “This multi-lab paper provides a preview of what I hope will become a standard approach in psychology.”

3 thoughts on “Welcome To The Era of Big Replication

  1. Very thoughtful as always, Ed. It’s really good to hear that everyone was so positive about the findings — that’s as it should be, but it’s obviously not always the case, as you’ve found out first hand.

    Brian and his collaborators have really been doing a fantastic job with this, not just because it’s important research that takes a tremendous amount of care and work, but because it’s been a really intelligent approach. As you mention in the post, any single study can be interpreted as a refutation of previous research or a flawed replication, and usually you get both, and not a lot of progress, just acrimony. The Many Labs approach starts with research that is very basic (doesn’t require a lot of things to go right to replicate correctly), and executes it carefully so that everyone can get on board.

    There’s a lot more work to be done, but I think the field is really lucky to have some very reasonable people leading the charge. Thanks for covering it so well.

  2. In the tangentially-related field of psychiatry (i.e. better living through chemistry), a conceptually similar problem in studying effectiveness of medications surfaces routinely. There, the only objectively meaningful diagnosis is often “condition responds to medication X”.

    This can make double-blind trials nearly useless, because the criteria for inclusion are externally perceptible symptoms that correlate very poorly with any causative brain chemistry. Most participants may actually suffer from radically different underlying pathologies. In trials, a few respond positively to the treatment, and the rest see no, or even harmful, effects — not because the treatment is ineffective, but because their problem is not one that it treats. The summary seems to demonstrate only statistically insignificant effects.

    Ed probably sees (saw?) a lot of this in his work, where apparently identical tumors differ only in which treatments work. For fields where diagnosis is still in the dark ages (probably most, for at least some conditions) the double-blind “gold standard” amounts to superstition. The only solution is to devise better methods for such cases, and standards for when trials must resort to these methods.

    Is anyone working on this?

  3. So…these people who replicated the experiments–and ended up validating most of them (despite the situation having been previously described as a “replicability crisis”)–are from the same field (psychology)?

    I think we have to agree that one must be a kind soul to trust their results, their vindication of their field. If, for some reason, many past experiments gave, ahem, let’s just say “inaccurate” results, why should this hodgepodge of experiments–somewhat like one muddled Mega-Experiment–be any different?! That these scientists hold 12 different passports does little to increase their credibility–after all, this didn’t prevent them from collaborating on the same project, having as an unstated objective: “Okay, folks…we’re here to save the very reputability of our field.” Don’t forget, these scientists probably have memberships in the same associations, attend the same conferences, study from the same books…in short, they think alike. And, excuse the expression, if they could collaborate, they could conspire.

    Perhaps part of the problem is that the “samples” in these experiments are *human beings*?! Perhaps psychology is as much art as science? This could be far more positive than the frantic attempts to ‘mass-produce’ experiments. It could make many people more receptive to the field’s findings when communicated as “highly probable hypotheses” or “highly trustable advices” than the Gospel-truths many psycologists like to present them as.

    Back to the issue of inaccuracies and falsehoods, I think it’s a plague with nearly *all* branches of sciences, not just psychology. Instead of making more of the same, scientists should now put their own methodolgies, motives, even themselves, under the microscope.

Leave a Reply

Your email address will not be published. Required fields are marked *