A Blog by

Neuroscience Cannae Do It Cap’n, It Doesn’t Have the Power

The brain has hit the big time. Barack Obama has just announced $100 million of funding for the BRAIN Intitiative—an ambitious attempt to apparently map the activity of every neuron in the brain. On the other side of the Atlantic, the Human Brain Project will try to simulate those neurons with a billion euros of funding from the European Commission. And news about neuroscience, from dream-decoding to mind-melding to memory-building, regularly dominates the headlines.

But while the field’s star seems to be rising, a new study casts a disquieting shadow upon the reliability of its results. A team of scientists led by Marcus Munafo from the University of Bristol analysed a broad range of neuroscience studies and found them plagued by low statistical power.

Statistical power refers to the odds that a study will find an effect—say, whether antipsychotic drugs affect schizophrenia symptoms, or whether impulsivity is linked to addiction—assuming those effects exist. Most scientists regard a power of 80 percent as adequate—that gives you a 4 in 5 chance of finding an effect if there’s one to be found. But the studies that Munafo’s team examined tended to be so small that they had an average (median) power of just 21 percent. At that level, if you ran the same experiment five times, you’d only find an effect on one of those. The other four tries would be wasted.

But if studies are generally underpowered, there are more worrying connotations beyond missed opportunities. It means that when scientists do claim to have found effects—that is, if experiments seem to “work”—the results are less likely to be real. And it means that if the results are actually real, they’re probably bigger than they should be. As the team writes, this so-called “winner’s curse” means that “a ‘lucky’ scientist who makes the discovery in a small study is cursed by finding an inflated effect.”

So, a field that is rife with low power is one that is also rife with wasted effort, false alarms and exaggerated results.

Across the sciences

These problems are far from unique to neuroscience. They exist in medicine, where corporate teams have only managed to reproduce a minority of basic studies in cancer, heart disease and other conditions. They exist in psychology—a field that I have written about extensively, and that is now taking a lead in wrestling with issues of replicability. They exist in genetics, which used to be flooded with tiny studies purporting to show links between some genetic variant and a trait or disease—links that were later disproved in larger studies. Now, geneticists are increasingly working with larger samples by pooling their recruits in big collaborations, and verifying their results in an independent group of people before publishing. “Different fields have learned from similar lessons in the recent past,” says Munafo.

Munafo himself, who studies addictive behaviour, works in the intersection between genetics and brain sciences. Over the past decade, he has published several meta-analyses—overviews of existing studies—looking at links between genetic variants and mental health, attention and drug cravings, brain activity and depression, and more. And he kept on seeing the same thing. “These studies were all coming up with the same average power of around 20%,” he says. “The convergence was really striking given the diversity of fields that we studied.”

He decided to take a more thorough look at neuroscience’s power, and enlisted a team of scientists from a wide range of fields. They included psychologist Kate Button, a postdoc in Munafo’s team, and Claire Mokrysz, a former Masters student now studying mental health at UCL. John Ioannidis also signed up–his now-classic paper “Why Most Published Research Findings Are False” has made him a figurehead among scientists looking at the reliability of their discoveries. Another partner, psychologist Brian Nosek, is at the forefront of efforts to make science more open and reliable and leads the newly opened Centre for Open Science. “We were trying to present a range of perspectives rather than come across as one particular interest group trying to criticise one another,” says Munafo. “We want to be constructive rather than critical.”

Together, the team looked at every neuroscience meta-analysis published in 2011—49 in total, including over 730 individual studies between them. Their average power was just 21 percent. Their analysis is published in Nature Reviews Neuroscience.

“I think this is a really important paper for the field,” says Jon Simons, a neuroscientist from the University of Cambridge. “Much of neuroscience is still relatively young, so the best and most robust methods are still being established. I think it’s a sign of a healthy, thriving scientific discipline that these developments are being published in such a prominent flagship journal.”

Ioannidis agrees. “As the neuroscience community is expanding its reach towards more ambitious projects, I think it will be essential to ensure that not only more sophisticated technologies are used, but also larger sample sizes are involved in these studies,” he says.

If there’s a problem with the team’s approach, it’s that most of the meta-analyses the team considered looked at genetic associations with mental traits, or the effect of drugs and treatments on mental health. One could argue that these only reflect a small proportion of neuroscience studies, and are already “covered” by fields like genetics and medicine where issues of replicability have been discussed.

“This is the main limitation and a fair criticism,” says Munafo. To address it, his team looked at two other types of experiment. In an earlier study, Ioannidis showed that brain-scanning studies, which looked at brain volume in people with mental health conditions, reported a surfeit of positive results—a sign that negative  studies were not being published. By analysing these studies again—all 461 of them—he showed that they have a median statistical power of just 8 percent.

The team also looked at 40 studies where rats were put in mazes to test their learning and memory. Again, these were typically so small that they only had a median power of 18 to 31 percent.

“Neuroscience is so broad that it’s hard to generalise,” says Munafo, “but across a diverse range of research questions and methods—genetics, imaging, animal studies, human studies—a consistent picture emerges that the studies are endemically underpowered.”

Fixing the problem

Unfortunately, raising power is easier said than done. It costs time and money, and there are many reasons why studies are currently underpowered. Partly, it’s just the nature of science. Power depends not just on sample size but on the strength of the effect you’re looking at—subtler effects demand larger samples to get the same power. But when you’re the first to study a phenomenon, you don’t know the size of the effect you’re looking for. You’re working off educated guesses or, perhaps more likely, what you have the time and money to do. “We’ve all done this where we’ve got a little bit of resource available to test a novel question and we might find some results,” says Munafo. Without such forays, science would grind to a halt.

The problem isn’t in the existence of such exploratory studies, but in how they are described in papers—poorly. “We need to be clear that when we’re having a punt and running a study that’s only as big as we can afford, it’ll probably be underpowered,” says Munafo.

Transparency is vital. In their paper, the team outlines several ways of addressing the problems of poorly powered studies, which all revolve around this theme. They include: pre-registering experimental plans before the results are in to reduce the odds of tweaking or selectively reporting data; making methods and raw data openly available so they can be easily checked by other scientists and pooled together in large samples; and working together to boost sample sizes.

But ultimately, the problem of underpowered studies ties into a recurring lament—that scientists face incentives that aren’t geared towards producing reliable results. Small, underpowered studies are great at producing what individuals and journals need—lots of new, interesting, significant and publishable results—but poor at producing what science as a whole needs—lots of true results. As long as these incentives continue to be poorly aligned, underpowered studies will remain a regular presence. “It would take a brave soul to do a tenth of the studies they were planning to do and just do a really big adequately powered one unless they’re secure enough in their career,” says Munafo.

This is why the team is especially keen that people who make decisions about funding in science will pay attention to his analysis. “If you have lots of people running studies that are too small to get a clear answer, that’s more wasteful in the long-term,” Munafo says. And if those studies involve animals, there is a clear ethical problem. “You end up sacrificing more animals than if you’d just run a single, large authoritative study in the first place. Paradoxically, I know people who’ve submitted animal grants that are powered to 95 percent but been told: ‘This is too much. You’re using too many animals.’”

I’m thrilled they’ve written this review,” says David Eagleman, a neuroscientist at Baylor College of Medicine. “Hopefully, this sort of exposure can build towards a reduction of wastefulness in research, not only in terms of taxpayer dollars but in terms of scientific man-hours. “

Reference: Button, Ioannidis, Mokrysz, Nosek, Flint, Robinson & Munafo. 2013. Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience http://dx.doi.org/0.1038/nrn3475

17 thoughts on “Neuroscience Cannae Do It Cap’n, It Doesn’t Have the Power

  1. Another argument for parallel recording. Traditional, one-neuron-at-a-time neurophysiological papers study 10s of neurons. Multi-electrode studies have 100s or 1000s of neurons. Enough power? Maybe not, but way more power than single neuron recording.

  2. Great piece on a really important article. Funding bodies must look at study designs to ensure power to detect hypothesised effects. Reporting must be transparent and reviewers and editors must be aware of the dangerous allure of cool, novel results from small studies. And someone has to come up with a statistical method that takes these effects into account – clearly the existing tests looking just at the data of individual expts after the fact don’t give p-values that mean what they purport to mean. Maybe Bayesian analysis incorporating a prior likelihood of seeing a significant result? A kind of meta-statistic that would correct for false positives across many small studies.

  3. Now that behavioral epigenetics has detailed gene x environment interactions at the molecular level of adaptive evolution in species from microbes to man, it’s time to look at study design and results in the context of what we’ve already learned from animal models. For example, nutrient-dependent amino acid substitutions show up in phenotypical traits of reproductive fitness. Selection is for pheromones that signal hormone-dependent reproductive fitness in vertebrates and invertebrates.

    Although statistical analyses can be used to link mutations to nutrient-dependent pheromone-controlled adaptive evolution, the statistical analyses do not address Darwin’s “conditions of existence,” which are nutrient-dependent and pheromone-controlled. Thus, mutations theory tells us about missense mutations and nonsense mutations that somehow result in adaptive variations or disease, with no mention of how the epigenetic landscape becomes the physical landscape of DNA (e.g., via chromatin remodeling). Consistent use of a model for how olfactory/pheromonal input epigenetically effects phenotypic traits that include behavior might provide a framework for evaluation of study results from different disciplines and bring us closer to placing what seem like disparate findings into their ‘proper’ context. The proper context has not changed from ‘conditions of existence,’ which are nutrient-dependent and pheromone-controlled. Proper context is not statistically determined; it’s the result of adaptive evolution.

  4. As you touch on, I think association studies (quant genetics) and drug studies (psychiatry) in humans are quite different from mainstream neuroscience. Not that there aren’t likely statistical problems in many subfields (looking at you, fMRI), but the problems here aren’t “neuroscience” problems, they are genetic association and drug trial problems.

    [I do touch on this, but I also think that objections like this assume that the null hypothesis should be “Problems don’t exist in my sub-field.” Given that problems of low power, poor replicability, publication bias and so on have been found in every field where people have actually collected objective data, I think the null hypothesis should be that there ARE problems, and it’s down to the field to disprove that by collecting data. – Ed]

  5. I agree Ed, but the two field discussed here are themselves small subfields of neuroscience. Or not even subfields… they are methods used in all association studies and all drug trials, neuro or not.

    I don’t doubt all subfields have their stat probs, but THESE stat probs are neither specific to nor particularly prevalent in neuroscience.

  6. I don’t completely agree. You say “waste of taxpayer money …”. The argument can go the other way. To perform an experiment with more power costs much more: more subjects, replications, etc. The question is cost-benefit. What is the cost of underpowered studies versus the cost of over-powered studies? My sense is that underpowered studies should be published with strong caveats. When conclusions are important, they should be followed by appropriately powered studies. Underpowered studies should be considered publishable, but pilot studies.

    [John, I don’t think anyone’s saying underpowered studies should never be published. As I wrote, that would spell the end for exploratory research and harm science. You and I said the same thing: clearly label these as pilot exploratory studies. And then carry out replications that are billed as such and adequately powered. – E]

  7. I should be clear, I am not saying underpowered experimental designs are not prevalent… they may well be. These authors have an interest in clininical studies in humans, but they are conflating a basic research field (neuroscience) with a couple of methods related to its medical application in humans (neurology and related specialities).

    The equivalent might be identifying shortcomings in clinical cancer trials, and then declaring there are widespread statistical problems in the field of cell biology.

  8. The disincentive in academic publishing to include caveat or to clearly label pilot studies is a major fault here. Preliminary studies and negative results do not play well in major journals, even though they could in principle play important roles in understanding the big scientific picture.

    If preliminary or negative results do not get rewarded with tenure, then what is the incentive for framing these studies accurately?

    The popular media are at fault here as well. They place even more importance on the big splashes, often with almost no consideration of including the proper context.

  9. Thanks for another clear and compelling piece to bring these problems to the forefront, Ed.

    One point that you raise that should be particularly compelling to researchers is that running studies with low power means that you may be testing good ideas that are true, but discarding them as false.

    The flip side, that you turn to, is the increase chance of false positives. This makes sense to someone who follows the argument, but I’m not sure it’s intuitively compelling yet in the story as it’s written. I would argue that, on an intuitive level, if you go looking for something, with a small chance of finding it, then it seems reasonable that if you find it then it’s real — and it’s particularly impressive that you found it since you had such a small chance of doing so.

    What the intuition fails to grasp is that when you “find” something statistically, it’s not the same thing as finding a sunken treasure that you can hold in your hand. So what remains to be explained — and I recognize this is difficult — is why it is that low power makes a finding less likely to be true (or, if true, then probably inflated).

    The authors cover this in the paper, but (understandably) it’s conveyed in a way that assumes an acquaintance with statistical distributions. I could a imagine using an example to illustrate the issue, but I can see that it’s hard to fit that into an article like this.

    p.s. just saw that @soozaphone’s Guardian piece does a pretty nice job with this.

  10. If chronically underpowered studies typically cannot be replicated, then using more animals in an attempt to replicate small exploratory studies will typically sacrifice a lot of animals to no great purpose. As an alternative, I suggest designing experiments and using evolutionary meta-strategies that produce sharper results. Consider the history of discoveries which have verifiably led to genuine advances. For example, cancer treatments certainly are keeping patients alive longer than was possible 30 years ago. In that field, unfortunately, many subjects have had to die; nevertheless, somehow we now have better treatment protocols. Ask how that happened.

  11. In regards to your last point concerning grant committees deciding the sample size is too large, the same thing goes for clinical ethics committees. Since there are currently no accurate tests that say “for x Power test x Participants”, ethics committees have nothing to go on. So they insist on the least number of participants possible, which could explain why a lot of fMRI studies in clinical populations have n<10 (in combination with time, money and difficulties getting participants of course).

  12. On the other hand, Javier Gonzalez-Castillo’s results show that increasing the power of an fMRI study leads to the entire gray matter of the brain being “active” = significantly correlated with the task timing. This is hard to interpret — which is what he’s up to now. And it is hard and confusing.

Leave a Reply

Your email address will not be published. Required fields are marked *