Nice Results, But What Did You Expect?

In 2008, a team of psychologists from the University of Michigan apparently found a simple memory task that could boost intelligence. They asked volunteers to watch a sequence of symbols while listening to a series of letters. Holding both streams of information in their heads, they had to say if the current symbol or letter matched the one from a few cycles back. This memory-based “dual n-back” task seemed to improve the volunteers’ fluid intelligence—a general ability to solve problems that goes well beyond mere memory. The team said that their study opened up “a wide range of applications”.

Walter Boot from Florida State University and Daniel Simons from the University of Illinois at Urbana-Champaign disagree. They think the study had a critical weakness: it compared the people who did the n-back task with a control group who did nothing. Those who did the memory training may have expected to gain a temporary boost in intelligence, memory or mental abilities. Those who sat and waited wouldn’t have expected anything.

For decades, we’ve known that our expectations can wield a huge influence over our behaviour and our bodies. This is why new medicines are tested in double-blind randomised trials, where neither doctors nor patients know who’s getting a drug and who’s just getting a placebo. If the trials weren’t blinded, the patients who got the drug might get better simply because they expected to get better—the infamous placebo effect.

Running double-blind studies is easy enough when your medicine looks the same as a saline drip, but it’s usually impossible in psychology. “Psychology interventions aren’t like pills,” says Simons. “If you’re receiving an experimental treatment for depression, you know that you’re receiving treatment.” Or if you’re doing the dual n-back task, you know you’re doing a memory task.

That’s the problem: It’s hard to run these interventions without revealing your hand to volunteers. The even bigger problem, according to Boot and Simons, is that psychologists have largely dealt with this issue by sweeping it under the rug. “There’s a standard that’s really flawed,” says Simons. “Everyone knows the reason for double-blind designs. If you don’t have those, you have to control for the things they control for, and we’re not. We need to do better.”

Now, together with colleagues Cary Stothart and Cassie Stutts, Boot and Simons have published a paper—more a manifesto, really—that outlines their gripes. (You can read the full details of their project here, including an FAQ and a blog post.)

Great expectations

They first realised the scale of this problem when they looked at a long line of studies on the benefits of video games. Since 2003, many studies have shown that people who play action video games—mostly shooters like Unreal Tournament—have better attention and visual skills than people who play more sedate games like Tetris or The Sims. (This TED talk summarises the results.)

Boot and Simons failed to replicate a few of these classic results but, in doing so, they also noticed that none of these studies examined the volunteers’ expectations. Do the action gamers expect to do better in the tasks that they eventually do better in, compared to the slow-paced gamers? It’s a simple question, but no one was asking it, much less had answers.

The team collected some preliminary data of their own. They surveyed 400 people who watched a video of a game (Unreal Tournament or Tetris) and a second video of a mental test, of the sort used in earlier gaming studies. Did the volunteers think they would do better at the test if they had played the game? Yes, but selectively so. People who played the action game expected to do better at vision and attention tasks, while Tetris gamers expected to improve at mental rotation. And that’s exactly what previous studies found.

“We’ve only shown that expectations line up with the improvements people have seen, but not that they drive these effects,” says Simons. ”But this shows the original results are inconclusive. The point is that we just don’t know.”

The team picked on video game studies not because they’re weak, but because they’re actually some of the best ones around. Their control group actually did something comparable, unlike those in the n-back study who sat around and did nothing. But that’s still not good enough, says Simons. “People assume that all active control groups are placebo controls, which is nonsense.”

The same problems emerged when the team looked at a broader range of psychological studies, from psychotherapy to brain-training. “I’ve been collecting a list of every intervention done since the start of 2013, and have yet to find a single one in psychology or education that adequately controls for expectations,” says Simons. “The great irony of this is that psychologists are the ones responsible for demonstrating the power of expectancy effects!”

“This is a very important point that tends to be systematically shoved aside even in carefully designed studies,” says Axel Cleeremans from the Free University of Brussels. “Participants in any study will always attempt to consciously figure out what one wants from them and how they should behave.”

“It’s something the field desperately needed to be reminded of,” says Randall Engle from Georgia Institute of Technology. It’s not just subjects either. Without double-blind studies, experimenters can skew the results of experiments because they expect their subjects to behave in a certain way. “We’re all vulnerable to expectancy effects, because we have a strong vested interest in finding something interesting,” says Engle.

But Torsten Schubert from the Humboldt University of Berlin, who studies video games, thinks the problem is overrated. He also says the team’s views are inconsistent with their own past research. For example, in 2008, they showed that an action game (Medal of Honor: Allied Assault) doesn’t improve short-term memory or attention-switching compared to a puzzle game (Tetris). Fair enough, but they didn’t control for expectancy either. How could they find no effect “if expectancy is a strong factor influencing the results of training studies?” asks Schubert.

Aboard the brain train

Boot and Simons aren’t saying that studies are rubbish if they don’t account for expectations. What bothers them is the mismatch between the methods being used and the claims that are made on the back of those methods.

Consider the growing number of ‘brain-training’ companies, which purport to improve general mental abilities through simple tasks. Some of these cite studies that support their causal claims, including the Michigan dual n-back experiment. But Simons says, “None of these do an adequate job of backing up the claims. The control is either doing nothing or doing a crossword, which is inadequate. These studies are being published in top journals and affecting public discourse.”

Engle agrees. Customers for brain-training programmes range from schools to intelligence agencies, and he doubts that they will get any braininess for their buck. “If this was some ivory tower effect, I wouldn’t worry so much about it, but it’s something that has real societal importance,” he says.

Dealing with the problem is easier said than done, especially since the gold standard of a double-blind trial is unreachable. But the team says that some “silver-standard” options might do. Psychologists could actually measure expectations, as Boot and Simons did in their quick survey. Then, at least, they could adjust their final results. Better still, they could use expectation surveys to help design studies in the first place. For example, in the dual n-back study, the ideal control task would be something that people would also expect to improve general intelligence, but that doesn’t rely on memory in the same way.

Psychologists can also deliberately manipulate the expectations of their volunteers—something that doctors would struggle to do ethically. They could tell some volunteers (in both the intervention and control groups) that they’d expect to see a benefit, while telling the rest that nothing should happen. They could tell people that they’d only see benefits after a certain amount of training, and test them before and after this point.

“It is of no doubt that expectancy can play a role in training studies but every one of the methods proposed by [the team] has its minuses,” says Schubert. A mix of techniques might be best, but that would greatly increase the money and time needed for a study. Why go to such extremes when the field is still at a young point, and researchers are arguing whether the effects they’re seeing are real at all? “They’re hanging a heavy stone on a new promising research area, the potential of which is currently not yet known,” says Schubert.

The team is aware of the realities of cost, time, and the high bar that they’ve set. “I know we’ve struck a fairly negative tone because we want to alert people to this issue,” says Simons. “We want to make sure that reviewers and editors ask the right questions, and encourage people to take steps to remedy these issues. Psychology has always led the way in controlling these sorts of problems.”

Reference: Boot, Simons, Stothart & Stutts. 2013. The Pervasive Problem With Placebos in Psychology: Why Active Control Groups Are Not Sufficient to Rule Out Placebo Effects. Perspective in Psychological Science

6 thoughts on “Nice Results, But What Did You Expect?

  1. I’m sick to death of this “double-blind = Gold Standard” assertion. Where’s the quadruple-blind Iridium-Standard evidence for its perfection?

    We see frequent announcements that this or that promising new treatment is no better than placebo, or actually harmful. Certainly something is being tested, but how often does what was demonstrated really match what was concluded? Especially troubling are treatments for ill-defined conditions such as depression, autism, or cancer. Each of these appears to be a constellation of illnesses that share some common symptoms. It would be astonishing if one treatment fixed them all, or more than one, yet this is the Gold Standard test of usefulness. Imagine the first double-blind gold-standard trial of penicillin as a treatment for “ague”. A few patients might improve dramatically, but probably few more, and maybe fewer, than in the control group, depending on how the patients were sorted out.

    Failures of the putative Gold Standard make us uncomfortable, but that discomfort is a hint that there is something worth investigating. We might learn something that makes us change the Gold Standard, and distrust the results of older trials. That would be unfortunate, but not nearly so unfortunate as continuing to trust flawed trials.

  2. Are there any scientifically valid, replicated, controlled clinical trials comparing talking to a psychotherapist versus to talking to friends, volunteers, peers, pets, life coaches, motivational speakers, gurus, psychics/palm readers, cult leaders, mystics, or other vendors that promise emotional health and healing? It would be interesting to see the results of trials in which the subject does not know whether they are speaking to a licensed “professional” versus talking to an unlicensed empathic person, and equally expected to be helped in the latter case. It would also be interesting to see trials comparing talking to a licensed psychotherapist to other active control groups, such as writing in a journal, exercising, meditating, reading self-help books, getting massages, dancing, listening to music, taking a nature walk with a companion, etc. while equally expecting to be helped in all of the control groups. (It would help to manage and equalize expectations by making it clear that the person conducting the trial is a neutral third-party, not a psychologist, so as not to bias subjects in that direction.) If it turns out that a licensed psychotherapist does no better than another scenario, then people in need (as well as tax-payers and those who pay insurance premiums) could save huge amounts of money by not wasting it on unnecessary (and sometimes harmful) therapists and embracing cheaper healthy alternatives.

  3. Clearly the solution is to run these same experiments using the psychologists as subjects. Divide them into “training-credulous” and “training incredulous” based on a standardized survey and see if which category they’re in has an effect on their results.

  4. I’ve seen all the studies that report, say, Prozac is no better than placebos for treating depression. But I’ve seen and known of many people who had to try 2, 3, or even 5 different medicines before hitting on one that made a huge difference for them. Or, who needed combinations of medicines. Also, many times psychotherapies deliver no change UNTIL the patient is stabilized or improved in the short term by some medicine(s). For discussion, let’s say you go through 3 SRRI’s before one that works is tried. Will ‘gold standard testing’ remove one or more useful medicines from the toolkit?

    I am very concerned that many treatments highly invaluable to some large parts of a given population will never get them because so many questions are tested ONLY in isolation. That is perhaps a bigger problem than the one discussed.

    As an aside I’d like to say that if the lesson in current discussion is that we should all be friendly and encouraging to everybody, that’s OK.

  5. I agree with the sentiment of Nathan Myers about double blind trials, but nonetheless the blog raises core issues. It is a pity that scientists (and journal reviewers) often do not think beyond methodology much of the time. The core issues are not scientific but real. Tools and tasks are different….

Leave a Reply

Your email address will not be published. Required fields are marked *