Why null ain't necessarily dull
Something slightly unusual happened this week. In a paper in the journal Vision Research, Simon Baron-Cohen and colleagues reported that they had failed to find any statistically significant differencebetween the visual acuity of individuals with and without autism. The study was a follow-up to a 2009 paper that claimed to show enhanced (or "eagle-eyed") visual acuity in autism. Following two particularly damning commentaries by experts in vision science, the Baron-Cohen group got together with the critics, fixed up the problems with the study, and tried to replicate their original findings. They failed.
While it's slightly concerning that the original study ever made it to publication, it's heartening that the authors took the criticism seriously, the concerns were addressed, and the scientific record was set straight fairly quickly. This is how science is supposed to work. But it's something that happens all too rarely.
Most of the time, at least in the fields of science I'm familiar with, we’re in the business of null hypothesis testing. We're looking for an effect - a difference between two conditions of an experiment or two populations of people, or a correlation between two variables. But we test this effect statistically by seeing how likely it is that we would have made the observations we did if our hypothesis was wrong and there wasn’t an effect at all. If the tests suggest that it’s unlikely that this null hypothesis can account for the data, we conclude that there was an effect.
The criteria are deliberately strict. By convention, there has to be less than a 5% chance that the null hypothesis can explain your data before you can confidently conclude that an effect exists. This is supposed to minimize the occurrence of people making grand claims based on small effects that could easily have come about purely by chance. But the problem is that it doesn’t work in reverse. If you don’tfind a statistically significant effect, you can’t be confident that there isn’t one. Reviewers know this. Editors know this. Researchers know that reviewers and editors know this. Rather than being conservative, null hypothesis testing actually biases the whole scientific process towards spurious effects entering the literature and biases against publication of follow-up studies that don't show such an effect. Failure to reject the null hypothesis is seen as just that - a failure.
This is something with which I'm well acquainted. My PhD was essentially a series of failures to replicate. To cut a very long story very short, a bunch of studies in the mid 90s had apparently shown that, during memory tasks, people with Williams syndrome rely less on the meanings of words and more on their sounds. I identified a number of alternative explanations for these results and, like a good little scientist, designed some experiments to rule them out. Lo and behold, all the group differences disappeared.
So the big question. How do you get a null result published?
One helpful suggestion comes from Chris Aberson in the brilliantly titled Journal of Articles in Support of the Null Hypothesis. He points out that you can never really say that an effect doesn’t exist. What you can do, however, is report confidence intervals on the effect size. In other words, you can say that, if an effect exists, it’s almost certainly going to be very small.
Another possibility is to go Bayesian. Rather than simply telling you that there is not enough evidence to reject the null hypothesis, Bayesian statistics provides information on how likely it is that the null hypothesis versus the experimental hypothesis is correct given the observed data. I haven't attempted this yet myself so I'd be interested to hear from anyone who has.
The strategy I've found really helpful is to look at factors that contribute to the size of the effect you're interested in. For example, in one study on context effects in language comprehension in autism, we were concerned that group differences in previous studies were really down to confounding group differences in language skills. Sure enough, when we selected our control group to have similar language skills to our autism group, we found no difference between the two groups. But more importantly, within each group, we were able to show that an individual's language level predicted the size of their context effect. This gave us a significant result to report and in itself is quite an interesting finding.
This brings me neatly to my final point. In research on disorders such as autism or Williams syndrome, a significant group difference is considered to be the holy grail. In terms of getting the study published, it certainly makes life easier. But there is another way of looking at it. If you find a group difference, you’ve failed to control for whatever it is that has caused the group difference in the first place. A significant effect should really only be the beginning of the story.
While it's slightly concerning that the original study ever made it to publication, it's heartening that the authors took the criticism seriously, the concerns were addressed, and the scientific record was set straight fairly quickly. This is how science is supposed to work. But it's something that happens all too rarely.
In a brilliant piece in last weekend's New York Times, Carl Zimmer highlighted the difficulty science has in correcting itself. Wrong hypotheses are, in principle, there to be disproven but it's not always that straightforward in reality. In particular, as Zimmer points out, scientists are under various pressures to investigate new hypotheses and report novel findings rather than revisit their own or other people's old studies and replicate (or not) their results. And many journals have a policy of not publishing replication studies, even if the outcomes should lead to a complete reassessment of the original study's conclusions.
There is, however, a deeper problem that Zimmer doesn’t really go into.
The criteria are deliberately strict. By convention, there has to be less than a 5% chance that the null hypothesis can explain your data before you can confidently conclude that an effect exists. This is supposed to minimize the occurrence of people making grand claims based on small effects that could easily have come about purely by chance. But the problem is that it doesn’t work in reverse. If you don’tfind a statistically significant effect, you can’t be confident that there isn’t one. Reviewers know this. Editors know this. Researchers know that reviewers and editors know this. Rather than being conservative, null hypothesis testing actually biases the whole scientific process towards spurious effects entering the literature and biases against publication of follow-up studies that don't show such an effect. Failure to reject the null hypothesis is seen as just that - a failure.
This is something with which I'm well acquainted. My PhD was essentially a series of failures to replicate. To cut a very long story very short, a bunch of studies in the mid 90s had apparently shown that, during memory tasks, people with Williams syndrome rely less on the meanings of words and more on their sounds. I identified a number of alternative explanations for these results and, like a good little scientist, designed some experiments to rule them out. Lo and behold, all the group differences disappeared.
Perhaps not surprisingly, publishing these studies turned out to be a major challenge. One paper was rejected four times before being finally accepted. By this time, I'd finished my PhD, completed a post-doc on similar issues in Down syndrome, and published two papers arising from that study. In some ways, they were much less interesting than the Williams syndrome studies because they really just confirmed what we already knew about Down syndrome. But they contained significant group differences and were both accepted first time.
So the big question. How do you get a null result published?
One helpful suggestion comes from Chris Aberson in the brilliantly titled Journal of Articles in Support of the Null Hypothesis. He points out that you can never really say that an effect doesn’t exist. What you can do, however, is report confidence intervals on the effect size. In other words, you can say that, if an effect exists, it’s almost certainly going to be very small.
Another possibility is to go Bayesian. Rather than simply telling you that there is not enough evidence to reject the null hypothesis, Bayesian statistics provides information on how likely it is that the null hypothesis versus the experimental hypothesis is correct given the observed data. I haven't attempted this yet myself so I'd be interested to hear from anyone who has.
The strategy I've found really helpful is to look at factors that contribute to the size of the effect you're interested in. For example, in one study on context effects in language comprehension in autism, we were concerned that group differences in previous studies were really down to confounding group differences in language skills. Sure enough, when we selected our control group to have similar language skills to our autism group, we found no difference between the two groups. But more importantly, within each group, we were able to show that an individual's language level predicted the size of their context effect. This gave us a significant result to report and in itself is quite an interesting finding.
This brings me neatly to my final point. In research on disorders such as autism or Williams syndrome, a significant group difference is considered to be the holy grail. In terms of getting the study published, it certainly makes life easier. But there is another way of looking at it. If you find a group difference, you’ve failed to control for whatever it is that has caused the group difference in the first place. A significant effect should really only be the beginning of the story.
No comments:
Post a Comment