The New Yorker has such an interesting article it’s already generating discussion here, so it deserves a thread of it’s own. It describes a true modern paradox, namely that so many good studies can show interesting “significant” results, yet very few of these turn out to be genuine repeatable findings, and frustrated researchers struggle to get similar results, and it’s almost as if, the harder they try, the worse it gets. Many researchers across disparate fields are noticing an odd trend that the effect they thought was so solid, appears to mysteriously “wear off” as the years and the repeat trials go on.
It’s a sober warning to all of us to search hard for the truth hidden behind variables we are not even able to name yet, let alone measure, and to be ever vigilant about variables we can name, like “publishing bias” and “selective reporting”.
Annals of Science
The Truth Wears Off
Is there something wrong with the scientific method?
by Jonah Lehrer December 13, 2010
These are quick quotes from a 5 page article. It’s well written, and worth reading in full.
But now all sorts of well-established, multiply confirmed findings have started to look increasingly uncertain. It’s as if our facts were losing their truth: claims that have been enshrined in textbooks are suddenly unprovable. This phenomenon doesn’t yet have an official name, but it’s occurring across a wide range of fields, from psychology to ecology. In the field of medicine, the phenomenon seems extremely widespread, affecting not only antipsychotics but also therapies ranging from cardiac stents to Vitamin E and antidepressants: Davis has a forthcoming analysis demonstrating that the efficacy of antidepressants has gone down as much as threefold in recent decades.
It’s becoming known as the decline effect.
Jennions, similarly, argues that the decline effect is largely a product of publication bias, or the tendency of scientists and scientific journals to prefer positive data over null results, which is what happens when no effect is found. The bias was first identified by the statistician Theodore Sterling, in 1959, after he noticed that ninety-seven per cent of all published psychological studies with statistically significant data found the effect they were looking for. A “significant” result is defined as any data point that would be produced by chance less than five per cent of the time. This ubiquitous test was invented in 1922 by the English mathematician Ronald Fisher, who picked five per cent as the boundary line, somewhat arbitrarily, because it made pencil and slide-rule calculations easier. Sterling saw that if ninety-seven per cent of psychology studies were proving their hypotheses, either psychologists were extraordinarily lucky or they published only the outcomes of successful experiments. In recent years, publication bias has mostly been seen as a problem for clinical trials, since pharmaceutical companies are less interested in publishing results that aren’t favorable. But it’s becoming increasingly clear that publication bias also produces major distortions in fields without large corporate incentives, such as psychology and ecology.
Could it be selective reporting?
[Palmer] noticed that the distribution of results with smaller sample sizes wasn’t random at all but instead skewed heavily toward positive results. Palmer has since documented a similar problem in several other contested subject areas. “Once I realized that selective reporting is everywhere in science, I got quite depressed,” Palmer told me. “As a researcher, you’re always aware that there might be some nonrandom patterns, but I had no idea how widespread it is.” In a recent review article, Palmer summarized the impact of selective reporting on his field: “We cannot escape the troubling conclusion that some—perhaps many—cherished generalities are at best exaggerated in their biological significance and at worst a collective illusion nurtured by strong a-priori beliefs often repeated.”
The Ioannidis study “Why most published Research Findings are False” was a major eye opener into the flaws of peer reviewed papers.
John Ioannidis, an epidemiologist at Stanford University, argues that such distortions are a serious issue in biomedical research. “These exaggerations are why the decline has become so common,” he says. “It’d be really great if the initial studies gave us an accurate summary of things. But they don’t. And so what happens is we waste a lot of money treating millions of patients and doing lots of follow-up studies on other themes based on results that are misleading.” In 2005, Ioannidis published an article in the Journal of the American Medical Association that looked at the forty-nine most cited clinical-research studies in three major medical journals. Forty-five of these studies reported positive results, suggesting that the intervention being tested was effective. Because most of these studies were randomized controlled trials—the “gold standard” of medical evidence—they tended to have a significant impact on clinical practice, and led to the spread of treatments such as hormone replacement therapy for menopausal women and daily low-dose aspirin to prevent heart attacks and strokes. Nevertheless, the data Ioannidis found were disturbing: of the thirty-four claims that had been subject to replication, forty-one per cent had either been directly contradicted or had their effect sizes significantly downgraded.
According to Ioannidis, the main problem is that too many researchers engage in what he calls “significance chasing,” or finding ways to interpret the data so that it passes the statistical test of significance—the ninety-five-per-cent boundary invented by Ronald Fisher.
This would seem to be the obvious, why-didn’t-we-do-it-ten-years-ago idea:
In a forthcoming paper, Schooler recommends the establishment of an open-source database, in which researchers are required to outline their planned investigations and document all their results.
I found this study quite gripping:
John Crabbe, a neuroscientist at the Oregon Health and Science University, conducted an experiment that showed how unknowable chance events can skew tests of replicability. He performed a series of experiments on mouse behavior in three different science labs: in Albany, New York; Edmonton, Alberta; and Portland, Oregon. Before he conducted the experiments, he tried to standardize every variable he could think of. The same strains of mice were used in each lab, shipped on the same day from the same supplier. The animals were raised in the same kind of enclosure, with the same brand of sawdust bedding. They had been exposed to the same amount of incandescent light, were living with the same number of littermates, and were fed the exact same type of chow pellets. When the mice were handled, it was with the same kind of surgical glove, and when they were tested it was on the same equipment, at the same time in the morning.
The premise of this test of replicability, of course, is that each of the labs should have generated the same pattern of results. “If any set of experiments should have passed the test, it should have been ours,” Crabbe says. “But that’s not the way it turned out.”
Under seemingly identical conditions, mice were injected with cocaine: In Portland the mice moved an average of an extra 600 cm a day; In Albany, 700 cm extra; in Edmonton, 5,000 cm extra.
The disturbing implication of the Crabbe study is that a lot of extraordinary scientific data are nothing but noise. The hyperactivity of those coked-up Edmonton mice wasn’t an interesting new fact—it was a meaningless outlier, a by-product of invisible variables we don’t understand.
Roy Spencer is also getting with the theme:
January 3rd, 2011 by Roy W. Spencer, Ph. D.
Those aren’t my words — it’s the title of a 2005 article, brought to my attention by Cal Beisner, which uses probability theory to “prove” that “…most claimed research findings are false”. While the article comes from the medical research field, it is sufficiently general that some of what it discusses can be applied to global warming research as well.
Biased funding sets up a problem before anyone even gets to biased reporting or selective publishing:
Twice I have testified in congress that unbiased funding on the subject of the causes of warming would be much closer to a reality if 50% of that money was devoted to finding natural reasons for climate change. Currently, that kind of research is almost non-existent.
Thanks to Pat for pointing me at the New Yorker article, and Jaymez for the Roy Spencer post.
David Burgess points out a good article on statistical significance: Odds Are It’s Wrong.