- JoNova - https://www.joannenova.com.au -

Failure of Peer Review, meaningless statistical significance, needs fixing says Doctor & journal

Most of the results reported in peer reviewed literature in medicine are mere artefacts of poor methodology, despite being done to more exacting standards than climate studies. There are calls in the medical literature for all data to be made public and for higher P values to be required. (Yes please say skeptics everywhere). Miller and Young recommend that observational studies don’t be taken at all seriously until they are replicated at least once. That would have ruled out the original HockeyStick two times over.

Even the absolute best medical papers are wrong 20% of the time, but mere observational studies (like climate research) failed 80 – 100% of the time. These studies of papers demonstrate why anyone who waves the “Peer Review” red flag is in denial of the evidence — “Peer Review” is not part of the scientific method. It’s a form of argument from authority. A fallacy of reasoning is still a fallacy, no matter how many times it is repeated. Those who claim it is essential or rigorous are not scientists, no matter what their government-given title says.

 GEN, Genetic Engineering and Biotechnology News, May 1, 2014, Point of View

Are Medical Articles True on Health, Disease?

Sadly, Not as Often as You Might Think

S. Stanley Young  |  Henry I. Miller, M.D

How many ways can a reported result fail?

Science works only when experiments are reproducible. If an experiment cannot be replicated, both the scientific enterprise and those who depend upon its results are in trouble. Driven by the realization that experiments surprisingly often do not replicate, the issue of claims in scientific papers is receiving increasing scrutiny. Given that biomedical research is one of the most important goals of the scientific enterprise, it is especially important to know how well the claims that result from clinical studies hold up.

Observational studies are mere “data mining” they say, while RCT (randomized controlled trials) are the gold standard.  Neither was producing very useful results but observational studies were especially poor. By its nature, most climate studies are observational.

Observational studies could be replicated 0% of the time.

Young and Karr1 found 12 articles in prominent journals in which 52 claims coming from observational studies were tested in randomized clinical trials. Many of the RCTs were quite large, and most were run in factorial designs, e.g., vitamin D and calcium individually and together, along with a placebo group. Remarkably, none of the claims replicated in the direction claimed in the observational studies; in five instances there was actually statistical significance in the opposite direction.

Ioannidis looked at highly cited (supposedly the most important papers) and found that RCT’s were replicated 67% of the time (which is still a 37% failure rate) but observational studies only replicated one time in 6 (16%).

He remarked that this was not good enough:

Replication rates of 0.0%,1 67.9%, or 16.6%,4 are unacceptable. Clearly the standard p-value of <0.05 as a measure of statistical significance is not a reliable indicator that a result will replicate.

They discuss problems with design of experiments, data mining, and modeling, all of which apply to climate studies (and then some):

Items 4, 5, 10 and 11: Journals generally require a p-value <0.05 to merit consideration for publication, but because they do not require investigators to make datasets available, there may be an incentive to manipulate the process to get a p-value that “qualifies.” There is general agreement that fabricating data is fraud, but is it legitimate to ask hundreds of questions and/or look at thousands of models and not show how these choices affect the resulting p-values and claims?

The experimental “results” can be created through adjustments but that process is hidden from the reader — ain’t that just the way with climate models and then some?

Item 5: “Multiple modeling” provides the analyst with the flexibility to manipulate the data in order to get a p-value <0.05. For example, the analysis can be adjusted using linear models to make treatment groups more similar; with 10 covariates there are 1024 possible ways to perform this adjustment. Patient matching can be done in a number of ways. But such subtleties are largely hidden from the reader.

Statistical significance of “p <0.05” doesn’t mean much:

Even in the best randomized controlled trials (curiously run by industry), 20% of the papers are probably wrong. Presumably non-industry means “government funded” where things are twice as bad, and 40% of those were wrong.

“There have been claims that using a p-value cutoff at 0.05 is not sufficiently stringent.6 RCTs used to support drug approval require two studies with p-value <0.05 for an effective p-value ~0.0025. Industry-funded RCTs replicate with a frequency of about 78.6%, while other RCTs, typically using a single p-value <0.05, replicate 57.1% of the time.

Observational Studies – where super tiny p values still mean very little

There are so many ways an observational study can go wrong.

There appear to be systemic problems with the way that observational studies are commonly conducted. Virtually all of the problems listed in the Table can plague observational studies and, of course, any one alone or a combination of them could wreck a study. In light of multiple testing and multiple modeling, a p-value <0.05 is not nearly rigorous enough.6 Five of the six observational studies in Ioannidis4 reported p-values of 0.0001, 0.001, 0.003, 0.008, and 0.015 (the 6th study was a case series and reported no p-value). Most of these p-values are small enough that they would be considered either “strong” or even “decisive” evidence by statistics professor Valen E. Johnson,6 but in all these cases, well-designed RCTs failed to confirm the claims made in these observational studies.

The journals and the funding agencies are part of the problem

It is popular to blame investigators for these problems, but the culpability must be shared by the managers of the scientific process: funding agencies and journal editors. At a minimum, funding agencies should require that datasets used in papers be deposited so that the normal scientific peer oversight can occur. Journal editors need to reexamine their policy of being satisfied with a p-value <0.05, unadjusted for multiple testing or multiple modeling. Editors are using “quality by inspection” (p-value <0.05) rather than the more modern “quality by design.”

By essentially requiring a p-value <0.05, editors are directly responsible for publication bias, because most negative studies are not published.


1 Young SS, Karr A. Deming, data and observational studies: A process out of control and needing fixing. Significance 2011;September:122–126.

2 Prinz F, SchlangeT, Asadullah K. Believe it or not: how much can we rely on published data on potential drug targets? Nature Rev. Drug Discov. 2011;10:712-713.

3 Begley CG, Ellis LM. Raise standards for preclinical cancer research. Nature 2012;483:531-533.

4 Ioannidis JPA. Contradicted and initially stronger effects in highly cited clinical research. JAMA 2005;294:218–229.

5 Madigan D, Ryan PB, Schuemie M. Does design matter? Systematic evaluation of the impact of analytical choices on effect estimates in observational studies. Therapeutic Advances in Drug Safety. 2013;4:53-62.

6 Johnson VE. Revised standards for statistical evidence. PNAS 2013;110:19313-19317.

7  Young and Miller (2014) Are Medical Articles True on Health, Disease?  GEN, Genetic Engineering and Biotechnology News, May 1, 2014, Point of View, (Vol. 34, No. 9)

8.7 out of 10 based on 48 ratings