Saturday, April 8, 2017

Statistical Errors in the Medical Literature

As Doug Altman famously wrote in his Scandal of Poor Medical Research in BMJ in 1994, the quality of how statistical principles and analysis methods are applied in medical research is quite poor.  According to Doug and to many others such as Richard Smith, the problems have only gotten worse.  The purpose of this blog article is to contain a running list of new papers in major medical journals that are statistically problematic, based on my random encounters with the literature.

One of the most pervasive problems in the medical literature (and in other subject areas) is misuse and misinterpretation of p-values as detailed here, and chief among these issues is perhaps the absence of evidence is not evidence of absence error written about so clearly by Altman and Bland.   The following thought will likely rattle many biomedical researchers but I've concluded that most of the gross misinterpretation of large p-values by falsely inferring that a treatment is not effective is caused by (1) the investigators not being brave enough to conclude "We haven't learned anything from this study", i.e., they feel compelled to believe that their investments of time and money must be worth something, (2) journals accepting such papers without demanding a proper statistical interpretation in the conclusion.  One example of proper wording would be "This study rules out, with 0.95 confidence, a reduction in the odds of death that is more than by a factor of 2."  Ronald Fisher, when asked how to interpret a large p-value, said "Get more data."

Adoption of Bayesian methods would solve many problems including this one.  Whether a p-value is small or large a Bayesian can compute the posterior probability of similarity of outcomes of two treatments (e.g., Prob(0.85 < odds ratio < 1/0.85)), and the researcher will often find that this probability is not large enough to draw a conclusion of similarity.  On the other hand, what if even under a skeptical prior distribution the Bayesian posterior probability of efficacy were 0.8 in a "negative" trial?  Would you choose for yourself the standard therapy when it had a 0.2 chance of being better than the new drug? [Note: I am not talking here about regulatory decisions.]  Imagine a Bayesian world where it is standard to report the results for the primary endpoint using language such as:

  • The probability of any efficacy is 0.94 (so the probability of non-efficacy is 0.06).
  • The probability of efficacy greater than a factor of 1.2 is 0.78 (odds ratio < 1/1.2).
  • The probability of similarity to within a factor of 1.2 is 0.3.
  • The probability that the true odds ratio is between [0.6, 0.99] is 0.95 (credible interval; doesn't use the long-run tendency of confidence intervals to include the true value for 0.95 of confidence intervals computed).

In a so-called "negative" trial we frequently see the phrase "treatment B was not significantly different from treatment A" without thinking out how little information that carries.  Was the power really adequate? Is the author talking about an observed statistic (probably yes) or the true unknown treatment effect?  Why should we care more about statistical significance than clinical significance?  The phrase "was not significantly different" seems to be a way to avoid the real issues of interpretation of large p-values.

Since my #1 area of study is statistical modeling, especially predictive modeling, I pay a lot of attention to model development and model validation as done in the medical literature, and I routinely encounter published papers where the authors do not have basic understanding of the statistical principles involved.  This seems to be especially true when a statistician is not among the paper's authors.  I'll be commenting on papers in which I encounter statistical modeling, validation, or interpretation problems.

Misinterpration of P-values and of Main Study Results

One of the most problematic examples I've seen is in the March 2017 paper Levosimendan in Patients with Left Ventricular Dysfunction Undergoing Cardiac Surgery by Rajenda Mehta in the New England Journal of Medicine.  The study was designed to detect a miracle - a 35% relative odds reduction with drug compared to placebo, and used a power requirement of only 0.8 (type II error a whopping 0.2).  [The study also used some questionable alpha-spending that Bayesians would find quite odd.]  For the primary endpoint, the adjusted odds ratio was 1.00 with 0.99 confidence interval [0.66, 1.54] and p=0.98.  Yet the authors concluded "Levosimendan was not associated with a rate of the composite of death, renal-replacement therapy, perioperative myocardial infarction, or use of a mechanical cardiac assist device that was lower than the rate with placebo among high-risk patients undergoing cardiac surgery with the use of cardiopulmonary bypass."   Their own data are consistent with a 34% reduction (as well as a 54% increase)!  Almost nothing was learned from this underpowered study.  It may have been too disconcerting for the authors and the journal editor to have written "We were only able to rule out a massive benefit of drug."  [Note: two treatments can have agreement in outcome probabilities by chance just as they can have differences by chance.]  It would be interesting to see the Bayesian posterior probability that the true unknown odds ratio is in [0.85, 1/0.85].

The primary endpoint is the union of death, dialysis, MI, or use of a cardiac assist device.  This counts these four endpoints as equally bad.  An ordinal response variable would have yielded more statistical information/precision and perhaps increased power.  And instead of dealing with multiplicity issues and alpha-spending, the multiple endpoints could have been dealt with more elegantly with a Bayesian analysis.  For example, one could easily compute the joint probability that the odds ratio for the primary endpoint is less than 0.8 and the odds ratio for the secondary endpoint is less than 1 [the secondary endpoint was death or assist device and and is harder to demonstrate because of its lower incidence, and is perhaps more of a "hard endpoint"].  In the Bayesian world of forward directly relevant probabilities there is no need to consider multiplicity.  There is only a need to state the assertions for which one wants to compute current probabilities.

The paper also contains inappropriate assessments of interactions with treatment using subgroup analysis with arbitrary cutpoints on continuous baseline variables and failure to adjust for other main effects when doing the subgroup analysis.

This paper had a fine statistician as a co-author.  I can only conclude that the pressure to avoid disappointment with a conclusion of spending a lot of money with little to show for it was in play.

Why was such an underpowered study launched?  Why do researchers attempt "hail Mary passes"?  Is a study that is likely to be futile fully ethical?   Do medical journals allow this to happen because of some vested interest?

Similar Examples

Perhaps the above example is no worse than many.  Examples of "absence of evidence" misinterpretations abound.  Consider the JAMA paper by Kawazoe et al published 2017-04-04.  They concluded that "Mortality at 28 days was not significantly different in the dexmedetomidine group vs the control group (19 patients [22.8%] vs 28 patients [30.8%]; hazard ratio, 0.69; 95% CI, 0.38-1.22; P = .20)."  The point estimate was a reduction in hazard of death by 31% and the data are consistent with the reduction being as large as 62%!

Or look at this 2017-03-21 JAMA article in which the authors concluded "Among healthy postmenopausal older women with a mean baseline serum 25-hydroxyvitamin D level of 32.8 ng/mL, supplementation with vitamin D3 and calcium compared with placebo did not result in a significantly lower risk of all-type cancer at 4 years." even though the observed hazard ratio was 0.7, with lower confidence limit of a whopping 53% reduction in the incidence of cancer.  And the 0.7 was an unadjusted hazard ratio; the hazard ratio could well have been more impressive had covariate adjustment been used to account for outcome heterogeneity within each treatment arm.


  1. The study was severely underpowered for even the optimistic targeted effect size. What I find puzzling is they originally expected 201 events from 760 subjects. At an interim point of 600 enrolled, they adjusted the target enrollment to 880. How many endpoints they had at the interim time point is not known but it had to be less than 105 (final number). Why bother adjusting from 760 to 880 if you are observing less than half the expected number of events? Not sure what I would have recommended if I was on a DMB - maybe terminate immediately.

  2. Trial sequential analysis of one trial comparing levosimendan versus placebo on primary endpoint in patients left ventricular dysfunction undergoing cardiac surgery.
    Trial sequential analysis of one trial of levosimendan versus placebo on left ventricular dysfunction undergoing cardiac surgery based on the diversity-adjusted required information size (DARIS) of 2981 patients. This DARIS was calculated based upon a proportion of patients with the low cardiac output syndrome after cardiac surgery of 24.1% in the control group; a RRR of 20% in the experimental intervention group; an alpha (α) of 5%; and a beta (β) of 20%. The cumulative Z-curve (blue line) did not cross the conventional alpha of 5% (green line) after one trial. It implies that there is a random error. The cumulative Z-curve did not reach the futility area, which is not even drawn by the program. Presently, only 28.5% (849/2981) of the DARIS has been obtained. Had we calculated the DARIS based on a more realistic RRR such as <20%, the obtained evidence would represent a much smaller part of the DARIS. There is need of more powered-randomised clinical trials for drawing reliable conclusions.
    Figure is available (

    1. Sequential analysis in clinical trials is very important but for this particular trial we can make the needed points by mainly considering the final full-data analysis.

  3. This is a great idea. I'd love to see some sort of database collecting studies with poor methodology. I think this would be very valuable as a teaching resource.

    1. They are so easy to find it's hard to know where to start :-)

  4. I'd like to add that, often even with a significant effect, not much is learned because the noise makes is so the effect may range from plausible to implausible--eg, d = 1.0, 95-% CI [0.01 - 2.0].

  5. Yes, You are right! But I see that everything is about amount, or I mean, the true difference is the amount. If I see a difference and is small, it doesn't seem to matter, does it? If the difference is stronguer, then does matter. That's the problem and the solution. Just the big difference matter!
    Thats the solution ignore the small differences.

    1. But don't equate estimated differences with true differences.

  6. I too wish that the quality of reporting was better, and I think that Professor Harrell's wider work certainly improves the quality of research.
    However, I wonder, given the ubiquity of poorly analysed and reported research, if it is useful to single out particular papers. Researchers who would be embarrassed to see their work featured on this blog (much like they would be embarrassed to be featured as an example of poor-practice on BBC Radio 4's "More or Less") are likely to have agreed to a presentation of results either from weariness of battling with collaborators, a lack of power within the collaboration and/or job pressures.
    Researchers who are ignorant of how to usefully interpret a confidence interval will not recognise the validity of these criticisms. We only need to look in the letters pages to see how authors respond to valid criticisms.
    Hopefully the moves towards reproducible research, code sharing and open data will help with research quality.
    Nevertheless, surely the fundamental problem is the way in which academics are assessed for career progression. As Richard Smith argued when he spoke on this issue at an International Epidemiology Journal conference (I hope I am remembering his talk properly), universities are not fulfilling their duty; to judge and maintain the quality of work their staff is doing, and instead are outsourcing this responsibility to journal editors.
    David McAllister
    Wellcome Trust Intermediate Clinical Fellow and Beit Fellow,
    University of Glasgow

    1. David these are great comments and open up a legitimate debate (which I hope we can continue) of the value and appropriateness of singling out individual papers and authors. Part of my motivation is based on the fact that in addition to universities not fully meeting their duty, journals are very much to blame and need to be embarrassed. Very few journals simultaneously get the points that (1) statistical design has a lot to do with study quality and efficiency and (2) statistical interpretation is nuanced and is clouded by the frequentist approach. I am doing this for journal editors and reviewers as much as for authors and for pointing out limitations of translating study findings to practice. More of your thoughts, and the thoughts of others, welcomed.

    2. It is helpful for me to see actual examples as opposed to a generic criticism of under-powered studies. I am guilty of allowing researchers to say similar things to the NEJM article in the past. This example forces me to examine my role as a consulting statistician. I hope he will continue to point out problems using real examples, and also offer potential remedies.

    3. We should definitely develop a proposal for an optimum way of reporting frequentist inferential results. Overall the solutions are (1) appropriate interpretations, (2) designing studies for precision, and (3) making studies simpler and larger. Regarding (1), confidence intervals should be emphasized no matter what the p-value. I too have been involved in many, many underpowered studies. They usually end up in second-line journals. I am especially concerned when first-line journals don't do their job.

    4. Hi Frank,
      Thanks for your very open-minded reply. I take the point about journals and agree that they do need to be embarrassed.
      I have a perhaps unrealistic suggestion on this issue which which it would be great to hear your views on.
      I think that peer-review should be separated into two stages, a methods stage and a results and discussion stage. I think that introduction, methods section and those results which are a measure of the study robustness (eg total number of participants, loss to follow-up, etc) should be reviewed and given a score by each reviewer which reflects the quality of the methods. Only after having submitted this score, they should be able to access the full paper. Journals ought to report the methods-quality scores for every published paper, as well as some metric which summarises the quality of all of their published papers, and those they reject.
      I realise that there is a lot of issues around definition and measurement, but at least the within-journal comparison between accepted and rejected manuscripts would be illuminating.
      best wishes,

    5. David I think there is tremendous merit in this approach. It raises the question of exactly who reviewers should be. A related issue has been proposed by others: have all medical papers be reviewed without the results. Results bias reviews of methods. The likely reluctance of journals to adopt this approach will reveal the true motives of many journals. They stand not as much for science as they stand for readership and advertising dollars.

  7. This comment has been removed by a blog administrator.