*Updated 2017-11-04*

- Misinterpretation of P-values and Main Study Results
- Dichotomania
- Problems With Change Scores
- Improper Subgrouping
- Serial Data and Response Trajectories

As Doug Altman famously wrote in his Scandal of Poor Medical Research in BMJ in 1994, the quality of how statistical principles and analysis methods are applied in medical research is quite poor. According to Doug and to many others such as Richard Smith, the problems have only gotten worse. The purpose of this blog article is to contain a running list of new papers in major medical journals that are statistically problematic, based on my random encounters with the literature.

One of the most pervasive problems in the medical literature (and in other subject areas) is misuse and misinterpretation of p-values as detailed here, and chief among these issues is perhaps the absence of evidence is not evidence of absence error written about so clearly by Altman and Bland. The following thought will likely rattle many biomedical researchers but I've concluded that most of the gross misinterpretation of large p-values by falsely inferring that a treatment is not effective is caused by (1) the investigators not being brave enough to conclude "We haven't learned anything from this study", i.e., they feel compelled to believe that their investments of time and money must be worth something, (2) journals accepting such papers without demanding a proper statistical interpretation in the conclusion. One example of proper wording would be "This study rules out, with 0.95 confidence, a reduction in the odds of death that is more than by a factor of 2." Ronald Fisher, when asked how to interpret a large p-value, said "Get more data."

Adoption of Bayesian methods would solve many problems including this one. Whether a p-value is small or large a Bayesian can compute the posterior probability of similarity of outcomes of two treatments (e.g., Prob(0.85 < odds ratio < 1/0.85)), and the researcher will often find that this probability is not large enough to draw a conclusion of similarity. On the other hand, what if even under a skeptical prior distribution the Bayesian posterior probability of efficacy were 0.8 in a "negative" trial? Would you choose for yourself the standard therapy when it had a 0.2 chance of being better than the new drug? [Note: I am not talking here about regulatory decisions.] Imagine a Bayesian world where it is standard to report the results for the primary endpoint using language such as:

- The probability of any efficacy is 0.94 (so the probability of non-efficacy is 0.06).
- The probability of efficacy greater than a factor of 1.2 is 0.78 (odds ratio < 1/1.2).
- >The probability of similarity to within a factor of 1.2 is 0.3.
- The probability that the true odds ratio is between [0.6, 0.99] is 0.95 (credible interval; doesn't use the long-run tendency of confidence intervals to include the true value for 0.95 of confidence intervals computed).

Since my #1 area of study is statistical modeling, especially predictive modeling, I pay a lot of attention to model development and model validation as done in the medical literature, and I routinely encounter published papers where the authors do not have basic understanding of the statistical principles involved. This seems to be especially true when a statistician is not among the paper's authors. I'll be commenting on papers in which I encounter statistical modeling, validation, or interpretation problems.

### Misinterpretation of P-values and of Main Study Results

One of the most problematic examples I've seen is in the March 2017 paper Levosimendan in Patients with Left Ventricular Dysfunction Undergoing Cardiac Surgery by Rajenda Mehta in the New England Journal of Medicine. The study was designed to detect a miracle - a 35% relative odds reduction with drug compared to placebo, and used a power requirement of only 0.8 (type II error a whopping 0.2). [The study also used some questionable alpha-spending that Bayesians would find quite odd.] For the primary endpoint, the adjusted odds ratio was 1.00 with 0.99 confidence interval [0.66, 1.54] and p=0.98. Yet the authors concluded "Levosimendan was not associated with a rate of the composite of death, renal-replacement therapy, perioperative myocardial infarction, or use of a mechanical cardiac assist device that was lower than the rate with placebo among high-risk patients undergoing cardiac surgery with the use of cardiopulmonary bypass." Their own data are consistent with a 34% reduction (as well as a 54% increase)! Almost nothing was learned from this underpowered study. It may have been too disconcerting for the authors and the journal editor to have written "We were only able to rule out a massive benefit of drug." [Note: two treatments can have agreement in outcome probabilities by chance just as they can have differences by chance.] It would be interesting to see the Bayesian posterior probability that the true unknown odds ratio is in [0.85, 1/0.85].The primary endpoint is the union of death, dialysis, MI, or use of a cardiac assist device. This counts these four endpoints as equally bad. An ordinal response variable would have yielded more statistical information/precision and perhaps increased power. And instead of dealing with multiplicity issues and alpha-spending, the multiple endpoints could have been dealt with more elegantly with a Bayesian analysis. For example, one could easily compute the joint probability that the odds ratio for the primary endpoint is less than 0.8 and the odds ratio for the secondary endpoint is less than 1 [the secondary endpoint was death or assist device and and is harder to demonstrate because of its lower incidence, and is perhaps more of a "hard endpoint"]. In the Bayesian world of forward directly relevant probabilities there is no need to consider multiplicity. There is only a need to state the assertions for which one wants to compute current probabilities.

The paper also contains inappropriate assessments of interactions with treatment using subgroup analysis with arbitrary cutpoints on continuous baseline variables and failure to adjust for other main effects when doing the subgroup analysis.

This paper had a fine statistician as a co-author. I can only conclude that the pressure to avoid disappointment with a conclusion of spending a lot of money with little to show for it was in play.

Why was such an underpowered study launched? Why do researchers attempt "hail Mary passes"? Is a study that is likely to be futile fully ethical? Do medical journals allow this to happen because of some vested interest?

#### Similar Examples

Perhaps the above example is no worse than many. Examples of "absence of evidence" misinterpretations abound. Consider the JAMA paper by Kawazoe et al published 2017-04-04. They concluded that "Mortality at 28 days was not significantly different in the dexmedetomidine group vs the control group (19 patients [22.8%] vs 28 patients [30.8%]; hazard ratio, 0.69; 95% CI, 0.38-1.22;P > = .20)." The point estimate was a reduction in hazard of death by 31% and the data are consistent with the reduction being as large as 62%!
Or look at this 2017-03-21 JAMA article in which the authors concluded "Among healthy postmenopausal older women with a mean baseline serum 25-hydroxyvitamin D level of 32.8 ng/mL, supplementation with vitamin D_{3} and calcium compared with placebo did not result in a significantly lower risk of all-type cancer at 4 years." even though the observed hazard ratio was 0.7, with lower confidence limit of a whopping 53% reduction in the incidence of cancer. And the 0.7 was an *unadjusted* hazard ratio; the hazard ratio could well have been more impressive had covariate adjustment been used to account for outcome heterogeneity within each treatment arm.

An incredibly high-profile paper published online 2017-11-02 in *The Lancet* demonstrates a lack of understanding of some statistical issues. In Percutaneous coronary intervention in stable angina (ORBITA): a double-blind, randomised controlled trial by Rasha Al-Lamee et al, the authors (or was it the journal editor?) boldly claimed "In patients with medically treated angina and severe coronary stenosis, PCI did not increase exercise time by more than the effect of a placebo procedure." The authors are to be congratulated on using a rigorous sham control, but the authors, reviewers, and editor allowed a classic *absence of evidence is not evidence of absence* error to be made in attempting to interpret p=0.2 for the primary analysis of exercise time in this small (n=200) RCT. In doing so they ignored the useful (but flawed; see below) 0.95 confidence interval of this effect of [-8.9, 42] seconds of exercise time increase for PCI. Thus their data are consistent with a 42 second increase in exercise time by real PCI. It is also important to note that the authors fell into the change from baseline trap by disrespecting their own parallel group design. They should have asked the covariate-adjusted question: For two patients starting with the same exercise capacity, one assigned PCI and one assigned PCI sham, what is the average difference in follow-up exercise time?

**But** there are other ways to view this study. Sham studies are difficult to fund and difficult to recruit large number of patients. Criticizing the interpretation of the statistical analysis fails to recognize the value of the study. One value is the study's ruling out an exercise time improvement greater than 42s (with 0.95 confidence). If, as several cardiologists have told me, 42s is not very meaningful to the patient, then the study is definitive and clinically relevant. I just wish that authors and especially editors would use exactly correct language in abstracts of articles. For this trial, suitable language would have been along these lines: The study did not find evidence against the null hypothesis of no change in exercise time (p=0.2), but was able to (with 0.95 confidence) rule out an effect larger than 42s. A Bayesian analysis would have been even more clinically useful. For example, one might find that the posterior probability that the increase in exercise time with PCI is less than 20s is 0.97. And our infatuation with 2-tailed p-values comes into play here. A Bayesian posterior probability of *any* improvement might be around 0.88, far more "positive" than what someone who misunderstands p-values would conclude from an "insignificant" p-value. Other thoughts concerning the ORBITA trial may be found here.

### Dichotomania

Dichotomania, as discussed by Stephen Senn, is a very prevalent problem in medical and epidemiologic research. Categorization of continuous variables for analysis is inefficient at best and misleading and arbitrary at worst. This JAMA paper by VISION study investigators "Association of Postoperative High-Sensitivity Troponin Levels With Myocardial Injury and 30-Day Mortality Among Patients Undergoing Noncardiac Surgery" is an excellent example of bad statistical practice that limits the amount of information provided by the study. The authors categorized high-sensitivity troponin T levels measured post-op and related these to the incidence of death. They used four intervals of troponin, and there is important heterogeneity of patients within these intervals. This is especially true for the last interval (> 1000 ng/L). Mortality may be much higher for troponin values that are much larger than 1000. The relationship should have been analyzed with a continuous analysis, e.g., logistic regression with a regression spline for troponin, nonparametric smoother, etc. The final result could be presented in a simple line graph with confidence bands.
An example of dichotomania that may not be surpassed for some time is Simplification of the HOSPITAL Score for Predicting 30-day Readmissions by Carole E Aubert, et al in *BMJ Quality and Safety* 2017-04-17. The authors arbitrarily dichotomized several important predictors, resulting in a major loss of information, then dichotomized their resulting predictive score, sacrificing much of what information remained. The authors failed to grasp probabilities, resulting in risk of 30-day readmission of "unlikely" and "likely". The categorization of predictor variables leaves demonstrable outcome heterogeneity within the intervals of predictor values. Then taking an already oversimplified predictive score and dichotomizing it is essentially saying to the reader "We don't like the integer score we just went to the trouble to develop." I now have serious doubts about the thoroughness of reviews at *BMJ Quality and Safety*.

A very high-profile paper was published in BMJ on 2017-06-06: Moderate alcohol consumption as risk factor for adverse brain outcomes and cognitive decline: longitudinal cohort study by Anya Topiwala et al. The authors had a golden opportunity to estimate the dose-response relationship between amount of alcohol consumed and quantitative brain changes. Instead the authors squandered the data by doing analyzes that either assumed that responses are linear in alcohol consumption or worse, by splitting consumption into 6 heterogeneous intervals when in fact consumption was shown in their Figure 3 to have a nice continuous distribution. How much more informative (and statistically powerful) it would have been to fit a quadratic or a restricted cubic spline function to consumption to estimate the continuous dose-response curve.

The NEJM keeps giving us great teaching examples with its 2017-08-03 edition. In Angiotensin II for the treatment of vasodilatory shock by Ashish Khanna et al, the authors constructed a bizarre response variable: "The primary end point was a response with respect to mean arterial pressure at hour 3 after the start of infusion, with response defined as an increase from baseline of at least 10 mm Hg or an increase to at least 75 mm Hg, without an increase in the dose of background vasopressors." This form of dichotomania has been discredited by Stephen Senn who provided a similar example in which he decoded the response function to show that the lucky patient is one (in the NEJM case) who has a starting blood pressure of 74mmHg. His example is below:

When a clinical trial's response variable is one that is arbitrary, loses information and power, is difficult to interpret, and means different things for different patients, expect trouble.

### Change from Baseline

Many authors and pharmaceutical clinical trialists make the mistake of analyzing change from baseline instead of making the raw follow-up measurements the primary outcomes, covariate-adjusted for baseline. To compute change scores requires many assumptions to hold, e.g.:- the variable is not used as an inclusion/exclusion criterion for the study, otherwise regression to the mean will be strong
- if the variable is used to select patients for the study, a second post-enrollment baseline is measured and this baseline is the one used for all subsequent analysis
- the post value must be linearly related to the pre value
- the variable must be perfectly transformed so that subtraction "works" and the result is not baseline-dependent
- the variable must not have floor and ceiling effects
- the variable must have a smooth distribution
- the slope of the pre value vs. the follow-up measurement must be close to 1.0 when both variables are properly transformed (using the same transformation on both)

Regarding 3. above, if pre is not linearly related to post, there is no transformation that can make a change score work.

Regarding 7. above, often the baseline is not as relevant as thought and the slope will be less than 1. When the treatment can cure every patient, the slope will be zero. Sometimes the relationship between baseline and follow-up Y is not even linear, as in one example I've seen based on the Hamilton D depression scale.

The purpose of a parallel-group randomized clinical trial is to compare the parallel groups, not to compare a patient with herself at baseline. The central question is for two patients with the same pre measurement value of x, one given treatment A and the other treatment B, will the patients tend to have different post-treatment values? This is exactly what analysis of covariance assesses. Within-patient change is affected strongly by regression to the mean and measurement error. When the baseline value is one of the patient inclusion/exclusion criteria, the only meaningful change score requires one to have a second baseline measurement post patient qualification to cancel out much of the regression to the mean effect. It is he second baseline that would be subtracted from the follow-up measurement.

The savvy researcher knows that analysis of covariance is required to "rescue" a change score analysis. This effectively cancels out the change score and gives the right answer even if the slope of post on pre is not 1.0. But this works only in the linear model case, and it can be confusing to have the pre variable on both the left and right hand sides of the statistical model. And if Y is ordinal but not interval-scaled, the difference in two ordinal variables is no longer even ordinal. Think of how meaningless difference from baseline in ordinal pain categories are. A **major problem** in the use of change score summaries, even when a correct analysis of covariance has been done, is that many papers and drug product labels still quote change scores out of context.

Patient-reported outcome scales are particularly problematic. An article published 2017-05-07 in JAMA, doi:10.1001/jama.2017.5103 like many other articles makes the error of trusting change from baseline as an appropriate analysis variable. Mean change from baseline may not apply to anyone in the trial. Consider a 5-point ordinal pain scale with values Y=1,2,3,4,5. Patients starting with no pain (Y=1) cannot improve, so their mean change must be zero. Patients starting at Y=5 have the most opportunity to improve, so their mean change will be large. A treatment that improves pain scores by an average of one point may average a two point improvement for patients for whom any improvement is possible. Stating mean changes out of context of the baseline state can be meaningless.

The NEJM paper Treatment of Endometriosis-Associated Pain with Elagolix, an Oral GnRH Antagonist by Hugh Taylor et al is based on a disastrous set of analyses, combining all the problems above. The authors computed change from baseline on variables that do not have the correct properties for subtraction, engaged in dichotomania by doing responder analysis, and in addition used last observation carried forward to handle dropouts. A proper analysis would have been a longitudinal analysis using all available data that avoided imputation of post-dropout values and used raw measurements as the responses. Most importantly, the twin clinical trials randomized 872 women, and had proper analyses been done the required sample size to achieve the same power would have been far less. Besides the ethical issue of randomizing an unnecessarily large number of women to inferior treatment, the approach used by the investigators maximized the cost of these positive trials.

The NEJM paper Oral Glucocorticoid–Sparing Effect of Benralizumab in Severe Asthma by Parameswaran Nair et al not only takes the problematic approach of using change scores from baseline in a parallel group design but they used percent change from baseline as the raw data in the analysis. This is an asymmetric measure for which arithmetic doesn't work. For example, suppose that one patient increases from 1 to 2 and another decreases from 2 to 1. The corresponding percent changes are 100% and -50%. The overall summary should be 0% change, not +25% as found by taking the simple average. Doing arithmetic on percent change can essentially involve adding ratios; ratios that are not proportions are never added; they are multiplied. What was needed was an analysis of covariance of raw oral glucocorticoid dose values adjusted for baseline after taking an appropriate transformation of dose, or using a more robust transformation-invariant ordinal semi-parametric model on the raw follow-up doses (e.g., proportional odds model).

In Trial of Cannabidiol for Drug-Resistant Seizures in the Dravet Syndrome in NEJM 2017-05-25, Orrin Devinsky et al take seizure frequency, which might have a nice distribution such as the Poisson, and compute its change from baseline, which is likely to have a hard-to-model distribution. Once again, authors failed to recognize that the purpose of a parallel group design is to compare the parallel groups. Then the authors engaged in improper subtraction, improper use of percent change, dichotomania, and loss of statistical power simultaneously: "The percentage of patients who had at least a 50% reduction in convulsive-seizure frequency was 43% with cannabidiol and 27% with placebo (odds ratio, 2.00; 95% CI, 0.93 to 4.30; P=0.08)." The authors went on to analyze the change in a discrete ordinal scale, where change (subtraction) cannot have a meaning independent of the starting point at baseline.

Troponins (T) are myocardial proteins that are released when the heart is damaged. A high-sensitivity T assay is a high-information cardiac biomarker used to diagnose myocardial infarction and to assess prognosis. I have been hoping to find a well-designed study with standardized serially measured T that is optimally analyzed, to provide answers to the following questions:

- What is the shape of the relationship between the latest T measurement and time until a clinical endpoint?
- How does one use a continuous T to estimate risk?
- If T were measured previously, does the previous measurement add any predictive information to the current T?
- If both the earlier and current T measurement are needed to predict outcome, how should they be combined? Is what's important the difference of the two? Is it the ratio? Is it the difference in square roots of T?
- Is the 99
^{th}percentile of T for normal subjects useful as a prognostic threshold?

*Circulation*paper Serial Measurement of High-Sensitivity Troponin I and Cardiovascular Outcomes in Patients With Type 2 Diabetes Mellitus in the EXAMINE Trial by Matthew Cavender et al was based on a well-designed cardiovascular safety study of diabetes in which uniformly measured high-sensitivity troponin I measurements were made at baseline and six months after randomization to the diabetes drug Alogliptin. [Note: I was on the DSMB for this study] The authors nicely envisioned a landmark analysis based on six-month survivors. But instead of providing answers to the questions above, the authors engaged in dichotomania and never checked whether changes in T or changes in log T possessed the appropriate properties to be used as a valid change score, i.e., they did not plot change in T vs. baseline T or log T ratio vs. baseline T and demonstrate a flat line relationship. Their statistical analysis used statistical methods from 50 years ago, even doing the notorious "test for trend" that tests for a linear correlation between an outcome and an integer category interval number. The authors seem to be unaware of the many flexible tools developed (especially starting in the mid 1980s) for statistical modeling that would answer the questions posed above.

Cavender et all stratified T in <1.9 ng/L, 1.9-<10 ng/L, 10-<26 ng/L, and ≥26 ng/L. Fully 1/2 of the patients were in the second interval. Except for the first interval (T below the lower detection limit) the groups are heterogeneous with regard to outcome risks. And there are no data from this study or previous studies that validates these cutpoints. To validate them, the relationship between T and outcome risk would have to be shown to be discontinuous at the cutpoints, and flat between them.

From their paper we still don't know how to use T continuously, and we don't know whether baseline T is informative once a clinician has obtained an updated T. The inclusion of a 3-D block diagram in the supplemental material is symptomatic of the data presentation problems in this paper.

It's not as though T hasn't been analyzed correctly. In a 1996 NEJM paper, Ohman et al used a nonparametric smoother to estimate the continuous relationship between T and 30-day risk. Instead, Cavender, et al created arbitrary heterogeneous intervals of both baseline and 6m T, then created various arbitrary ways to look at change from baseline and its relationship to risk.

An analysis that would have answered my questions would have been to

- Fit a standard Cox proportional hazards time-to-event model with the usual baseline characteristics
- Add to this model a tensor spline in the baseline and 6m T levels, i.e., a smooth 3-D relationship between baseline T, 6m T, and log hazard, allowing for interaction, and restricting the 3-D surface to be smooth. See for example BBR Figure 4.23. One can do this by using restricted cubic splines in both T's and by computing cross-products of these terms for the interactions. By fitting a flexible smooth surface, the data would be able to speak for themselves without imposing linearity or additivity assumptions and without assuming that change or change in log T is how these variables combine.
- Do a formal test of whether baseline T (as either a main effect or as an effect modifier of the 6m T effect, i.e., interaction effect) is associated with outcome when controlling for 6m T and ordinary baseline variables
- Quantify the prognostic value added by baseline T by computing the fraction of likelihood ratio chi-square due to both T's combined that is explained by baseline T. Do likewise to show the added value of 6m T. Details about these methods may be found in Regression Modeling Strategies,
*2*^{nd}edition

As long as continuous markers are categorized, clinicians are going to get suboptimal risk prediction and are going to find that more markers need to be added to the model to recover the information lost by categorizing the original markers. They will also continue to be surprised that other researchers find different "cutpoints", not realizing that when things don't exist, people will forever argue about their manifestations.

### Improper Subgrouping

The JAMA Internal Medicine Paper Effect of Statin Treatment vs Usual Care on Primary Cardiovascular Prevention Among Older Adults by Benjamin Han et al makes the classic statistical error of attempting to learn about differences in treatment effectiveness by subgrouping rather than by correctly modeling interactions. They compounded the error by not adjusting for covariates when comparing treatments in the subgroups, and even worse, by subgrouping on a variable for which grouping is ill-defined and information-losing: age. They used age intervals of 65-74 and 75+. A proper analysis would have been, for example, modeling age as a smooth nonlinear function (e.g., using a restricted cubic spline) and interacting this function with treatment to allow for a high-resolution, non-arbitrary analysis that allows for nonlinear interaction. Results could be displayed by showing the estimated treatment hazard ratio and confidence bands (y-axis) vs. continuous age (x-axis). The authors' analysis avoids the question of a dose-response relationship between age and treatment effect. A full strategy for interaction modeling for assessing heterogeneity of treatment effect (AKA*precision medicine*) may be found in the analysis of covariance chapter in Biostatistics for Biomedical Research.

To make matters worse, the above paper included patients with a sharp cutoff of 65 years of age as the lower limit. How much more informative it would have been to have a linearly increasing (in age) enrollment function that reaches a probability of 1.0 at 65y. Assuming that something magic happens at age 65 with regard to cholesterol reduction is undoubtedly a mistake.

### Serial Data and Response Trajectories

Serial data (aka longitudinal data) with multiple follow-up assessments per patient presents special challenges and opportunities. My preferred analysis strategy uses full likelihood or Bayesian continuous-time analysis, using generalized least squares or mixed effects models. This allows each patient to have different measurement times, analysis of the data using actual days since randomization instead of clinic visit number, and non-random dropouts as long as the missing data are missing at random. Missing at random here means that given the baseline variables and the previous follow-up measurements the current measurement is missing completely at random. Imputation is not needed.
In the *Hypertension* July 2017 article Heterogeneity in Early Responses in ALLHAT (Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial) by Sanket Dhruva et al, the authors did advanced statistical analysis that is a level above the papers discussed elsewhere in this article. However, their claim of avoiding dichotomania is unfounded. The authors were primarily interested in the relationship between blood pressures measured at randomization, 1m, 3m, 6m with post-6m outcomes, and they correctly envisioned the analysis as a landmark analysis of patients who were event-free at 6m. They did a careful cluster analysis of blood pressure trajectories from 0-6m. But their chosen method assumes that the variety of trajectories falls into two simple homogeneous trajectory classes (immediate responders and all others). Trajectories of continuous measurements, like the continuous measurements themselves, rarely fall into discrete categories with shape and level homogeneity within the categories. The analyses would in my opinion have been better, and would have been simpler, had everything been considered on a continuum.

With landmark analysis we now have 4 baseline measurements: the new baseline (previously called the 6m blood pressure) and 3 historical measurements. One can use these as 4 covariates to predict time until clinical post-6m outcome using a standard time-to-event model such as the Cox proportional hazards model. In doing so, we are estimating the prognosis associated with every possible trajectory and we can solve for the trajectory that yields the best outcome. We can also do a formal statistical test for whether the trajectories can be summarized more simply than with a 4-dimensional construct, e.g., whether the final blood pressure contains all the prognostic information. Besides specifying the model with baseline covariates (in addition to other original baseline covariates), one also has the option of creating a tall and thin dataset with 4 records per patient (if correlations are accounted for, e.g., cluster sandwich or cluster bootstrap covariance estimates) and modeling outcome using updated covariates and possible interactions with time to allow for time-varying blood pressure effects.

A logistic regression trick described in my book *Regression Modeling Strategies* comes in handy for modeling how baseline characteristics such as sex, age, or randomized treatment relate to the trajectories. Here one predicts the baseline variable of interest using the four blood pressures. By studying the 4 regression coefficients one can see exactly how the trajectories differ between patients grouped by the baseline variable. This includes studying differences in trajectories by treatment with no dichotomization. For example, if there is a significant association (using a composite (chunk) test) between treatment and any of the 4 blood pressures and in the logistic model predicting treatment, that implies that the reverse is true: one or more of the blood pressures is associated with treatment. Suppose for example that a 4 d.f. test demonstrates some association, the 1 d.f. for the first blood pressure is very significant, and the 3 d.f. test for the last 3 blood pressures is not. This would be interpreted as the treatment having an early effect that wears off shortly thereafter. [For this particular study, with the first measurement being made pre-randomization, such a result would indicate failure of randomization and no blood-pressure response to treatment of any kind.] Were the 4 regression coefficients to be negative and in descending order, this would indicate a progressive reduction in blood pressure due to treatment.

Returning to the originally stated preferred analysis when blood pressure is the outcome of interest (and not time to clinical events), one can use generalized least squares to predict the longitudinal blood pressure trends from treatment. This will be more efficient and also allows one to adjust for baseline variables other than treatment. It would probably be best to make the original baseline blood pressure a baseline variable and to have 3 serial measurements in the longitudinal model. Time would usually be modeled continuously (e.g., using a restricted cubic spline function). But in the Dhruva article the measurements were made at a small number of discrete times, so time could be considered a categorical variable with 3 levels.

I have had misgivings for many years about the quality of statistical methods used by the Channing Lab at Harvard, as well as misgivings about the quality of nutritional epidemiology research in general. My misgivings were again confirmed in the 2017-07-13 NEJM publication Association of Changes in Diet Quality with Total and Cause-Specific Mortality by Mercedes Sotos-Prieto et al. There are the usual concerns about confounding and possible alternate explanations, which the authors did not fully deal with (and why did the authors not include an analysis that characterized which types of subjects tended to have changes in their dietary quality?). But this paper manages to combine dichotomania with probably improper change score analysis and hard-to-interpret results. It started off as a nicely conceptualized landmark analysis in which dietary quality scores were measured during both an 8-year and a 16-year period, and these scores were related to total and all-cause mortality following those landmark periods. But then things went seriously wrong. The authors computed change in diet scores from the start to the end of the qualification period, did not validate that these are proper change scores (see above for more about that), and engaged in percentiling as if the number of neighbors with worse diets than you is what predicts your mortality rather than the absolute quality of your own diet. They then grouped the changes into quintile groups without justification, and examined change quantile score group effects in Cox time-to-event models. It is telling that the baseline dietary scores varied greatly over the change quintiles. The authors emphasized the 20-percentile increase in each score when interpreting result. What does that mean? How is it related to absolute diet quality scores?

The high quality dataset available to the authors could have been used to answer real questions of interest using statistical analyses that did not have hidden assumptions. From their analyses we have no idea of how the subjects' diet trajectories affected mortality, or indeed whether then change in diet quality was as important as the most recent diet quality for the subject, ignoring how the subject arrived at that point at the end of the qualification period. What would be an informative analysis? Start with the simpler one: used a smooth tensor spline interaction surface to estimate relative log hazard of mortality, and construct a 3-D plot with initial diet quality on the x-axis, final (landmark) diet quality on the y-axis, and relative log hazard on the z-axis. Then the more in-depth modeling analysis can be done in which one uses multiple measures of diet quality over time and relates the trajectory (its shape, average level, etc.) to hazard of death. Suppose that absolute diet quality was measured at four baseline points. These four variables could be related to outcome and one could solve for the trajectory that was associated with the lowest mortality.
For a study that is almost wholly statistical, it is a shame that modern statistical methods appeared to not even be considered. And for heaven's sake **analyze the raw diet scales and do not percentile them**.

The study was severely underpowered for even the optimistic targeted effect size. What I find puzzling is they originally expected 201 events from 760 subjects. At an interim point of 600 enrolled, they adjusted the target enrollment to 880. How many endpoints they had at the interim time point is not known but it had to be less than 105 (final number). Why bother adjusting from 760 to 880 if you are observing less than half the expected number of events? Not sure what I would have recommended if I was on a DMB - maybe terminate immediately.

ReplyDeleteTrial sequential analysis of one trial comparing levosimendan versus placebo on primary endpoint in patients left ventricular dysfunction undergoing cardiac surgery.

ReplyDeleteTrial sequential analysis of one trial of levosimendan versus placebo on left ventricular dysfunction undergoing cardiac surgery based on the diversity-adjusted required information size (DARIS) of 2981 patients. This DARIS was calculated based upon a proportion of patients with the low cardiac output syndrome after cardiac surgery of 24.1% in the control group; a RRR of 20% in the experimental intervention group; an alpha (Î±) of 5%; and a beta (Î²) of 20%. The cumulative Z-curve (blue line) did not cross the conventional alpha of 5% (green line) after one trial. It implies that there is a random error. The cumulative Z-curve did not reach the futility area, which is not even drawn by the program. Presently, only 28.5% (849/2981) of the DARIS has been obtained. Had we calculated the DARIS based on a more realistic RRR such as <20%, the obtained evidence would represent a much smaller part of the DARIS. There is need of more powered-randomised clinical trials for drawing reliable conclusions.

Figure is available (arturo.marti.carvajal@gmail.com)

Sequential analysis in clinical trials is very important but for this particular trial we can make the needed points by mainly considering the final full-data analysis.

DeleteThis is a great idea. I'd love to see some sort of database collecting studies with poor methodology. I think this would be very valuable as a teaching resource.

ReplyDeleteThey are so easy to find it's hard to know where to start :-)

DeleteI'd like to add that, often even with a significant effect, not much is learned because the noise makes is so the effect may range from plausible to implausible--eg, d = 1.0, 95-% CI [0.01 - 2.0].

ReplyDeleteYes, You are right! But I see that everything is about amount, or I mean, the true difference is the amount. If I see a difference and is small, it doesn't seem to matter, does it? If the difference is stronguer, then does matter. That's the problem and the solution. Just the big difference matter!

ReplyDeleteThats the solution ignore the small differences.

But don't equate estimated differences with true differences.

DeleteI too wish that the quality of reporting was better, and I think that Professor Harrell's wider work certainly improves the quality of research.

ReplyDeleteHowever, I wonder, given the ubiquity of poorly analysed and reported research, if it is useful to single out particular papers. Researchers who would be embarrassed to see their work featured on this blog (much like they would be embarrassed to be featured as an example of poor-practice on BBC Radio 4's "More or Less") are likely to have agreed to a presentation of results either from weariness of battling with collaborators, a lack of power within the collaboration and/or job pressures.

Researchers who are ignorant of how to usefully interpret a confidence interval will not recognise the validity of these criticisms. We only need to look in the letters pages to see how authors respond to valid criticisms.

Hopefully the moves towards reproducible research, code sharing and open data will help with research quality.

Nevertheless, surely the fundamental problem is the way in which academics are assessed for career progression. As Richard Smith argued when he spoke on this issue at an International Epidemiology Journal conference (I hope I am remembering his talk properly), universities are not fulfilling their duty; to judge and maintain the quality of work their staff is doing, and instead are outsourcing this responsibility to journal editors.

David McAllister

Wellcome Trust Intermediate Clinical Fellow and Beit Fellow,

University of Glasgow

David these are great comments and open up a legitimate debate (which I hope we can continue) of the value and appropriateness of singling out individual papers and authors. Part of my motivation is based on the fact that in addition to universities not fully meeting their duty, journals are very much to blame and need to be embarrassed. Very few journals simultaneously get the points that (1) statistical design has a lot to do with study quality and efficiency and (2) statistical interpretation is nuanced and is clouded by the frequentist approach. I am doing this for journal editors and reviewers as much as for authors and for pointing out limitations of translating study findings to practice. More of your thoughts, and the thoughts of others, welcomed.

DeleteIt is helpful for me to see actual examples as opposed to a generic criticism of under-powered studies. I am guilty of allowing researchers to say similar things to the NEJM article in the past. This example forces me to examine my role as a consulting statistician. I hope he will continue to point out problems using real examples, and also offer potential remedies.

DeleteWe should definitely develop a proposal for an optimum way of reporting frequentist inferential results. Overall the solutions are (1) appropriate interpretations, (2) designing studies for precision, and (3) making studies simpler and larger. Regarding (1), confidence intervals should be emphasized no matter what the p-value. I too have been involved in many, many underpowered studies. They usually end up in second-line journals. I am especially concerned when first-line journals don't do their job.

DeleteHi Frank,

DeleteThanks for your very open-minded reply. I take the point about journals and agree that they do need to be embarrassed.

I have a perhaps unrealistic suggestion on this issue which which it would be great to hear your views on.

I think that peer-review should be separated into two stages, a methods stage and a results and discussion stage. I think that introduction, methods section and those results which are a measure of the study robustness (eg total number of participants, loss to follow-up, etc) should be reviewed and given a score by each reviewer which reflects the quality of the methods. Only after having submitted this score, they should be able to access the full paper. Journals ought to report the methods-quality scores for every published paper, as well as some metric which summarises the quality of all of their published papers, and those they reject.

I realise that there is a lot of issues around definition and measurement, but at least the within-journal comparison between accepted and rejected manuscripts would be illuminating.

best wishes,

David

David I think there is tremendous merit in this approach. It raises the question of exactly who reviewers should be. A related issue has been proposed by others: have all medical papers be reviewed without the results. Results bias reviews of methods. The likely reluctance of journals to adopt this approach will reveal the true motives of many journals. They stand not as much for science as they stand for readership and advertising dollars.

DeleteThis comment has been removed by a blog administrator.

ReplyDeleteInteresting post, however, I still wonder what would be an appropriate wording / reporting of results? Take your example: "Levosimendan was not associated..." - would it be better to add "significantly" here (Levosimendan was not significantly associated...)? I mean, there is definitely an effect, it's just that the authors cannot say into which direction.

ReplyDeleteAnother question: you mentioned in your comments to focus on CI instead of p - regarding the problems of CI, especially for mixed models, would it not even be better to focus even more on the standard error than on confidence intervals, because standard errors are more "robust" than CI (which assume a specific distribution for the test-statistic)? But in this point I'm not sure, because I'm not statistician.

I don't get anything out of 'significantly'. Bayesian posterior probabilities of similarity, efficacy, and harm are to me the ultimate solutions. Within the frequentist world I suggest wording such as in this example: The Wilcoxon two-sample test yielded P=0.09 with 0.95 confidence limits for odds ratios from the proportional odds model of [0.7, 1.1]. Thus we were unable to assemble strong evidence against the null hypothesis of no treatment effect. A larger sample size might have been able to do so. The data are consistent with a reduction in odds of a lower level of response in treatment B by a factor of 0.7 as well as an increase by a factor of 1.1 with 0.95 confidence.

ReplyDeleteOne other thought for a study with very low information yield, e.g., the confidence interval for a hazard ratio is [0.25, 4.0]. Valid wording of a conclusion might be "The overall mortality observed in this study (both treatment groups combined) was 0.03 at 2 years. This will be useful in planning a future study that is useful, unlike this one. This study provides no information about relative efficacy of the two treatments.

ReplyDeleteThanks! I prefer reporting also non-significant results, because in my opinion these are not meaningless - but it's not easy to convince co-authors sometimes. Their argument: focus on significant results, this makes the argumentation more clear to the reader (so, increasing readability).

ReplyDeleteThe study was registered at ClinicalTrials.gov

ReplyDeletehttps://clinicaltrials.gov/ct2/show/record/NCT02025621

The study pre-specified a difference of medical relevance - 35% reduction in odds ratio - and type I and type II error rates (1% and 20% respectively).

http://www.sciencedirect.com/science/article/pii/S0002870316301843

"Statistical power and sample size considerations

The sample size is based on an assumed composite primary end point event rate (death, MI, dialysis, mechanical assist) of 32% for placebo, a 35% relative reduction for levosimendan (20.8% event rate at 30 days), a significance level of .01, and 80% power. A total sample size of 760 should provide 201 events."

They report measured outcomes estimates, confidence intervals, in addition to p-values, as recommended in the ASA statement on p-values.

The trial finding was a null result, lack of statistical significance for a clinical trial powered to detect an a-priori stated difference of medical relevance, and the finding was published in a top tier journal. A null result, published!

How many drugs that appeared promising in early tests and small trials end up failing to show useful effect when studied in a properly sized well controlled trial? This drug appears to be another example of this phenomenon. The review paper cited in this paper (Landoni et al., Crit Care Med 2012; 40:634–646) shows multiple small studies demonstrating a "significant" effect - now there are some papers with problems. Even the meta-analysis in the Landoni paper suggests a relative risk of 0.80, 95%CI (0.72, 0.89) which did not appear to be the case in this large clinical trial.

Given all the a-priori specifications we can conclude that with 80% confidence, this drug is not doing what it was purported to do. Now we can argue the merits of setting the type II error rate to 20% when the type I error rate is 5% or 1%, or argue whether 35% reduction in odds ratio is too much or not enough of a medically relevant difference, but nonetheless we can conclude with 80% confidence based on this large trial that this drug is not doing much relative to the stated difference of medical relevance. How is the evidence provided by this trial not an improvement over all the little studies done previously, with who knows how many of which never came to publication so as to be available for the meta-analysis discussed above?

So many things went right here, so I fail to see why this is a poor example. I have seen many poorly reported study findings, far poorer than this effort. I am surprised that for you this is "One of the most problematic examples I've seen".

Steven you are right that this study is better done than many studies and it is good to see 'negative' studies published (only the more expensive multi-center clinical trials tend to be published when 'negative']. But why did you omit the confidence interval of [0.66, 1.54] from your comment? That is the most important piece of frequentist evidence reported in the paper. The confidence interval tells all. We know little more after the study than we did before the study about the relative efficacy. Notice that the trial was designed to detect a whopping 35% reduction and the lower confidence interval was only a 34% reduction. A big part of the issue in this particular result is that the investigators thought that type I error was 20 times more important than type II error. How does that make any sense? After a study is completed, the error rates are not relevant and the data are. Note also that we are not 80% confident that the drug is not doing much. Not only are the data consistent with a 30% reduction in odds of a primary event, but the power is not relevant in the calculation of this probability. What would be needed for you to make that statement is a Bayesian posterior probability of non-efficacy (one minus probability of efficacy). On the non-relevance of error rates see papers by Blume, Royall, etc. One statistician (I wish I remember to whom this should be attributed) gave this analogy: A frequentist judge is one who brags that in the history of her court she only convicts 0.03 of innocent defendants. Judges are supposed to maximize the probability of making correct decisions for individual defendants. Long-run operating characteristics are not useful in interpreting results once they are in. A side issue is that confidence intervals, though having a formal definition that will seem to be non-useful to most, have the nice property that that are equally relevant no matter whether the p-value is 'significant' or not.

ReplyDelete"A big part of the issue in this particular result is that the investigators thought that type I error was 20 times more important than type II error. How does that make any sense?"

DeleteType I error: We declare the drug to be effective when in fact it is not.

Type II error: We declare the drug to be not effective when in fact it is.

The onus is on the drug developers to demonstrate a marked effect for the new drug. Declaring a sugar pill to be something marvellous is expensive. How much money are we spending right now on this drug? How much does a course of this drug cost? If it is not doing anything, we are wasting money that could be better spent on other treatments with more efficacious outcome. We are also wasting money and resources treating people for side effects for a drug that isn't offering much help. Adding more placebos to our formulary is bloating our health care budget unnecessarily. It yields more profits for drug companies, which they love, but yields a health care system that is a cancer on the national economy. So at a time when health care costs are out of control, we need tough tests of treatment effectiveness. We need to trim out any treatment that isn't showing a large benefit.

Declaring a drug to be not effective when in fact it is is of course tragic. The cost then is in quality of life lost, and length of life lost. But the drug ought to yield a substantial benefit, and substantial benefits are not that difficult to demonstrate in large trials. Treatments should add years to lives, and improve quality of life substantially, which is why a "whopping 35% reduction" is not an unreasonable outcome to expect for a drug such as this one.

Look at Tables 3 and 4 of the Mehta et al. paper in question here. In measure after measure, on hundreds of patients, this drug shows little if any effect. This drug has been around for nearly 20 years so there's been plenty of time to figure out which patients should benefit from this drug, yet this large clinical trial shows precious little.

We have plenty of drugs and treatments that are cheap and bring years of high quality extra life to most patients, with many patients denied such access around the world. We would do far better to make that happen than the minor, if any, benefits that this drug is showing. This is one line of reasoning wherein type I errors are considerably more expensive than type II errors.

You are right that totality of evidence needs to be considered. Frequentist methods don't help very much in that regard. I am concentrated on the primary endpoing for simplicity. I disagree that a 35% effect is realistic for powering a study. When investigators choose a low-power binary endpoint they have the obligation of finding the money to make that endpoint 'work' by enrolling a much larger number of patients. Relative efficacy of 15% or 25% are commonly used in power calculations, and we need at least 0.9 power, not 0.8. Many papers have been written showing that for common disorders, relative effects smaller than 15% will still result in large public health benefits. And your reasoning to justify type II errors being 20 times less important than type I is not compelling. You've only justified that type II errors may be allowed to be larger than type I errors. I feel that too many resources are wasted by studies being launched on "a wing and a prayer."

ReplyDeleteHi Frank,

ReplyDeleteWhat do you think about the interpretation of reported treatment-covariate interactions?

This is a big interest of mine as I am doing a 4-year project looking at heterogeneity of treatment effects (on various scales) according to the presence or absence of comorbidities.

I think that while the dangers of sub-group reporting are appreciated by most, the potential that some treatment have lower efficacy (on some relative scale) is largely disregarded.

Moreover, there is a tendency to report stratum-specific hazard ratios along with some NHST P-values for the interaction which is often close to one, which many readers take as evidence for no interaction. When I have used the published data to estimate confidence intervals for interactions, the lower and upper limits have been consistent with massive interactions in either direction.

best wishes,

David

Great comments David and you raise a lot of good points. Stratum-specific treatment effects are inappropriate for many reasons. Interaction effect estimation is what is needed, and such assessment must be done in the presence of aggressive main effect adjustment. I have detailed this in my BBR notes - see http://www.fharrell.com/p/blog-page.html section 13.6. I slightly disagree with your statement 'is largely disregarded'. In my experience real interactions on an appropriate relative scale are not very common.

ReplyDeleteFrank, if real interactions are not very common, what does that mean for stratified medicine, in your view?

DeleteBy 'stratified medicine' I mean interactions with baseline characteristics, rather than 'responder analysis', which is obviously silly for reasons you've touched on in your discussion of change.

Thanks.

'Stratified medicine' AKA 'precision medicine' AKA 'personalized medicine' is overhyped and based on misunderstandings. Stephen Senn has written eloquently about this. There are some real differential treatment effects out there (especially in cancer and bacterial infections) but true interactions with treatment are more rare than many believe. Clinicians need to first understand the simple concept of risk magnification that always exists (e.g., sicker patients have more absolute treatment benefit, no matter what is the cause of the sickness) and why that is different from differential treatment effectiveness. Any field that changes names every couple of years is bound to be at least partly BS. See my Links page here to get to the BBR notes that go into great detail in the analysis of covariance chapter.

DeleteThanks for taking the time to reply Frank. I feel like we (methodologists) have a bit of a conflict of interest when it comes to these medical fads. We get grants to develop 'new' precision medicine methodology, so have an interest in promoting the hype.

DeleteJack I believe you are exactly correct. I have seen statisticians switch to 'precision medicine' and invent solutions that are inappropriate for clinical use (e.g., they require non-available information) and I have seen many more statisticians do decent work but not question their medical leaders about either clinical practice or analysis strategy. We are afflicted with two problems: timidity and wanting to profit from the funding that NIH and other agencies make available with too little methodologic peer review.

DeleteI dont think many of the dichotomisers understand what they're doing, they've collected data, only to throw a portion of it away by dichotomising continuous measurements (and then they'll throw more data away, by randomly splitting their data into development and validation data). The consequences of dichotomising is there for all to see, a substantial loss of predictive accuracy - both a drop in discrimination and poor calibration (see http://bit.ly/2pfbZiw). Plenty of systematic reviews show this is poor behaviour is widespread (http://bit.ly/2qkjR2x, http://bit.ly/2ptKRIN). Prediction models are typically developed without statistical input, and peer review (again often a statistician is not reviewing) is clearly failing to pick this up.

ReplyDeleteBW

Gary

Spot on Gary. The avoidance of statistical expertise is one of the most irritating aspects of this. And researchers don't realize the downstream problems caused by their poor analysis, e.g., requiring more biomarkers to be measured because the information content of any one biomarker is minimized by dichotomization They also fail to realize that cutpoints must mathematically be functions of the other predictors to lead to optimum decisions. I'm going to edit the text to point to my summary of problems of categorization on our Author Checklist. I need to add your excellent paper to the list of references there too.

ReplyDeleteHi Frank, I'm finding these critiques very useful and can see some immediate changes that I can make in my own work.

ReplyDeleteI had a query relating to the application of smoothing splines for subgroup analyses. I would like to do try what you've described under 'improper subgrouping' heading and estimate a non-linear interaction between age and treatment effect (which for this outcome and population is fairly likely). The difficulty is that this is from a cluster randomised trial which means the primary analysis uses a multilevel model. I have looked but can't find a smoothing splines procedure compatible with hierarchical data.

My first instinct is to ignore the clustering and find an appropriate form with a fractional polynomial analysis, then go back and add the continuous variable in this form to the main (multi-level) model and estimate the interaction with the treatment. Do you think that there is a better way to approach this?

Thanks, Kris.

Good question Kris. Smoothing splines require special methods but regression splines don't. When the variable of interest is a subject-level continuous variable, the way you create terms for a regression spline is completely agnostic to other aspects of the model (random effects, semiparametric model, etc.). So I would add the nonlinear terms as main effects and as treatment interactions, whether you use regression splines or fractional polynomials.

ReplyDeleteThanks for your comment on this! I think one reason for categorising appears to be the unreliability in dietary assessment methods and the difficulty of addressing this in analyses. Study participants can be classified into quantiles - and this is reasonably reliable; using absolute intake data (or scores) would be much more difficult.

ReplyDeleteHow would you address uncertainty in exposure assessment here? There are not many publications who seem to do this (at least in nutritional epidemiology).

Best wishes,

Gunter

Errors in measurements are definitely a challenge. But what also has a large effect on the quality of research in nutritional epidemiology is the low statistical standards that have been set. The use of categorization to deal with measurement errors is a common misconception. Grouping actually makes things worse by loss of information and because of the fact that measurement errors put a subject in the wrong group, which is a maximal error. This demonstration helps illustrate these points: https://youtu.be/Q6snkhhTPnQ

ReplyDeleteThe work of Ray Carroll is an excellent source for formal analysis of nutritional data accounting for measurement error.

Thank you - that is very interesting. Given that categorisation is now considered to be the standard method, do you think this will change?

ReplyDeleteIt is not considered to be the standard method; it is just commonly used. It is so invalid and misleading that things MUST change. Quantile groups are also impossible to interpret clinically, as I've written about in my RMS course notes (see the Links page on this blog). If researchers want to do good science and want the research to be reproducible, statistical approaches MUST change.

ReplyDeleteIt's difficult to find many papers in nutritional epidemiology that don't use quantiles (at least in the field I'm working in), and so it is the method most people use. I agree that it has to be changed, but I don't think it is easy.

ReplyDeleteWe don't do the right thing because it's easy. We do it because it's right.

ReplyDeleteMight I ask: were quantiles used to make analyses easier at a time of limited computing power, and there is no need to do this anymore? At least on my data, using the approaches you suggest are no more difficult or time consuming and comparing the results is very interesting.

ReplyDeleteMy personal observation is that users of such methods tend to be afraid of algebra, and they think that categorizing a continuous variable makes fewer assumptions than just fitting a straight line (which they are comfortable with). So they tend to avoid flexible, powerful approaches such as regression splines (the ease of using splines with software packages nowadays notwithstanding). In fact categorization makes far more assumptions: the relationship is piecewise flat and the interval boundaries are correctly specified, plus there being discontinuities in nature at these boundaries (which we never see).

DeleteThanks for your blog and your comments - they have given me a lot to think about (and confirmed some of my suspicions about my field).

DeleteI do think that categorization is a vestige from the days of hand calculation. I often point out that statisticians at the time understood that it was not ideal and even wrote about it--there was even a correction factor for the loss of power. At some point, that concern was forgotten.

DeleteThat categorization continues is, I think, a fascinating sociology of science study. An incredible amount of rationalization goes on in the face of compelling evidence. I've given papers to colleagues, held workshops, talks, begged and pleaded one-on-one, all with apparently little effect. Fortunately, my students have taken in more of the message--and they know I'm watching!

Well put Mike. Deep down I think it is as my colleague Drew Levy says 'cognitive laziness' and its cousin is racism and racial profiling. Dichotomization saves time and cranial oxygen consumption.

DeleteI could not agree more with the examples. However, they are if anything on the milder side of the scale. For example, there are plenty papers (like this one http://dx.doi.org/10.2337/dc09-1556) that look at a continuous marker that leads to disease diagnosis, plot it over time relative to when it exceeds the diagnostic threshold and then claim that this proves an acceleration of the underlying disease just before people get diagnosed. I have stopped counting how many of those have by now come out in supposedly good journals... But try to get a properly done methodological paper published that discusses the problem with this kind of analysis (7 years and counting). I guess that one is still somewhat subtle. But what about combining several biased assessments and one not overly biased one with the claim that the last one fixes all the bias? You'd think that people would not make that obvious a mistake, but I have also seen a few papers like that (e.g. https://doi.org/10.2337/diacare.26.4.977; although it is perhaps not immediately obvious that the type I error rate one produces is as high as 99%).

ReplyDeleteFantastic points. Biomarker research in general is of the same low quality as nutritional epidemiology research, and I am constantly amazed at how a research will create a potentially good prognostic or diagnostic marker then proceed to destroy it by categorization as if she actually hated the marker.

ReplyDeleteAlso concerning the blog article hitting on the "milder side" I think that is true, especially when I think of the many reviews I do for medical journals. If you think the published papers are bad, the ones that don't get published are often astonishingly bad. It is often as if the authors are allergic to statisticians and don't want them anywhere near.

This comment has been removed by the author.

ReplyDeleteProf. Frank Harrell, thank you for all your immeasurable contribution to the stats knowledge.

ReplyDeleteThe "change in scores" is elsewhere, but it is quite intriguing how the authors analysed their primary outcome (a cytokine) in this RCT in JAMA today (22/08/2017). They analysed it as a mean difference (between groups), but of a change from baseline (apologies if I could not be clear), i.e.: delta from d14-d1 in group A = T; delta from d14-d1 in group B = Z, they compared the difference T - Z. Including their sample size and SAP were based on this "difference from a change".

Could you comment on it? Could we relate this analysis with an ANCOVA?

Many Thanks,

http://jamanetwork.com/journals/jama/article-abstract/2649189

This sort of double difference is based on a misunderstanding of what a parallel group randomized trial really is. It assumes many things including linear effect of baseline and slope of baseline of 1.0, no regression to the mean, no measure errors, etc. Simple ANCOVA would be far superior here, and would open up to robust semiparametric methods. And any time you quote a change from baseline in a paper when the baseline value is what qualified the patient to be enrolled, you need to have a second baseline to use in difference calculations.

DeleteThank you Prof Frank Harrell!

ReplyDeleteThe high profile paper published in the Lancet (the ORBITA trial) shows a very good understanding of statistical issues. The authors recognized that no truly blinded study of this medical manoeuvre had ever been done. Years of anecdotal publications litter the literature, enough to convince many who do not understand statistical issues that this trial would be unethical. What is unethical is to continue to promote ill-founded medical manoeuvres based on poorly done studies.

ReplyDeleteThese authors worked hard to convince others of the errors in their thinking, and arranged for a proper blinded clinical trial. They registered their trial plan beforehand, with ClinicalTrials.gov and pre-published with the Lancet. They identified an outcome of no medical relevance (30 second difference between the two groups) and performed a power calculation using then-available data which showed that 100 cases per treatment group would provide 80% power to detect such a difference.

The authors, reviewers, and editor did not allow a classic absence of evidence is not evidence of absence error to be made. In the presence of a power analysis showing adequate sample size to detect any difference larger than that of no medical relevance, a large p-value does provide sound statistical evidence that a difference of medical relevance is likely not present, i.e. that the null hypothesis is the relevant hypothesis to accept, at the stated type II error rate. This study measured a difference of 16.6 seconds between the two groups, well below the minimum difference of medical relevance that they specified a-priori. If a difference of more than 30 seconds had been the true state of affairs, 4 out of 5 such studies would have detected the difference. This study did not yield such a measurement, so their sound statistical conclusion is entirely valid: these data support the null hypothesis at the stated type II error rate.

This weekend I will host a visiting friend who is recovering from a stroke induced by a portion of a stent breaking away and lodging in his brain. My friend will travel by train, no longer being able to drive due to probably permanent damage to a portion of the visual portion of the brain necessary to process important driving-related visual cues. Placing a stent is not a benign manoeuvre, and if people are going to suffer consequences such as my friend experienced, there should be solid statistical evidence that the manoeuvre can provide substantial medical benefit, enough to outweigh the harms that the manoeuvre can also induce.

You made some excellent points Steven, and I think the ORBITA study was carried out in excellent fashion. I wrote http://www.fharrell.com/2017/11/statistical-criticism-is-easy-i-need-to.html to temper my original criticism of the article. However there are two things I really disagree with in your otherwise excellent comments: The authors, reviewers, and journal editor did in fact commit an "absence of evidence is not evidence of absence" error because the abstract of their article contains the phrase "PCI did not increase exercise time by more than the effect of a placebo procedure." This is a major misinterpretation of what a large p-value means (which Fisher, the inventor of p-values said means "get more data"). Had the article stopped with the confidence interval and not reported the p-value, all would have been well (other than the problem of computing change form baseline in exercise time). This misinterpretation of p-values (a very common misinterpretation in Lancet, NEJM, and JAMA) caused a firestorm among cardiologists which could have been avoided Secondly, consider your statement "measured a difference of 16.6 seconds between the two groups, well below the minimum difference ..." The problem with your reasoning is that you are putting faith in the point estimate of 16.6s when in fact that estimate has a "fuzz" or margin of error. For this purpose you need to dwell on the most favorable confidence limit, which exceeds the 30 sec. point. You also forgot that power is not relevant after the fact, so the fact that 4 out of 5 replicate studies would have found a difference >= 30s is completely irrelevant at this point. What we have now is the confidence interval from the study, and an entire Bayesian posterior distribution would have even been better to compute.

ReplyDeleteThis study provided no evidence of a difference greater than the minimum difference of medical relevance. The statement "PCI did not increase exercise time by more than the effect of a placebo procedure" is a valid claim at the type II error rate for which this study was devised. I don't know what it means that an a-priori power analysis is not relevant after the fact. The a-priori power analysis is precisely the reason that we can make a declarative statement about a large p-value after the experimental results are in. The power analysis was not performed with the data from the trial.

ReplyDeleteIn the presence of an a-priori power analysis, this is the proper interpretation of what a large p-value means. This study did not just run a test and comment willy-nilly on a resultant p-value. This study pre-specified a difference of medical relevance and performed a power analysis using similar previously available data to ensure that the type II error rate was known. These are the steps all too frequently not performed in other studies, that then leave such other studies unable to say anything about the state of affairs in the presence of large p-values, as the error rate of claims is then unknown. Without an a-priori power analysis, interpreting a large p-value as demonstrating no difference is a misinterpretation. This study did not misinterpret.

I do put faith in means, the law of large numbers and central limit theorem suggest that such faith is well placed, and we can know the rate at which our faith is misplaced. I understand very well that there is fuzz in the data, I never forget that. The fuzz means my conclusions will be wrong some times. Statistics doesn't always yield the truth, but proficiently performed statistical analyses allow us to understand the rate at which we will err given the fuzz. At least we know how often we are wrong.

To insist that the ends of the confidence interval must not occur beyond the difference of medical relevance is to insist on very high power, with attendant large sample size. Could this study ever have been done if several thousand cases were required? This study revealed that indeed larger studies should in future be performed, but given the erroneous attitude in the medical community that stent placement was just obviously beneficial, obtaining ethical approval for a huge study was likely not possible.

If several thousand cases had been studied, yielding a mean difference near 15 seconds with a nice tight confidence interval ranging from say 10 seconds to 20 seconds, of course the p-value would have been quite small. That would still not permit conclusion that stent placement was superior to placebo, because the difference would still not have been medically relevant.

I maintain that a reasonable Bayesian analysis would yield a highest posterior density interval ranging from about -9 seconds to +42 seconds, as in all but pathological situations, Bayesian and frequentist approaches yield very similar findings. Since you have worked with Darrel Francis, I urge you to obtain the data from him and show us what this Bayesian analysis would have shown. The conundrum of which prior to use will of course be present. Should it be based on years of biased data and opinion?

To declare one thing better than another requires specifying a single dimension in which to measure differences. On what dimension is the Bayesian analysis better?

Steven I'm sorry to say that you have misunderstood the underpinnings of frequentist statistical inference. There is no formal way, nor need, to embed the power into the interpretation of the results. The evidence provided by the study in the frequentist context is contained in the 0.95 confidence interval, and all values within that interval would be accepted at the 0.05 significance level if tested with that value as a null value. And even though the sample mean will converge to the true population mean as n goes to infinity, that has nothing to do with claims about its precision in a finite sample. The confidence interval does that. I stand by my statement that the editors/reviewers/authors made an unjustified 'evidence for absence' claim. And Bayesian methods are dramatically different than frequentist in this setting. Bayesian methods will not allow the 'evidence for absence' error, as when you compute the probability of similarity or non-inferiority, that probability will not be large enough to support those assertions. The prior distribution for this could be a normal distribution that even gives the benefit of the doubt to similarity, and you'd still find the posterior probability is not high enough. When interpreting the results, you can almost pretend the observed mean is not there. By the way, a few simple simulations where one draws data from the lognormal distribution will show the irrelevance of the central limit theorem. A sample of size 50,000 is not large enough for the CLT to provide accuracy confidence intervals when the data have a lognormal distribution. See for example https://stats.stackexchange.com/questions/204585

DeleteTo add one example demonstrating why power is irrelevant after the study is completed, consider a case where a certain standard deviation has been assumed in doing the power calculation, and after the study it is found that the standard deviation is actually larger by a factor of 1.3. The original power calculation is not only not relevant now, it is also quite wrong.

DeleteAn a-priori power calculation, involving a pre-specified effect size of scientific relevance, are key ingredients in allowing proper inference after study data are collected and analyzed. There is indeed a formal way to embed the power into the interpretation of the results. References for a sound inferential procedure can be found for example in "Why perform a priori sample size calculation?" (DOI: 10.1503/cjs.018012), Deborah Mayo's "Error and Inference" book, page 263, and van Belle et al.'s "Biostatistics", page 20, 135, 146-147. If I have misunderstood the underpinnings of frequentist statistical inference, then many other statisticians and philosophers far more talented than I have misunderstood. Power is not irrelevant once a study is done, the a-priori power analysis is our best estimate of the state of affairs before the study is run and remains our best estimate, as power for a study can not be calculated using the data from the study. There are plenty of discussions of the inappropriateness of post-hoc power calculations - calculating power for a study, using the data from the study, e.g. "Some Practical Guidelines for Effective Sample Size Determination" (DOI: 10.1198/000313001317098149) Of course the standard deviation calculated from the study data will differ from the standard deviation calculated from the a-priori data used in the power calculation. Estimates have "fuzz". That does not render prior data irrelevant and wrong. How could you use a prior distribution in a Bayesian analysis under that tenet - your posterior will differ from your prior, so the prior must have been quite wrong.

ReplyDeleteAs for the StackExchange question you reference, the R code provided shows comparison of two means from a lognormal density. I see no discussion of confidence intervals therein at all.

ReplyDeleteI used that code to generate 50,000 random draws of 1,000 observations from a lognormal. If I calculate the mean minus the expected value and divide by the specified standard deviation for each of the 50,000 draws and plot a density estimate, a standard normal distribution fits almost exactly on top of the density of the 50,000 standardized means. So for averages of lognormal data at a sample size of 1,000 the CLT is certainly demonstrably accurate. The example presented by the original poster of that code involves 50,000 repetitions of two samples each consisting of 50 observations.

The StackExchange discussion includes a plot showing a density curve for the simulated data t statistics with a T distribution density curve overlaid. On code line 60 there is a logical bug, dt(x, 8) so a T distribution density with 8 degrees of freedom is shown in the image instead of one with 98 d.f. If I correct the bug, the T density tails lie nicely on top of the simulation data density curve. So even for two samples of size 50 the t statistic is behaving close to advertised rates. This discussion of the behaviour of the chi-square denominator requiring hundreds of thousands of observations before it shows its CLT convergence is a red herring. The problem concerns average estimates for the mean of lognormal data, not variance estimates of the second moment of a lognormal. The CLT says nothing about the form of the first and second central moments, needed in the numerator and denominator, other than that they exist in some finite non-pathological form. The CLTs focus is on the average of a set of independent random variables.

For the scenario of no difference in means for the two groups, I collected the 50000 95% confidence intervals as well. 47736 of them cover zero, the difference between the two means, for a coverage rate of 0.95472. How are these 95% CIs not performing well?

For a power demonstration, a scenario of blood pressure for two populations is posited, with the blood pressure having a lognormal distribution. The simulated scenario demonstrating type II error rates has a difference of 10 units between the known means of the two lognormal distributions. 50000 simulations of 50 observations per group are run, 12 of which yield a test statistic less than the cutoff value, for a power of 0.99976. The t-test was easily able to frequently detect this difference under the distributional conditions of the simulation. The comment there is that the power is "only 99%". Well the simulation estimate is actually 99.976%, so in what sense is the t-test not performing well? I collected the 50000 95% confidence intervals as well. 47401 of them cover the difference between the two means, for a coverage rate of 0.94802. How are these 95% CIs not performing well?

Surprisingly, no one discusses the obvious in this stack exchange discussion. If blood pressure data is lognormal, then one can also run statistics on the logarithms of the blood pressure data. They of course would be normal.

Steven you are quite mistaken in your interpretation of ORBITA and your belief that power calculations with made up standard deviation estimates remain valid after the study is quite wrong. A likelihoodist or Bayesian would say "the data are everything". (the Bayesian would also need to add "along with the prior"). Type I and II error are relevant before the data are in, i.e., are relevant in the initial planning of fixed sample size studies, and are not relevant to a single observed dataset once the data arrive.

ReplyDeleteYour simulations miss the point. Rand Wilcox and John Tukey have shown that invisible increases in the tail density from Gaussian make confidence interval coverage go awry, and in the particular case of the one-sample problem from the lognormal distribution, the confidence non-coverage probabilities in both tails are quite far from 0.025 for a putative 0.95 confidence interval even with n=50,000. Details about that were in another stackexchange post.