Tuesday, November 21, 2017

Scoring Multiple Variables, Too Many Variables and Too Few Observations: Data Reduction

This post will grow to cover questions about data reduction methods, also known as unsupervised learning methods. These are intended primarily for two purposes:

  • collapsing correlated variables into an overall score so that one does not have to disentangle correlated effects, which is a difficult statistical task
  • reducing the effective number of variables to use in a regression or other predictive model, so that fewer parameters need to be estimated
The latter example is the "too many variables too few subjects" problem.  Data reduction methods are covered in Chapter 4 of my book Regression Modeling Strategies, and in some of the book's case studies.

Sacha Varin writes 2017-11-19:

I want to add/sum some variables having different units. I decide to standardize (Z-scores) the values and then, once transformed in Z-scores, I can sum them.  The problem is that my variables distributions are non Gaussian (my distributions are not symmetrical (skewed), they are long-tailed, I have all types of weird distributions, I guess we can say the distributions are intractable. I know that my distributions don't need to be gaussian to calculate Z-scores, however, if the distributions are not close to gaussian or at least symmetrical enough, I guess the classical Z-score transformation: (Value - Mean)/SD, is not valid, that's why I decide, because my distributions are skewed and long-tailed to use the Gini's mean difference (robust and efficient estimator).
  1. If the distributions are skewed and long-tailed, can I standardize the values using that formula :(Value - Mean)/GiniMd?  Or the mean is not a good estimator in presence of skewed and long-tailed distributions?  What about (Value - Median)/GiniMd?  Or what else with GiniMd for a formula to standardize?
  2. In presence of outliers, skewed and long-tailed distributions, for standardization, what formula is better to use between (Value - Median)/MAD (=median absolute deviation) or (Value - Mean)/GiniMd?  And why?
My situation is not the predictive modeling case, but I want to sum the variables.

These are excellent questions and touch on an interesting side issue.  My opinion is that standard deviations (SDs) are not very applicable to asymmetric (skewed) distributions, and that they are not very robust measures of dispersion.  I'm glad you mentioned Gini's mean difference, which is the mean of all absolute differences of pairs of observations.  It is highly robust and is surprisingly efficient as a measure of dispersion when compared to the SD, even when normality holds. 

The questions also touch on the fact that when normalizing more than one variable so that the variables may be combined, there is no magic normalization method in statistics.  I believe that Gini's mean difference is as good as any and better than the SD.  It is also more precise than the mean absolute difference from the mean or median, and the mean may not be robust enough in some instances.  But we have a rich history of methods, such as principal components (PCs), that use SDs.

What I'm about to suggest is a bit more applicable to the case where you ultimately want to form a predictive model, but it can also apply when the goal is to just combine several variables.  When the variables are continuous and are on different scales, scaling them by SD or Gini's mean difference will allow one to create unitless quantities that may possibly be added.  But the fact that they are on different scales begs the question of whether they are already "linear" or do they need separate nonlinear transformations to be "combinable".

I think that nonlinear PCs may be a better choice than just adding scaled variables.  When the predictor variables are correlated, nonlinear PCs learn from the interrelationships, even occasionally learning how to optimally transform each predictor to ultimately better predict Y.  The transformations (e.g., fitted spline functions) are solved for to maximize predictability of a predictor, from the other predictors or PCs of them.  Sometimes the way the predictors move together is the same way they relate to some ultimate outcome variable that this undersupervised learning method does not have access to.  An example of this is in Section 4.7.3 of my book.

With a little bit of luck, the transformed predictors have more symmetric distributions, so ordinary PCs computed on these transformed variables, with their implied SD normalization, work pretty well.  PCs take into account that some of the component variables are highly correlated with each other, and so are partially redundant and should not receive the same weights ("loadings") as other variables.

The R transcan function in the Hmisc package has various options for nonlinear PCs, and these ideas are generalized in the R homals package.

How do we handle the case where the number of candidate predictors p is large in comparison to the effective sample size n?  Penalized maximum likelihood estimation (e.g., ridge regression) and Bayesian regression typically have the best performance, but data reduction methods are competitive and sometimes more interpretable.  For example, one can use variable clustering and redundancy analysis as detailed in the RMS book and course notes.  Principal components (linear or nonlinear) can also be an excellent approach to lowering the number of variables than need to be related to the outcome variable Y.  Two example approaches are:

  1. Use the 15:1 rule of thumb to estimate how many predictors can reliably be related to Y.  Suppose that number is k.  Use the first k principal components to predict Y.
  2. Enter PCs in decreasing order of variation (of the system of Xs) explained and chose the number of PCs to retain using AIC.  This is far from stepwise regression which enters variables according to their p-values with Y.  We are effectively entering variables in a pre-specified order with incomplete principal component regression. 
Once the PC model is formed, one may attempt to interpret the model by studying how raw predictors relate to the principal components or to the overall predicted values.

Returning to Sacha's original setting, if linearity is assumed for all variables, then scaling by Gini's mean difference is reasonable.  But psychometric properties should be considered, and often the scale factors need to be derived from subject matter rather than statistical considerations.

Sunday, November 5, 2017

Statistical Criticism is Easy; I Need to Remember That Real People are Involved

I have been critical of a number of articles, authors, and journals in this growing blog article. Linking the blog with Twitter is a way to expose the blog to more readers. It is far too easy to slip into hyperbole on the blog and even easier on Twitter with its space limitations. Importantly, many of the statistical problems pointed out in my article, are very, very common, and I dwell on recent publications to get the point across that inadequate statistical review at medical journals remains a serious problem. Equally important, many of the issues I discuss, from p-values, null hypothesis testing to issues with change scores are not well covered in medical education (of authors and referees), and p-values have caused a phenomenal amount of damage to the research enterprise. Still, journals insist on emphasizing p-values. I spend a lot of time educating biomedical researchers about statistical issues and as a reviewer for many medical journals, but still am on a quest to impact journal editors.

Besides statistical issues, there are very real human issues, and challenges in keeping clinicians interested in academic clinical research when there are so many pitfalls, complexities, and compliance issues. In the many clinical trials with which I have been involved, I've always been glad to be the statistician and not the clinician responsible for protocol logistics, informed consent, recruiting, compliance, etc.

A recent case discussed here has brought the human issues home, after I came to know of the extraordinary efforts made by the ORBITA study's first author, Rasha Al-Lamee, to make this study a reality. Placebo-controlled device trials are very difficult to conduct and to recruit patients into, and this was Rasha's first effort to launch and conduct a randomized clinical trial. I very much admire Rasha's bravery and perseverance in conducting this trial of PCI, when it is possible that many past trials of PCI vs. medical theory were affected by placebo effects.

Professor of Cardiology at Imperial College London, a co-author on the above paper, and Rasha's mentor, Darrel Francis, elegantly pointed out to me that there is a real person on the receiving end of my criticism, and I heartily agree with him that none of us would ever want to discourage a clinical researcher from ever conducting her second randomized trial. This is especially true when the investigator has a burning interest to tackle difficult unanswered clinical questions. I don't mind criticizing statistical designs and analyses, but I can do a better job of respecting the sincere efforts and hard work of biomedical researchers.

I note in passing that I had the honor of being a co-author with Darrel on this paper of which I am extremely proud.

Dr Francis gave me permission to include his thoughts, which are below. After that I list some ideas for making the path to presenting clinical research findings a more pleasant journey.

As the PI for ORBITA, I apologise for this trial being 40 years late, due to a staffing issue. I had to wait for the lead investigator, Rasha Al-Lamee, to be born, go to school, study Medicine at Oxford University, train in interventional cardiology, and start as a consultant in my hospital, before she could begin the trial.

Rasha had just finished her fellowship. She had experience in clinical research, but this was her first leadership role in a trial. She was brave to choose for her PhD a rigorous placebo-controlled trial in this controversial but important area.

Funding was difficult: grant reviewers, presumably interventional cardiologists, said the trial was (a) unethical and (b) unnecessary. This trial only happened because Rasha was able to convince our colleagues that the question was important and the patients would not be without stenting for long. Recruitment was challenging because it required interventionists to not apply the oculostenotic reflex. In the end the key was Rasha keeping the message at the front of all our colleagues' minds with her boundless energy and enthusiasm. Interestingly, when the concept was explained to patients, they agreed to participate more easily than we thought, and dropped out less frequently than we feared. This means we should indeed acquire placebo-controlled data on interventional procedures.

Incidentally, I advocate the term "placebo" over "sham" for these trials, for two reasons. First, placebo control is well recognised as essential for assessing drug efficacy, and this helps people understand the need for it with devices. Second, "sham" is a pejorative word, implying deception. There is no deception in a placebo controlled trial, only pre-agreed withholding of information.

There are several ways to improve the system that I believe would foster clinical research and make peer review more objective and productive.

  • Have journals conduct reviews of background and methods without knowledge of results.
  • Abandon journals and use researcher-led online systems that invite open post-"publication" peer review and give researchers the opportunities to improve their "paper" in an ongoing fashion.
  • If not publishing the entire paper online, deposit the background and methods sections for open pre-journal submission review.
  • Abandon null hypothesis testing and p-values. Before that, always keep in mind that a large p-value means nothing more than "we don't yet have evidence against the null hypothesis", and emphasize confidence limits.
  • Embrace Bayesian methods that provide safer and more actionable evidence, including measures that quantify clinical significance. And if one is trying to amass evidence that the effects of two treatments are similar, compute the direct probability of similarity using a Bayesian model.
  • Improve statistical education of researchers, referees, and journal editors, and strengthen statistical review for journals.
  • Until everyone understands the most important statistical concepts, better educate researchers and peer reviewers on statistical problems to avoid.
On a final note, I regularly review clinical trial design papers for medical journals. I am often shocked at design flaws that authors state are "too late to fix" in their response to the reviews. This includes problems caused by improper endpoint variables that necessitated the randomization of triple the number of patients actually needed to establish efficacy. Such papers have often been through statistical review before the journal submission. This points out two challenges: (1) there is a lot of between-statistician variation that statisticians need to address, and (2) there are many fundamental statistical concepts that are not known to many statisticians (witness the widespread use of change scores and dichotomization of variables even when senior statisticians are among a paper's authors).

Monday, October 9, 2017

Continuous Learning from Data: No Multiplicities from Computing and Using Bayesian Posterior Probabilities as Often as Desired

(In a Bayesian analysis) It is entirely appropriate to collect data
until a point has been proven or disproven, or until the data collector
runs out of time, money, or patience. - Edwards, Lindman, Savage (1963)

Bayesian inference, which follows the likelihood principle, is not affected by the experimental design or intentions of the investigator. P-values can only be computed if both of these are known, and as been described by Berry (1987) and others, it is almost never the case that the computation of the p-value at the end of a study takes into account all the changes in design that were necessitated when pure experimental designs encounter the real world.

When performing multiple data looks as a study progress, one can accelerate learning by more quickly abandoning treatments that do not work, by sometimes stopping early for efficacy, and frequently by arguing to extend a promising but as-yet-inconclusive study by adding subjects over the originally intended sample size. Indeed the whole exercise of computing a single sample size is thought to be voodoo by most practicing statisticians. It has become almost comical to listen to rationalizations for choosing larger detectable effect sizes so that smaller sample sizes will yield adequate power.

Multiplicity and resulting inflation of type I error when using frequentist methods is real. While Bayesians concern themselves with "what did happen?", frequentists must consider "what might have happened?" because of the backwards time and information flow used in their calculations. Frequentist inference must envision an indefinitely long string of identical experiments and must consider extremes of data over potential studies and over multiple looks within each study if multiple looks were intended. Multiplicity comes from the chances (over study repetitions and data looks) you give data to be more extreme (if the null hypothesis holds), not from the chances you give an effect to be real. It is only the latter that is of concern to a Bayesian. Bayesians entertain only one dataset at a time, and if one computes posterior probabilities of efficacy multiple times, it is only the last value calculated that matters.

To better understand the last point, consider a probabilistic pattern recognition system for identifying enemy targets in combat. Suppose the initial assessment when the target is distant is a probability of 0.3 of being an enemy vehicle. Upon coming closer the probability rises to 0.8. Finally the target is close enough (or the air clears) so that the pattern analyzer estimates a probability of 0.98. The fact that the probabilty was < 0.98 earlier is of no consequence as the gunner prepares to fire a canon. Even though the probability may actually decrease while the shell is in the air due to new information, the probability at the time of firing was completely valid based on then available information.

This is very much how an experimenter would work in a Bayesian clinical trial. The stopping rule is unimportant when interpreting the final evidence. Earlier data looks are irrelevant. The only ways a Bayesian would cheat would be to ignore a later look if it is less favorable than an earlier look, or to try to pull the wool over reviewers' eyes by changing the prior distribution once data patterns emerge.

The meaning and accuracy of posterior probabilities of efficacy in a clinical trial are mathematical necessities that follow from Bayes' rule, if the data model is correctly specified (this model is needed just as much by frequentist methods). So no simulations are needed to demonstrate these points. But for the non-mathematically minded, simulations can be comforting. For everyone, simulation code exposes the logic flow in the Bayesian analysis paradigm.

One other thing: when the frequentist does a sequential trial with possible early termination, the sampling distribution of the statistics becomes extremely complicated, but must be derived to allow one to obtain proper point estimates and confidence limits. It is almost never the case that the statistician actually performs these complex adjustments in a clinical trial with multiple looks. One example of the harm of ignoring this problem is that if the trial stops fairly early for efficacy, efficacy will be overestimated. On the other hand, the Bayesian posterior mean/median/mode of the efficacy parameter will be perfectly calibrated by the prior distribution you assume. If the prior is skeptical and one stops early, the posterior mean will be "pulled back" by a perfect amount, as shown in the simulation below.

We consider the simplest clinical trial design for illustration. The efficacy measure is assumed to be normally distributed with mean μ and variance 1.0, μ=0 indicates no efficacy, and μ<0 indicates a detrimental effect. Our inferential jobs are to see if evidence may be had for a positive effect and to see if further there is evidence for a clinically meaningful effect (except for the futility analysis, we will ignore the latter in what follows). Our business task is to not spend resources on treatments that have a low chance of having a meaningful benefit to patients. The latter can also be an ethical issue: we'd like not to expose too many patients to an ineffective treatment. In the simulation, we stop for futility when the probability that μ < 0.05 exceeds 0.9, considering μ=0.05 to be a minimal clinically important effect.

The logic flow in the simulation exposes what is assumed by the Bayesian analysis.

  1. The prior distribution for the unknown effect μ is taken as a mixture of two normal distributions, each with mean zero. This is a skeptical prior that gives an equal chance for detriment as for benefit from the treatment. Any prior would have done.
  2. In the next step it is seen that the Bayesian does not consider a stream of identical trials but instead (and only when studying performance of Bayesian operating characteristics) considers a stream of trials with different efficacies of treatment, by drawing a single value of μ from the prior distribution. This is done independently for 50,000 simulated studies. Posterior probabilities are not informed by this value of μ. Bayesians operate in a predictive mode, trying for example to estimate Prob(μ>0) no matter what the value of μ.
  3. For the current value of μ, simulate an observation from a normal distribution with mean μ and SD=1.0. [In the code below all n=500 subjects' data are simulated at once then revealed one-at-a-time.]
  4. Compute the posterior probability of efficacy (μ > 0) and of futility (μ < 0.05) using the original prior and latest data.
  5. Stop the study if the probability of efficacy ≥0.95 or the probability of futility ≥0.9.
  6. Repeat the last 3 steps, sampling one more subject each time and performing analyses on the accumulated set of subjects to date.
  7. Stop the study when 500 subjects have entered.

What is it that the Bayesian must demonstrate to the frequentist and reviewers? She must demonstrate that the posterior probabilities computed as stated above are accurate, i.e., they are well calibrated. From our simulation design, the final posterior probability will either be the posterior probability computed after the last (500th) subject has entered, the probability of futility at the time of stopping for futility, or the probability of efficacy at the time of stopping for efficacy. How do we tell if the posterior probability is accurate? By comparing it to the value of μ (unknown to the posterior probability calculation) that generated the sequence of data points that were analyzed. We can compute a smooth nonparametric calibration curve for each of (efficacy, futility) where the binary events are μ > 0 and μ < 0.05, respectively. For the subset of the 50,000 studies that were stopped early, the range of probabilities is limited so we can just compare the mean posterior probability at the moment of stopping with the proportion of such stopped studies for which efficacy (futility) was the truth. The mathematics of Bayes dictates the mean probability and the proportion must be the same (if enough trials are run so that simulation error approaches zero). This is what happened in the simulations.

For the smaller set of studies not stopping early, the posterior probability of efficacy is uncertain and will have a much wider range. The calibration accuracy of these probabilities is checked using a nonparametric calibration curve estimator just as we do in validating risk models, by fitting the relationship between the posterior probability and the binary event μ>0.

The simulations also demonstrated that the posterior mean efficacy at the moment of stopping is perfectly calibrated as an estimator of the true unknown μ.

Simulations were run in R and used functions in the R Hmisc and rms package. The results are below. Feel free to take the code and alter it to run any simulations you'd like.

Wednesday, October 4, 2017

Bayesian vs. Frequentist Statements About Treatment Efficacy

The following examples are intended to show the advantages of Bayesian reporting of treatment efficacy analysis, as well as to provide examples contrasting with frequentist reporting. As detailed here, there are many problems with p-values, and some of those problems will be apparent in the examples below. Many of the advantages of Bayes are summarized here. As seen below, Bayesian posterior probabilities prevent one from concluding equivalence of two treatments on an outcome when the data do not support that (i.e., the "absence of evidence is not evidence of absence" error).

Suppose that a parallel group randomized clinical trial is conducted to gather evidence about the relative efficacy of new treatment B to a control treatment A. Suppose there are two efficacy endpoints: systolic blood pressure (SBP) and time until cardiovascular/cerebrovascular event. Treatment effect on the first endpoint is assumed to be summarized by the B-A difference in true mean SBP. The second endpoint is assumed to be summarized as a true B:A hazard ratio (HR). For the Bayesian analysis, assume that pre-specified skeptical prior distributions were chosen as follows. For the unknown difference in mean SBP, the prior was normal with mean 0 with SD chosen so that the probability that the absolute difference in SBP between A and B exceeds 10mmHg was only 0.05. For the HR, the log HR was assumed to have a normal distribution with mean 0 and SD chosen so that the prior probability that the HR > 2 or HR < 1/2 was 0.05. Both priors specify that it is equally likely that treatment B is effective as it is detrimental. The two prior distributions will be referred to as p1 and p2.

Example 1: So-called "Negative" Trial (Considering only SBP)

  • Frequentist Statement
    • Incorrect Statement: Treatment B did not improve SBP when compared to A (p=0.4)
    • Confusing Statement: Treatment B was not significantly different from treatment A (p=0.4)
    • Accurate Statement: We were unable to find evidence against the hypothesis that A=B (p=0.4). More data will be needed. As the statistical analysis plan specified a frequentist approach, the study did not provide evidence of similarity of A and B (but see the confidence interval below).
    • Supplemental Information: The observed B-A difference in means was 4mmHg with a 0.95 confidence interval of [-5, 13]. If this study could be indefinitely replicated and the same approach used to compute the confidence interval each time, 0.95 of such varying confidence intervals would contain the unknown true difference in means.
  • Bayesian Statement
    • Assuming prior distribution p1 for the mean difference of SBP, the probability that SBP with treatment B is lower than treatment A is 0.67. Alternative statement: SBP is probably (0.67) reduced with treatment B. The probability that B is inferior to A is 0.33. Assuming a minimally clinically important difference in SBP of 3mmHg, the probability that the mean for A is within 3mmHg of the mean for B is 0.53, so the study is uninformative about the question of similarity of A and B.
    • Supplemental Information: The posterior mean difference in SBP was 3.3mmHg and the 0.95 credible interval is [-4.5, 10.5]. The probability is 0.95 that the true treatment effect is in the interval [-4.5, 10.5]. [could include the posterior density function here, with a shaded right tail with area 0.67.]

Example 2: So-called "Positive" Trial

  • Frequentist Statement
    • Incorrect Statement: The probability that there is no difference in mean SBP between A and B is 0.02
    • Confusing Statement: There was a statistically significant difference between A and B (p=0.02).
    • Correct Statement: There is evidence against the null hypothesis of no difference in mean SBP (p=0.02), and the observed difference favors B. Had the experiment been exactly replicated indefinitely, 0.02 of such repetitions would result in more impressive results if A=B.
    • Supplemental Information: Similar to above.
    • Second Outcome Variable, If the p-value is Small: Separate statement, of same form as for SBP.
  • Bayesian Statement
    • Assuming prior p1, the probability that B lowers SBP when compared to A is 0.985. Alternative statement: SBP is probably (0.985) reduced with treatment B. The probability that B is inferior to A is 0.015.
    • Supplemental Information: Similar to above, plus evidence about clinically meaningful effects, e.g.: The probability that B lowers SBP by more than 3mmHg is 0.81.
    • Second Outcome Variable: Bayesian approach allows one to make a separate statement about the clinical event HR and to state evidence about the joint effect of treatment on SBP and HR. Examples: Assuming prior p2, HR is probably (0.79) lower with treatment B. Assuming priors p1 and p2, the probability that treatment B both decreased SBP and decreased event hazard was 0.77. The probability that B improved either of the two endpoints was 0.991.
One would also report basic results. For SBP, frequentist results might be chosen as the mean difference and its standard error. Basic Bayesian results could be said to be the entire posterior distribution of the SBP mean difference.

Note that if multiple looks were made as the trial progressed, the frequentist estimates (including the observed mean difference) would have to undergo complex adjustments. Bayesian results require no modification whatsoever, but just involve reporting the latest available cumulative evidence.

Tuesday, August 1, 2017

Integrating Audio, Video, and Discussion Boards with Course Notes

As a biostatistics teacher I've spent a lot of time thinking about inverting the classroom and adding multimedia content. My first thought was to create YouTube videos corresponding to sections in my lecture notes. This typically entails recording the computer screen while going through slides, adding a voiceover. I realized that the maintenance of such videos is difficult, and this also creates a barrier to adding new content. In addition, the quality of the video image is lower than just having the student use a pdf viewer on the original notes. For those reasons I decided to create audio narration for the sections in the notes to largely capture what I would say during a live lecture. The audio mp3 files are stored on a local server and are streamed on demand when a study clicks on the audio icon in a section of the notes. The audio recordings can also be downloaded one-at-a-time or in a batch.

The notes themselves are created using LaTeX, R, and knitr using a LaTeX style I created that is a compromise format between projecting slides and printing notes. In the future I will explore using bookdown for creating content in html instead of pdf. In either case, the notes can change significantly when R commands within them are re-executed by knitr in R.

An example of a page of pdf notes with icons that link to audio or video content is in Section 10.5 of BBR. I add red letters in the right margin for each subsection in the text, and occasionally call out these letters in the audio so that the student will know where I am.

There are several student activities for which the course would benefit by recording information. Two of them are students pooling notes taken during class sessions, and questions and answers between sessions. The former might be handled by simultaneous editing or wiki curation on the cloud, and I haven't thought very much about how to link this with the course notes to in effect expand the notes for the next class of students. Let's consider the Q&A aspect. It would be advantageous for questions and answers to "grow", and for future students to take advantage of the Q&As from past students. Being able to be looking at a subsection in the course notes and quickly linking to cumulative Q&A on that topic is a plus. My first attempt at this was to set up a slack.com team for courses in our department, and then setting up a channel for each of the two courses I teach. As slack does not allow sub-channels, the discussions need to be organized in some way. I went about this by assigning a mnemonic in the course notes that should be mentioned when a threaded discussion is started in slack. Students can search for discussions about a subsection in the notes by searching for that mnemonic. I have put hyperlinks from the notes to a slack search expression that is supposed to bring up discussions related to the mnemonic in the course's slack channel. The problem is that slack doesn't have a formal URL construction that guarantees that a hyperlink to a URL with that expression will cause the correct discussions to pop up in the user's browser. This is a work in progress, and other ideas are welcomed. See Section 10.5.1 of BBR for an example where an icon links to slack (see the mnemonic reg-simple).

Besides being hard to figure out how to create URLs o get the student and instructor directly into a specific discussion, slack has the disadvantage that users need to be invited to join the team. If every team member is to be from the instructor's university, you can configure slack so that anyone with an email address in the instructor's domain can be added to the team automatically.

I have entertained another approach of using disqus for linking student comments to sections of notes. This is very easy to set up, but when one wants to have a separate discussion about each notes subsection, I haven't figured out how to have disqus use keywords or some other means to separate the discussions.

stats.stackexchange.com is the world's most active Q&A and discussion board for statistics. Its ability to format questions, answers, comments, math equations, and images is unsurpassed. Perhaps every discussion about a statistical issue should be started in stackexchange and then linked to from the course notes. This has the disadvantage of needing to link to multiple existing stackexchange questions related to one topic, but has the great advantage of gathering input from statisticians around the world, not just those in the class.

No mater which method for entering Q & A is used, I think that such comments need to be maintained separately from the course notes because of the dynamic, reproducible nature of the notes using knitr. Just as important, when I add new static content to the notes I want the existing student comments to just move appropriately with these changes. Hyperlinking to Q & A does that. There is one more issue not discussed above - students often annotate the pdf file, but their annotations are undone when I produce an update to he notes. It would be nice to have some sort of dynamic annotation capability. This is likely to work better as I use R bookdown for new notes I develop.

I need your help in refining the approach or discovering completely new approaches to coordination of information using the course notes as a hub. Please add comments to this post below, or short suggestions to @f2harrell on twitter.

Thursday, June 1, 2017

EHRs and RCTs: Outcome Prediction vs. Optimal Treatment Selection

Frank Harrell
Professor of Biostatistics
Vanderbilt University School of Medicine

Laura Lazzeroni
Professor of Psychiatry and, by courtesy, of Medicine (Cardiovascular Medicine) and of Biomedical Data Science
Stanford University School of Medicine
Revised July 17, 2017

It is often said that randomized clinical trials (RCTs) are the gold standard for learning about therapeutic effectiveness. This is because the treatment is assigned at random so no variables, measured or unmeasured, will be truly related to treatment assignment. The result is an unbiased estimate of treatment effectiveness. On the other hand, observational data arising from clinical practice has all the biases of physicians and patients in who gets which treatment. Some treatments are indicated for certain types of patients; some are reserved for very sick ones. The fact is that the selection of treatment is often chosen on the basis of patient characteristics that influence patient outcome, some of which may be unrecorded. When the outcomes of different groups of patients receiving different treatments are compared, without adjustment for patient characteristics related to treatment selection and outcome, the result is a bias called confounding by indication.

To set the stage for our discussion of the challenges caused by confounding by indication, incomplete data, and unreliable data, first consider a nearly ideal observational treatment study then consider an ideal RCT. First, consider a potentially optimal observational cohort design that has some possibility of providing an accurate treatment outcome comparison. Suppose that an investigator has obtained $2M in funding to hire trained research nurses to collect data completely and accurately, and she has gone to the trouble of asking five expert clinicians in the disease/treatment area to each independently list the patient characteristics they perceive are used to select therapies for patients. The result is a list of 18 distinct patient characteristics, for which a data dictionary is written and case report forms are collected. Data collectors are instructed to obtain these 18 variables on every patient with very few exceptions, and other useful variables, especially strong prognostic factors, are collected in addition. Details about treatment are also captured, including the start and ending dates of treatment, doses, and dose schedule. Outcomes are well defined and never missing. The sample size is adequate, and when data collection is complete, analysis of covariance is used to estimate the outcome difference for treatment A vs. treatment B. Then the study PI discovers that there is a strong confounder that none of the five experts thought of, and a sensitivity analysis indicates that the original treatment effect estimate might have been washed away by the additional confounder had it been collected. The study results in no reliable knowledge about the treatments.

The study just described represents a high level of observational study quality, and still needed some luck to be useful. The treatments, entry criteria, and follow-up clock were well defined, and there were almost no missing data. Contrast that with the electronic health record (EHR). If questions of therapeutic efficacy are so difficult to answer with nearly perfect observational data how can they be reliably answered from EHR data alone?

To complete our introduction to the discussion, envision a well-conducted parallel-group RCT with complete follow-up and highly accurate and relevant baseline data capture. Study inclusion criteria allowed for a wide range of age and severity of disease. The endpoint is time until a devastating clinical event. The treatment B:treatment A covariate-adjusted hazard ratio is 0.8 with 0.95 credible interval of [0.65, 0.93]. The authors, avoiding unreliable subgroup analysis, perform a careful but comprehensive assessment of interaction between patient types and treatment effect, finding no evidence for heterogeneity of treatment effect (HTE). The hazard ratio of 0.8 is widely generalizable, even to populations with much different baseline risk. A simple nomogram is drawn to show how to estimate absolute risk reduction by treatment B at 3 years, given a patient's baseline 3y risk.

There is an alarming trend in advocates of learning from the EHR saying that statistical inference can be bypassed because (1) large numbers overcome all obstacles, (2) the EHR reflects actual clinical practice and patient populations, and (3) if you can predict outcomes for individual patients you can just find out for which treatment the predicted outcomes are optimal. Advocates of such "logic" often go on to say that RCTs are not helpful because the proportion of patients seen in practice that would qualify for the trial is very small with randomized patients being unrepresentative of the clinical population, because the trial only estimates the average treatment effect, because there must be HTE, and because treatment conditions are unrepresentative. Without HTE, precision medicine would have no basis. But evidence of substantial HTE has yet to be generally established and its existence in particular cases can be an artifact of the outcome scale used for the analysis. See this for more about the first two complaints about RCTs. Regarding (1), researchers too often forget that measurement or sample bias does not diminish no matter how large the sample size. Often, large sample sizes only provide precise estimates of the wrong quantity.

To illustrate this problem, suppose that one is interested in estimating and testing the treatment effect, B-A, of a certain blood pressure lowering medication (drug B) when compared to another drug (A). Assume a relevant subset of the EHR can be extracted in which patients started initial monotherapy at a defined date and systolic blood pressure (SBP) was measured routinely at useful follow-up intervals. Suppose that the standard deviation (SD) of SBP across patients is 8 mmHg regardless of treatment group. Suppose further that minor confounding by indication is present due to the failure to adjust for an unstated patient feature involved in the drug choice, which creates a systematic unidirectional bias of 2 mmHg in estimating the true B-A difference in mean SBP. If the EHR has m patients in each treatment group, the variance of the estimated mean difference is the sum of the variances of the two individual means or 64/m + 64/m = 128/m. But the variance only tells us about how close our sample estimate is to the incorrect value, B-A + 2 mmHg. It is the mean squared error, the variance plus the square of the bias or 128/m + 4, that relates to the probability that the estimate is close to the true treatment effect B-A. As m gets larger, the variance goes to zero indicating a stable estimate has been achieved. But the bias is constant so the mean squared error remains at 4 (root mean squared error = 2 mmHg).

Now consider an RCT that is designed not to estimate the mean SBP for A or the mean SBP for B but, as with all randomized trials, is designed to estimate the B-A difference (treatment effect). If the trial randomized m subjects per treatment group, the variance of the mean difference is 128/m and the mean squared error is also 128/m. The comparison of the square root of mean squared errors for an EHR study and an equal-sized RCT is depicted in the figure below. Here, we have even given the EHR study the benefit of the doubt in assuming that SBP is measured as accurately as would be the case in the RCT. This is unlikely, and so in reality the results presented below are optimistic for the performance of the EHR.

EHR studies have the potential to provide far larger sample sizes than RCTs, but note that an RCT with a total sample size of 64 subjects is as informative as an EHR study with infinitely many patients. Bigger is not better. What if the SBP measurements from the EHR, not collected under any protocol, are less accurate than those collected under the RCT protocol? Let’s exemplify that by setting the SD for SBP to 10 mmHg for the EHR while leaving it as 8 mmHg for the RCT. For very large sample sizes, bias trumps variance so the breakeven point of 64 subjects remains, but for non-large EHRs the increased variability of measured SBPs harms the margin of error of EHR estimate of mean SBP difference.

We have addressed estimation error for the treatment effect, but note that while an EHR-based statistical test for any treatment difference will have excellent power for large n, this comes at the expense of being far from preserving the type I error, which is essentially 1.0 due to the estimation bias causing the two-sample statistical test to be biased, .

Interestingly, bias decreases the benefits achieved by larger sample sizes to the extent that, in contrast to an unbiased RCT, the mean squared error for an EHR of size 3000 in our example is nearly identical to what it would be with an infinite sample size. While this disregards the need for larger samples to target multiple treatments or distinct patient populations, it does suggest that overcoming the specific resource-intensive challenges associated with handling huge EHR samples may yield fewer advances in medical treatment than anticipated by some, if the effects of bias are considered.

There is a mantra heard in data science that you just need to "let the data speak." You can indeed learn much from observational data if quality and completeness of data are high (this is for another discussion; EHRs have major weakness just in these two aspects). But data frequently teach us things that are just plain wrong. This is due to a variety of reasons, including seeing trends and patterns that can be easily explained by pure noise. Moreover, treatment group comparisons in EHRs can reflect both the effects of treatment and the effects of specific prior patient conditions that led to the choice of treatment in the first place - conditions that may not be captured in the EHR. The latter problem is confounding by indication, and this can only be overcome by randomization, strong assumptions, or having high-quality data on all the potential confounders (patient baseline characteristics related to treatment selection and to outcome--rarely if ever possible). Many clinical researchers relying on EHRs do not take the time to even list the relevant patient characteristics before rationalizing that the EHR is adequate. To make matters worse, EHRs frequently do not provide accurate data on when patients started and stopped treatment. Furthermore, the availability of patient outcomes can depend on the very course of treatment and treatment response under study. For example, when a trial protocol is not in place, lab tests are not ordered at pre-specified times but because of a changing patient condition. If EHR cannot provide a reliable estimate of the average treatment effect how could it provide reliable estimates of differential treatment benefit (HTE)?

Regarding the problem with signal vs. noise in "let the data speak", we envision a clinician watching someone playing a slot machine in Las Vegas. The clinician observes that a small jackpot was hit after 17 pulls of the lever, and now has a model for success: go to a random slot machine with enough money to make 17 pulls. Here the problem is not a biased sample but pure noise.

Observational data, when complete and accurate, can form the basis for accurate predictions. But what are predictions really good for? Generally speaking, predictions can be used to estimate likely patient outcomes given prevailing clinical practice and treatment choices, with typical adherence to treatment. Prediction is good for natural history studies and for counseling patients about their likely outcomes. What is needed for selecting optimum treatments is an answer to the "what if" question: what is the likely outcome of this patient were she given treatment A vs. were she given treatment B? This is inherently a problem of causal inference, which is why such questions are best answered using experimental designs, such as RCTs. When there is evidence that the complete, accurate observational data captured and eliminated confounding by indication, then and only then can observational data be a substitute for RCTs in making smart treatment choices.

What is a good global strategy for making optimum decisions for individual patients? Much more could be said, but for starters consider the following steps:

  • Obtain the best covariate-adjusted estimate of relative treatment effect (e.g., odds ratio, hazards ratio) from an RCT. Check whether this estimate is constant or whether it depends on patient characteristics (i.e., whether heterogeneity of treatment effect exists on the relative scale). One possible strategy, using fully specified interaction tests adjusted for all main effects, is in Biostatistics for Biomedical Research in the Analysis of Covariance chapter.
  • Develop a predictive model from complete, accurate observational data, and perform strong interval validation using the bootstrap to verify absolute calibration accuracy. Use this model to handle risk magnification whereby absolute treatment benefits are greater for sicker patients in most cases.
  • Apply the relative treatment effects from the RCT, separately for treatment A and treatment B, to the estimated outcome risk from the observational data to obtain estimates of absolute treatment benefit (B vs. A) for the patient. See the first figure below which relates a hazard ratio to absolute improvement in survival probability.
  • Develop a nomogram using the RCT data to estimate absolute treatment benefit for an individual patient. See the second figure below whose bottom axis is the difference between two logistic regression models. (Both figures are from BBR Chapter 13)
  • For more about such strategies, see Stephen Senn's presentation.

Saturday, April 8, 2017

Statistical Errors in the Medical Literature

Updated 2017-11-04

  1. Misinterpretation of P-values and Main Study Results
  2. Dichotomania
  3. Problems With Change Scores
  4. Improper Subgrouping
  5. Serial Data and Response Trajectories

As Doug Altman famously wrote in his Scandal of Poor Medical Research in BMJ in 1994, the quality of how statistical principles and analysis methods are applied in medical research is quite poor.  According to Doug and to many others such as Richard Smith, the problems have only gotten worse.  The purpose of this blog article is to contain a running list of new papers in major medical journals that are statistically problematic, based on my random encounters with the literature.

One of the most pervasive problems in the medical literature (and in other subject areas) is misuse and misinterpretation of p-values as detailed here, and chief among these issues is perhaps the absence of evidence is not evidence of absence error written about so clearly by Altman and Bland.   The following thought will likely rattle many biomedical researchers but I've concluded that most of the gross misinterpretation of large p-values by falsely inferring that a treatment is not effective is caused by (1) the investigators not being brave enough to conclude "We haven't learned anything from this study", i.e., they feel compelled to believe that their investments of time and money must be worth something, (2) journals accepting such papers without demanding a proper statistical interpretation in the conclusion.  One example of proper wording would be "This study rules out, with 0.95 confidence, a reduction in the odds of death that is more than by a factor of 2."  Ronald Fisher, when asked how to interpret a large p-value, said "Get more data." Adoption of Bayesian methods would solve many problems including this one.  Whether a p-value is small or large a Bayesian can compute the posterior probability of similarity of outcomes of two treatments (e.g., Prob(0.85 < odds ratio < 1/0.85)), and the researcher will often find that this probability is not large enough to draw a conclusion of similarity.  On the other hand, what if even under a skeptical prior distribution the Bayesian posterior probability of efficacy were 0.8 in a "negative" trial?  Would you choose for yourself the standard therapy when it had a 0.2 chance of being better than the new drug? [Note: I am not talking here about regulatory decisions.]  Imagine a Bayesian world where it is standard to report the results for the primary endpoint using language such as:
  • The probability of any efficacy is 0.94 (so the probability of non-efficacy is 0.06).
  • The probability of efficacy greater than a factor of 1.2 is 0.78 (odds ratio < 1/1.2).
  • >The probability of similarity to within a factor of 1.2 is 0.3.
  • The probability that the true odds ratio is between [0.6, 0.99] is 0.95 (credible interval; doesn't use the long-run tendency of confidence intervals to include the true value for 0.95 of confidence intervals computed).
In a so-called "negative" trial we frequently see the phrase "treatment B was not significantly different from treatment A" without thinking out how little information that carries.  Was the power really adequate? Is the author talking about an observed statistic (probably yes) or the true unknown treatment effect?  Why should we care more about statistical significance than clinical significance?  The phrase "was not significantly different" seems to be a way to avoid the real issues of interpretation of large p-values. Since my #1 area of study is statistical modeling, especially predictive modeling, I pay a lot of attention to model development and model validation as done in the medical literature, and I routinely encounter published papers where the authors do not have basic understanding of the statistical principles involved.  This seems to be especially true when a statistician is not among the paper's authors.  I'll be commenting on papers in which I encounter statistical modeling, validation, or interpretation problems.

Misinterpretation of P-values and of Main Study Results

One of the most problematic examples I've seen is in the March 2017 paper Levosimendan in Patients with Left Ventricular Dysfunction Undergoing Cardiac Surgery by Rajenda Mehta in the New England Journal of Medicine.  The study was designed to detect a miracle - a 35% relative odds reduction with drug compared to placebo, and used a power requirement of only 0.8 (type II error a whopping 0.2).  [The study also used some questionable alpha-spending that Bayesians would find quite odd.]  For the primary endpoint, the adjusted odds ratio was 1.00 with 0.99 confidence interval [0.66, 1.54] and p=0.98.  Yet the authors concluded "Levosimendan was not associated with a rate of the composite of death, renal-replacement therapy, perioperative myocardial infarction, or use of a mechanical cardiac assist device that was lower than the rate with placebo among high-risk patients undergoing cardiac surgery with the use of cardiopulmonary bypass."   Their own data are consistent with a 34% reduction (as well as a 54% increase)!  Almost nothing was learned from this underpowered study.  It may have been too disconcerting for the authors and the journal editor to have written "We were only able to rule out a massive benefit of drug."  [Note: two treatments can have agreement in outcome probabilities by chance just as they can have differences by chance.]  It would be interesting to see the Bayesian posterior probability that the true unknown odds ratio is in [0.85, 1/0.85]. The primary endpoint is the union of death, dialysis, MI, or use of a cardiac assist device.  This counts these four endpoints as equally bad.  An ordinal response variable would have yielded more statistical information/precision and perhaps increased power.  And instead of dealing with multiplicity issues and alpha-spending, the multiple endpoints could have been dealt with more elegantly with a Bayesian analysis.  For example, one could easily compute the joint probability that the odds ratio for the primary endpoint is less than 0.8 and the odds ratio for the secondary endpoint is less than 1 [the secondary endpoint was death or assist device and and is harder to demonstrate because of its lower incidence, and is perhaps more of a "hard endpoint"].  In the Bayesian world of forward directly relevant probabilities there is no need to consider multiplicity.  There is only a need to state the assertions for which one wants to compute current probabilities.
The paper also contains inappropriate assessments of interactions with treatment using subgroup analysis with arbitrary cutpoints on continuous baseline variables and failure to adjust for other main effects when doing the subgroup analysis.
This paper had a fine statistician as a co-author.  I can only conclude that the pressure to avoid disappointment with a conclusion of spending a lot of money with little to show for it was in play.
Why was such an underpowered study launched?  Why do researchers attempt "hail Mary passes"?  Is a study that is likely to be futile fully ethical?   Do medical journals allow this to happen because of some vested interest?

Similar Examples

Perhaps the above example is no worse than many.  Examples of "absence of evidence" misinterpretations abound.  Consider the JAMA paper by Kawazoe et al published 2017-04-04.  They concluded that "Mortality at 28 days was not significantly different in the dexmedetomidine group vs the control group (19 patients [22.8%] vs 28 patients [30.8%]; hazard ratio, 0.69; 95% CI, 0.38-1.22;P > = .20)."  The point estimate was a reduction in hazard of death by 31% and the data are consistent with the reduction being as large as 62%! Or look at this 2017-03-21 JAMA article in which the authors concluded "Among healthy postmenopausal older women with a mean baseline serum 25-hydroxyvitamin D level of 32.8 ng/mL, supplementation with vitamin D3 and calcium compared with placebo did not result in a significantly lower risk of all-type cancer at 4 years." even though the observed hazard ratio was 0.7, with lower confidence limit of a whopping 53% reduction in the incidence of cancer.  And the 0.7 was an unadjusted hazard ratio; the hazard ratio could well have been more impressive had covariate adjustment been used to account for outcome heterogeneity within each treatment arm.
An incredibly high-profile paper published online 2017-11-02 in The Lancet demonstrates a lack of understanding of some statistical issues. In Percutaneous coronary intervention in stable angina (ORBITA): a double-blind, randomised controlled trial by Rasha Al-Lamee et al, the authors (or was it the journal editor?) boldly claimed "In patients with medically treated angina and severe coronary stenosis, PCI did not increase exercise time by more than the effect of a placebo procedure." The authors are to be congratulated on using a rigorous sham control, but the authors, reviewers, and editor allowed a classic absence of evidence is not evidence of absence error to be made in attempting to interpret p=0.2 for the primary analysis of exercise time in this small (n=200) RCT. In doing so they ignored the useful (but flawed; see below) 0.95 confidence interval of this effect of [-8.9, 42] seconds of exercise time increase for PCI. Thus their data are consistent with a 42 second increase in exercise time by real PCI. It is also important to note that the authors fell into the change from baseline trap by disrespecting their own parallel group design. They should have asked the covariate-adjusted question: For two patients starting with the same exercise capacity, one assigned PCI and one assigned PCI sham, what is the average difference in follow-up exercise time?
But there are other ways to view this study. Sham studies are difficult to fund and difficult to recruit large number of patients. Criticizing the interpretation of the statistical analysis fails to recognize the value of the study. One value is the study's ruling out an exercise time improvement greater than 42s (with 0.95 confidence). If, as several cardiologists have told me, 42s is not very meaningful to the patient, then the study is definitive and clinically relevant. I just wish that authors and especially editors would use exactly correct language in abstracts of articles. For this trial, suitable language would have been along these lines: The study did not find evidence against the null hypothesis of no change in exercise time (p=0.2), but was able to (with 0.95 confidence) rule out an effect larger than 42s. A Bayesian analysis would have been even more clinically useful. For example, one might find that the posterior probability that the increase in exercise time with PCI is less than 20s is 0.97. And our infatuation with 2-tailed p-values comes into play here. A Bayesian posterior probability of any improvement might be around 0.88, far more "positive" than what someone who misunderstands p-values would conclude from an "insignificant" p-value. Other thoughts concerning the ORBITA trial may be found here.


Dichotomania, as discussed by Stephen Senn, is a very prevalent problem in medical and epidemiologic research.  Categorization of continuous variables for analysis is inefficient at best and misleading and arbitrary at worst.  This JAMA paper by VISION study investigators "Association of Postoperative High-Sensitivity Troponin Levels With Myocardial Injury and 30-Day Mortality Among Patients Undergoing Noncardiac Surgery" is an excellent example of bad statistical practice that limits the amount of information provided by the study.  The authors categorized high-sensitivity troponin T levels measured post-op and related these to the incidence of death.  They used four intervals of troponin, and there is important heterogeneity of patients within these intervals.  This is especially true for the last interval (> 1000 ng/L).  Mortality may be much higher for troponin values that are much larger than 1000.  The relationship should have been analyzed with a continuous analysis, e.g., logistic regression with a regression spline for troponin, nonparametric smoother, etc.  The final result could be presented in a simple line graph with confidence bands. An example of dichotomania that may not be surpassed for some time is Simplification of the HOSPITAL Score for Predicting 30-day Readmissions by Carole E Aubert, et al in BMJ Quality and Safety 2017-04-17. The authors arbitrarily dichotomized several important predictors, resulting in a major loss of information, then dichotomized their resulting predictive score, sacrificing much of what information remained. The authors failed to grasp probabilities, resulting in risk of 30-day readmission of "unlikely" and "likely". The categorization of predictor variables leaves demonstrable outcome heterogeneity within the intervals of predictor values. Then taking an already oversimplified predictive score and dichotomizing it is essentially saying to the reader "We don't like the integer score we just went to the trouble to develop." I now have serious doubts about the thoroughness of reviews at BMJ Quality and Safety.
A very high-profile paper was published in BMJ on 2017-06-06: Moderate alcohol consumption as risk factor for adverse brain outcomes and cognitive decline: longitudinal cohort study by Anya Topiwala et al. The authors had a golden opportunity to estimate the dose-response relationship between amount of alcohol consumed and quantitative brain changes. Instead the authors squandered the data by doing analyzes that either assumed that responses are linear in alcohol consumption or worse, by splitting consumption into 6 heterogeneous intervals when in fact consumption was shown in their Figure 3 to have a nice continuous distribution. How much more informative (and statistically powerful) it would have been to fit a quadratic or a restricted cubic spline function to consumption to estimate the continuous dose-response curve.
The NEJM keeps giving us great teaching examples with its 2017-08-03 edition. In Angiotensin II for the treatment of vasodilatory shock by Ashish Khanna et al, the authors constructed a bizarre response variable: "The primary end point was a response with respect to mean arterial pressure at hour 3 after the start of infusion, with response defined as an increase from baseline of at least 10 mm Hg or an increase to at least 75 mm Hg, without an increase in the dose of background vasopressors." This form of dichotomania has been discredited by Stephen Senn who provided a similar example in which he decoded the response function to show that the lucky patient is one (in the NEJM case) who has a starting blood pressure of 74mmHg. His example is below:

When a clinical trial's response variable is one that is arbitrary, loses information and power, is difficult to interpret, and means different things for different patients, expect trouble.

Change from Baseline

Many authors and pharmaceutical clinical trialists make the mistake of analyzing change from baseline instead of making the raw follow-up measurements the primary outcomes, covariate-adjusted for baseline.  To compute change scores requires many assumptions to hold, e.g.:
  1. the variable is not used as an inclusion/exclusion criterion for the study, otherwise regression to the mean will be strong
  2. if the variable is used to select patients for the study, a second post-enrollment baseline is measured and this baseline is the one used for all subsequent analysis
  3. the post value must be linearly related to the pre value
  4. the variable must be perfectly transformed so that subtraction "works" and the result is not baseline-dependent
  5. the variable must not have floor and ceiling effects
  6. the variable must have a smooth distribution
  7. the slope of the pre value vs. the follow-up measurement must be close to 1.0 when both variables are properly transformed (using the same transformation on both)
Details about problems with analyzing change may be found in BBR Section 14.4 and here, and references may be found here. See also this.  A general problem with the approach is that when Y is ordinal but not interval-scaled, differences in Y may no longer be ordinal.  So analysis of change loses the opportunity to do a robust, powerful analysis using a covariate-adjusted ordinal response model such as the proportional odds or proportional hazards model.  Such ordinal response models do not require one to be correct in how to transform Y. Regarding 3. above, if pre is not linearly related to post, there is no transformation that can make a change score work.
Regarding 7. above, often the baseline is not as relevant as thought and the slope will be less than 1.  When the treatment can cure every patient, the slope will be zero.  Sometimes the relationship between baseline and follow-up Y is not even linear, as in one example I've seen based on the Hamilton D depression scale.
The purpose of a parallel-group randomized clinical trial is to compare the parallel groups, not to compare a patient with herself at baseline. The central question is for two patients with the same pre measurement value of x, one given treatment A and the other treatment B, will the patients tend to have different post-treatment values? This is exactly what analysis of covariance assesses.  Within-patient change is affected strongly by regression to the mean and measurement error.  When the baseline value is one of the patient inclusion/exclusion criteria, the only meaningful change score requires one to have a second baseline measurement post patient qualification to cancel out much of the regression to the mean effect.  It is he second baseline that would be subtracted from the follow-up measurement.
The savvy researcher knows that analysis of covariance is required to "rescue" a change score analysis. This effectively cancels out the change score and gives the right answer even if the slope of post on pre is not 1.0. But this works only in the linear model case, and it can be confusing to have the pre variable on both the left and right hand sides of the statistical model. And if Y is ordinal but not interval-scaled, the difference in two ordinal variables is no longer even ordinal. Think of how meaningless difference from baseline in ordinal pain categories are. A major problem in the use of change score summaries, even when a correct analysis of covariance has been done, is that many papers and drug product labels still quote change scores out of context.
Patient-reported outcome scales are particularly problematic.  An article published 2017-05-07 in JAMA, doi:10.1001/jama.2017.5103 like many other articles makes the error of trusting change from baseline as an appropriate analysis variable.  Mean change from baseline may not apply to anyone in the trial.  Consider a 5-point ordinal pain scale with values Y=1,2,3,4,5.  Patients starting with no pain (Y=1) cannot improve, so their mean change must be zero.  Patients starting at Y=5 have the most opportunity to improve, so their mean change will be large.  A treatment that improves pain scores by an average of one point may average a two point improvement for patients for whom any improvement is possible.  Stating mean changes out of context of the baseline state can be meaningless.
The NEJM paper Treatment of Endometriosis-Associated Pain with Elagolix, an Oral GnRH Antagonist by Hugh Taylor et al is based on a disastrous set of analyses, combining all the problems above. The authors computed change from baseline on variables that do not have the correct properties for subtraction, engaged in dichotomania by doing responder analysis, and in addition used last observation carried forward to handle dropouts. A proper analysis would have been a longitudinal analysis using all available data that avoided imputation of post-dropout values and used raw measurements as the responses. Most importantly, the twin clinical trials randomized 872 women, and had proper analyses been done the required sample size to achieve the same power would have been far less. Besides the ethical issue of randomizing an unnecessarily large number of women to inferior treatment, the approach used by the investigators maximized the cost of these positive trials.
The NEJM paper Oral Glucocorticoid–Sparing Effect of Benralizumab in Severe Asthma by Parameswaran Nair et al not only takes the problematic approach of using change scores from baseline in a parallel group design but they used percent change from baseline as the raw data in the analysis. This is an asymmetric measure for which arithmetic doesn't work. For example, suppose that one patient increases from 1 to 2 and another decreases from 2 to 1. The corresponding percent changes are 100% and -50%. The overall summary should be 0% change, not +25% as found by taking the simple average. Doing arithmetic on percent change can essentially involve adding ratios; ratios that are not proportions are never added; they are multiplied. What was needed was an analysis of covariance of raw oral glucocorticoid dose values adjusted for baseline after taking an appropriate transformation of dose, or using a more robust transformation-invariant ordinal semi-parametric model on the raw follow-up doses (e.g., proportional odds model).
In Trial of Cannabidiol for Drug-Resistant Seizures in the Dravet Syndrome in NEJM 2017-05-25, Orrin Devinsky et al take seizure frequency, which might have a nice distribution such as the Poisson, and compute its change from baseline, which is likely to have a hard-to-model distribution. Once again, authors failed to recognize that the purpose of a parallel group design is to compare the parallel groups. Then the authors engaged in improper subtraction, improper use of percent change, dichotomania, and loss of statistical power simultaneously: "The percentage of patients who had at least a 50% reduction in convulsive-seizure frequency was 43% with cannabidiol and 27% with placebo (odds ratio, 2.00; 95% CI, 0.93 to 4.30; P=0.08)." The authors went on to analyze the change in a discrete ordinal scale, where change (subtraction) cannot have a meaning independent of the starting point at baseline.
Troponins (T) are myocardial proteins that are released when the heart is damaged. A high-sensitivity T assay is a high-information cardiac biomarker used to diagnose myocardial infarction and to assess prognosis. I have been hoping to find a well-designed study with standardized serially measured T that is optimally analyzed, to provide answers to the following questions:
  1. What is the shape of the relationship between the latest T measurement and time until a clinical endpoint?
  2. How does one use a continuous T to estimate risk?
  3. If T were measured previously, does the previous measurement add any predictive information to the current T?
  4. If both the earlier and current T measurement are needed to predict outcome, how should they be combined? Is what's important the difference of the two? Is it the ratio? Is it the difference in square roots of T?
  5. Is the 99th percentile of T for normal subjects useful as a prognostic threshold?
The 2017-05-16 Circulation paper Serial Measurement of High-Sensitivity Troponin I and Cardiovascular Outcomes in Patients With Type 2 Diabetes Mellitus in the EXAMINE Trial by Matthew Cavender et al was based on a well-designed cardiovascular safety study of diabetes in which uniformly measured high-sensitivity troponin I measurements were made at baseline and six months after randomization to the diabetes drug Alogliptin. [Note: I was on the DSMB for this study] The authors nicely envisioned a landmark analysis based on six-month survivors. But instead of providing answers to the questions above, the authors engaged in dichotomania and never checked whether changes in T or changes in log T possessed the appropriate properties to be used as a valid change score, i.e., they did not plot change in T vs. baseline T or log T ratio vs. baseline T and demonstrate a flat line relationship. Their statistical analysis used statistical methods from 50 years ago, even doing the notorious "test for trend" that tests for a linear correlation between an outcome and an integer category interval number. The authors seem to be unaware of the many flexible tools developed (especially starting in the mid 1980s) for statistical modeling that would answer the questions posed above. Cavender et all stratified T in <1.9 ng/L, 1.9-<10 ng/L, 10-<26 ng/L, and ≥26 ng/L. Fully 1/2 of the patients were in the second interval. Except for the first interval (T below the lower detection limit) the groups are heterogeneous with regard to outcome risks. And there are no data from this study or previous studies that validates these cutpoints. To validate them, the relationship between T and outcome risk would have to be shown to be discontinuous at the cutpoints, and flat between them.
From their paper we still don't know how to use T continuously, and we don't know whether baseline T is informative once a clinician has obtained an updated T. The inclusion of a 3-D block diagram in the supplemental material is symptomatic of the data presentation problems in this paper.
It's not as though T hasn't been analyzed correctly. In a 1996 NEJM paper, Ohman et al used a nonparametric smoother to estimate the continuous relationship between T and 30-day risk. Instead, Cavender, et al created arbitrary heterogeneous intervals of both baseline and 6m T, then created various arbitrary ways to look at change from baseline and its relationship to risk.
An analysis that would have answered my questions would have been to
  1. Fit a standard Cox proportional hazards time-to-event model with the usual baseline characteristics
  2. Add to this model a tensor spline in the baseline and 6m T levels, i.e., a smooth 3-D relationship between baseline T, 6m T, and log hazard, allowing for interaction, and restricting the 3-D surface to be smooth. See for example BBR Figure 4.23. One can do this by using restricted cubic splines in both T's and by computing cross-products of these terms for the interactions. By fitting a flexible smooth surface, the data would be able to speak for themselves without imposing linearity or additivity assumptions and without assuming that change or change in log T is how these variables combine.
  3. Do a formal test of whether baseline T (as either a main effect or as an effect modifier of the 6m T effect, i.e., interaction effect) is associated with outcome when controlling for 6m T and ordinary baseline variables
  4. Quantify the prognostic value added by baseline T by computing the fraction of likelihood ratio chi-square due to both T's combined that is explained by baseline T. Do likewise to show the added value of 6m T. Details about these methods may be found in Regression Modeling Strategies, 2nd edition
Without proper analyses of T as a continuous variable, the reader is left with confusion as to how to really use T in practice, and is given no insight into whether changes are relevant or the baseline T can be ignored with a later T is obtained. In all the clinical outcome studies I've analyzed (including repeated LV ejection fractions and serum creatinines), the latest measurement has been what really mattered, and it hasn't mattered very much how the patient got there. As long as continuous markers are categorized, clinicians are going to get suboptimal risk prediction and are going to find that more markers need to be added to the model to recover the information lost by categorizing the original markers. They will also continue to be surprised that other researchers find different "cutpoints", not realizing that when things don't exist, people will forever argue about their manifestations.

Improper Subgrouping

The JAMA Internal Medicine Paper Effect of Statin Treatment vs Usual Care on Primary Cardiovascular Prevention Among Older Adults by Benjamin Han et al makes the classic statistical error of attempting to learn about differences in treatment effectiveness by subgrouping rather than by correctly modeling interactions. They compounded the error by not adjusting for covariates when comparing treatments in the subgroups, and even worse, by subgrouping on a variable for which grouping is ill-defined and information-losing: age. They used age intervals of 65-74 and 75+. A proper analysis would have been, for example, modeling age as a smooth nonlinear function (e.g., using a restricted cubic spline) and interacting this function with treatment to allow for a high-resolution, non-arbitrary analysis that allows for nonlinear interaction. Results could be displayed by showing the estimated treatment hazard ratio and confidence bands (y-axis) vs. continuous age (x-axis). The authors' analysis avoids the question of a dose-response relationship between age and treatment effect. A full strategy for interaction modeling for assessing heterogeneity of treatment effect (AKA precision medicine) may be found in the analysis of covariance chapter in Biostatistics for Biomedical Research. To make matters worse, the above paper included patients with a sharp cutoff of 65 years of age as the lower limit. How much more informative it would have been to have a linearly increasing (in age) enrollment function that reaches a probability of 1.0 at 65y. Assuming that something magic happens at age 65 with regard to cholesterol reduction is undoubtedly a mistake.

Serial Data and Response Trajectories

Serial data (aka longitudinal data) with multiple follow-up assessments per patient presents special challenges and opportunities. My preferred analysis strategy uses full likelihood or Bayesian continuous-time analysis, using generalized least squares or mixed effects models. This allows each patient to have different measurement times, analysis of the data using actual days since randomization instead of clinic visit number, and non-random dropouts as long as the missing data are missing at random. Missing at random here means that given the baseline variables and the previous follow-up measurements the current measurement is missing completely at random. Imputation is not needed. In the Hypertension July 2017 article Heterogeneity in Early Responses in ALLHAT (Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial) by Sanket Dhruva et al, the authors did advanced statistical analysis that is a level above the papers discussed elsewhere in this article. However, their claim of avoiding dichotomania is unfounded. The authors were primarily interested in the relationship between blood pressures measured at randomization, 1m, 3m, 6m with post-6m outcomes, and they correctly envisioned the analysis as a landmark analysis of patients who were event-free at 6m. They did a careful cluster analysis of blood pressure trajectories from 0-6m. But their chosen method assumes that the variety of trajectories falls into two simple homogeneous trajectory classes (immediate responders and all others). Trajectories of continuous measurements, like the continuous measurements themselves, rarely fall into discrete categories with shape and level homogeneity within the categories. The analyses would in my opinion have been better, and would have been simpler, had everything been considered on a continuum.
With landmark analysis we now have 4 baseline measurements: the new baseline (previously called the 6m blood pressure) and 3 historical measurements. One can use these as 4 covariates to predict time until clinical post-6m outcome using a standard time-to-event model such as the Cox proportional hazards model. In doing so, we are estimating the prognosis associated with every possible trajectory and we can solve for the trajectory that yields the best outcome. We can also do a formal statistical test for whether the trajectories can be summarized more simply than with a 4-dimensional construct, e.g., whether the final blood pressure contains all the prognostic information. Besides specifying the model with baseline covariates (in addition to other original baseline covariates), one also has the option of creating a tall and thin dataset with 4 records per patient (if correlations are accounted for, e.g., cluster sandwich or cluster bootstrap covariance estimates) and modeling outcome using updated covariates and possible interactions with time to allow for time-varying blood pressure effects.
A logistic regression trick described in my book Regression Modeling Strategies comes in handy for modeling how baseline characteristics such as sex, age, or randomized treatment relate to the trajectories. Here one predicts the baseline variable of interest using the four blood pressures. By studying the 4 regression coefficients one can see exactly how the trajectories differ between patients grouped by the baseline variable. This includes studying differences in trajectories by treatment with no dichotomization. For example, if there is a significant association (using a composite (chunk) test) between treatment and any of the 4 blood pressures and in the logistic model predicting treatment, that implies that the reverse is true: one or more of the blood pressures is associated with treatment. Suppose for example that a 4 d.f. test demonstrates some association, the 1 d.f. for the first blood pressure is very significant, and the 3 d.f. test for the last 3 blood pressures is not. This would be interpreted as the treatment having an early effect that wears off shortly thereafter. [For this particular study, with the first measurement being made pre-randomization, such a result would indicate failure of randomization and no blood-pressure response to treatment of any kind.] Were the 4 regression coefficients to be negative and in descending order, this would indicate a progressive reduction in blood pressure due to treatment.
Returning to the originally stated preferred analysis when blood pressure is the outcome of interest (and not time to clinical events), one can use generalized least squares to predict the longitudinal blood pressure trends from treatment. This will be more efficient and also allows one to adjust for baseline variables other than treatment. It would probably be best to make the original baseline blood pressure a baseline variable and to have 3 serial measurements in the longitudinal model. Time would usually be modeled continuously (e.g., using a restricted cubic spline function). But in the Dhruva article the measurements were made at a small number of discrete times, so time could be considered a categorical variable with 3 levels.
I have had misgivings for many years about the quality of statistical methods used by the Channing Lab at Harvard, as well as misgivings about the quality of nutritional epidemiology research in general. My misgivings were again confirmed in the 2017-07-13 NEJM publication Association of Changes in Diet Quality with Total and Cause-Specific Mortality by Mercedes Sotos-Prieto et al. There are the usual concerns about confounding and possible alternate explanations, which the authors did not fully deal with (and why did the authors not include an analysis that characterized which types of subjects tended to have changes in their dietary quality?). But this paper manages to combine dichotomania with probably improper change score analysis and hard-to-interpret results. It started off as a nicely conceptualized landmark analysis in which dietary quality scores were measured during both an 8-year and a 16-year period, and these scores were related to total and all-cause mortality following those landmark periods. But then things went seriously wrong. The authors computed change in diet scores from the start to the end of the qualification period, did not validate that these are proper change scores (see above for more about that), and engaged in percentiling as if the number of neighbors with worse diets than you is what predicts your mortality rather than the absolute quality of your own diet. They then grouped the changes into quintile groups without justification, and examined change quantile score group effects in Cox time-to-event models. It is telling that the baseline dietary scores varied greatly over the change quintiles. The authors emphasized the 20-percentile increase in each score when interpreting result. What does that mean? How is it related to absolute diet quality scores?
The high quality dataset available to the authors could have been used to answer real questions of interest using statistical analyses that did not have hidden assumptions. From their analyses we have no idea of how the subjects' diet trajectories affected mortality, or indeed whether then change in diet quality was as important as the most recent diet quality for the subject, ignoring how the subject arrived at that point at the end of the qualification period. What would be an informative analysis? Start with the simpler one: used a smooth tensor spline interaction surface to estimate relative log hazard of mortality, and construct a 3-D plot with initial diet quality on the x-axis, final (landmark) diet quality on the y-axis, and relative log hazard on the z-axis. Then the more in-depth modeling analysis can be done in which one uses multiple measures of diet quality over time and relates the trajectory (its shape, average level, etc.) to hazard of death. Suppose that absolute diet quality was measured at four baseline points. These four variables could be related to outcome and one could solve for the trajectory that was associated with the lowest mortality. For a study that is almost wholly statistical, it is a shame that modern statistical methods appeared to not even be considered. And for heaven's sake analyze the raw diet scales and do not percentile them.