What clinicians learn from clinical practice, unless they routinely do n-of-one studies, is based on comparisons of unlikes. Then they criticize like-vs-like comparisons from randomized trials for not being generalizable. This is made worse by not understanding that clinical trials are designed to estimate relative efficacy, and relative efficacy is surprisingly transportable.

Many clinicians do not even track what happens to their patients to be able to inform their future patients. At the least, randomized trials track everyone.

Parallel-group RCTs enroll volunteers whose characteristics do not mimic any population. They are then assigned treatment, and the result is the ability to estimate between-group shifts in outcomes, not to estimate population outcome tendencies in one treatment group. Think of a linear regression. The RCT is used to estimate the slope (relative shift), not the intercept (absolute anchor).

Randomized clinical trials (RCTs) have various goals, including providing evidence that

First published 2017-01-27; major revision 2023-02-13

- a treatment is superior than another treatment in a way that is likely to benefit patients
- a new treatment yields patient outcomes that are similar enough to an established treatment that the two may be considered interchangeable
- a diagnostic device or other technology provides information that improves patient management or outcomes

Let’s consider only the first goal. RCTs have long been held as the gold standard for generating evidence about the effectiveness of medical and surgical treatments, and for good reason. But I commonly hear clinicians lament that the results of RCTs are not generalizable to medical practice, primarily for two reasons:

- Patients in clinical practice are different from those enrolled in RCTs
- Drug adherence in clinical practice is likely to be lower than that achieved in RCTs, resulting in lower efficacy.

Point 2 is hard to debate because RCTs are run under protocol, and research personnel are watching and asking about patients’ adherence (but more about this below). But point 1 is a misplaced worry in the majority of trials. The explanation requires getting to the heart of what RCTs are really intended to do:

- Provide evidence for
**relative** treatment effectiveness over an adequate time horizon for assessing target patient outcomes
- Provide evidence for
**relative** safety over a somewhat adequate time horizon for assessing non-target safety outcomes

Let’s go into the meaning of relative effectiveness, for two types of outcome variables. For a continuous response Y such as systolic blood pressure (SBP), which for practical purposes may be considered to have an unrestricted range, the efficacy measure of interest is often the difference in two means, i.e., the mean reduction in SBP. Letting denote expected value or long-term average, denote treatment assignment ( or ), and denote a vector of baseline (pre-randomization) patient characteristics, a key quantity of interest for continuous is

The minority of RCTs actually use covariate adjustment for the primary analysis, a sad fact frequently lamented by regulatory authorities. The highly problematic consequences of this are discussed here, especially the resulting inflation of sample size to make up for failing to account for within-treatment patient outcome heterogeneity. Besides wasting time and resources, designating unadjusted analysis as the primary analysis leads to ethical concerns about exposing too many patients to experimental therapies. It also leads to much confusion about whether and how to handle observed baseline imbalance which would have been circumvented by pre-specifying covariate adjustment for important factors. Here we assume that the primary analysis uses best statistical practices so is covariate-adjusted.
where denotes “conditional on” or “holding constant”. Since the long-run mean is a linear operator and we typically use a linear model to analyze the data, the average is a collapsible quantity, meaning that the covariate-specific treatment effect equals the marginal treatment effect . Covariate adjustment is still needed to estimate this difference to reduce variance and hence achieve optimum power and precision. The mean difference can be estimated from the patients in the RCT and this also estimates a population-averaged treatment effect. The patient mix does not matter unless there are interactions between one or more s and and the distribution of the interacting factors in differs between RCT and target populations.

When we condition on baseline characteristics in these describe *types* of patients. So we obtain patient-type-specific tendencies of . When omits an important patient characteristic we obtain patient-type-specific values up to the resolution of how “type” is measured. Such conditional estimates will be *marginal* over omitted covariates, i.e., they will average over the sample distribution of omitted covariates. For linear models this is consequential only in not further reducing the residual variance, so some efficiency is lost. For nonlinear models such as logistic and Cox models, the consequence is that the treatment effect is a kind of weighted average over the sample distribution of omitted covariates that are important. That doesn’t make it wrong or unhelpful. The effect of this on the average is to underestimate the true treatment effect that compares like with like by conditioning on all important covariates.

Any covariate conditioning is better than none. Estimating unadjusted treatment effects in nonlinear model situations will result in stronger attenuation of the treatment effect (e.g., move an OR towards 1.0) on the average, will get the model wrong, and will not lend itself to understanding the ARR distribution nor provide any basis for treatment interaction/assessment of differential treatment effect. Regarding “get the model wrong”, a good example is that if the treatment effect is constant over time upon covariate adjustment (i.e., the proportional hazards (PH) assumption holds), the unadjusted treatment effect will violate PH. As an example let there be a large difference in survival time between males and females. Failure to condition on sex will make the analyst see a complex bimodal survival time distribution with unexplained modes, and this can lead to violating PH for treatment. Practical experience has found more studies with PH after covariate adjustment than studies with PH without covariate adjustment.

To obtain not only patient-type-specific treatment effects but also patient-specific effects requires conditioning on patient, otherwise estimates are marginalized with respect to patient. Conditioning on patient requires having random effects for patients, e.g., random intercepts. To have random effects requires having multiple post-randomization observations per patient—either, for example, a 6-period 2-treatment randomized crossover study or a longitudinal study with lots of longitudinal assessments per patient.

A binary logistic regression model for treatment and covariate-specific probability of the outcome event may be stated as where , is a 0/1 indicator for treatment (0=, 1=), is the log OR and its anti-log is the adjusted (for ) OR.

The omission of interaction terms in the above model is a default position related to the times I’ve analyzed trial data that were large enough to assess interactions and found no evidence for them, and also the huge number of published funnel plots showing remarkable consistency of ORs across patient subgroups (see also this). Subject matter considerations or secondary RCT efficacy analyses (or sensitivity analysis) would cause us to add interaction terms to the model. The material that follows is still relevant but would involve covariate-specific ORs. Were interaction parameters added to the model, the result would be primary relative efficacy estimands. For the (up to) absolute risk reductions, since each one already conditions on , the form of the computations wouldn’t change. ARRs would be exaggerated for certain levels of interacting factors, and it would be helpful to display the ARR distributions by levels of interacting factors.

The decision to include interactions needs to be sample-size dependent in addition to being driven by subject matter knowledge, and should also recognize the huge variance-bias trade-offs involved. Reduction in bias by inclusion of treatment interactions can easily be offset by large increases in variances, so that it would have been better to pretend that the relative treatment effect is constant. This is explored here. In the best of situations, where there is a single binary interacting factor having a prevalence of 0.5, the sample size needed to estimate an interaction effect with a specific precision/margin of error is higher than the sample size needed to estimate a main effect to the same precision. In that ideal case, the precision (e.g., confidence interval width) of the treatment effect estimate for one level of the interacting factor is worse by a factor of than the precision for a treatment main effect.

In RCTs, primary analyses should be pre-specified. Other analyses can be more adaptive. A useful pragmatic strategy when the number of covariates is manageable (e.g., 10 or fewer) is to ask this question: Will predictions of patient-type-specific treatment effects be better made with inclusion of all treatment interactions, or by ignoring them? This question can be answered by comparing Akaike’s information criterion (AIC) of models with and without the interactions and choosing the model with the smaller AIC. This is equivalent to basing the decision on whether the likelihood ratio test statistic for all interactions combined exceeds twice its degrees of freedom. [See this for related ideas.]

Even better is to use a more linear non-dichotomous process whereby interactions are not “in” or “out” of the model but are always partially “in”. Parameters for interaction terms are present but are discounted using either (1) cross-validation-like considerations to choose a penalty parameter in penalized maximum likelihood estimation, or (2) by setting Bayesian priors that may specify, for example, that interaction effects are unlikely to be larger than main effects or that interaction effects are unlikely to be beyond a certain magnitude (e.g., the ratio of ORs is unlikely to exceed or be less than ). An example of (1) is here.

More assumption-free ways to incorporate covariates into the analysis to gain precision in estimating the average treatment effect hide the problem of interactions and do not provide insights about effect modification/differential treatment effect.

There are those who believe that traditional statistical models should not be used in RCTs. They tend to favor the use of machine learning or nonparametric risk models that allow treatment to interact with every baseline variable, and use such models to estimate the average risk as if every patient were on treatment B, then estimate the average risk as if they were on treatment A. The differences in these two average risk estimates is a sample average treatment effect (SATE). Unless effectiveness is summarized with a difference in means, the SATE is a function of the distribution of characteristics of patients who happened to enter the trial, and it cannot be used to estimate the population average treatment effect (PATE) because probability samples are not used to select patients at random to enroll in RCTs. Here are some other comments about this disdain for traditional ANCOVA.

- I’ll take a method that has assumptions that are not testable that represent reasonable approximations, over a method that ignores things that are clearly present.
- Minimal-assumption almost-nonparametric approaches to estimation fail to account for the large variance-bias tradeoff they entail. By targeting the estimation of SATEs, the instabilities of such approaches average out to result in precise SATE estimates, but the resulting treatment effectiveness estimates that are average over unlikes may not apply to any patient in or out of the RCT. And they can’t give you the goal PATE.
- To not average/to estimate effectiveness for individual patients, it’s not possible to allow for all possible treatment interactions without exploding the needed sample size. Approximating a needed patient-type-specific estimand is better than not attempting to estimate it.

This famous quote from John Tukey comes to mind.

Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.

To select a transportable relative effect measure for binary , we are then seeking a function such that, in the absence of interactions, satisfies

where is a relative effect ratio measure and is a single number. For the logistic model which is the conversion of risk to odds, and is the OR.

The cost of having a transportable treatment effect parameter that is consistent with individual patient decision making is to specify a statistical model for how the measurable part of patient heterogeneity happens, so that easily explainable outcome heterogeneity can be explained.

RCTs of even drastically different patients can provide estimates of relative treatment benefit on odds or hazard ratio scales that are highly transportable. This is most readily seen in subgroup analyses provided by the trials themselves - so called forest plots that demonstrate remarkable constancy of relative treatment benefit. When an effect ratio is applied to a population with a much different risk profile, that relative effect can still fully apply. It is only likely that the absolute treatment benefit will change, and it is easy to estimate the absolute benefit (e.g., risk difference) for a patient given the relative benefit and the absolute baseline risk for the subject. This is covered in detail in Biostatistics for Biomedical Research, Section 13.6. See also Stephen Senn’s excellent presentation.

See also the remarkable constancy of ORs in this large RCT, where every opportunity was given for a covariate to interact with treatment, but the best predicting model forced all interactions to zero.
Transportability of relative efficacy estimated from an RCT to patients in the field depends on a number of factors that need to be elucidated. For example, relative efficacy from an RCT is more likely to be transportable if patients in the RCT differ from those in the field by a matter of degree, e.g., are younger or further out on a disease severity continuum. Success can also happen when patients differ in an important etiologic or structural way when there are covariates that are well correlated with these characteristics that were captured in the RCT. But if etiology or structure differ in an undescribed way, the translation of RCT estimates to the field may fail, as is also the case if there is an important covariate treatment interaction that was omitted from the RCT model and the interacting factor has a much different distribution in the field than in the RCT patients. This latter phenomenon is covered in detail here.

Now that we have dived into relative effects and what RCTs are designed to estimate, consider how the “real world” does not provide what is needed to learn about treatment effectiveness in the sense of estimating what using a new treatment instead of an old treatment is likely to accomplish. Clinical practice provides anecdotal evidence that biases clinicians. What a clinician sees in her practice is patient on treatment and patient on treatment . She may remember how patient fared in comparison to patient , not appreciate confounding by indication, and suppose this provides a valid estimate of the difference in effectiveness in treatment vs. . But the real therapeutic question is how does the outcome of a patient were she given treatment compare to her outcome were she given treatment . The gold standard design is thus the randomized crossover design, when the treatment is short acting. Stephen Senn eloquently writes about how a 6-period 2-treatment crossover study can even do what proponents of personalized medicine mistakenly think they can do with a parallel-group randomized trial: estimate treatment effectiveness for individual patients beyond what is predicted by covariates.

For clinical practice to provide the evidence really needed, the clinician would have to see patients and assign treatments using one of the top four approaches listed in the hierarchy of evidence below. Entries are in the order of strongest evidence requiring the least assumptions to the weakest evidence. Note that crossover studies, when feasible, even surpass randomized studies of matched identical twins in the quality and relevance of information they provide.

Let denote patient and the treatments be denoted by and . Thus represents patient 2 on treatment . represents the average outcome over a sample of patients from which patient 1 was selected.

6-period crossover |
vs (directly measure HTE) |

2-period crossover |
vs |

RCT in identical twins |
vs |

group RCT |
vs , on avg |

Observational, good artificial control |
vs , hopefully on avg |

Observational, poor artificial control |
vs , on avg |

Real-world physician practice |
vs |

The best experimental designs yield the best evidence a clinician needs to answer the “what if” therapeutic question for the one patient in front of her. Covariate adjustment allows line four in the above table to be translated to patient-type-specific outcomes and not just group averages.

Regarding adherence, proponents of “real world” evidence advocate for estimating treatment effects in the context of making treatment adherence low as in clinical practice. This would result in lower efficacy and the abandonment of many treatments. It is hard to argue that a treatment should not be available for a potentially adherent patient because her fellow patients were poor adherers. Note that an RCT is by far the best hope for estimating efficacy as a function of adherence, through for example an instrumental variable analysis (the randomization assignment is a truly valid instrument). Much more needs to be said about how to handle treatment adherence and what should be the target adherence in an RCT, but overall it is a good thing that RCTs do not mimic clinical practice. We are entering a new era of pragmatic clinical trials. Pragmatic trials are worthy of in-depth discussion, but it is not a stretch to say that the chief advantage of pragmatic trials is not that they provide results that are more relevant to clinical practice but that they are cheaper and faster than traditional randomized trials.

An observational study has great difficulty unbiasedly estimating the average treatment effect. Using the same data to attempt to estimate efficacy under a specific degree of adherence is near impossible.

## Discussion Archive

This discussion from 2017 and 2018 came from the old blog platform.

]]>
## Comments from Readers

Giuliano Cruz: Very interesting post! It hurts even more to read sens/spec “optimization” in, say, the biomarker literature after seeing decision theoretic approaches for specification of risk thresholds. One doubt I have: following https://www.tandfonline.com/doi/abs/10.1198/000313008X370302, the formula for rational odds(Pt) (that arrived at Pt=10%) seems inverted. As I read, it should be odds(Pt) = (TN-FP)/(TP-FN). Is that correct? Thanks!Uriah Finkel: Great post! I wonder what about Risk Percentiles? In some applications it is required to observe results for top 5% population at risk, or that there’s enough budget for intervention for limited absolute amount of people. Should it be considered as another constraint?