Randomized Clinical Trials Do Not Mimic Clinical Practice, Thank Goodness
What clinicians learn from clinical practice, unless they routinely do n-of-one studies, is based on comparisons of unlikes. Then they criticize like-vs-like comparisons from randomized trials for not being generalizable. This is made worse by not understanding that clinical trials are designed to estimate relative efficacy, and relative efficacy is surprisingly transportable.
Many clinicians do not even track what happens to their patients to be able to inform their future patients. At the least, randomized trials track everyone.
Parallel-group RCTs enroll volunteers whose characteristics do not mimic any population. They are then assigned treatment, and the result is the ability to estimate between-group shifts in outcomes, not to estimate population outcome tendencies in one treatment group. Think of a linear regression. The RCT is used to estimate the slope (relative shift), not the intercept (absolute anchor).
Randomized clinical trials (RCTs) have various goals, including providing evidence that
- a treatment is superior than another treatment in a way that is likely to benefit patients
- a new treatment yields patient outcomes that are similar enough to an established treatment that the two may be considered interchangeable
- a diagnostic device or other technology provides information that improves patient management or outcomes
Let’s consider only the first goal. RCTs have long been held as the gold standard for generating evidence about the effectiveness of medical and surgical treatments, and for good reason. But I commonly hear clinicians lament that the results of RCTs are not generalizable to medical practice, primarily for two reasons:
- Patients in clinical practice are different from those enrolled in RCTs
- Drug adherence in clinical practice is likely to be lower than that achieved in RCTs, resulting in lower efficacy.
Point 2 is hard to debate because RCTs are run under protocol, and research personnel are watching and asking about patients’ adherence (but more about this below). But point 1 is a misplaced worry in the majority of trials. The explanation requires getting to the heart of what RCTs are really intended to do:
- Provide evidence for relative treatment effectiveness over an adequate time horizon for assessing target patient outcomes
- Provide evidence for relative safety over a somewhat adequate time horizon for assessing non-target safety outcomes
Let’s go into the meaning of relative effectiveness, for two types of outcome variables. For a continuous response Y such as systolic blood pressure (SBP), which for practical purposes may be considered to have an unrestricted range, the efficacy measure of interest is often the difference in two means, i.e., the mean reduction in SBP. Letting
where
When we condition on baseline characteristics in
Any covariate conditioning is better than none. Estimating unadjusted treatment effects in nonlinear model situations will result in stronger attenuation of the treatment effect (e.g., move an OR towards 1.0) on the average, will get the model wrong, and will not lend itself to understanding the ARR distribution nor provide any basis for treatment interaction/assessment of differential treatment effect. Regarding “get the model wrong”, a good example is that if the treatment effect is constant over time upon covariate adjustment (i.e., the proportional hazards (PH) assumption holds), the unadjusted treatment effect will violate PH. As an example let there be a large difference in survival time between males and females. Failure to condition on sex will make the analyst see a complex bimodal survival time distribution with unexplained modes, and this can lead to violating PH for treatment. Practical experience has found more studies with PH after covariate adjustment than studies with PH without covariate adjustment.
To obtain not only patient-type-specific treatment effects but also patient-specific effects requires conditioning on patient, otherwise estimates are marginalized with respect to patient. Conditioning on patient requires having random effects for patients, e.g., random intercepts. To have random effects requires having multiple post-randomization observations per patient—either, for example, a 6-period 2-treatment randomized crossover study or a longitudinal study with lots of longitudinal assessments per patient.
For continuous
When
How are models chosen for such classes of
A binary logistic regression model for treatment and covariate-specific probability of the outcome event
The omission of
The decision to include interactions needs to be sample-size dependent in addition to being driven by subject matter knowledge, and should also recognize the huge variance-bias trade-offs involved. Reduction in bias by inclusion of treatment interactions can easily be offset by large increases in variances, so that it would have been better to pretend that the relative treatment effect is constant. This is explored here. In the best of situations, where there is a single binary interacting factor having a prevalence of 0.5, the sample size needed to estimate an interaction effect with a specific precision/margin of error is
In RCTs, primary analyses should be pre-specified. Other analyses can be more adaptive. A useful pragmatic strategy when the number of covariates is manageable (e.g., 10 or fewer) is to ask this question: Will predictions of patient-type-specific treatment effects be better made with inclusion of all treatment interactions, or by ignoring them? This question can be answered by comparing Akaike’s information criterion (AIC) of models with and without the interactions and choosing the model with the smaller AIC. This is equivalent to basing the decision on whether the likelihood ratio
Even better is to use a more linear non-dichotomous process whereby interactions are not “in” or “out” of the model but are always partially “in”. Parameters for interaction terms are present but are discounted using either (1) cross-validation-like considerations to choose a penalty parameter in penalized maximum likelihood estimation, or (2) by setting Bayesian priors that may specify, for example, that interaction effects are unlikely to be larger than main effects or that interaction effects are unlikely to be beyond a certain magnitude (e.g., the ratio of ORs is unlikely to exceed
More assumption-free ways to incorporate covariates into the analysis to gain precision in estimating the average treatment effect hide the problem of interactions and do not provide insights about effect modification/differential treatment effect.
There are those who believe that traditional statistical models should not be used in RCTs. They tend to favor the use of machine learning or nonparametric risk models that allow treatment to interact with every baseline variable, and use such models to estimate the average risk as if every patient were on treatment B, then estimate the average risk as if they were on treatment A. The differences in these two average risk estimates is a sample average treatment effect (SATE). Unless effectiveness is summarized with a difference in means, the SATE is a function of the distribution of characteristics of patients who happened to enter the trial, and it cannot be used to estimate the population average treatment effect (PATE) because probability samples are not used to select patients at random to enroll in RCTs. Here are some other comments about this disdain for traditional ANCOVA.
- I’ll take a method that has assumptions that are not testable that represent reasonable approximations, over a method that ignores things that are clearly present.
- Minimal-assumption almost-nonparametric approaches to estimation fail to account for the large variance-bias tradeoff they entail. By targeting the estimation of SATEs, the instabilities of such approaches average out to result in precise SATE estimates, but the resulting treatment effectiveness estimates that are average over unlikes may not apply to any patient in or out of the RCT. And they can’t give you the goal PATE.
- To not average/to estimate effectiveness for individual patients, it’s not possible to allow for all possible treatment interactions without exploding the needed sample size. Approximating a needed patient-type-specific estimand is better than not attempting to estimate it.
This famous quote from John Tukey comes to mind.
Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.
To select a transportable relative effect measure for binary
where
So that the RCT provides a single measure that is likely to transport outside of the trial, where there will certainly be a different patient mix on
The cost of having a transportable treatment effect parameter that is consistent with individual patient decision making is to specify a statistical model for how the measurable part of patient heterogeneity happens, so that easily explainable outcome heterogeneity can be explained.
A RCT with reasonably wide patient inclusion criteria can also provide good estimates of absolute risk of the trial’s outcome
If there are no
RCTs of even drastically different patients can provide estimates of relative treatment benefit on odds or hazard ratio scales that are highly transportable. This is most readily seen in subgroup analyses provided by the trials themselves - so called forest plots that demonstrate remarkable constancy of relative treatment benefit. When an effect ratio is applied to a population with a much different risk profile, that relative effect can still fully apply. It is only likely that the absolute treatment benefit will change, and it is easy to estimate the absolute benefit (e.g., risk difference) for a patient given the relative benefit and the absolute baseline risk for the subject. This is covered in detail in Biostatistics for Biomedical Research, Section 13.6. See also Stephen Senn’s excellent presentation.
Transportability of relative efficacy estimated from an RCT to patients in the field depends on a number of factors that need to be elucidated. For example, relative efficacy from an RCT is more likely to be transportable if patients in the RCT differ from those in the field by a matter of degree, e.g., are younger or further out on a disease severity continuum. Success can also happen when patients differ in an important etiologic or structural way when there are covariates that are well correlated with these characteristics that were captured in the RCT. But if etiology or structure differ in an undescribed way, the translation of RCT estimates to the field may fail, as is also the case if there is an important covariate
Now that we have dived into relative effects and what RCTs are designed to estimate, consider how the “real world” does not provide what is needed to learn about treatment effectiveness in the sense of estimating what using a new treatment instead of an old treatment is likely to accomplish. Clinical practice provides anecdotal evidence that biases clinicians. What a clinician sees in her practice is patient
For clinical practice to provide the evidence really needed, the clinician would have to see patients and assign treatments using one of the top four approaches listed in the hierarchy of evidence below. Entries are in the order of strongest evidence requiring the least assumptions to the weakest evidence. Note that crossover studies, when feasible, even surpass randomized studies of matched identical twins in the quality and relevance of information they provide.
Let
Design | Patients Compared |
---|---|
6-period crossover | |
2-period crossover | |
RCT in identical twins | |
Observational, good artificial control | |
Observational, poor artificial control | |
Real-world physician practice |
The best experimental designs yield the best evidence a clinician needs to answer the “what if” therapeutic question for the one patient in front of her. Covariate adjustment allows line four in the above table to be translated to patient-type-specific outcomes and not just group averages.
Regarding adherence, proponents of “real world” evidence advocate for estimating treatment effects in the context of making treatment adherence low as in clinical practice. This would result in lower efficacy and the abandonment of many treatments. It is hard to argue that a treatment should not be available for a potentially adherent patient because her fellow patients were poor adherers. Note that an RCT is by far the best hope for estimating efficacy as a function of adherence, through for example an instrumental variable analysis (the randomization assignment is a truly valid instrument). Much more needs to be said about how to handle treatment adherence and what should be the target adherence in an RCT, but overall it is a good thing that RCTs do not mimic clinical practice. We are entering a new era of pragmatic clinical trials. Pragmatic trials are worthy of in-depth discussion, but it is not a stretch to say that the chief advantage of pragmatic trials is not that they provide results that are more relevant to clinical practice but that they are cheaper and faster than traditional randomized trials.
Further Reading
- Why representativeness should be avoided by Rothman, Gallacher, and Hatch
- The magic of randomization versus the myth of real-world evidence by Collins et al
- Treatment effects may remain the same even when trial participants differed from the target population by MJ Bradburn et al.
- Interpreting randomized controlled trials by P Msaouel, J Lee, P Thall
Discussion Archive
This discussion from 2017 and 2018 came from the old blog platform.