Friday, January 27, 2017

Randomized Clinical Trials Do Not Mimic Clinical Practice, Thank Goodness

Randomized clinical trials (RCT) have long been held as the gold standard for generating evidence about the effectiveness of medical and surgical treatments, and for good reason.  But I commonly hear clinicians lament that the results of RCTs are not generalizable to medical practice, primarily for two reasons:
  1. Patients in clinical practice are different from those enrolled in RCTs
  2. Drug adherence in clinical practice is likely to be lower than that achieved in RCTs, resulting in lower efficacy.
Point 2 is hard to debate because RCTs are run under protocol and research personnel are watching and asking about patients' adherence (but more about this below).  But point 1 is a misplaced worry in the majority of trials.  The explanation requires getting to the heart of what RCTs are really intended to do: provide evidence for relative treatment effectiveness.  There are some trials that provide evidence for both relative and absolute effectiveness.   This is especially true when the efficacy measure employed is absolute as in measuring blood pressure reduction due to a new treatment.  But many trials use binary or time-to-event endpoints and the resulting efficacy measure is on a relative scale such as the odds ratio or hazard ratio.

RCTs of even drastically different patients can provide estimates of relative treatment benefit on odds or hazard ratio scales that are highly transportable.  This is most readily seen in subgroup analyses provided by the trials themselves - so called forest plots that demonstrate remarkable constancy of relative treatment benefit.  When an effect ratio is applied to a population with a much different risk profile, that relative effect can still fully apply.  It is only likely that the absolute treatment benefit will change, and it is easy to estimate the absolute benefit (e.g., risk difference) for a patient given the relative benefit and the absolute baseline risk for the subject.   This is covered in detail in Biostatistics for Biomedical Research, Section 13.6. See also Stephen Senn's excellent presentation.

Clinical practice provides anecdotal evidence that biases clinicians.  What a clinician sees in her practice is patient i on treatment A and patient j on treatment B.  She may remember how patient i fared in comparison to patient j, not appreciate confounding by indication, and suppose this provides a valid estimate of the difference in effectiveness in treatment A vs. B.  But the real therapeutic question is how does the outcome of a patient were she given treatment A compare to her outcome were she given treatment B.  The gold standard design is thus the randomized crossover design, when the treatment is short acting.  Stephen Senn eloquently writes about how a 6-period 2-treatment crossover study can even do what proponents of personalized medicine mistakenly think they can do with a parallel-group randomized trial: estimate treatment effectiveness for individual patients.

For clinical practice to provide the evidence really needed, the clinician would have to see patients and assign treatments using one of the top four approaches listed in the hierarchy of evidence below. Entries are in the order of strongest evidence requiring the least assumptions to the weakest evidence. Note that crossover studies, when feasible, even surpass randomized studies of matched identical twins in the quality and relevance of information they provide.

Let Pi denote patient i and the treatments be denoted by A and B. Thus P2B represents patient 2 on treatment BP1 represents the average outcome over a sample of patients from which patient 1 was selected.  HTE is heterogeneity of treatment effect.

DesignPatients Compared
6-period crossoverP1A vs P1B (directly measure HTE)
2-period crossoverP1A vs P1B
RCT in idential twinsP1A vs P1B
 group RCTP1A vs P2BP1=P2 on avg
Observational, good artificial controlP1A vs P2BP1=P2 hopefully on avg
Observational, poor artificial controlP1A vs P2BP1≠ P2 on avg
Real-world physician practiceP1A vs P2B

The best experimental designs yield the best evidence a clinician needs to answer the "what if" therapeutic question for the one patient in front of her.

Regarding adherence, proponents of "real world" evidence advocate for estimating treatment effects in the context of making treatment adherence low as in clinical practice. This would result in lower efficacy and the abandonment of many treatments. It is hard to argue that a treatment should not be available for a potentially adherent patient because her fellow patients were poor adherers. Note that an RCT is the best hope for estimating efficacy as a function of adherence, through for example an instrumental variable analysis (the randomization assignment is a truly valid instrument). Much more needs to be said about how to handle treatment adherence and what should be the target adherence in an RCT, but overall it is a good thing that RCTs do not mimic clinical practice.  We are entering a new era of pragmatic clinical trials.  Pragmatic trials are worthy of in-depth discussion, but it is not a stretch to say that the chief advantage of pragmatic trials is not that they provide results that are more relevant to clinical practice but that they are cheaper and faster than traditional randomized trials.

Updated 2017-06-25 (last paragraph regarding adherence)


  1. Thanks for this thought-provoking material. As an early-career statistician, this is a great format for learning. In relation to the last paragraph, is it your view that that intention to treat effects are not important? While I do think the question of efficacy as a function of adherence is important, I'm not sure that we should dismiss the question of whether or not a treatment works in practice, as it is currently quite fashionable to do. Non-adherence could be due to adverse side effects of the treatment, for example. The analogy I think of is a world class football player who is great on the pitch but is injured most of the time. Her efficacy is great but her practical effectiveness is low.
    Moreover, if we recommend treatments on the basis of average treatment effects, I wonder if it is really less reasonable to incorporate typical patterns of adherence into our decision making. Would be great to hear your thoughts.
    Thanks again.

    1. What great questions Jack! I'm not sure I can do them justice but here is a start. Intent-to-treat is extremely important and should not be de-emphasized. Randomized trials are probably the only reliable platform for estimation of efficacy as a function of adherence, using for example the only perfect instrument we have (randomization assignment) in an instrumental variable analysis. If one can estimate this relationship reliably, then a great way to use an RCT is to show the average treatment effect as a function of adherence, before getting into interactions with treatment (differential treatment effect; heterogeneity of treatment effect). If you bypass an RCT and try to see if a drug "works in practice" the result will generally be uninterpretable, and the adherence achieved in practice may be below what can be achieved once word gets out that the treatment has been objectively demonstrated in an RCT to benefit patients.

      Take a look at the studies where the investigators showed that the patients who adhered to placebo had low cardiovascular mortality. See for example .

      I look forward to continuing this discussion.

    2. Thanks for the reply Frank. I'm sure a lot of people are reading these discussions and benefitting a great deal from them (I certainly am). Thanks for clarifying re: ITT - I hear a lot of (quite senior) folk saying that it is irrelevant. I thought you were saying the same thing with your post - but see now that this was a misreading.

      Thanks again!