Wednesday, October 4, 2017

Bayesian vs. Frequentist Statements About Treatment Efficacy

The following examples are intended to show the advantages of Bayesian reporting of treatment efficacy analysis, as well as to provide examples contrasting with frequentist reporting. As detailed here, there are many problems with p-values, and some of those problems will be apparent in the examples below. Many of the advantages of Bayes are summarized here. As seen below, Bayesian posterior probabilities prevent one from concluding equivalence of two treatments on an outcome when the data do not support that (i.e., the "absence of evidence is not evidence of absence" error).

Suppose that a parallel group randomized clinical trial is conducted to gather evidence about the relative efficacy of new treatment B to a control treatment A. Suppose there are two efficacy endpoints: systolic blood pressure (SBP) and time until cardiovascular/cerebrovascular event. Treatment effect on the first endpoint is assumed to be summarized by the B-A difference in true mean SBP. The second endpoint is assumed to be summarized as a true B:A hazard ratio (HR). For the Bayesian analysis, assume that pre-specified skeptical prior distributions were chosen as follows. For the unknown difference in mean SBP, the prior was normal with mean 0 with SD chosen so that the probability that the absolute difference in SBP between A and B exceeds 10mmHg was only 0.05. For the HR, the log HR was assumed to have a normal distribution with mean 0 and SD chosen so that the prior probability that the HR > 2 or HR < 1/2 was 0.05. Both priors specify that it is equally likely that treatment B is effective as it is detrimental. The two prior distributions will be referred to as p1 and p2.

Example 1: So-called "Negative" Trial (Considering only SBP)

  • Frequentist Statement
    • Incorrect Statement: Treatment B did not improve SBP when compared to A (p=0.4)
    • Confusing Statement: Treatment B was not significantly different from treatment A (p=0.4)
    • Accurate Statement: We were unable to find evidence against the hypothesis that A=B (p=0.4). More data will be needed. As the statistical analysis plan specified a frequentist approach, the study did not provide evidence of similarity of A and B (but see the confidence interval below).
    • Supplemental Information: The observed B-A difference in means was 4mmHg with a 0.95 confidence interval of [-5, 13]. If this study could be indefinitely replicated and the same approach used to compute the confidence interval each time, 0.95 of such varying confidence intervals would contain the unknown true difference in means.
  • Bayesian Statement
    • Assuming prior distribution p1 for the mean difference of SBP, the probability that SBP with treatment B is lower than treatment A is 0.67. Alternative statement: SBP is probably (0.67) reduced with treatment B. The probability that B is inferior to A is 0.33. Assuming a minimally clinically important difference in SBP of 3mmHg, the probability that the mean for A is within 3mmHg of the mean for B is 0.53, so the study is uninformative about the question of similarity of A and B.
    • Supplemental Information: The posterior mean difference in SBP was 3.3mmHg and the 0.95 credible interval is [-4.5, 10.5]. The probability is 0.95 that the true treatment effect is in the interval [-4.5, 10.5]. [could include the posterior density function here, with a shaded right tail with area 0.67.]

Example 2: So-called "Positive" Trial

  • Frequentist Statement
    • Incorrect Statement: The probability that there is no difference in mean SBP between A and B is 0.02
    • Confusing Statement: There was a statistically significant difference between A and B (p=0.02).
    • Correct Statement: There is evidence against the null hypothesis of no difference in mean SBP (p=0.02), and the observed difference favors B. Had the experiment been exactly replicated indefinitely, 0.02 of such repetitions would result in more impressive results if A=B.
    • Supplemental Information: Similar to above.
    • Second Outcome Variable, If the p-value is Small: Separate statement, of same form as for SBP.
  • Bayesian Statement
    • Assuming prior p1, the probability that B lowers SBP when compared to A is 0.985. Alternative statement: SBP is probably (0.985) reduced with treatment B. The probability that B is inferior to A is 0.015.
    • Supplemental Information: Similar to above, plus evidence about clinically meaningful effects, e.g.: The probability that B lowers SBP by more than 3mmHg is 0.81.
    • Second Outcome Variable: Bayesian approach allows one to make a separate statement about the clinical event HR and to state evidence about the joint effect of treatment on SBP and HR. Examples: Assuming prior p2, HR is probably (0.79) lower with treatment B. Assuming priors p1 and p2, the probability that treatment B both decreased SBP and decreased event hazard was 0.77. The probability that B improved either of the two endpoints was 0.991.
One would also report basic results. For SBP, frequentist results might be chosen as the mean difference and its standard error. Basic Bayesian results could be said to be the entire posterior distribution of the SBP mean difference.

Note that if multiple looks were made as the trial progressed, the frequentist estimates (including the observed mean difference) would have to undergo complex adjustments. Bayesian results require no modification whatsoever, but just involve reporting the latest available cumulative evidence.


  1. If I understand the last sentence correctly, then I do not agree with it.

    I remember several years ago hearing something similar (but I think I misunderstood at that point, did not hear enough) and tried a test. I simulated a set of 100 values from a binomial (actually Bernoulli) with proportion 0.5 and starting with the 10th observation computed a posterior distribution given a uniform prior and binomial likelihood. From the posterior I calculated the probability that the true proportion was less than 0.5 and had the simulation stop and report the posterior if that probability was less than 0.05. Then I ran this whole process a bunch of times (at least 1,000 but I don't remember exactly). When I looked at all 100 draws from each simulation then the proportion of posterior probabilities less than 0.05 was about 5%, but if I let the simulation stop early, then the proportion was about 14%. Researcher degrees of freedom and the garden of forking paths can affect Bayesian analysis as well.

    Later I saw a presentation by Scott Berry (and later read his book on Bayesian adaptive clinical trials) where he recommended using simulations to choose an appropriate prior and probability cut-off on the posterior to give desired properties of the trial.

    The best option for multiple looks at the data with possible early stopping is to use the simulations to choose the prior and stopping rule. At least the Bayesian analysis should honestly report the number of actual and potential looks at the data.

  2. Thanks for your comments. Scott Berry's statement was in the context of computing frequentist type I error, which I do not care about. I think that the simulation you described did not calculate the needed probability. You don't want the proportion of time the posterior probabilities crossed a threshold. That would be related to Bayesian power. What we want is to determine whether the posterior probability at the moment of stopping is well calibrated. I ran one simulation of 10,000 clinical trials with 400 looks at the data (one look after each new subject is added) with a rule to stop when the posterior prob. exceeds 0.95. The average posterior probability at the moment of stopping was 0.96 which exactly equalled the proportion of the 10,000 simulations in which the true efficacy was positive.

    There are no multiplicity problems with Bayes. Multiplicity comes from the chances you give data to be more extreme (relevant in the frequentist world) not the chances you give assertions to be true.

    1. Could you post your simulation code?

    2. I'll do that along with graphical output in a future blog, probably in a couple of weeks. It's very simple - one-sample problem, in R.

    3. "The average posterior probability at the moment of stopping was 0.96 which exactly equalled the proportion of the 10,000 simulations in which the true efficacy was positive."

      I've seen people use this logic in the past. Vary whether some statement A is true; simulate optional stopping in Bayes; the proportion of replicates for which the stopping rule was met matches the probability that A is true.

      However, that doesn't sit right with me. If the true state of the universe is A, then A is just simply... true. That sort of assumes that the universe randomly generates whether A is true or not, with some fixed probability. 80% of the time, A is true; 20% of the time, it's not. Is that not a weird assumption to you?

      I love Bayesian stats, don't get me wrong, but I don't buy that Bayesian models are unaffected by stopping rules. Seems to only be true if you assume truth is drawn from an urn, and I can't back that. In simulations where there is a Truth, period, and you're simulating from that true state, optional stopping does affect expected rates of decisions. Bayesian models can account for that (by modifying the likelihood to include how optional stopping can modify the observable data space for some stopping rule and N; by modifying the priors), but it seems like those who say "Bayes is unaffected by stopping" either mean "bayesian interpretation is unaffected" [true] or "decisions rates match the proportions for which those decisions are true" [questionable, assumes truth is drawn by universe ala an urn problem].

      Would love to hear your thoughts though.

    4. Your opinion is completely consistent with someone who wishes the unknown parameter to be able to take on exactly one value, i.e., a frequentist. If you really want to believe that then you should not spend any time on this Bayesian stuff. On the other hand, Bayesians describe unknowns with distributions. You don't have to agree with that. But it tends to solve a lot of problems and allow us to represent uncertainty in a reasonable way.

      Your last paragraph doesn't follow for the Bayesian approach. As shown in papers by Berry and others, the likelihood principle used by the Bayes machine has no place for the stopping rule and must ignore it. It would be improper to incorporate modifications of the data space into the Bayesian model. Bayes has nothing to do with the sample space. Bayes is unaffected by the stopping rule (except for the Bayesian power being affected, i.e., the probability that you'll achieve a certain high level of posterior probability of efficacy), and the Bayesian interpretation is unaffected by the stopping rule. Further, if you don't accept that the true parameter values being simulated from should following a whole distribution, then chose a prior that puts mass at only one point or at a few points. Infinitely many looks at the data will still yield valid posterior probabilities at any moment.

      To put this another way, whatever representation you want to make for the unknown parameter, when used as a prior, will result in a perfectly calibrated posterior probability at any moment, assuming the posterior used the same prior that you simulated from.

    5. Thank you for replying;

      I disagree that thinking a parameter takes on one value makes one a frequentist. When I generate data via a DGP that I create, there is A fixed value. The fixed parameter is indeed "the" parameter responsible for the data generating procedure. The parameter does not change, even if the analysis does not depend on a specification of a fixed parameter (which frequentist analyses do).

      That is not to say parameters are unable to change across time or subpopulations, but in the context of a simulation when I know the DGP responsible for generating the observations, there is indeed a fixed parameter, not a set of parameters randomly drawn from an urn at any given time.

      And to say Bayesian methods do not have anything to do with the sample space is the likelihood principle that I think is too often thrown around. There are stopping rules that can restrict observable sets of data, in which case the likelihood should account for that (to the degree that the likelihood is a description of probabilistic generation of observations). Alternatively, a stopping rule can alter the prior probability (or parameter process) of a parameter, and this is apparent when you include a stopping rule into bayes theorem itself; at some point, the prior can become conditional on the stopping rule, and a proper treatment requires the inclusion of a prior density that combats how the stopping rule changes the prior probability of a parameter. This isn't a violation of the likelihood principle, but rather proper modeling of a DGP under certain stopping rules.

      However, I agree that the *interpretation* of a posterior does not change, but that isn't really as important to me. This is a similar issue to BFs --- The interpretation of a BF is independent of a stopping rule, but the probability of a decision made from a BF can certainly change as a result of stopping rules. And the latter is more important to me --- If the stopping rule can alter the probability of a decision, then stopping rules are nevertheless important to me as a Bayesian. It's just due to collecting data until a sequence of events matches one prior predictive distribution better than another, then stopping --- This still results in altered decision rates. Same with continuous posteriors --- If you observe until a set of events satisfies a condition, the probability of meeting that condition is different than if you do not have such a condition. The consequences are affected by the stopping rule, even if the interpretation does not.

    6. Thanks for continuing the discussion. The practical way of talking about the single value of an unknown parameter is that it requires you, when doing a simulation for example, to know that one value. Bayesians operate on the opinion that this is presumptuous. So you can think of having a prior distribution as having a nice way to admit what we don't know and don't have access to.

      I respectfully disagree with all of your third paragraph. The capturing of prior evidence about a parameter in fact *must not* be influenced by the stopping rule, and the Bayesian approach has no place (nor method) to put the sample space in any of the calculations. The beauty of Bayes is that when you calculate probabilities that are forward in time and information flow, these probabilities are sufficient and simply interpreted, unmolested by multiplicity, etc. Multiplicity problems result from the use of probabilities like type I error that are backwards in time and information flow, so you have to envision "what might have been."

    7. Hi Frank:
      Interesting post. I do simulations with Bayes both ways: 1) assume one value; or 2) draw from prior and then generate data.

      That said, could you post some code in regards to your previous comments on peaking many times.

    8. If you assume one value, you are choosing a prior with all its mass at one point. Fully works, but a strange state of knowledge about the truth. I'll be posting a new blog article with simple simulations showing that you can compute posterior probabilities infinitely often with no downside. It will be around 2017-10-16.

    9. I just published a new blog article about sequential testing, with the simulation.

  3. A useful post, thank you. Can you suggest some readings/references for someone new to Bayesian methodology that is specific to the clinical trials context?

    1. and see papers listed at

  4. 'Assuming prior distribution p1 for the mean difference of SBP, the probability that SBP with treatment B is lower than treatment A is 0.67.'
    Clinical researchers will appreciate this as they now think that their study proves that '2 out of 3 patients will benefit from treatment B' (despite the insignificant result). Correct me if I am wrong, but I guess the statement should deal with mean SBP.

    1. Yes I should have qualified that by the mean.

  5. In a Frequentist paradigm, and using the idea that anything that lowers SPB by more than 3mmHg is 0.8 is meaningful, the appropriate test to discuss is an equivalence test (such as TOST). Using TOST *and* Bayes, you can easily perform power analyses to design an informative study, control the Type 1 error rate (which you don't care about, but if I were a patient receiving a drug you worked on, I would care about), and then you can still add the posterior probability. I don't know why you wouldn't at least raise the bar in your Frequentist criticism a little bit. The p > 0.05 so no effect fallacy is easy to criticize, but also trivial. I'd much rather read your criticism on equivalence testing, and learn something less trivial.

    Your report of the posterior probabilities are also not very attractice. You can easily calculate your poaterior, but you can't compute mine. And for me, my posterior is what matters - I don't care about what you believe. So a better report would contain a snesitivity analysis, plotting posteriors across a range of priors. Do you agree, or not? That obviously changes the conclusion a bit.

    Overal, it is my strogn conviction that you love more by ignoring Frequentist stats, than you gain. As long as you make correct inferences (from a Neyman-Pearson perspective) you complement your research, especially when designing studies, at almost no cost (because you will end with your posterior anyway). Bayesian stats has limitations, and Frequentist stats has limitations, but there is nothing preventing you from embracing the relative strengths of both approaches. Saying 'I don't care about error rates' is your right, but you should expect a decent proportion of readers to care about it. Alternatively, you can discuss how you would in practice deal with situations where error control matters - e.g., exploring 100 DV's, and reporting the one with the highest posterior is perfectly fine in Bayesian stats, but I see no guidelines on how to prevent massive amounts of misleading information if people work like this.

    1. You've raised a lot of important points and I may not get to all of them right away. The reasons I don't care (and no one doing treatment students should care) about type I errors are detailed here:
      What we need is the probability that the treatment is ineffective (in an efficacy study), which is just one minus the posterior probability that it is ineffective. We need evidence for the one study at hand, and long-run error rates are not relevant for that. Regarding 100 DVs, the logical point in the logic flow for inserting skepticism about any one of them is the prior probability, not by putting skepticism on data such that the way you view one DV is influenced by the way you viewed the other.

      Concerning the sensitivity analysis, I see some sense in that. But in a regulated environment we are more likely to need to get the prior agreed upon jointly by the sponsor and the regulator.

      I'm honestly having trouble seeing how the frequentist approach augments a Bayesian analysis. I was a frequentist for about 20 years and was educated only in the frequentist paradigm, so no one can claim I didn't give it a fair shot.

      Concerning equivalence tests I think we got off on the wrong foot in envisioning equivalence (which should be 'similarity') as something you test rather than something you estimate. A posterior probability is a direct evidential quantitate and is something we estimate.