Monday, October 9, 2017

Continuous Learning from Data: No Multiplicities from Computing and Using Bayesian Posterior Probabilities as Often as Desired

(In a Bayesian analysis) It is entirely appropriate to collect data
until a point has been proven or disproven, or until the data collector
runs out of time, money, or patience. - Edwards, Lindman, Savage (1963)

Bayesian inference, which follows the likelihood principle, is not affected by the experimental design or intentions of the investigator. P-values can only be computed if both of these are known, and as been described by Berry (1987) and others, it is almost never the case that the computation of the p-value at the end of a study takes into account all the changes in design that were necessitated when pure experimental designs encounter the real world.

When performing multiple data looks as a study progress, one can accelerate learning by more quickly abandoning treatments that do not work, by sometimes stopping early for efficacy, and frequently by arguing to extend a promising but as-yet-inconclusive study by adding subjects over the originally intended sample size. Indeed the whole exercise of computing a single sample size is thought to be voodoo by most practicing statisticians. It has become almost comical to listen to rationalizations for choosing larger detectable effect sizes so that smaller sample sizes will yield adequate power.

Multiplicity and resulting inflation of type I error when using frequentist methods is real. While Bayesians concern themselves with "what did happen?", frequentists must consider "what might have happened?" because of the backwards time and information flow used in their calculations. Frequentist inference must envision an indefinitely long string of identical experiments and must consider extremes of data over potential studies and over multiple looks within each study if multiple looks were intended. Multiplicity comes from the chances (over study repetitions and data looks) you give data to be more extreme (if the null hypothesis holds), not from the chances you give an effect to be real. It is only the latter that is of concern to a Bayesian. Bayesians entertain only one dataset at a time, and if one computes posterior probabilities of efficacy multiple times, it is only the last value calculated that matters.

To better understand the last point, consider a probabilistic pattern recognition system for identifying enemy targets in combat. Suppose the initial assessment when the target is distant is a probability of 0.3 of being an enemy vehicle. Upon coming closer the probability rises to 0.8. Finally the target is close enough (or the air clears) so that the pattern analyzer estimates a probability of 0.98. The fact that the probabilty was < 0.98 earlier is of no consequence as the gunner prepares to fire a canon. Even though the probability may actually decrease while the shell is in the air due to new information, the probability at the time of firing was completely valid based on then available information.

This is very much how an experimenter would work in a Bayesian clinical trial. The stopping rule is unimportant when interpreting the final evidence. Earlier data looks are irrelevant. The only ways a Bayesian would cheat would be to ignore a later look if it is less favorable than an earlier look, or to try to pull the wool over reviewers' eyes by changing the prior distribution once data patterns emerge.

The meaning and accuracy of posterior probabilities of efficacy in a clinical trial are mathematical necessities that follow from Bayes' rule, if the data model is correctly specified (this model is needed just as much by frequentist methods). So no simulations are needed to demonstrate these points. But for the non-mathematically minded, simulations can be comforting. For everyone, simulation code exposes the logic flow in the Bayesian analysis paradigm.

One other thing: when the frequentist does a sequential trial with possible early termination, the sampling distribution of the statistics becomes extremely complicated, but must be derived to allow one to obtain proper point estimates and confidence limits. It is almost never the case that the statistician actually performs these complex adjustments in a clinical trial with multiple looks. One example of the harm of ignoring this problem is that if the trial stops fairly early for efficacy, efficacy will be overestimated. On the other hand, the Bayesian posterior mean/median/mode of the efficacy parameter will be perfectly calibrated by the prior distribution you assume. If the prior is skeptical and one stops early, the posterior mean will be "pulled back" by a perfect amount, as shown in the simulation below.

We consider the simplest clinical trial design for illustration. The efficacy measure is assumed to be normally distributed with mean μ and variance 1.0, μ=0 indicates no efficacy, and μ<0 indicates a detrimental effect. Our inferential jobs are to see if evidence may be had for a positive effect and to see if further there is evidence for a clinically meaningful effect (except for the futility analysis, we will ignore the latter in what follows). Our business task is to not spend resources on treatments that have a low chance of having a meaningful benefit to patients. The latter can also be an ethical issue: we'd like not to expose too many patients to an ineffective treatment. In the simulation, we stop for futility when the probability that μ < 0.05 exceeds 0.9, considering μ=0.05 to be a minimal clinically important effect.

The logic flow in the simulation exposes what is assumed by the Bayesian analysis.

  1. The prior distribution for the unknown effect μ is taken as a mixture of two normal distributions, each with mean zero. This is a skeptical prior that gives an equal chance for detriment as for benefit from the treatment. Any prior would have done.
  2. In the next step it is seen that the Bayesian does not consider a stream of identical trials but instead (and only when studying performance of Bayesian operating characteristics) considers a stream of trials with different efficacies of treatment, by drawing a single value of μ from the prior distribution. This is done independently for 50,000 simulated studies. Posterior probabilities are not informed by this value of μ. Bayesians operate in a predictive mode, trying for example to estimate Prob(μ>0) no matter what the value of μ.
  3. For the current value of μ, simulate an observation from a normal distribution with mean μ and SD=1.0. [In the code below all n=500 subjects' data are simulated at once then revealed one-at-a-time.]
  4. Compute the posterior probability of efficacy (μ > 0) and of futility (μ < 0.05) using the original prior and latest data.
  5. Stop the study if the probability of efficacy ≥0.95 or the probability of futility ≥0.9.
  6. Repeat the last 3 steps, sampling one more subject each time and performing analyses on the accumulated set of subjects to date.
  7. Stop the study when 500 subjects have entered.

What is it that the Bayesian must demonstrate to the frequentist and reviewers? She must demonstrate that the posterior probabilities computed as stated above are accurate, i.e., they are well calibrated. From our simulation design, the final posterior probability will either be the posterior probability computed after the last (500th) subject has entered, the probability of futility at the time of stopping for futility, or the probability of efficacy at the time of stopping for efficacy. How do we tell if the posterior probability is accurate? By comparing it to the value of μ (unknown to the posterior probability calculation) that generated the sequence of data points that were analyzed. We can compute a smooth nonparametric calibration curve for each of (efficacy, futility) where the binary events are μ > 0 and μ < 0.05, respectively. For the subset of the 50,000 studies that were stopped early, the range of probabilities is limited so we can just compare the mean posterior probability at the moment of stopping with the proportion of such stopped studies for which efficacy (futility) was the truth. The mathematics of Bayes dictates the mean probability and the proportion must be the same (if enough trials are run so that simulation error approaches zero). This is what happened in the simulations.

For the smaller set of studies not stopping early, the posterior probability of efficacy is uncertain and will have a much wider range. The calibration accuracy of these probabilities is checked using a nonparametric calibration curve estimator just as we do in validating risk models, by fitting the relationship between the posterior probability and the binary event μ>0.

The simulations also demonstrated that the posterior mean efficacy at the moment of stopping is perfectly calibrated as an estimator of the true unknown μ.

Simulations were run in R and used functions in the R Hmisc and rms package. The results are below. Feel free to take the code and alter it to run any simulations you'd like.


  1. Excellent post Frank. Although it still highlights a difference in philosophy about the DGP that I have.

    This assumes that the generative parameter indeed varies according to the prior, as you state. Such that, mu ~ prior; y ~ mu. My issue with that still remains that it's assuming that the true parameter, in the universe, does vary according to some distribution. Whereas parameters may change over time or across subpopulations, I'm not sold on the idea that the generative parameter is indeed drawn from some probability distribution for any given collection effort. Likewise, if you go into the discrete world of things, we see a similar philosophy. E.g., BFs and p(H_k|y) are shown to be calibrated such that when the BF is 6, then across repeated samples, there is indeed a 6:1 odds that the data were generated under H1 than under H0. But my issue with that is it assumes, in the universe, that H1 and H0 are simultaneously true, but vary in probability. E.g., 80% of the time H1 is true, and 20% of the time H0 is true. I think that's a weird statement to make about the universe --- That two hypotheses about how the universe works can simultaneously be true any given proportion of the time. It's drawing a truth statement from an urn.

    The same notion is present in this simulation --- The state of truth randomly varies; ASSUMING this is true, then sure it is calibrated. But I think from a philosophy of science standpoint, it's a little weird. "Theta = .26 is true for the moment you are collecting data; but Theta = -1.2 is true for another moment you are collecting data".

    Again, not to say parameters can't actually vary a bit over time, across cultures, across subpopulations (and for this, longitudinal, mixture, or random effects models can capture that tendency). But to say at any given moment the generative parameter is a totally random process is strange to me.

    When you don't assume the above is true, when you instead assume a generative parameter is stable but unknown (but data are fixed, because we observed them), the probability of making decisions does change, as Kruschke, I, and many others have shown. E.g., if mu = 1.2 [which you know, because you are creating the universe at hand] and y ~ N(mu, sigma); but as an unknowing scientist you estimate a model: mu ~ prior; sigma ~ prior; then decision rates can change with sequential sampling. So it seems to me that the only way Bayesian models are calibrated, is if the true generative parameter can indeed change in the universe each time you sample from it, and that's difficult for me to digest. (Otoh, other Bayesians indeed don't assume that, and use a model that adjusts for sampling intentions in a fully Bayesian way through likelihood or prior modifications).

  2. Hi Stephen - I think the simulation is more consistent with your world view than you think. The value of the parameter generating your data is fixed. The value doesn't need to change the next time you generate data. The use of a probability distribution behind the scene is a useful device to deal with "would that we only knew the value now." Parameters don't need to vary across time, cultures, etc. But for any dataset we have at present THE value is unknown and we find probabilities helpful for dealing with uncertainties.

    If you wanted to use the same value of mu each time, you have to effectively force the prior to be degenerate, and the Bayesian posterior will behave exactly as you wish. There is no way to use a prior that is smooth but having the data be forced to be generated from only one value. I think these are very subtle points, and worthy of further discussion.

    I don't see how you can be Bayesian and need to keep track of sampling intentions.

  3. Interesting. I have done similar simulations, but not for optional stopping. Rather, looking at estimation in longitudinal models across hypothesized distributions. I describe it slightly differently, however. Sure, the value does differ when drawing from prior but the value also differs when assuming one value. The bayesian approach has two sources of uncertainty: 1) in the effect; 2) sampling variability. Frequentist only has sampling variability. Lastly, both approaches assume that there is a true value, but differ in how this is described. Bayes allows the "truth"-eg, effect of interest-to have a probabilty distribution that allows for incorporating uncertainty into the simulations.

  4. Donald I'm not clear on 'the value also differs when assuming one value.' Perhaps this whole issue can become more clear by stating that whether or not we assume the same value of the parameter in different studies comes from the experimental design. If you had two independent replications of a study you could have pooled all the raw data together and estimated a single parameter whether Bayesian or frequentist. The bigger point is that describing uncertainty in known quantities is very conveniently done using probability distributions, and this also results in good solutions to everyday problems as well as to highly complex ones.

    1. Hi:
      The value differs because of sampling variability, but in an unbiased situation the average frequency should be the assumed point estimate. So, for any given iteration, the value (in this case the mean) will never be exactly the same:

      mu_f <- replicate(1000, mean(rnorm(100, 0, 1)))
      # here values span from -.3 to .3.

      In contrast, the Bayes assumption has this variability and incorporates the fact that we do not know the exact value for a treatment. However, we can still think one value exists, which is the assumption of Bayesian methods, but we also allow for uncertainty surrounding the "true" effect. This is different than suggesting many effect exist. Of course, I think the question is not which assumption is "true", but understanding the context under which assuming a set of assumptions allows for richer inference.

      mu_b <- replicate(1000, mean(rnorm(100, rnorm(1, 0, .2), 1)))
      # here, the values span from < - 0.5 and > .5

      In both cases, the returned mean is basically zero but the Bayes has captured uncertainty in the effect. This does not necessarily mean that we think the treatment effect is actually different, as indicated by the unbiasedness of the estimator.

    2. Hi Donald. Sorry to be slow in responding due to attending the ASA Statistical Inference conference. It was good to see you there also.

      The experimental design and assumptions between studies can dictate in all statistical paradigms that the parameter is the same. For example, you can adjust for the subject's sex in the model as a covariate but assume that the sex regression coefficient is zero, meaning that you assume that mu is the same for females and males. You can also pool studies if you are fairly certain they are generated from the same mu. And with Bayes, it is not necessarily the case that there is fuzz in mu. Bayes is consistent with their being one true single-valued mu. It's just that we don't know what it is, and the less we know the more uncertain will be the final Bayes result (wider posterior distribution). I think of Bayes this way: We don't know what an experiments is throwing at the Bayesian analysis. The Bayesian model you specify should be able to handle a variety of possible values. If your theory dictates that some values are impossible (e.g., with blood pressure it can't be below about 30mmHg or the patient would not have lived long enough to enroll in the study), those values will be excluded in the formulation of the prior distribution to improve the final estimates. After such considerations, we want the Bayesian model to properly deal with whatever the experiment is throwing at it. We demonstrate that by simulating mu from a prior consistent with our prior beliefs/knowledge. Uncertainty is coming from absence of knowledge of the single true mu, not from any inherent fuzziness in the true mu that is generating the data at hand for one experimental design.

      Another way to think about this is that in the frequentist world we waste an incredible amount of funding by designing studies to detect a single mu, where that detectable value of mu has been set with too little knowledge. In Bayesian sample size estimation, uncertainty around mu, through a prior distribution, goes into the Bayesian power calculation, resulting in much more honest sample sizes. With Bayes you can compute the expected sample size, the 0.95 quantile of the sample size, etc., because of admission of uncertainty about the true mu.

  5. Frank, in your simulations please calculate the efficacy, effect sized, and variances across the various Ns and plot them. What does a plot of efficacy against N look like with a dashed line of the actual efficacy in the population?

  6. For my situation the pertinent thing to emphasize would be the width of say a 0.95 credible interval. And you'll find that the posterior mean tracks the true mean (efficacy) no matter what the N. Feel free to modify the code to do any of this.

  7. I would disagree with your characterization of the frequentist approach to optional stopping. As I showed in "Almost sure hypothesis testing and a resolution of the Jeffreys-Lindley paradox", one can make the probability of incorrectly stopping an experiment arbitrarily small by selecting a sequence of significance levels that decrease to 0 at an appropriate rate. The frequentist only needs to bound the probability of incorrectly terminating the experiment.

    1. I can't disagree with that, though I'm not convinced of the relevance of a frequentist strategy that no one uses in practice and that hurts type II error.