until a point has been proven or disproven, or until the data collector

runs out of time, money, or patience. - Edwards, Lindman, Savage (1963)

Bayesian inference, which follows the *likelihood principle*, is not affected by the experimental design or intentions of the investigator. P-values can only be computed if both of these are known, and as been described by Berry (1987) and others, it is almost never the case that the computation of the p-value at the end of a study takes into account all the changes in design that were necessitated when pure experimental designs encounter the real world.

When performing multiple data looks as a study progress, one can accelerate learning by more quickly abandoning treatments that do not work, by sometimes stopping early for efficacy, and frequently by arguing to extend a promising but as-yet-inconclusive study by adding subjects over the originally intended sample size. Indeed the whole exercise of computing a single sample size is thought to be voodoo by most practicing statisticians. It has become almost comical to listen to rationalizations for choosing larger detectable effect sizes so that smaller sample sizes will yield adequate power.

Multiplicity and resulting inflation of type I error when using frequentist methods is real. While Bayesians concern themselves with "what did happen?", frequentists must consider "what might have happened?" because of the backwards time and information flow used in their calculations. Frequentist inference must envision an indefinitely long string of identical experiments and must consider extremes of data over potential studies and over multiple looks within each study if multiple looks were intended. Multiplicity comes from the chances (over study repetitions and data looks) you give data to be more extreme (if the null hypothesis holds), not from the chances you give an effect to be real. it is only the latter that is of concern to a Bayesian. Bayesians entertain only one dataset at a time, and if one computes posterior probabilities of efficacy multiple times, it is only the last value calculated that matters.

To better understand the last point, consider a probabilistic pattern recognition system for identifying enemy targets in combat. Suppose the initial assessment when the target is distant is a probability of 0.3 of being an enemy vehicle. Upon coming closer the probability rises to 0.8. Finally the target is close enough (or the air clears) so that the pattern analyzer estimates a probability of 0.98. The fact that the probabilty was < 0.98 earlier is of no consequence as the gunner prepares to fire a canon. Even though the probability may actually decrease while the shell is in the air due to new information, the probability at the time of firing was completely valid based on then available information.

This is very much how an experimenter would work in a Bayesian clinical trial. The stopping rule is unimportant. Earlier data looks are irrelevant. The only ways a Bayesian would cheat would be to ignore a later look if it is less favorable than an earlier look, or to try to pull the wool over reviewers' eyes by changing the prior distribution once data patterns emerge.

The meaning and accuracy of posterior probabilities of efficacy in a clinical trial are mathematical necessities that follow from Bayes' rule, if the data model is correctly specified (this model is needed just as much by frequentist methods). So no simulations are needed to demonstrate these points. But for the non-mathematically minded, simulations can be comforting. For everyone, simulations expose the logic flow in the Bayesian analysis paradigm.

One other thing: when the frequentist does a sequential trial with possible early termination, the sampling distribution of the statistics becomes extremely complicated, but must be derived to allow one to obtain proper point estimates and confidence limits. It is almost never the case that the statistician actually performs these complex adjustments in a clinical trial with multiple looks. One example of the harm of ignoring this problem is that if the trial stops fairly early for efficacy, efficacy will be overestimated. On the other hand, the Bayesian posterior mean/median/mode of the efficacy parameter will be perfectly calibrated by the prior distribution you assume. If the prior is skeptical and one stops early, the posterior mean will be "pulled back" by a perfect amount, as shown in the simulation below.

We consider the simplest clinical trial design for illustration. The efficacy measure is assumed to be normally distributed with mean μ and variance 1.0, μ=0 indicates no efficacy, and μ<0 indicate a detrimental effect. Our inferential jobs are to see if evidence may be had for a positive effect and to see if further there is evidence for a clinically meaningful effect (except for the futility analysis, we will ignore the latter in what follows). Our business task is to not spend resources on treatments that have a low chance of having a meaningful benefit to patients. The latter can also be an ethical issue: we'd like not to expose too many patients to an ineffective treatment. In the simulation, we stop for futility when the probability that μ < 0.05 exceeds 0.9, considering μ=0.05 to be a minimal clinically important effect.

The logic flow in the simulation exposes what is assumed by the Bayesian analysis.

- The prior distribution for the unknown effect μ is taken as a mixture of two normal distributions, each with mean zero. This is a skeptical prior that gives an equal chance for detriment as for benefit from the treatment.
- In the next step it is seen that the Bayesian does not consider a stream of identical trials but instead (and only when studying performance of Bayesian methods) considers a stream of trials with
**different**efficacies of treatment, by drawing a single value of μ from the prior distribution. This is done independently for 50,000 simulated studies. Posterior probabilities are not informed by this value of μ. Bayesians operate in a predictive mode, trying for example to estimate Prob(μ>0) no matter what the value of μ. - For the current value of μ, simulate an observation from a normal distribution with mean μ and SD=1.0. [In the code below all n=500 subjects' data are simulated at once then revealed one-at-a-time.]
- Compute the posterior probability of efficacy (μ > 0) and of futility (μ < 0.05) using the original prior and latest data.
- Stop the study if the probability of efficacy ≥0.95 or the probability of futility ≥0.9.
- Repeat the last 3 steps, sampling one more subject each time and performing analyses on the accumulated set of subjects to date.
- Stop the study when 500 subjects have entered.

What is it that the Bayesian must demonstrate to the frequentist and reviewers? She must demonstrate that the posterior probabilities computed as stated above are accurate, i.e., they are well calibrated. From our simulation design, the final posterior probability will either be the posterior probability computed after the last (500th) subject has entered, the probability of futility at the time of stopping for futility, or the probability of efficacy at the time of stopping for efficacy. How do we tell if the posterior probability is accurate? By comparing it to the value of μ (unknown to the posterior probability calculation) that generated the sequence of data points that were analyzed. We can compute a smooth nonparametric calibration curve for each of (efficacy, futility) where the binary events are μ > 0 and μ < 0.05, respectively. For the subset of the 50,000 studies that were stopped early, the range of probabilities is limited so we can just compare the mean posterior probability at the moment of stopping with the proportion of such stopped studies for which efficacy (futility) was the truth. The mathematics of Bayes dictates the mean probability and the proportion must be the same (if enough trials are run so that simulation error approaches zero). This is what happened in the simulations.

For the smaller set of studies not stopping early, the posterior probability of efficacy is uncertain and will have a much wider range. The calibration accuracy of these probabilities is checked using a nonparametric calibration curve estimator just as we do in validating risk models, by fitting the relationship between the posterior probability and the binary event μ>0.

The simulations also demonstrated that the posterior mean efficacy at the moment of stopping is perfectly calibrated as an estimator of the true unknown μ.

Simulations were run in R and used functions in the R Hmisc and rms package. The results are below. Feel free to take the code and alter it to run any simulations you'd like.