FDA Office of Biostatistics

Enhancing capabilities of CDER and its Office of Biostatistics in Bayesian clinical trial design and analysis

Continuous Learning from Data: No Multiplicities from Computing and Using Bayesian Posterior Probabilities as Often as Desired

(In a Bayesian analysis) It is entirely appropriate to collect data until a point has been proven or disproven, or until the data collector runs out of time, money, or patience. Edwards, Lindman, Savage (1963) Introduction Bayesian inference, which follows the likelihood principle, is not affected by the experimental design or intentions of the investigator. P-values can only be computed if both of these are known, and as been described by Berry (1987) and others, it is almost never the case that the computation of the p-value at the end of a study takes into account all the changes in design that were necessitated when pure experimental designs encounter the real world.

Bayesian vs. Frequentist Statements About Treatment Efficacy

The following examples are intended to show the advantages of Bayesian reporting of treatment efficacy analysis, as well as to provide examples contrasting with frequentist reporting. As detailed here, there are many problems with p-values, and some of those problems will be apparent in the examples below. Many of the advantages of Bayes are summarized here. As seen below, Bayesian posterior probabilities prevent one from concluding equivalence of two treatments on an outcome when the data do not support that (i.

Statistical Errors in the Medical Literature

Misinterpretation of P-values and Main Study Results Dichotomania Problems With Change Scores Improper Subgrouping Serial Data and Response Trajectories As Doug Altman famously wrote in his Scandal of Poor Medical Research in BMJ in 1994, the quality of how statistical principles and analysis methods are applied in medical research is quite poor. According to Doug and to many others such as Richard Smith, the problems have only gotten worse.

My Journey From Frequentist to Bayesian Statistics

Type I error for smoke detector: probability of alarm given no fire=0.05 Bayesian: probability of fire given current air data Frequentist smoke alarm designed as most research is done: Set the alarm trigger so as to have a 0.8 chance of detecting an inferno Advantage of actionable evidence quantification: Set the alarm to trigger when the posterior probability of a fire exceeds 0.02 while at home and at 0.

A Litany of Problems With p-values

In my opinion, null hypothesis testing and p-values have done significant harm to science. The purpose of this note is to catalog the many problems caused by p-values. As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress. The American Statistical Association has done a great service by issuing its Statement on Statistical Significance and P-values. Now it’s time to act.

Clinicians' Misunderstanding of Probabilities Makes Them Like Backwards Probabilities Such As Sensitivity, Specificity, and Type I Error

Imagine watching a baseball game, seeing the batter get a hit, and hearing the announcer say “The chance that the batter is left handed is now 0.2!” No one would care. Baseball fans are interested in the chance that a batter will get a hit conditional on his being right handed (handedness being already known to the fan), the handedness of the pitcher, etc. Unless one is an archaeologist or medical examiner, the interest is in forward probabilities conditional on current and past states.

p-values and Type I Errors are Not the Probabilities We Need

In trying to guard against false conclusions, researchers often attempt to minimize the risk of a “false positive” conclusion. In the field of assessing the efficacy of medical and behavioral treatments for improving subjects’ outcomes, falsely concluding that a treatment is effective when it is not is an important consideration. Nowhere is this more important than in the drug and medical device regulatory environments, because a treatment thought not to work can be given a second chance as better data arrive, but a treatment judged to be effective may be approved for marketing, and if later data show that the treatment was actually not effective (or was only trivially effective) it is difficult to remove the treatment from the market if it is safe.

Null Hypothesis Significance Testing Never Worked

Much has been written about problems with our most-used statistical paradigm: frequentist null hypothesis significance testing (NHST), p-values, type I and type II errors, and confidence intervals. Rejection of straw-man null hypotheses leads researchers to believe that their theories are supported, and the unquestioning use of a threshold such as p<0.05 has resulted in hypothesis substitution, search for subgroups, and other gaming that has badly damaged science. But we seldom examine whether the original idea of NHST actually delivered on its goal of making good decisions about effects, given the data.