Statisticians have convinced regulators that long-run operating characteristics of a testing procedure should rule the day, e.g., if we did 1000 clinical trials where efficacy was always zero, we want no more than 50 of these trials to be judged as "positive." Never mind that this type I error operating characteristic does not refer to making a correct judgment for the clinical trial at hand. Still, there is a belief that type I error is the probability of regulator's regret (a false positive), i.e., that the treatment is not effective when the data indicate it is. In fact, clinical trialists have been sold a bill of goods by statisticians. No probability derived from an assumption that the treatment has zero effect can provide evidence about that effect. Nor does it measure the chance of the error actually in question. All probabilities are conditional on something, and to be useful they must condition on the right thing. This usually means that what is conditioned upon must be knowable.
The probability of regulator's regret is the probability that a treatment doesn't work given the data. So the probability we really seek is the probability that the treatment has no effect or that it has a backwards effect. This is precisely one minus the Bayesian posterior probability of efficacy.
In reality, there is unlikely to exist a treatment that has exactly zero effect. As Tukey argued in 1991, the effects of treatments A and B are always different, to some decimal place. So the null hypothesis is always false and the type I error could be said to be always zero.
The best paper I've read about the many ways in which p-values are misinterpreted is Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations written by a group of renowned statisticians. One of my favorite quotes from this paper is
Thus to claim that the null P value is the probability that chance alone produced the observed association is completely backwards: The P value is a probability computed assuming chance was operating alone. The absurdity of the common backwards interpretation might be appreciated by pondering how the P value, which is a probability deduced from a set of assumptions (the statistical model), can possibly refer to the probability of those assumptions.In 2016 the American Statistical Association took a stand against over-reliance on p-values. This would have made a massive impact on all branches of science had it been issued 50 years ago but better late than never.
Update 2017-01-19Though believed to be true by many non-statisticians, p-values are not the probability that H0 is true, and to turn them into such probabilities requires Bayes' rule. If you are going to use Bayes' rule you might as well formulate the problem as a full Bayesian model. This has many benefits, not the least of them being that you can select an appropriate prior distribution and you will get exact inference. Attempts by several authors to convert p-values to probabilities of interest (just as sensitivity and specificity are converted to probability of disease once one knows the prevalence of disease) have taken the prior to be discontinuous, putting a high probability on H0 being exactly true. In my view it is much more sensible to believe that there is no discontinuity in the prior at the point represented by H0, encapsulating prior knowledge instead by saying that values near H0 are more likely if no relevant prior information is available.
Returning to the non-relevance of type I error as discussed above, and ignoring for the moment that long-run operating characteristics do not directly assist us in making judgments about the current experiment, there is a subtle problem that leads researchers to believe that by controlling type I "error" they think they have quantified the probability of misleading evidence. As discussed at length by my colleague Jeffrey Blume, once an experiment is done the probability that positive evidence is misleading is not type I error. And what exactly does "error" mean in "type I error?" It is the probability of rejecting H0 when H0 is exactly true, just as the p-value is the probability of obtaining data more impressive than that observed given H0 is true. Are these really error probabilities? Perhaps ... if you have been misled earlier into believing that we should base conclusions on how unlikely the observed data would have been observed under H0. Part of the problem is in the loaded word "reject." Rejecting H0 by seeing data that are unlikely if H0 is true is perhaps the real error.
The "error quantification" truly needed is the probability that a treatment doesn't work given all the current evidence, which as stated above is simply one minus the Bayesian posterior probability of positive efficacy.
Update 2017-01-20Type I error control is an indirect way to being careful about claims of effects. It should never have been the preferred method for achieving that goal. Seen another way, we would choose type I error as the quantity to be controlled if we wanted to:
- require the experimenter to visualize an infinite number of experiments that might have been run, and assume that the current experiment could be exactly replicated
- be interested in long-run operating characteristics vs. judgments needing to be made for the one experiment at hand
- be interested in the probability that other replications result in data more extreme than mine if there is no treatment effect
- require early looks at the data to be discounted for future looks
- require past looks at the data to be discounted for earlier inconsequential looks
- create other multiplicity considerations, all of them arising from the chances you give data to be extreme as opposed to the chances that you give effects to be positive
- data can be more extreme for a variety of reasons such as trying to learn faster by looking more often or trying to learn more by comparing more doses or more drugs