## Sunday, February 5, 2017

### A Litany of Problems With p-values

In my opinion, null hypothesis testing and p-values have done significant harm to science.  The purpose of this note is to catalog the many problems caused by p-values.  As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress.

The American Statistical Association has done a great service by issuing its Statement on Statistical Significance and P-values.  Now it's time to act.  To create the needed motivation to change, we need to fully describe the depth of the problem.

It is important to note that no statistical paradigm is perfect.  Statisticians should choose paradigms that solve the greatest number of real problems and have the fewest number of faults.  This is why I believe that the Bayesian and likelihood paradigms should replace frequentist inference.

Consider an assertion such as "the coin is fair", "treatment A yields the same blood pressure as treatment B", "B yields lower blood pressure than A", or "B lowers blood pressure at least 5mmHg before A."  Consider also a compound assertion such as "A lowers blood pressure by at least 3mmHg and does not raise the risk of stroke."

### A. Problems With Conditioning

1. p-values condition on what is unknown (the assertion of interest; H0) and do not condition on what is known (the data).
2. This conditioning does not respect the flow of time and information; p-values are backward probabilities.

### B. Indirectness

1. Because of A above, p-values provide only indirect evidence and are problematic as evidence metrics.  They are sometimes monotonically related to the evidence (e.g., when the prior distribution is flat) we need but are not properly calibrated for decision making.
2. p-values are used to bring indirect evidence against an assertion but cannot bring evidence in favor of the assertion.
3. As detailed here, the idea of proof by contradiction is a stretch when working with probabilities, so trying to quantify evidence for an assertion by bringing evidence against its complement is on shaky ground.
4. Because of A, p-values are difficult to interpret and very few non-statisticians get it right.  The best article on misinterpretations I've found is here.

### C. Problem Defining the Event Whose Probability is Computed

1. In the continuous data case, the probability of getting a result as extreme as that observed with our sample is zero, so the p-value is the probability of getting a result more extreme than that observed.  Is this the correct point of reference?
2. How does more extreme get defined if there are sequential analyses and multiple endpoints or subgroups?  For sequential analyses do we consider planned analyses are analyses intended to be run even if they were not?

### D. Problems Actually Computing p-values

1. In some discrete data cases, e.g., comparing two proportions, there is tremendous disagreement among statisticians about how p-values should be calculated.  In a famous 2x2 table from an ECMO adaptive clinical trial, 13 p-values have been computed from the same data, ranging from 0.001 to 1.0.  And many statisticians do not realize that Fisher's so-called "exact" test is not very accurate in many cases.
2. Outside of binomial, exponential, and normal (with equal variance) and a few other cases, p-values are actually very difficult to compute exactly, and many p-values computed by statisticians are of unknown accuracy (e.g., in logistic regression and mixed effects models). The more non-quadratic the log likelihood function the more problematic this becomes in many cases.
3. One can compute (sometimes requiring simulation) the type-I error of many multi-stage procedures, but actually computing a p-value that can be taken out of context can be quite difficult and sometimes impossible.  One example: one can control the false discovery probability (incorrectly usually referred to as a rate), and ad hoc modifications of nominal p-values have been proposed, but these are not necessarily in line with the real definition of a p-value.

### E. The Multiplicity Mess

1. Frequentist statistics does not have a recipe or blueprint leading to a unique solution for multiplicity problems, so when many p-values are computed, the way they are penalized for multiple comparisons results in endless arguments.  A Bonferroni multiplicity adjustment is consistent with a Bayesian prior distribution specifying that the probability that all null hypotheses are true is a constant no matter how many hypotheses are tested.  By contrast, Bayesian inference reflects the facts that P(A ∪ B) ≥ max(P(A), P(B)) and P(A ∩ B) ≤ min(P(A), P(B)) when A and B are assertions about a true effect.
2. There remains controversy over the choice of 1-tailed vs. 2-tailed tests.  The 2-tailed test can be thought of as a multiplicity penalty for being potentially excited about either a positive effect or a negative effect of a treatment.  But few researchers want to bring evidence that a treatment harms patients; a pharmaceutical company would not seek a licensing claim of harm.  So when one computes the probability of obtaining an effect larger than that observed if there is no true effect, why do we too often ignore the sign of the effect and compute the (2-tailed) p-value?
3. Because it is a very difficult problem to compute p-values when the assertion is compound, researchers using frequentist methods do not attempt to provide simultaneous evidence regarding such assertions and instead rely on ad hoc multiplicity adjustments.
4. Because of A1, statistical testing with multiple looks at the data, e.g., in sequential data monitoring, is ad hoc and complex.  Scientific flexibility is discouraged.  The p-value for an early data look must be adjusted for future looks.  The p-value at the final data look must be adjusted for the earlier inconsequential looks.  Unblinded sample size re-estimation is another case in point.  If the sample size is expanded to gain more information, there is a multiplicity problem and some of the methods commonly used to analyze the final data effectively discount the first wave of subjects.  How can that make any scientific sense?
5. Most practitioners of frequentist inference do not understand that multiplicity comes from chances you give data to be extreme, not from chances you give true effects to be present.

### F. Problems With Non-Trivial Hypotheses

1. It is difficult to test non-point hypotheses such as "drug A is similar to drug B".
2. There is no straightforward way to test compound hypotheses coming from logical unions and intersections.

### G. Inability to Incorporate Context and Other Information

1. Because extraordinary claims require extraordinary evidence, there is a serious problem with the p-value's inability to incorporate context or prior evidence.  A Bayesian analysis of the existence of ESP would no doubt start with a very skeptical prior that would require extraordinary data to overcome, but the bar for getting a "significant" p-value is fairly low. Frequentist inference has a greater risk for getting the direction of an effect wrong (see here for more).
2. p-values are unable to incorporate outside evidence.  As a converse to 1, strong prior beliefs are unable to be handled by p-values, and in some cases the results in a lack of progress.  Nate Silver in The Signal and the Noise beautifully details how the conclusion that cigarette smoking causes lung cancer was greatly delayed (with a large negative effect on public health) because scientists (especially Fisher) were caught up in the frequentist way of thinking, dictating that only randomized trial data would yield a valid p-value for testing cause and effect.  A Bayesian prior that was very strongly against the belief that smoking was causal is obliterated by the incredibly strong observational data.  Only by incorporating prior skepticism could one make a strong conclusion with non-randomized data in the smoking-lung cancer debate.
3. p-values require subjective input from the producer of the data rather than from the consumer of the data.

### H. Problems Interpreting and Acting on "Positive" Findings

1. With a large enough sample, a trivial effect can cause an impressively small p-value (statistical significance ≠ clinical significance).
2. Statisticians and subject matter researchers (especially the latter) sought a "seal of approval" for their research by naming a cutoff on what should be considered "statistically significant", and a cutoff of p=0.05 is most commonly used.  Any time there is a threshold there is a motive to game the system, and gaming (p-hacking) is rampant.  Hypotheses are exchanged if the original H0 is not rejected, subjects are excluded, and because statistical analysis plans are not pre-specified as required in clinical trials and regulatory activities, researchers and their all-too-accommodating statisticians play with the analysis until something "significant" emerges.
3. When the p-value is small, researchers act as though the point estimate of the effect is a population value.
4. When the p-value is small, researchers believe that their conceptual framework has been validated.

### I. Problems Interpreting and Acting on "Negative" Findings

1. Because of B2, large p-values are uninformative and do not assist the researcher in decision making (Fisher said that a large p-value means "get more data").

1. Thanks for the list! I for one didnt realized that Fisher's exact test is not very accurate!

2. Great post, Frank. Re problem I.1, I think this one has implications for publication bias the and the reproducibility crisis: The fact that non-significant p values are regarded as uninformative (rather than evidence against an effect) is part of the reason why studies with non-significant findings aren't published. Some authors and journals might be willing to publish evidence against an hypothesis, but not interested in publishing findings that are uninformative. But when publication is conditional on p < 0.05, a biased literature results.

3. For I1 I'm not sure B2 is the best argument. Perhaps you might add that because the p-value distribution is flat when the null hypothesis is true a failure to reject the null hypothesis means that every p-value was equally probable to have occurred and therefore the magnitude of the p-value is meaningless.

1. Good point. But when we are trying to bring evidence in favor of H0 we can't assume H0 which is what is required when computing the p-value.

2. Why in the world would you stick with Fisherian tests when they've been replaced byNeyman-Pearson style tests long ago.I don't care what you call them, there's a statistical alternative There are a great many results in science that are null results, and setting upper bounds to discrepancies on the basis of "negative" results couldn't be more common. Do we really have to keep coming back to pre-1933 before N-P tests were developed, or to focus on some very distorted animals from the social sciences (which ironically use powerwhich is an N-P notion).
I don't interpret tests or intervals in a so-called behavioristic fashion, but that doesn't stop one from using proper statistical tests with alternatives.

4. In phase III drug development confirmatory or pivotal trials, sample size is calculated under the decision framework of Neyman Pearson (for example, effect size as a single alternate hypothesis, sigma as an assumption, and alpha + beta as conditions). The P value does not matter, just the critical region. As far as we have enough patients, this has sense to me. What do you think?
But I'm confused because this approach is also used in trials aimed to add evidence (the big majority), not to allow a decision. From my view, and in accordance to reporting guidelines, what matter are uncertainty measures. Within classical inference, precision trials (allowing to some pre-specified amplitude of the 95%CI) are more appropriate. But they are very, very rare...
What do you think?

1. I have a hard time enjoying either Neyman-Pearson or p-values. They are equally indirect. I think things would be drastically different had Rev. Bayes had a PR machine.

2. Frank calls error probabilities of methods "indirect" because he assumes (without argument) that what's "directly" wanted is a posterior probability of a hypothesis or model.

That has not been shown, or even argued for.

On the other hand, a falsification of claim H in science, be it deductive or statistical, takes place by being able to produce results that could not have been brought about if were H correct (or approximately correct) , in the respect tested. All testing turns on being able to characterize the capabilities of methods (e.g.,their ability to have discerned a flaw if present). In the land of formal testing, this characterization is by means of probabilistic properties of the methods. They're given by the relevant sampling distribution.) That's what you directly need for reliable inference. So you should not assume a posterior probability (of any sort you can obtain) is "direct", whereas an error probability is "indirect" unless you're prepared to justify this. I've never seen it done. Even where one might speak of assigning a probability to a hypothesis, I deny we want highly probable ones, as they would be the least informative.

3. I’d agree but with a caveat. The scientific benchmark for what is discrepant should not change. We can use statistical tools to assess if that benchmark is achieved or not, but we should not be using statistical tools to set the benchmark, which is what hypothesis testing effectively does. The problem is that the scientific force of the discrepancy changes (it depends on standard errors), so we can end up with statistically significant results that are not actually scientifically discrepant. Personally, this is why I prefer other approaches (e.g., likelihood, Bayesian) that respect the original scale of the data upfront.

5. This is utterly unfair for the p-values. Yes, there are misinterpretations, but that applies to most measures. The p-value is one possibility to quantify a statistical statement, and helps in decision making (and no decision should be taken on p-values alone). What are other quantitative measures, and evaluate then the relative benefits!

1. This was not an attempt to be fair but rather was intended to catalog the many serious problems with p-values. Future articles will deal with Bayesian approaches and the fact that posterior probabilities are much less likely to be misinterpreted. And as I started the article, other paradigms don't have to be perfect to be much, much better than p-values.

6. I like the catalog of problems. This will help teach my collaborators (and biostatistics students) about problems and misinterpretation of p-values. I can think of two additional issues.
1. Many physicians and biologists mistakenly treat the 0.05 threshold as a "forced decision"; the null hypothesis is true or it is false, and they act accordingly. I believe Neyman-Pearson also advocated "to act as though" the null or alternative hypothesis were true of false. I think the forced dichotomy, while necessary for decision problems (and for misplaced psychological comfort), is profoundly unscientific in inference. We need space to accommodate uncertainty in inferential results.

BTW - I think any threshold based inference procedure will suffer the same plight.

2. p-values are a nonlinear transformation of test statistics, and are not reproducible in repeated studies (even when the null hypothesis is false).

1. p-values are often not reproducible because they confound the effect size and the level of precision. Also two equal p-values don’t imply the same amount of “evidence”, so its not clear why one would care or expect them to replicate. The thing to replicate is the effect size, not the p-value.

7. No, N-P did not advocate that you"act as if a hypothesis is true/ false" when you accept/reject it. This is another huge blunder in depicting N-P bypeople who have never read N-p, which is essentially, almost everyone who writes about them. On my blog, you will find several posts on N and N-P which also link to articles, so you won't have to continue to repeat false claims about them. Under the "behavioristic construal where act/reject are "acts", the acts can be anything, e.g., infer there's some evidence for a discrepancy from Ho, check your assumptions, withdraw the paper, sell a bag of bolts. One could never identify these "acts" with taking a standpoint on the truth of a claim. When in the behavioristic land, there's no such thing. It has it's problems, but can't have the one you mention.
Please disregard almost everything you've read on N-P by critics. Neyman has some extremely accessible non-technical papers I can point you to on my blog. also my discussion in Error and the Growth of Experimental Knowledge (Mayo 1996), which can be found in copy form on my publications page off my blog. Good luck.

8. Interesting comments Dean and Deborah. I can see both sides because so many researchers fail to deeply understand the issues and the net result is all kinds of strange behavior. In my opinion an approach that creates endless debate and requires continual deep thinking has its problems. I am drawn to the Bayesian "what should I believe now".

1. Lacking time to complete mycomment earlier, let me add this query: Please show me where the Bayesian account tells you what "you should believe now"? That would presumably mean, what you are warranted in believing, what has passed muster on grounds of evidence, yes? Else it's just saying: given you subjectively believe so and so, then given the data and likelihoods are assumed, then you should believe such and such--assuming you update beliefs by Bayes theorem (which many Bayesians do not). How is that relevant for finding out what you ought to believe in the sense of what is the case? The belief priors are often elicited through betting assessments and what not. Not only is this a bad idea in all but "personal decision" settings, it's found so impossibly difficult to carry out that nearly all Bayesians appeal instead to default priors of some sort. But default priors may not even be beliefs--they may not even be probabilities (being improper often). So how do you get the desired "what you ought to believe"? It seems there's a gentleman's agreement not to really call out whether there's substance to promises to tell you what you ought to believe.

The account that really requires extraordinarily deep thinking is the subjective Bayesian one. A person has to think very, very carefully about all possible outcomes in order to arrive at an appropriate prior now that they can live with. By the time she gets started, we've got a new model, with various bounds on parameters that do not go away.

But there are few true-blue subjective Bayesians now (Kadane is one). Ask the others what they mean by their posteriors, and which method they endorse. They won't agree. Papering over the very real difficulty in figuring out what's even being talked about in Bayesian country does not mean anyone really knows. (Empirical Bayes is different.)

2. The mandatory hypocrisy in the N-P "accepting a hypothesis does not imply that you believe in it; it only means that you act as if it were true. [Gigerenzer]" certainly isn't very attractive.

3. There's no hypocrisy, and N-P do not tell you to act as if H were true. So I don't get your criticism.
The output is a specific inference, and declaring it is, or can be seen as, an "act". There are two reasons they ever spoke of "acts" or decisions rather than inductive inferences, even though in practice they speak of inferences (qualified of course by the error probabilistic warrant): (1) to contrast with the going tendency at the time (and still) to regard inductive inference as a form of probabilism. (Fisher's confusion about fiducial probability at around the time was also part of reason (1).) (2) There are many who hold that the only kind of logic there is is deductive, e.g., Neyman, Popperians. Now N-P methods deliver outputs that go beyond their premises. (This is unlike that form of Bayesian who simply reports the posterior given the prior, or a comparison of likelihoods.) That is, N-P methods are "ampliative" or inductive in the strict sense. N-P methods do not, however, speak of induction as probabilism or as updating degrees of belief. So they needed a new term. We have inductive (ampliative) outputs, and it's common to call them acts or decisions. So Neyman said one day, if there's anything we inductively change with evidence it's our behavior. So the behavioristic interpretation was born. E. Pearson wasn't keen on it.
Fisher was quite right in telling Neyman that he (Neyman) was merely expressing a preferred way of talking (Fisher 1955). Indeed, and then Neyman showed Fisher he talked that way as well. So did Popper. An output might be "state theta is within [a,b], along with the confidence level". Stating is an act. This rigidity, Neyman made clear, was to distinguish what he was doing from the use of probability to represent psychological strength of some sort (which Fisher abhorred as well).
Accepting a claim as warranted because it has passed stringent tests is very different from srongly believing it. N-P-F were right.

4. Okay - and it's easy to see that there's no more hypocrisy in the N-P robot's behaviors than there is in the Jaynes robot's beliefs - but that doesn't make the emulation of its behavior any more attractive. And its makers' ideas about the proper interpretation and use of probability are profoundly ill-conceived in my view. As Popper discovered, abhorring subjectivity eventually leads to a wholly unnecessary* confrontation with common sense.

* "The EPR experiment pin-points the need for subjectivity in quantum probability; the same need in classical probability has been known and used since Bayes." https://web.archive.org/web/20151117174141/http://www.mth.kcl.ac.uk/~streater/EPR.html

5. Silly to refer to EPR experiment as indicating anything about the role of prob in science, but no time to say more than read my new book (Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars CUP) when its out this year. It has a chapter on what it is about objectivity that we won't give up and shouldn't. While I'm at it, here's a paper I wrote with David Cox that has objectivity in its title:
Cox D. R. and Mayo. D. G. (2010).

http://www.phil.vt.edu/dmayo/personal_website/ch%207%20cox%20&%20mayo.pdf
"Objectivity and Conditionality in Frequentist Inference" in Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science Cambridge: Cambridge University Press: 276-304.

6. The EPR experiment illustrates the necessity of a subjective interpretation of probability in science. It, and the probability theory of noncommuting observables more generally, indicates that there is a hard limit to the role of other interpretations in science. What on earth is silly about that? Given its title it's hardly likely that there's anything in your book (and there isn't in that paper) which even addresses it, let alone shows it to be silly.

9. It doesn't require deep thinking to simply do it right. It's exactly the reasoning you'd use every day if you didn't want to be fooled. It's only criticisms from those who have assumed we had to have a posterior, never mind what in the world it means (and no one has been able to say), and questionable research practices, cherry-picking and the like (which don't go away but are masked under Bayesianism) that has caused confusion. If you're saying, gee with Bayesian inference, I don't have to think, then you're better off not using statistics.
That's the opposite attitude that one should have in science.

1. The first sentence is a little ironic, no? Many people have thought long and hard about these issues and we’ve been debating them for over a century. And there are plenty of examples that don’t assume a Bayesian solution that make hypothesis tests look downright insane (Pratt’s voltage detector example, in 1961 I think, is one).

10. A bit harsh but I get some of your points. I remain unswayed about posterior probabilities masking the problem (or at least masking it as much as other methods do) but reproducible research, cherry picking, and the rest are extremely important. Posterior probabilities will be misleading primarily when the Bayesian got the model wrong, e.g., failed to take into account a source of uncertainty or used a prior that disagreed with the prior used by the judge of the research. If one's premise is that extraordinary statements require extraordinary evidence, especially where the claim comes from cherry picking, this would need to be reflected in a skeptical prior for the posterior to help.

1. It's odd to see science as trying the match the beliefs of the judge of research; science is about finding things out.

2. I wish that were true more often. I sometimes work in a regulatory environment where tough decisions (e.g., about whether a drug should appear on the market) have to be made and the judge is not the researcher but regulatory authorities and the experts they call upon.

11. Sorry, I guess I was going too fast because I'm rushing to finish a book (on these very issues). These issues cannot be discussed in a quick comment, I have 5.5 years of blogging and many years of published writings. Your applications may be those with clear priors (subjective? default? frequentist?). If you say you test your model assumptions, and if you ever falsify, I say you need statistical tests. Cherry picking and other biasing selection effects may not change your prior because you believe the hypothesis in question--what they change, for the error statistician--is how well you've tested your claim. These are very different assessments I'm interested in the latter. It doesn't follow there's no place for subjective beliefs.

1. Nothing about this process would tempt me to use a statistical test (other than what usually happens: I run out of time to do a better analysis). And to me one of the best Bayesian approaches is to use a skeptical prior so that a skeptic can be convinced.

2. Note, by the way, Senn remarking that even skeptical priors don't lead to the lump and smear priors used so often these days by Bes Factor folk. https://errorstatistics.com/2015/05/09/stephen-senn-double-jeopardy-judge-jeffreys-upholds-the-law-guest-post/

If you don't test your assumptions, then what you purport to be offering your "judge" is unlikely to be what you're really offering.

3. I'm not fond of Bayes factors or lump and smear priors. I think you've hit the key issue - getting the model right. The frequentist solution is ultimately disappointing because most frequentists make it a hard choice and ignore model uncertainty, resulting in standard errors that are too small and confidence coverage errors past the nominal. Bayesian's models are no worse fitting and when they entertain non-fit as Box did so eloquently they allow for the uncertainty between, say, two model choices. E.g., there is a prior for non-normality, resulting in somewhat elongated credible intervals that are more correct. Another avenue worth exploring is testing assumptions for 'comfort' but not changing the model, so as to avoid model uncertainty effects. On a separate issue I've found that all statisticians are a bit arbitrary as to what constitutes goodness of fit; late in the day everything seems to fit OK.

4. Tests of model assumptions are all frequentist, to my knowledge. That's why Box declared ecumenicism (spelling?). He said if we had to rely on assigning priors to the different models we'd never discover anything new in life. Discovery Box said, required statistical significance tests (I can send you the quote if you want. When Bayesians do model checking, they appear to all turn to Bayesian p-values (e.g, Gelman, J. Berger) of one sort or another.I'm not claiming to understand them, really. But there are deeper issues: what are you getting "wrong" when you find your Bayesian model wrong or inadequate? Lindley says he doesn't know what wrong means. The default Bayesian says the prior is just an undefined means for getting a posterior (and there are many different schools). Does it mean it gives an incorrect representation of some specific aspect of the data generation method or question being modelled? The answer would be yes for a frequentist error statistician, but what is it for a Bayesian? Is it getting your belief wrong, or is it wrong about your belief in the data generating mechanism? Or does it merely lead to a frequentist prediction that is rejectable by a significance test or graphical analysis? How do you distinguish an error due to the priors you assigned the parameters and the model itself? You may say it's all part of the model, but the fact is, you can nip and tuck the prior when the real problem is some violation of the statistical model assumption.
Finally, you'd typically one have one (non random) selection from the universe of prior distributions, theta. Of course if you had a frequentist prior that would be different.

5. Great food for thought. The little I've seen of Bayesian posterior or predictive model checking is very impressive - more comprehensive model assessment, incorporating sources of uncertainty, than available in the frequentist setting. One example is that in developing risk models we are usually content to showing good calibration of the point estimates of per-patient risks. Bayesian model assessment involves the entire posterior distribution of each prediction. I need to get more experience with that way of thinking.

12. What then is your favored Bayes way, a default?
What's the posterior distribution of each prediction? Gelman sees posterior checks as error-statistical http://www.stat.columbia.edu/~gelman/research/published/philosophy.pdf

13. Thanks for the reference - look forward to reading. In terms of favored Bayesian way I am a disciple of David Spiegelhalter, heavily influenced by his problem-solving approach as nicely described in his 1994 JRSS A paper which is focused on clinical trials and highly relevant to regulatory decision making.

1. Hi Frank,
Are you referring to the "Bayesian Approaches to Randomized Trials" paper (JRSS A ,157, part 3, 357-416)? In section 6.3 Speigelhalter recommends using Box's generalized p-value to check prior data compatibility. If I have a point prior (null) would that not simplify to Fisher's p-value?

2. That's the paper. I haven't looked into the Box approach. I would use a prior that is eventually overwhelmed by data, get agreement on it, and not often revisit the prior.

14. Hi Frank,

Interesting post. Do you see likelihood methods as subject to A, B and/or G?

Meaning eg Edwards-style likelihood (as in his book 'Likelihood') or Fisherian-style likelihood (as in eg Pawitan's book 'In all likelihood').

1. So I'll jump in as a Likelihoodist. Likelihood methods for measuring statistical evidence have very precise framework. The basic axiom is this: the hypothesis that does a better job predicting the observed data is better supported by the data.

For (a), I don’t see the “reverse” conditioning here as problematic; this is the natural way to compare predictions.

For (b), no issues here for the likelihood approach. The predictions from each hypothesis are directly compared and we just report which hypotheses did the better job of predicting the observed events. No need to rely on proof by contradiction, which Frank correctly points out is problematic when the direct implication is replaced by probabilistic tendency.

For (G): By design, likelihood methods report which hypotheses are better supported than others given the observed data and model. You need a model under which to specify the predictions of each hypothesis and that model is often prescribed by context. But I would guess this is not what Frank is referring to, if only because everyone generally agrees on the form of the likelihood function. Additional information, such as data from a previous study, would simply be combined in the likelihood (e.g., if the studies are independent, one could multiply the likelihoods). Information that represents personal believe or some other hunch would best be incorporated in the Bayesain framework. Likelihoodists want to know ‘what the data themselves say’, not ‘what the data say after I add in prior information’.

2. I glad we can agree on A. I don't think this is a satisfactory argument against pvalues and neither is it satisfactory against likelihood.

We can leave the other argument for another day. For now though I'll note that while I have quite a lot of sympathy for the 'pure' likelihood approach and/or evidential approaches, I don't find your axiom satisfactory.

There is of course a basic axiom behind pvalues - Fisher's disjunction. But I don't find this fully satisfactory either.

15. I believe this would be Edwards-Royall. I think that likelihood has a bit of a problem with G but not with A or B.

1. Would you then be OK with inference based on a pvalue function defined analogously to a likelihood function? That is,

PF(theta;y0) := prob(y>y0; theta)

considered as a function of theta for y0 fixed. Is this still subject to A or B?

(omaclaren)

2. That's not satisfying. High-level view: Aside from non-study information, the p-value is monotonically related to what I need, but it is not calibrated to be the metric I need.

3. Fair enough.

But would you agree it is no more of less subject to A and B than Likelihood? Or do you also disagree with this?

4. Not sure. Are you trying to get at Fisher's fiduciary method? I prefer having a full model for inference, but I do like the likelihood school of inference because it respects the likelihood principle. For example inference is independent of a stopping rule. How do you handle multiplicity/sequential stopping rules for a p-value function?

5. Well, it is related to Fisher's fiducial approach - see eg the reminisces of Fraser at the end here http://www.utstat.toronto.edu/dfraser/documents/ARST04-Fraser-copyedited.pdf - but more generally is just standard Fisherian-style confidence theory as used by eg Cox, Fraser etc and as opposed to Neymanian confidence and decisions.

I assume the p-value function will depend on stopping rules etc as generally understood - ie they violate the strong likelihood principle.

But, my more general point - rather than advocating for likelihood, Bayes or confidence theory - is that points A and B seem to either (logically) apply to both likelihood and confidence theory/pvalue functions or to neither.

6. Not sure. I'm going to invite a likelihoodist to join the conversation.

7. I did above.

8. And...I would not be 'ok' with “inference based on the p-value function” largely because there is not axiom to support its use. The axiom would be something like “A set of data supports the hypotheses that do a better job at prediction the data and data more extreme”. I don’t find this compelling because of the inclusion of “more extreme”, which means different things to different people. The NP hypothesis testing framework is clear that “more extreme” means large likelihood ratios. In contrast, significance testing often defines “more extreme” as further away in hypothesis space (tail areas). These two definitions are not the same; which confuses matters. Large likelihood ratios can lead to instances where the rejection regions are near the null hypothesis and not in the tails (e.g., comparing means of two normal models with different variances).

16. Anyone who ignores the stopping rule can erroneously declare significance with probability 1, and with the corresponding Bayesian priors can erroneously leave the true parameter value out of the Bayesian HPD interval. See stopping rules on my blog.
Anyone who obeys the LP cannot use error probabilities. I think only subjective Bayesians are still prepared to endorse that and staunch likelihoodists. Just what today's practice needs. For that matter, let replicationists try and try again until they can get a stat sig effect Then the replication rate will be 100%

I have, incidentally, disproved alleged proofs of the LP.

1. When the analyst uses the same prior as the judge, posterior probabilities are perfectly calibrated independent of the stopping rule.

2. Overly broad generalizations can lead to confusion. So let’s be specific and consider the points. The first clause is true for routine usage of classical hypothesis and significance tests when the sample size is allowed to grow forever. I have yet to see a study where this was possible, so the force of this point is diminished I think. If we take the case where the sample size is finite, the claim is false. The probably might be high in some cases, but it will not be one, and certain cases can be constructed so the probably is not so high. Regardless, this is an excellent reason to avoid using p-values in my opinion. Why not use a tool that does not have this property (e.g., a likelihood ratio)?

3. I think it is important to understand why this happens. When the null hypothesis is true the likelihood masses right on top of the null value. Every now and then, the tails of the likelihood shift by an infinitesimal amount and this tiny tiny shift causes the classical hypothesis test to reject because the benchmark for rejecting is measured in standard errors (which are rapidly shrinking to zero) and not standard deviations. It leaves us in the odd position of claiming to reject the null hypothesis when data support hypotheses that are arbitrary close the null. If instead we decided to only count the statistical rejections that also supported clinically meaningful differences, the resulting probability would not approach one or even be all that high.

4. The third sentence is a nice example of a common misinterpretation of the LP. The LP only says that the stopping rule is irrelevant for the measurement of the strength of evidence in data. It does not say the stopping rule is irrelevant for everything. Confusion rains because we often fail to distinguish between the measure of the strength of evidence and the probability that the evidence will be misleading.

Likelihoodists follow the LP and use error probabilities. When we design a study, we compute its frequency properties. If we plan to look at the data many times, the probability of observing misleading evidence gets inflated (lucky for us, however, that inflation is bounded above). These frequency properties help us choose between designs. Modern Bayesian do the same.

5. The key is that once data is observed, we compute the likelihood ratio to measure the evidence. This, of course, does not depend on the stopping rule. And if we are concerned that the data are likely to be misleading because we looked a lot during our study, we would compute the probability that our observed data are mistaken. The point is that the measure of the evidence (the LR) and the probability that the observed data are mistaken (written as P(H_0|LR>k)) , are not the same as the probability that the study design will generate misleading evidence (written as P(LR>k|H_0)). Error probabilities have an important role in statistics; they just don’t represent the strength of the evidence in the data or the probability that the observed data are misleading.

6. No time but to register: "absolutely absurd" though grist for my mills if Blume is for real. See my blog for details. errorstatisticscom

7. Those mills are going to be running overtime. I simulated a string of 100,000 standard normal deviates and then computed the running z-statistic for testing the null hypothesis is zero (assuming the variance is known). In 10,000 simulations, only 74% rejected at some point (not bad for 100,000 looks at the data). The 25th, 50th, and 75th quartiles of the stopping time were 10, 98, 1533. The mean stopping time was ~5200. That means, for example, that 25% of the rejections occurred when the sample mean was around 0.05 (=1.96/sqrt(1533)). That’s 5% of a standard deviation. In practice, an observed difference that small is often a rounding error.

The point is not that these are desirable properties, but rather that when these tests make a Type I Error, they often support hypotheses very close to the null. So close, in fact, that they may be practically indistinguishable from the null. If we only counted the rejections where the difference was at least 25% of a standard deviation, the rejection probability drops to about 20%. Not great, but not terrible for 100,000 looks at the data.

8. As for details on the evidential framework I alluded to above, a reference is: Blume JD. Likelihood and its evidential framework. In: Dov M. Gabbay and John Woods, editors, Handbook of The Philosophy of Science: Philosophy of Statistics. San Diego: North Holland, 2011, pp. 493-511.

Any system purporting to measure evidence for or against a hypothesis ought to be subjected to the same scrutiny. This entails identifying three things: (1) the metric that will be used for measuring evidence, (2) the probability of observing a misleading metric under certain experimental conditions (i.e., error probabilities), and (3) the probability than an observed metric is indeed misleading (e.g., this would be a false discovery rate). Systems for measuring evidence can then be compared on the basis of these three criteria and perhaps their axiomatic justification. These three concepts are distinct, so a single mathematical quantity like the tail area probability can’t possibly represent all three. The above paper uses the likelihood paradigm to illustrate the point.

9. For a disproof of the alleged disproof of the alleged proof of the likelihood principle see http://gandenberger.org/research/

Gandenberger G. “A New Proof of the Likelihood Principle.” The British Journal for the Philosophy of Science 66, 3 (2015): 475-503.

17. Anil Potti and Joseph Nevins used Bayesian methods recommended by Mike West during the Duke genomics fiasco a few years back. They overfitted models and committed many other analysis gaffes. Bayesian modelers are not immune from running afoul of proper interpretation of modeling findings to shed light on scientific phenomena.

Proper handling of statistics and interpretation of findings is needed in any statistical exercise, Bayesian, frequentist or otherwise.

Given the wealth of great advice in Regression Modeling Strategies and other writings, I am taken aback to see Frank Harrell blame a statistic when the problem is people misinterpreting a statistic.

A p-value is just a statistic, with certain knowable distributional properties under this and that condition. A Bayesian Highest Posterior Density region can be improperly obtained and misinterpreted just as readily.

1. Ask yourself how many investigative journalism articles mentioned Bayes when writing about the Duke fiasco. I think the answer is zero, because Bayesian modeling had nothing at all to do with the problem. Your comments are most curious. I can't think of any analytical method that cannot be abused, even just using descriptive statistics. You might also look at why Mike West resigned from the collaboration early on.

2. Following McKinney on this, it was their confidence that their prior freed them from having to have genuine hold out data, that they thought made it permissible. McKinney knows more about the specifics. The blatant ignoring of unwelcome results was brought out by the whistleblowers. Of course if you're not in the business of ensuring error control, actions that alter them will be irrelevant

3. This is exactly my point. Whether you use Bayesian methods, as the Duke group did, or you use frequentist methods, as Baggerly and Coombes did in reviewing several of the Duke analyses, you need to exercise discipline, using many ideas some of which are very ably described in Regression Modeling Strategies.

If you want to start using more Bayesian based methods, have at it. I look forward to seeing your examples.

What I find disingenuous and concerning are your statements “Statisticians should choose paradigms that solve the greatest number of real problems and have the fewest number of faults. This is why I believe that the Bayesian and likelihood paradigms should replace frequentist inference.” Where do you show this quantification, that Bayesian and likelihood paradigms solve the greatest number of real problems and have the fewest number of faults. That certainly wasn’t apparent in the Duke fiasco.

I found some course notes on line, “An Introduction to Bayesian Methods with Clinical Applications” from July 8, 1998, by Frank Harrell and Mario Peruggia. That’s nearly 20 years ago. Why does it take so long to bring Bayesian methods into practice? If you haven’t been able to do it in nearly 20 years, how are other mere mortals to do it?

18. This comment has been removed by the author.

19. The statistical paradigm had almost nothing to do with this. Ask the forensic biostatisticians Baggerly and Coombs who uncovered the whole problem. In their wonderful paper (Annals of Applied Statistics 2009) neither Bayes nor prior appear.

1. Precisely. This is why I do not understand this odd stance you have adopted. Bayesian methods require the same discipline in handling analyses and interpreting findings as any other paradigm.

When you and Gelman and others lead the way in demonstrating Bayesian analytical methods, I predict other blog postings such as "The litany of problems with Bayes factors" after groups with no statistical discipline misinterpret and mangle findings from such analyses.

But for those of us still striving to find truth in data, I do genuinely look forward to seeing more Bayesian-based approaches in future revisions of Regression Modeling Strategies, along with the attendant steps needed to interpret findings in a disciplined manner.

20. I take offense are your choice of words. The fact that I have not been a Bayesian my whole career and hence do not have a plethora of Bayesian examples doesn't hold me back from seeking better approaches. You seem to be threatened by the amount of work we have ahead of us to improve the situation. I am not, which is one reason I'm working closely with FDA on this very problem. You can do unscientific, non-reproducible research using any paradigm. I want to do good science and have results that have sensible interpretations.

1. I apologize for my choice of words. I am not threatened by the amount of work we have ahead of us to improve the situation, my entire career has been working to improve the situation. That’s why I own two copies of your RMS book and regularly use your software. I am threatened when an avalanche of pop-culture blog posts appear, inappropriately attributing fault to a statistic or paradigm when the fault lies with people mishandling and misinterpreting that statistic, or paradigm.

As you say, you can do unscientific non-reproducible research using any paradigm and statistics thereof. As you state in the introduction above, “As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress.” I hope that the title and opening sentence, and further general discussion of this topic, progresses and becomes

“A Litany of Problems With Misinterpretation of p-values”

“In my opinion, inappropriate handling and interpretation of null hypothesis testing and p-values have done significant harm to science.”

2. I don’t buy the argument that the user is the problem. These procedures have been in use for almost a century by many disciplines. Is the claim really that all the problems, counterexamples and unintuitiveness that arise are due to user error? Frank’s list might seem daunting, and there are perhaps quibbles to be made, but the theme is on target: hypothesis and significance test procedures are flawed when it comes to measuring and communicating the strength of evidence in a given body of observations. I would not claim that that they are useless, although a strong case can be made when only the p-value is reported, but rather that they have serious shortcoming that require attention.

21. I appreciate that Steven, and only take issue with your "find fault" sentence. True, more fault lies with unscientific work than with statistical paradigms, but there are major problems with p-values, and p-values lead to many downstream problems as I've tried to catalog. The paradigm really matters. Not all of the fault lies with practitioners. This becomes more clear for those like me (a follower of David Spiegelhalter) who embrace Bayesian posterior probabilities and favor skeptical priors. Once you do the right simulations or grasp the theory (the former being easier for me) you'll see things such as the fact that multiplicity comes from the chances you give data to be more extreme, not the chances you give assertions to be true. And the fact that frequentist thinking leads usually to fixed sample size designs turns out to be a huge issue in experimental work.

You're giving me the idea to post a separate article on my journey and how this relates to RMS, which I hope to complete in the next few days.

22. Of course it comes from increasing the chances you give for the data to be more extreme, but the relevance of such outcomes that didn't occur is just what's denied by Bayesians who endorse the Likelihood Principle. Sequential trials were advocated by frequentists long ago (Armitage). He also argued that optional stopping also results in posteriors being wrong with high probability. But Savage switched to a simple point against point hypothesis to defend the LP. That is still going on today as the latest reforms champion the irrelevance of optional stopping––to them.

Aside from inability to control error probabilities, the key problem with all Bayesian accounts is they never quite tell us what they're talking about (and they certainly don't agree with eachother), except perhaps empirical Bayesians. Is the prior/posterior an expression of degree of belief in various values of parameters? how frequently they occur in universes of parameters? As the number of parameters increases, the assessments–generally default priors–move further and further from anything we can get a grip on. They are not representing background information. And if we're going to test them, we'll have to report to something like significance tests.
But given what you said earlier, about only caring to match the beliefs of "the judge" the whole business of outsiders critically appriasing you may not matter.

1. You'll have a hard time convincing me of the relevance of things that might have happened that didn't. And simulation studies for the frameworks I envision demonstrate that the stopping rule is really irrelevant. The simplest example is a one-sample normal problem with known variance where one tests after each observation is acquired, resulting in n tests for an ultimate sample size n. If you are thinking of a particular issue you might sketch the flow of a simulation that would demonstrate it.

2. Re the concern about errors rates outside of formal NP theory: I think this is overblown. Just because NP theory is not used does not imply that all resulting inferences are suspect. Likelihood methods are an excellent example. In the likelihood paradigm, both the Type I and Type II error rates go to zero. In fact, if Neyman and Pearson had chosen to minimize the average error rate (instead of holding one constant), then they would have been likelihoodists, since that solution is given by the Law of Likelihood. From here, one could make a strong argument that likelihoodist have better frequency properties than those afforded by hypothesis testing, solely because it makes little sense to hold the type I error fixed over the sample size. In many cases, this is what is causes hypothesis tests to go awry. Bayesian analyses benefit from this behavior as long as the prior does not change too quickly and it relatively smooth.

3. Just a quick point on something alludes to above. Much of the discordance between the three schools of inference comes from how composite hypotheses are handled. For what it is worth, Neyman and Pearson acknowledged in their 1933 paper that there is no general solution to this problem. Their approach was to take the best supported alternative hypothesis in the alternative space and let that data-chosen alternative represent the alternative hypothesis when they applied their optimal solution to the simple -versus-simple case. It is a bit of a cheat, which they acknowledged, but a decent practical solution. The problem with this solution is that it can break down when the null hypothesis is true, because in that case the best supported alternative ends up being virtually identical to the null, but still regarded as an “alternative”. The Bayesian solution, of course, was to average over the alternative space to come up with a new simple hypothesis. Then the two simple hypotheses are compared. This has pros and cons too. So Savage is right on the mark here.

23. If I can't interest you in learning that someone tried and tried again to achieve a stat sig result (or a HPD interval excluding the true value), even though with high or max probability this can be achieved erroneously, then your view of "being cheated" and mine are very different. But I'm glad you stick with this, it's the Bayesians who try to wrangle out of the consequences of accepting the LP that bother me. Please see this blog post:
https://errorstatistics.com/2014/04/05/who-is-allowed-to-cheat-i-j-good-and-that-after-dinner-comedy-hour-2/

1. The solution here is to be specific about what is communicated. If we were to only report that the result of the hypothesis test (Reject or Accept), then the false discovery rates would indeed be direct functions of the Type I and Type II error rates. That is, if you only tell me that you rejected the null, that information is more likely misleading if you used a design with a large Type I Error rate. However, if you report the data, or some summary of it, then the above argument no longer holds. The probability that the null hypothesis is true given the data (this is now the false discovery rate if the test rejects) does not depend on the Type I and Type II errors. Why? Because here the likelihood function for the observations depends on the data and the model (and not the sample space), whereas in the first example the likelihood function is for the test result (not the data) and that likelihood depends on a binomial model where the error rates determine the likelihood function. So, really, it’s all about the likelihood.

24. Very well written article and interesting history. I think that consideration of what constitutes cheating is a very useful exercise. It is also useful to back-construct a prior that makes certain things possible or likely. For example, sampling to a foregone conclusion happens when a statistician uses a smooth prior but her critic uses a prior with an absorbing state (point mass) at the null. Bonferroni correction is equivalent to a prior that specifies that the probability that all null hypotheses are true is the same no matter how many hypotheses are tested (a very strange assumption). A Bayesian can cheat by changing the prior after observing data or by improper conditioning, e.g., acquiring more data, finding the cumulative result to be less impressive than it was before, and rolling back the data to only analyze the smaller sample. But choosing a smooth prior before looking at the data and having the prior at least as skeptical as the critic's implied prior, will result in a stream of posterior probabilities that are well calibrated independent of how aggressive were the 'data looks'. Not only are the posterior probabilities calibrated, but the posterior mean is perfectly calibrated, discounted by the prior more when stopping is very early. The frequentist correction for bias in the sample mean upon early stopping is quite complex. Frequentists tend to be very good at correcting p-values for multiplicity but very bad at correcting point estimates for same.

25. First off, thanks.
The frequentist doesn't assign priors to the hypotheses you mention; it is a fallacy to spoze that a match between numbers (error probabilities and posterior) means the frequentist makes those prior assignments.
But on cheating, I don't see how you can say "a Bayesian can cheat by changing the prior after observing data". You need a notion of cheating. Error statisticians have one, what's the Bayesian's? Bayesians, by and large, aren't troubled by changing their prior post data as you can see from this post:
"Can you change your Bayesian prior?"
https://errorstatistics.com/2015/06/18/can-you-change-your-bayesian-prior-i/
I think only subjective Bayesians may say no, but even Dawid says yes. I'd like to know what you think.

26. I didn't mean to imply that frequentists use priors. But sometimes you can solve for the prior that is consistent with how they operate. Regarding changing the prior I'm more skeptical that that is OK but I am influenced by working in a regulatory environment where pre-specification is all important.

27. No, Greg doesn't even purport to disprove me nor can he. As a logic professor, I'm clear on the logical mistake that led people to think Birnbaum had proved the LP--though it's quite subtle, and took a long time to explain (not to spot). My disproof is even deeper than Evan's disproof, because, as I explain in my paper, it requires more than a mere counterexample.

28. I posted because I think it is good to see alternative viewpoints and because I think it illustrates an important issue: The class of evidence functions being considered must be large enough to include functions that depend on the sample space and those that do not depend on the sample space. Otherwise the argument is effectively tautological.

29. One interesting point is that likelihoodists don’t really care about the proof of the LP from the CP or SP. This is because the LP is implied by the Law of Likelihood. Likewise if one defines the measure of the strength of evidence to be a Bayes factor, posterior probability, or some distance between the two hypothesis. However, if the measure of the strength of evidence is defined to be a probability or some other metric that depends on the sample space, then the LP will not apply. It all boils down to the fundamental building blocks: (1) what is the measure of the strength of evidence, (2) what is the probability that a study will generate misleading evidence, and (3) what is the probability that an observed measure is misleading. This is how systems for measuring evidence ought to be evaluated and compared.