With the many problems that p-values have, and the temptation to “bless” research when the p-value falls below an arbitrary threshold such as 0.05 or 0.005, researchers using p-values should at least be fully aware of what they are getting. They need to know exactly what a p-value means and what are the assumptions required for it to have that meaning. ♦ A p-value is the probability of getting, in another study, a test statistic that is more extreme than the one obtained in your study if a series of assumptions hold.
What clinicians learn from clinical practice, unless they routinely do n-of-one studies, is based on comparisons of unlikes. Then they criticize like-vs-like comparisons from randomized trials for not being generalizable. This is made worse by not understanding that clinical trials are designed to estimate relative efficacy, and relative efficacy is surprisingly transportable. Many clinicians do not even track what happens to their patients to be able to inform their future patients.
Optimum decision making in the presence of uncertainty comes from probabilistic thinking. The relevant probabilities are of a predictive nature: P(the unknown given the known). Thresholds are not helpful and are completely dependent on the utility/cost/loss function.
Corollary: Since p-values are P(someone else’s data are more extreme than mine if H0 is true) and we don’t know whether H0 is true, it is a non-predictive probability that is not useful for decision making.
Methods used to obtain unbiased estimates of future performance of statistical prediction models and classifiers include data splitting and resampling. The two most commonly used resampling methods are cross-validation and bootstrapping. To be as good as the bootstrap, about 100 repeats of 10-fold cross-validation are required.
As discussed in more detail in Section 5.3 of Regression Modeling Strategies Course Notes and the same section of the RMS book, data splitting is an unstable method for validating models or classifiers, especially when the number of subjects is less than about 20,000 (fewer if signal:noise ratio is high).
There are many principles involved in the theory and practice of statistics, but here are the ones that guide my practice the most.
Use methods grounded in theory or extensive simulation Understand uncertainty, and realize that the most honest approach to inference is a Bayesian model that takes into account what you don’t know (e.g., Are variances equal? Is the distribution normal? Should an interaction term be in the model?
Suggestions for future articles are welcomed as comments to this entry. Some topics I intend to write about are listed below.
Matching vs. covariate adjustment (see below from Arne Warnke) Statistical strategy for propensity score modeling and usage What is the full meaning of a posterior probability? Moving from pdf to html for statistical reporting Is machine learning statistics or computer science? Sample size calculation: Is it voodoo? Difference between Bayesian modeling and frequentist inference A few weeks ago we had a small discussion at CrossValidated about the pros and cons of matching here.
It is important to distinguish prediction and classification. In many decisionmaking contexts, classification represents a premature decision, because classification combines prediction and decision making and usurps the decision maker in specifying costs of wrong decisions. The classification rule must be reformulated if costs/utilities or sampling criteria change. Predictions are separate from decisions and can be used by any decision maker. Classification is best used with non-stochastic/deterministic outcomes that occur frequently, and not when two individuals with identical inputs can easily have different outcomes.
In trying to guard against false conclusions, researchers often attempt to minimize the risk of a “false positive” conclusion. In the field of assessing the efficacy of medical and behavioral treatments for improving subjects’ outcomes, falsely concluding that a treatment is effective when it is not is an important consideration. Nowhere is this more important than in the drug and medical device regulatory environments, because a treatment thought not to work can be given a second chance as better data arrive, but a treatment judged to be effective may be approved for marketing, and if later data show that the treatment was actually not effective (or was only trivially effective) it is difficult to remove the treatment from the market if it is safe.
Much has been written about problems with our most-used statistical paradigm: frequentist null hypothesis significance testing (NHST), p-values, type I and type II errors, and confidence intervals. Rejection of straw-man null hypotheses leads researchers to believe that their theories are supported, and the unquestioning use of a threshold such as p<0.05 has resulted in hypothesis substitution, search for subgroups, and other gaming that has badly damaged science. But we seldom examine whether the original idea of NHST actually delivered on its goal of making good decisions about effects, given the data.