Sunday, February 19, 2017

My Journey From Frequentist to Bayesian Statistics

If I had been taught Bayesian modeling before being taught the frequentist paradigm, I'm sure I would have always been a Bayesian.  I started becoming a Bayesian about 1994 because of an influential paper by David Spiegelhalter and because I worked in the same building at Duke University as Don Berry.  Two other things strongly contributed to my thinking: difficulties explaining p-values and confidence intervals (especially the latter) to clinical researchers, and difficulty of learning group sequential methods in clinical trials.  When I talked with Don and learned about the flexibility of the Bayesian approach to clinical trials, and saw Spiegelhalter's embrace of Bayesian methods because of its problem-solving abilities, I was hooked.  [Note: I've heard Don say that he became Bayesian after multiple attempts to teach statistics students the exact definition of a confidence interval.  He decided the concept was defective.]

At the time I was working on clinical trials at Duke and started to see that multiplicity adjustments were arbitrary.  This started with a clinical trial coordinated by Duke in which low dose and high dose of a new drug were to be compared to placebo, using an alpha cutoff of 0.03 for each comparison to adjust for multiplicity.  The comparison of high dose with placebo resulted in a p-value of 0.04 and the trial was labeled completely "negative" which seemed problematic to me. [Note: the p-value was two-sided and thus didn't give any special "credit" for the treatment effect coming out in the right direction.]

I began to see that the hypothesis testing framework wasn't always the best approach to science, and that in biomedical research the typical hypothesis was an artificial construct designed to placate a reviewer who believed that an NIH grant's specific aims must include null hypotheses.  I saw the contortions that investigators went through to achieve this, came to see that questions are more relevant than hypotheses, and estimation was even more important than questions.   With Bayes, estimation is emphasized.  I very much like Bayesian modeling instead of hypothesis testing.  I saw that a large number of clinical trials were incorrectly interpreted when p>0.05 because the investigators involved failed to realize that a p-value can only provide evidence against a hypothesis. Investigators are motivated by "we spent a lot of time and money and must have gained something from this experiment." The classic "absence of evidence is not evidence of absence" error results, whereas with Bayes it is easy to estimate the probability of similarity of two treatments.  Investigators will be surprised to know how little we have learned from clinical trials that are not huge when p>0.05.

I listened to many discussions of famous clinical trialists debating what should be the primary endpoint in a trial, the co-primary endpoint, the secondary endpoints, co-secondary endpoints, etc.  This was all because of their paying attention to alpha-spending.  I realized this was all a game.

I came to not believe in the possibility of infinitely many repetitions of identical experiments, as required to be envisioned in the frequentist paradigm.  When I looked more thoroughly into the multiplicity problem, and sequential testing, and I looked at Bayesian solutions, I became more of a believer in the approach.  I learned that posterior probabilities have a simple interpretation independent of the stopping rule and frequency of data looks.  I got involved in working with the FDA and then consulting with pharmaceutical companies, and started observing how multiple clinical endpoints were handled.  I saw a closed testing procedures where a company was seeking a superiority claim for a new drug, and if there was insufficient evidence for such a claim, they wanted to seek a non-inferiority claim on another endpoint.  They developed a closed testing procedure that when diagrammed truly looked like a train wreck.  I felt there had to be a better approach, so I sought to see how far posterior probabilities could be pushed.  I found that with MCMC simulation of Bayesian posterior draws I could quite simply compute probabilities such as P(any efficacy), P(efficacy more than trivial), P(non-inferiority), P(efficacy on endpoint A and on either endpoint B or endpoint C), and P(benefit on more than 2 of 5 endpoints).  I realized that frequentist multiplicity problems came from the chances you give data to be more extreme, not from the chances you give assertions to be true.

I enjoy the fact that posterior probabilities define their own error probabilities, and that they count not only inefficacy but also harm.  If P(efficacy)=0.97, P(no effect or harm)=0.03.  This is the "regulator's regret", and type I error is not the error of major interest (is it really even an 'error'?).  One minus a p-value is P(data in general are less extreme than that observed if H0 is true) which is the probability of an event I'm not that interested in.

The extreme amount of time I spent analyzing data led me to understand other problems with the frequentist approach.  Parameters are either in a model or not in a model.  We test for interactions with treatment and hope that the p-value is not between 0.02 and 0.2.  We either include the interactions or exclude them, and the power for the interaction test is modest.  Bayesians have a prior for the differential treatment effect and can easily have interactions "half in" the model.  Dichotomous irrevocable decisions are at the heart of many of the statistical modeling problems we have today.  I really like penalized maximum likelihood estimation (which is really empirical Bayes) but once we have a penalized model all of our frequentist inferential framework fails us.  No one can interpret a confidence interval for a biased (shrunken; penalized) estimate.  On the other hand, the Bayesian posterior probability density function, after shrinkage is accomplished using skeptical priors, is just as easy to interpret as had the prior been flat.  For another example, consider a categorical predictor variable that we hope is predicting in an ordinal (monotonic) fashion.  We tend to either model it as ordinal or as completely unordered (using k-1 indicator variables for k categories).  A Bayesian would say "let's use a prior that favors monotonicity but allows larger sample sizes to override this belief."

Now that adaptive and sequential experiments are becoming more popular, and a formal mechanism is needed to use data from one experiment to inform a later experiment (a good example being the use of adult clinical trial data to inform clinical trials on children when it is difficult to enroll a sufficient number of children for the child data to stand on their own), Bayes is needed more than ever.  It took me a while to realize something that is quite profound: A Bayesian solution to a simple problem (e.g., 2-group comparison of means) can be embedded into a complex design (e.g., adaptive clinical trial) without modification.  Frequentist solutions require highly complex modifications to work in the adaptive trial setting.

I met likelihoodist Jeffrey Blume in 2008 and started to like the likelihood approach.  It is more Bayesian than frequentist.  I plan to learn more about this paradigm. 

Several readers have asked me how I could believe all this and publish a frequentist-based book such as Regression Modeling Strategies.  There are two primary reasons.  First, I started writing the book before I knew much about Bayes.  Second, I performed a lot of simulation studies that showed that purely empirical model-building had a low chance of capturing clinical phenomena correctly and of validating on new datasets.  I worked extensively with cardiologists such as Rob Califf, Dan Mark, Mark Hlatky, David Prior, and Phil Harris who give me the ideas for injecting clinical knowledge into model specification.  From that experience I wrote Regression Modeling Strategies in the most Bayesian way I could without actually using specific  Bayesian methods.  I did this by emphasizing subject-matter-guided model specification.  The section in the book about specification of interaction terms is perhaps the best example.  When I teach the full-semester version of my course I interject Bayesian counterparts to many of the techniques covered.

There are challenges in moving more to a Bayesian approach.  The ones I encounter most frequently are:
  1. Teaching clinical trialists to embrace Bayes when they already do in spirit but not operationally.  Unlearning things is much more difficult than learning things.
  2. How to work with sponsors, regulators, and NIH principal investigators to specify the (usually skeptical) prior up front, and to specify the amount of applicability assumed for previous data.
  3. What is a Bayesian version of the multiple degree of freedom "chunk test"?  Partitioning sums of squares or the log likelihood into components, e.g., combined test of interaction and combined test of nonlinearities, is very easy and natural in the frequentist setting.
  4. How do we specify priors for complex entities such as the degree of monotonicity of the effect of a continuous predictor in a regression model?  The Bayesian approach to this will ultimately be more satisfying, but operationalizing this is not easy.
With new tools such as Stan and well written accessible books such as Kruschke's it's getting to be easier to be Bayesian each day.  The R brms package, which uses Stan, makes a large class of regression models even more accessible.

Sunday, February 5, 2017

Interactive Statistical Graphics: Showing More By Showing Less

Version 5 of the R rms package interfaces with interactive plotly graphics, which is an interface to the D3 javascript graphics library.  This allows various results of statistical analyses to be viewed interactively, with pre-programmed drill-down information.  More examples will be added here.  We start with a video showing a new way to display survival curves.

Note that plotly graphics are best used with RStudio Rmarkdown html notebooks, and are distributed to reviewers as self-contained (but somewhat large) html files. Printing is discouraged, but possible, using snapshots of the interactive graphics.

Concerning the second bullet point below, boxplots have a high ink:information ratio and hide bimodality and other data features.  Many statisticians prefer to use dot plots and violin plots.  I liked those methods for a while, then started to have trouble with the choice of a smoothing bandwidth in violin plots, and found that dot plots do not scale well to very large datasets, whereas spike histograms are useful for all sample sizes.  Users of dot charts have to have a dot stand for more than one observation if N is large, and I found the process too arbitrary.  For spike histograms I typically use 100 or 200 bins.  When the number of distinct data values is below the specified number of bins, I just do a frequency tabulation for all distinct data values, rounding only when two of the values are very close to each other.  A spike histogram approximately reduces to a rug plot when there are no ties in the data, and I very much like rug plots.

  • rms survplotp video: plotting survival curves
  • Hmisc histboxp interactive html example: spike histograms plus selected quantiles, mean, and Gini's mean difference - replacement for boxplots - show all the data!  Note bimodal distributions and zero blood pressure values for patients having a cardiac arrest.

A Litany of Problems With p-values

In my opinion, null hypothesis testing and p-values have done significant harm to science.  The purpose of this note is to catalog the many problems caused by p-values.  As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress.

The American Statistical Association has done a great service by issuing its Statement on Statistical Significance and P-values.  Now it's time to act.  To create the needed motivation to change, we need to fully describe the depth of the problem.

It is important to note that no statistical paradigm is perfect.  Statisticians should choose paradigms that solve the greatest number of real problems and have the fewest number of faults.  This is why I believe that the Bayesian and likelihood paradigms should replace frequentist inference.

Consider an assertion such as "the coin is fair", "treatment A yields the same blood pressure as treatment B", "B yields lower blood pressure than A", or "B lowers blood pressure at least 5mmHg before A."  Consider also a compound assertion such as "A lowers blood pressure by at least 3mmHg and does not raise the risk of stroke."

A. Problems With Conditioning

  1. p-values condition on what is unknown (the assertion of interest; H0) and do not condition on what is known (the data).
  2. This conditioning does not respect the flow of time and information; p-values are backward probabilities.

B. Indirectness

  1. Because of A above, p-values provide only indirect evidence and are problematic as evidence metrics.  They are sometimes monotonically related to the evidence (e.g., when the prior distribution is flat) we need but are not properly calibrated for decision making.
  2. p-values are used to bring indirect evidence against an assertion but cannot bring evidence in favor of the assertion.  
  3. As detailed here, the idea of proof by contradiction is a stretch when working with probabilities, so trying to quantify evidence for an assertion by bringing evidence against its complement is on shaky ground.
  4. Because of A, p-values are difficult to interpret and very few non-statisticians get it right.  The best article on misinterpretations I've found is here.

C. Problem Defining the Event Whose Probability is Computed

  1. In the continuous data case, the probability of getting a result as extreme as that observed with our sample is zero, so the p-value is the probability of getting a result more extreme than that observed.  Is this the correct point of reference?
  2. How does more extreme get defined if there are sequential analyses and multiple endpoints or subgroups?  For sequential analyses do we consider planned analyses are analyses intended to be run even if they were not?

D. Problems Actually Computing p-values

  1. In some discrete data cases, e.g., comparing two proportions, there is tremendous disagreement among statisticians about how p-values should be calculated.  In a famous 2x2 table from an ECMO adaptive clinical trial, 13 p-values have been computed from the same data, ranging from 0.001 to 1.0.  And many statisticians do not realize that Fisher's so-called "exact" test is not very accurate in many cases.
  2. Outside of binomial, exponential, and normal (with equal variance) and a few other cases, p-values are actually very difficult to compute exactly, and many p-values computed by statisticians are of unknown accuracy (e.g., in logistic regression and mixed effects models). The more non-quadratic the log likelihood function the more problematic this becomes in many cases. 
  3. One can compute (sometimes requiring simulation) the type-I error of many multi-stage procedures, but actually computing a p-value that can be taken out of context can be quite difficult and sometimes impossible.  One example: one can control the false discovery probability (incorrectly usually referred to as a rate), and ad hoc modifications of nominal p-values have been proposed, but these are not necessarily in line with the real definition of a p-value.

E. The Multiplicity Mess

  1. Frequentist statistics does not have a recipe or blueprint leading to a unique solution for multiplicity problems, so when many p-values are computed, the way they are penalized for multiple comparisons results in endless arguments.  A Bonferroni multiplicity adjustment is consistent with a Bayesian prior distribution specifying that the probability that all null hypotheses are true is a constant no matter how many hypotheses are tested.  By contrast, Bayesian inference reflects the facts that P(A ∪ B) ≥ max(P(A), P(B)) and P(A ∩ B) ≤ min(P(A), P(B)) when A and B are assertions about a true effect.
  2. There remains controversy over the choice of 1-tailed vs. 2-tailed tests.  The 2-tailed test can be thought of as a multiplicity penalty for being potentially excited about either a positive effect or a negative effect of a treatment.  But few researchers want to bring evidence that a treatment harms patients; a pharmaceutical company would not seek a licensing claim of harm.  So when one computes the probability of obtaining an effect larger than that observed if there is no true effect, why do we too often ignore the sign of the effect and compute the (2-tailed) p-value?
  3. Because it is a very difficult problem to compute p-values when the assertion is compound, researchers using frequentist methods do not attempt to provide simultaneous evidence regarding such assertions and instead rely on ad hoc multiplicity adjustments.
  4. Because of A1, statistical testing with multiple looks at the data, e.g., in sequential data monitoring, is ad hoc and complex.  Scientific flexibility is discouraged.  The p-value for an early data look must be adjusted for future looks.  The p-value at the final data look must be adjusted for the earlier inconsequential looks.  Unblinded sample size re-estimation is another case in point.  If the sample size is expanded to gain more information, there is a multiplicity problem and some of the methods commonly used to analyze the final data effectively discount the first wave of subjects.  How can that make any scientific sense?
  5. Most practitioners of frequentist inference do not understand that multiplicity comes from chances you give data to be extreme, not from chances you give true effects to be present.

F. Problems With Non-Trivial Hypotheses

  1. It is difficult to test non-point hypotheses such as "drug A is similar to drug B".
  2. There is no straightforward way to test compound hypotheses coming from logical unions and intersections. 

G. Inability to Incorporate Context and Other Information

  1. Because extraordinary claims require extraordinary evidence, there is a serious problem with the p-value's inability to incorporate context or prior evidence.  A Bayesian analysis of the existence of ESP would no doubt start with a very skeptical prior that would require extraordinary data to overcome, but the bar for getting a "significant" p-value is fairly low. Frequentist inference has a greater risk for getting the direction of an effect wrong (see here for more).
  2. p-values are unable to incorporate outside evidence.  As a converse to 1, strong prior beliefs are unable to be handled by p-values, and in some cases the results in a lack of progress.  Nate Silver in The Signal and the Noise beautifully details how the conclusion that cigarette smoking causes lung cancer was greatly delayed (with a large negative effect on public health) because scientists (especially Fisher) were caught up in the frequentist way of thinking, dictating that only randomized trial data would yield a valid p-value for testing cause and effect.  A Bayesian prior that was very strongly against the belief that smoking was causal is obliterated by the incredibly strong observational data.  Only by incorporating prior skepticism could one make a strong conclusion with non-randomized data in the smoking-lung cancer debate.
  3. p-values require subjective input from the producer of the data rather than from the consumer of the data.

H. Problems Interpreting and Acting on "Positive" Findings

  1. With a large enough sample, a trivial effect can cause an impressively small p-value (statistical significance ≠ clinical significance).
  2. Statisticians and subject matter researchers (especially the latter) sought a "seal of approval" for their research by naming a cutoff on what should be considered "statistically significant", and a cutoff of p=0.05 is most commonly used.  Any time there is a threshold there is a motive to game the system, and gaming (p-hacking) is rampant.  Hypotheses are exchanged if the original H0 is not rejected, subjects are excluded, and because statistical analysis plans are not pre-specified as required in clinical trials and regulatory activities, researchers and their all-too-accommodating statisticians play with the analysis until something "significant" emerges.
  3. When the p-value is small, researchers act as though the point estimate of the effect is a population value.
  4. When the p-value is small, researchers believe that their conceptual framework has been validated.  

I. Problems Interpreting and Acting on "Negative" Findings

  1. Because of B2, large p-values are uninformative and do not assist the researcher in decision making (Fisher said that a large p-value means "get more data").

Friday, January 27, 2017

Randomized Clinical Trials Do Not Mimic Clinical Practice, Thank Goodness

Randomized clinical trials (RCT) have long been held as the gold standard for generating evidence about the effectiveness of medical and surgical treatments, and for good reason.  But I commonly hear clinicians lament that the results of RCTs are not generalizable to medical practice, primarily for two reasons:
  1. Patients in clinical practice are different from those enrolled in RCTs
  2. Drug adherence in clinical practice is likely to be lower than that achieved in RCTs, resulting in lower efficacy.
Point 2 is hard to debate because RCTs are run under protocol and research personnel are watching and asking about patients' adherence.  But point 1 is a misplaced worry in the majority of trials.  The explanation requires getting to the heart of what RCTs are really intended to do: provide evidence for relative treatment effectiveness.  There are some trials that provide evidence for both relative and absolute effectiveness.   This is especially true when the efficacy measure employed is absolute as in measuring blood pressure reduction due to a new treatment.  But many trials use binary or time-to-event endpoints and the resulting efficacy measure is on a relative scale such as the odds ratio or hazard ratio.

RCTs of even drastically different patients can provide estimates of relative treatment benefit on odds or hazard ratio scales that are highly transportable.  This is most readily seen in subgroup analyses provided by the trials themselves - so called forest plots that demonstrate remarkable constancy of relative treatment benefit.  When an effect ratio is applied to a population with a much different risk profile, that relative effect can still fully apply.  It is only likely that the absolute treatment benefit will change, and it is easy to estimate the absolute benefit (e.g., risk difference) for a patient given the relative benefit and the absolute baseline risk for the subject.   This is covered in detail in Biostatistics for Biomedical Research, Section 13.6.

Clinical practice provides anecdotal evidence that biases clinicians.  What a clinician sees in her practice is patient i on treatment A and patient j on treatment B.  She may remember how patient i fared in comparison to patient j, not appreciate confounding by indication, and suppose this provides a valid estimate of the difference in effectiveness in treatment A vs. B.  But the real therapeutic question is how does the outcome of a patient were she given treatment A compare to her outcome were she given treatment B.  The gold standard design is thus the randomized crossover design, when the treatment is short acting.  Stephen Senn eloquently writes about how a 6-period 2-treatment crossover study can even do what proponents of personalized medicine mistakenly think they can do with a parallel-group randomized trial: estimate treatment effectiveness for individual patients.

For clinical practice to provide the evidence really needed, the clinician would have to see patients and assign treatments using one of the top four approaches listed in the hierarchy of evidence below. Entries are in the order of strongest evidence requiring the least assumptions to the weakest evidence. Note that crossover studies, when feasible, even surpass randomized studies of matched identical twins in the quality and relevance of information they provide.

Let Pi denote patient i and the treatments be denoted by A and B. Thus P2B represents patient 2 on treatment BP1 represents the average outcome over a sample of patients from which patient 1 was selected.  HTE is heterogeneity of treatment effect.

DesignPatients Compared
6-period crossoverP1A vs P1B (directly measure HTE)
2-period crossoverP1A vs P1B
RCT in idential twinsP1A vs P1B
 group RCTP1A vs P2BP1=P2 on avg
Observational, good artificial controlP1A vs P2BP1=P2 hopefully on avg
Observational, poor artificial controlP1A vs P2BP1≠ P2 on avg
Real-world physician practiceP1A vs P2B

The best experimental designs yield the best evidence a clinician needs to answer the "what if" therapeutic question for the one patient in front of her.

Much more needs to be said about how to handle treatment adherence and what should be the target adherence in an RCT, but overall it is a good thing that RCTs do not mimic clinical practice.  We are entering a new era of pragmatic clinical trials.  Pragmatic trials are worthy of in-depth discussion, but it is not a stretch to say that the chief advantage of pragmatic trials is not that they provide results that are more relevant to clinical practice but that they are cheaper and faster than traditional randomized trials.

Wednesday, January 25, 2017

Clinicians' Misunderstanding of Probabilities Makes Them Like Backwards Probabilities Such As Sensitivity, Specificity, and Type I Error

Imaging watching a baseball game, seeing the batter get a hit, and hearing the announcer say "The chance that the batter is left handed is now 0.2!"   No one would care.  Baseball fans are interested in the chance that a batter will get a hit conditional on his being right handed (handedness being already known to the fan), the handedness of the pitcher, etc.  Unless one is an archaeologist or medical examiner, the interest is in forward probabilities conditional on current and past states.  We are interested in the probability of the unknown given the known and the probability of a future event given past and present conditions and events.

Clinicians are people trained in the science and practice of medicine, and most of them are very good at it.  They are also very good at many aspects of research.  But they are generally not taught probability, and this can limit their research skills.  Many excellent clinicians even let their limitations in understanding probability make them believe that their clinical decision making is worse than it actually is.  I have taught many clinicians who say "I need a hard and fast rule so I know how to diagnosis or treat patients.  I need a hard cutoff on blood pressure, HbA1c, etc. so that I know what to do, and the fact that I either treat or not treat the patient means that I don't want to consider a probability of disease but desire a simple classification rule."  This makes the clinician try to influence the statistician to use inefficient, arbitrary methods such as categorization, stratification, and matching.

In reality, clinicians do not act that way when treating patients.  They are smart enough to know that if a patient has cholesterol just over someone's arbitrary threshold they may not start statin therapy right away if the patient has no other risk factors (e.g., smoking) going against him.  They know that sometimes you start a patient on a lower dose and see how she responds, or start one drug and try it for a while and then switch drugs if the efficacy is unacceptable or there is a significant side effect.

So I emphasize the need to understand probabilities when I'm teaching clinicians.  A probability is a self-contained summary of the current information, except for the patient's risk aversion and other utilities.  Clinicians need to be comfortable with a probability of 0.5 meaning "we don't know much" and not requesting a classification of disease/normal that does nothing but cover up the problem.  A classification does not account for gray zones or patient and physician utility functions.

Even physicians who understand the meaning of a probability are often not understanding conditioning.  Conditioning is all important, and conditioning on different things massively changes the meaning of the probabilities being computed.  Every physician I've known has been taught probabilistic medical diagnosis by first learning about sensitivity (sens) and specificity (spec).  These are probabilities that are in backwards time- and information flow order.  How did this happen? Sensitivity, specificity, and receiver operating characteristic curves were developed for radar and radio research in the military.  It was a important to receive radio signals from distant aircraft, and to detect an incoming aircraft on radar.  The ability to detect something that is really there is definitely important.  In the 1950s, virologists appropriated these concepts to measure the performance of viral cultures.  Virus needs to be detected when it's present, and not detected when it's not.  Sensitivity is the probability of detecting a condition when it is truly present, and specificity is the probability of not detecting it when it is truly absent.  One can see how these probabilities would be useful outside of virology and bacteriology when the samples are retrospective, as in a case-control studies.  But I believe that clinicians and researchers would be better off if backward probabilities were not taught or were mentioned only to illustrate how not to think about a problem.

But the way medical students are educated, they assume that sens and spec are what you first consider in a prospective cohort of patients!  This gives the professor the opportunity of teaching  Bayes' rule and requires the use of a supposedly unconditional probability known as prevalence which is actually not very well defined.  The students plugs everything into Bayes' rule and fails to notice that several quantities cancel out.  The result is the following: the proportion of patients with a positive test who have disease, and the proportion with a negative test who have disease.  These are trivially calculated from the cohort data without knowing anything about sens, spec, and Bayes.  This way of thinking harms the student's understanding for years to come and influences those who later engage in clinical and pharmaceutical research to believe that type I error and p-values are directly useful.

The situation in medical diagnosis gets worse when referral bias (also called workup bias) is present.  When certain types of patients do not get a final diagnosis, sens and spec are biased.  For example, younger women with a negative test may not get the painful procedure that yields the final diagnosis.  There are formulas that must be used to correct sens and spec.  But wait!  When Bayes' rule is used to obtain the probability of disease we needed in the first place, these corrections completely cancel out when the usual correction methods are used!  Using forward probabilities in the first place means that one just conditions on age, sex, and result of the initial diagnostic test and no special methods other than (sometimes) logistic regression are required.

There is an analogy to statistical testing.  p-values and type I error are affected by sequential testing and a host of other factors, but forward-time probabilities (Bayesian posterior probabilities) are not.  Posterior probabilities condition on what is known and does not have to imagine alternate paths to getting to what is known (as do sens and spec when workup bias exists).  p-values and type I errors are backwards-information-flow measures, and clinical researchers and regulators come to believe that type I error is the error of interest.  They also very frequently misinterpret p-values.  The p-value is one minus spec, and power is sens.  The posterior probability is exactly analogous to the probability of disease.

Sens and spec are so pervasive in medicine, bioinformatics, and biomarker research that we don't question how silly they would be in other contexts.  Do we dichotomize a response variable so that we can compute the probability that a patient is on treatment B given a "positive" response?  On the contrary we want to know the full continuous distribution of the response given the assigned treatment.  Again this represents forward probabilities.

Monday, January 23, 2017

Split-Sample Model Validation

Methods used to obtain unbiased estimates of future performance of statistical prediction models and classifiers include data splitting and resampling.  The two most commonly used resampling methods are cross-validation and bootstrapping.  To be as good as the bootstrap, about 100 repeats of 10-fold cross-validation are required.

As discussed in more detail in Section 5.3 of Regression Modeling Strategies Course Notes and the same section of the RMS book, data splitting is an unstable method for validating models or classifiers, especially when the number of subjects is less than about 20,000 (fewer if signal:noise ratio is high).  This is because were you to split the data again, develop a new model on the training sample, and test it on the holdout sample, the results are likely to vary significantly.   Data splitting requires a significantly larger sample size than resampling to work acceptably well.  See also Section 10.11 of BBR.

There are also very subtle problems:

  1. When feature selection is done, data splitting validates just one of a myriad of potential models.  In effect it validates an example model.  Resampling (repeated cross-validation or the bootstrap) validate the process that was used to develop the model.  Resampling is honest in reporting the results because it depicts the uncertainty in feature selection, e.g., the disagreements in which variables are selected from one resample to the next.
  2. It is not uncommon for researchers to be disappointed in the test sample validation and to ask for a "re-do" whereby another split is made or the modeling starts over, or both.  When reporting the final result they sometimes neglect to mention that the result was the third attempt at validation.
  3. Users of split-sample validation are wise to recombine the two samples to get a better model once the first model is validated.  But then they have no validation of the new combined data model.
There is a less subtle problem but one that is ordinarily not addressed by investigators: unless both the training and test samples are huge, split-sample validation is not nearly as accurate as the bootstrap.  See for example the section Studies of Methods Used in the Text here.  As shown in a simulation appearing there, bootstrapping is typically more accurate than data splitting and cross-validation that does not use a large number of repeats.  This is shown by estimating the "true" performance, e.g., the R-squared or c-index on an infinitely large dataset (infinite here means 50,000 subjects for practical purposes).  The performance of an accuracy estimate is taken as the mean squared error of the estimate against the model's performance in the 50,000 subjects.

Data are too precious to not be used in model development/parameter estimation.  Resampling methods allow the data to be used for both development and validation, and they do a good job in estimating the likely future performance of a model.  Data splitting only has an advantage when the test sample is held by another researcher to ensure that the validation is unbiased.

Update 2017-01-25

Many investigators have been told that they must do an "external" validation, and they split the data by time or geographical location.  They are sometimes surprised that the model developed in one country or time does not validate in another.  They should not be; this is an indirect way of saying there are time or country effects.  Far better would be to learn about and estimate time and location effects by including them in a unified model.  Then rigorous internal validation using the bootstrap, accounting for time and location all along the way.  The end result is a model that is useful for prediction at times and locations that were at least somewhat represented in the original dataset, but without assuming that time and location effects are nil.

Wednesday, January 18, 2017

Fundamental Principles of Statistics

There are many principles involved in the theory and practice of statistics, but here are the ones that guide my practice the most.
  1. Use methods grounded in theory or extensive simulation
  2. Understand uncertainty
  3. Design experiments to maximize information
  4. Understand the measurements you are analyzing and don't hesitate to question how the underlying information was captured
  5. Be more interested in questions than in null hypotheses, and be more interested in estimation than in answering narrow questions
  6. Use all information in data during analysis
  7. Use discovery and estimation procedures not likely to claim that noise is signal
  8. Strive for optimal quantification of evidence about effects
  9. Give decision makers the inputs (other than the utility function) that optimize decisions
  10. Present information in ways that are intuitive, maximize information content, and are correctly perceived
  11. Give the client what she needs, not what she wants
  12. Teach the client to want what she needs