Monday, January 23, 2017

Split-Sample Model Validation

Methods used to obtain unbiased estimates of future performance of statistical prediction models and classifiers include data splitting and resampling.  The two most commonly used resampling methods are cross-validation and bootstrapping.  To be as good as the bootstrap, about 100 repeats of 10-fold cross-validation are required.

As discussed in more detail in Section 5.3 of Regression Modeling Strategies Course Notes and the same section of the RMS book, data splitting is an unstable method for validating models or classifiers, especially when the number of subjects is less than about 20,000 (fewer if signal:noise ratio is high).  This is because were you to split the data again, develop a new model on the training sample, and test it on the holdout sample, the results are likely to vary significantly.   Data splitting requires a significantly larger sample size than resampling to work acceptably well.  See also Section 10.11 of BBR.

There are also very subtle problems:

  1. When feature selection is done, data splitting validates just one of a myriad of potential models.  In effect it validates an example model.  Resampling (repeated cross-validation or the bootstrap) validate the process that was used to develop the model.  Resampling is honest in reporting the results because it depicts the uncertainty in feature selection, e.g., the disagreements in which variables are selected from one resample to the next.
  2. It is not uncommon for researchers to be disappointed in the test sample validation and to ask for a "re-do" whereby another split is made or the modeling starts over, or both.  When reporting the final result they sometimes neglect to mention that the result was the third attempt at validation.
  3. Users of split-sample validation are wise to recombine the two samples to get a better model once the first model is validated.  But then they have no validation of the new combined data model.
There is a less subtle problem but one that is ordinarily not addressed by investigators: unless both the training and test samples are huge, split-sample validation is not nearly as accurate as the bootstrap.  See for example the section Studies of Methods Used in the Text here.  As shown in a simulation appearing there, bootstrapping is typically more accurate than data splitting and cross-validation that does not use a large number of repeats.  This is shown by estimating the "true" performance, e.g., the R-squared or c-index on an infinitely large dataset (infinite here means 50,000 subjects for practical purposes).  The performance of an accuracy estimate is taken as the mean squared error of the estimate against the model's performance in the 50,000 subjects.

Data are too precious to not be used in model development/parameter estimation.  Resampling methods allow the data to be used for both development and validation, and they do a good job in estimating the likely future performance of a model.  Data splitting only has an advantage when the test sample is held by another researcher to ensure that the validation is unbiased.

Update 2017-01-25

Many investigators have been told that they must do an "external" validation, and they split the data by time or geographical location.  They are sometimes surprised that the model developed in one country or time does not validate in another.  They should not be; this is an indirect way of saying there are time or country effects.  Far better would be to learn about and estimate time and location effects by including them in a unified model.  Then rigorous internal validation using the bootstrap, accounting for time and location all along the way.  The end result is a model that is useful for prediction at times and locations that were at least somewhat represented in the original dataset, but without assuming that time and location effects are nil.


  1. Not directly related to split-sample vs. resampling but I hope you think it is a relevant topic worth commenting on.

    Given your previous experience at the FDA, should split-sample or resampling based methods also be required for proving the clinical validity of deep-learning models? In particular, there are recent instances where the underlying models continually evolve as real-life clinical cases get converted into "training cases"- so it is not clear [to me] when the model is "fixed" and when it should subsequently be evaluated (and then reported to the FDA).

    For example, Arterys recently won FDA OK for its deep-learning model that automatically segments heart ventricles. But it is very likely that the model will have changed 6 months after the 510k submission due to additional training data acquired.

  2. Wonderful areas for ongoing discussions, and FDA is very interested in these issues. Some of it boils down to sample size. If you can't assemble tens of thousands of subjects, or even thousands, then split-sample validation is very noisy, inefficient, and unreliable, and resampling is more advantageous. The updating of "training" data as new subjects are enrolled in a study is something I haven't thought much about. But it seems to me that split-sample validation is only for the case where the model is fixed for all time, whereas resampling can be re-executed to update both the model and the validation if one accepts that new subjects should be used for both model refinement and model testing.

  3. Fantastic! The findings of my recent work that has been recently accepted at IEEE Transactions on Software Engineering ( are consistent with you, i.e., the split-sample model validation is problematic and it should be avoided, and the findings suggest researchers opt to use advanced bootstrap validation instead.

    1. Great to hear that and thanks for giving us the link to your paper. Besides the technical problems of split-sample validation, there as psychological problems that I tried to also communicate.

  4. I appreciate your comments in your 2017-01-25 update- but wanted to distinguish between split-sample techniques (which I agree should rarely be used in lieu of resampling-based techniques) with the needs of external validation. I have seen many instance where the practicalities in developing a model often means the initial model building stage is done using more easily accessible data (e.g. retrospective data), where the diversity of the population (geographic or otherwise) does not fully represent the population for which the clinical model is intended (i.e. the intended-use population)

    Getting a dataset where the estimated [time or country] effects may be estimated may require a much more expensive prospective study. For this reason I think many investigators rely on resampling-based procedures in building a model, but then view the validation of the model in the intended use population as a separate step- i.e. the external validation step, which is done at a later stage. As an example, many diagnostic tests are built using data available from retrospective sample biobanks, despite the fact that the population represented in the biobanks may not be from the country in which the test is being developed and clinically deployed. At some point in the future the model needs to be testing in a representative sample.

    Of course – all other things being equal- if the exact intended use population is available from the start, then it should be incorporated in a dataset to build the single unified model.

    In some cases, a separate external validation may also protect against information/label leakage, in which a useless unknown confounder is inadvertently linked to the outcome of interest purely due to bias in which the data is collected. It may be less likely that this unknown bias occurs in a separately derived dataset.

    1. Excellent points. In some cases one may argue that if a model is to be used internationally then one should wait until international data are collected before the model is developed.