Bootstrap Confidence Limits for Bootstrap Overfitting-Corrected Model Performance

computing
prediction
r
regression
validation
2025
The Efron-Gong optimism bootstrap has been used for decades to obtain reliable estimates of likely performance of statistical models on new data. It accomplishes this by estimating the bias (optimism) from overfitting and subtracting that bias from apparent model performance indexes. No fast reliable method for computing confidence intervals for overfitting-corrected measures currently exists, so analysts may have false confidence in internal model validations, especially for small datasets. The purpose of this research is to empirically derive a satisfactory fast algorithm for computing the needed confidence intervals when the model is a a binary logistic regression model. The approach is expected to work for a wide variety of models.
Author
Affiliation

Department of Biostatistics
Vanderbilt University School of Medicine

Published

July 24, 2025

Modified

July 25, 2025

Background

The goal here is strong internal validation after fitting a pre-specified regression model or one that was derived using backwards step-down variable selection such that the same variable selection procedure can be repeated afresh for each bootstrap repetition. So strong internal validation means estimating a variety of model performance measures in a way that does not reward them for overfitting and that penalizes for all aspects of model selection and derivation that utilized the outcome variable. The goal is estimating the likely future performance of the model on new data from the same stream as the data used in model development. A standard approach, as implemented in the R rms package’s validate and calibrate functions, is the Efron-Gong optimism bootstrap discussed in detail in RMS.

A key omission from the rms package has been computation of uncertainty intervals for bootstrap overfitting-corrected measures. In 2021 Hisashi Noma et al. wrote a very useful paper entitled Confidence intervals of prediction accuracy measures for multivariable prediction models based on the bootstrap-based optimism correction methods. They studied confidence interval coverage (although only in a two-tailed combined sense) of a full two-stage double bootstrap procedure to compute confidence intervals for Efron-Gong overfitting (bias/optimism)-corrected model performance measures. This is a very computationally demanding but accurate procedure. They also studied an approximate one-stage procedure that does not add to the computational burden of computing the overfitting-adjusted measures themselves.

Recall that the form of the bootstrap optimism-corrected measure \(\tau\) for a single index such as a Brier score or rank correlation between predicted and observed outcomes is

\[\tau = \theta + \bar{\theta}_{b} - \bar{\theta}_{w}\]

where \(\theta\) is the original index computed as usual on the whole sample (sometimes called the apparent index), \(\theta_b\) is the apparent performance in a bootstrap sample when model coefficients are fitted in the bootstrap sample, \(\theta_w\) is the performance of the bootstrap-fitted model on the original whole sample, and the horizontal bars represent averages over \(B\) bootstrap resamples. The estimated bias is \(\gamma = \bar{\theta}_{b} - \bar{\theta}_{w}\).

Noma et al.’s approximate bootstrap confidence intervals take the bootstrap distribution of \(\theta_{b}\), computes confidence intervals from it using the bootstrap percentile nonparametric method, and shifts the resulting intervals by \(\gamma\). They showed that in large samples, taking into account only the variation in \(\bar{\theta}_{b}\) is good enough, and that the overall confidence interval coverage was close to that of the expensive double bootstrap. But when the sample size is not large, considering only variation in \(\theta_b\) significantly underestimates uncertainties, and confidence interval coverage is far below the intended level (of 0.95, for example).

Fast Alternative Bootstrap Confidence Limits

To derive a fast approximate method that I expect to have better confidence coverage for non-large \(n\), consider \(B\) individual bootstrap sample estimates \(\theta_{b}\) and \(\theta_{w}\) and compute variances of several combinations of them, including \(\text{var}(\theta_{b})\), \(\text{var}(\theta_{w})\), \(\text{var}(\gamma)\), variance of weighted sums of \(\theta_{b}\) and \(\theta_{w}\), and other variants. Variances of vectors involving both \(\theta_{b}\) and \(\theta_{w}\) takes into account uncertainties in two quantities instead of one. One can think of the variance of a difference as a summative measure for two uncertainties, because the variance of a difference is the sum of the variances minus twice the covariance of the two. It is tempting to use the bootstrap nonparametric percentile interval, but we need to respect the width of the bootstrap distribution for the combination of \(\theta_{b}\) and \(\theta_{w}\) and pay no attention to its center. This pushes us to use normal-approximation bootstrap confidence intervals, which take an original point estimate (here a bootstrap bias-corrected index) \(\pm 1.96\times\) the bootstrap standard deviation, for example.

An additional complexity is that the bootstrap distribution may not be symmetric, and assuming that it is (as the standard deviation does) leads to symmetric confidence intervals instead of the needed asymmetric ones. The result is inaccurate confidence coverage in at least one of the tails. To fix that problem, we compute two standard deviations: the SD for upper values and the SD for lower ones. There is a question of where to make the high/low split. Traditionally the sample mean is chosen, but the mean is affected by asymmetry. So we could use the sample median, but this has lower precision. A nice remedy is to use the pseudo-median, i.e., the median of all possible pairwise averages. Each of the two confidence limits is computed using the square root of the average squared distance from the pseudo-median for observations in that limit’s side of the bootstrap distribution.

Next we attempt to find the best combination of variation of training and test indexes from bootstrap repetitions so that confidence interval coverage for overfitted-corrected indexes is good in both tails for three performance measures. The resulting bootstrap method will be called ABCLOC for asymmetric bootstrap confidence limits for overfitting-corrected model performance metrics.

Simulation to Derive ABCLOC to Obtain Accurate Confidence Coverage for Simple Performance Measures

Before proceeding to a more complex situation of studying accuracy of confidence intervals for estimated overfitting-corrected calibration curves, let’s study the performance of ABCLOC confidence limits for standard scalar predictive accuracy measures: Somers’ \(D_{xy}\) (equal to \(2\times (c - \frac{1}{2})\) where \(c\) is the concordance probability or area under the ROC curve), the calibration slope on the logit scale, and the Brier score. The R rms package validate.lrm function uses, among other methods, the Efron-Gong optimism bootstrap for debiasing the apparent performance indexes to obtain overfitting-corrected scores estimating likely future performance of the model for these three, and other indexes.

Define a function to simulate a matrix of p predictors for a sample of size \(n\), and a sample of size \(n\) of binary responses from that predictor matrix, using a true model where the only non-zero regression coefficient for the \(p\) predictors is \(\beta_{1} = 1\).

require(Hmisc)
require(rms)   # Version 8.1-0 is required
require(ggplot2)
require(data.table)
options(prType='html')

simd <- function(n, p) {
  x <- matrix(rnorm(n * p), n, p)
  L <- x[, 1]
  y <- ifelse(runif(n) <= plogis(L), 1, 0)
  list(x=x, y=y)
}

In what follows let \(n=200\) be the sample size and \(p=15\) be the number of predictors. This is a setting with major overfitting.

To check confidence interval coverage we need to know the true Brier score for our data generating model. This can be well estimated by simulating a single sample of 200,000 subjects, and computing the Brier score for many models fitted on 200 subjects, averaging their Brier scores.

set.seed(1)
N    <- 200
np   <- 15
Nbig <- 200000
big  <- simd(Nbig, np)
big  <- data.frame(x = I(big$x), y=big$y)   # I keeps matrix x as a matrix
Y    <- big$y
nsim <- 5000

Simulate 5000 samples of size 200. For each sample run 300 bootstrap resamples to compute approximate confidence limits. Also compute the true performance measures for each sample, to average.

# Define a function to generate and analyze one dataset
sim1 <- function(n, p, B=300) {
  .i. <<- .i. + 1
  cat(.i., '', file='/tmp/z', append=TRUE)
  d    <- simd(n, np)
  f    <- lrm(y ~ x, data=d, x=TRUE, y=TRUE)
  phat <- predict(f, newdata=big, type='fitted')
  
  trueB     <- mean((phat - Y) ^ 2)
  trueDxy   <- somers2(phat, Y)['Dxy']
  trueSlope <- lrm.fit(qlogis(phat), Y)$coef[2]
  # saveraw is passed to predab.re-sample in rms 8.1-0 to cause indexes
  # from individual bootstrap resamples to be saved in a global object .predab_raw.
  v <- validate(f, B=B, saveraw=TRUE)
  r <- .predab_raw.
  w <- which(names(r$orig) %in% c('Dxy', 'Slope', 'B'))
  r$orig   <- r$orig[w]
  r$btrain <- r$btrain[, w]
  r$btest  <- r$btest [, w]
  praw[[.i.]] <<- r
  v <- v[w, c('index.corrected', 'Lower', 'Upper')]
  v <- cbind(v, true=c(trueDxy, trueSlope, trueB))
  v
}
.i.  <- 0
praw <- NULL
set.seed(1)
if(file.exists('indexes.rds')) {
  s    <- readRDS('indexes.rds')
  praw <- readRDS('praw.rds')
  } else {
  s <- replicate(nsim, sim1(n=N, p=np))
  saveRDS(s,    'indexes.rds')
  saveRDS(praw, 'praw.rds')
}

# Compute means across repetitions  
rn <- function(x) round(x, 3)
rn(apply(s, 1:2, mean))
      index.corrected Lower Upper  true
Dxy             0.426 0.306 0.545 0.404
Slope           0.680 0.445 0.915 0.636
B               0.224 0.190 0.258 0.226

The average overfitting-corrected Brier score estimate closely tracks the average of all the true Brier scores. The optimism bootstrap overestimated the performance on the \(D_{xy}\) and calibration slope scales. The positive bias is 0.02 for \(D_{xy}\) and 0.04 for the calibration slope.

It is known that in extremely overfitted models, the bootstrap will underestimate the amount of overfitting when compared with 100 repeats of 10-fold cross-validation. But even in these situations, model performance estimated by the bootstrap will convey enough bad news.

Next examine tail non-coverage probabilities for a variety of ways of utilizing the training and test indexes over the 300 resamples drawn for each simulated dataset. This is done separately for each of the 3 performance indexes.

# Add true indexes to praw
raw <- praw
for(i in 1 : nsim) raw[[i]]$true <- s[, 'true', i]

# Function applied to each element w of raw, for bootstrap type and basis and
# index number idx
# Result is a pair of logical values left and right with TRUE indicating that
# the true index is outside that tail

h <- function(w, what=c('error', 'cltrue')) {
  what  <- match.arg(what)
  true  <- w$true [idx]
  train <- w$btrain[, idx]
  test  <- w$btest [, idx]
  opt   <- train - test
  mopt  <- mean(opt)
  theta <- w$orig[idx]
  x <- switch(basis,
              train = train,
              test  = test,
              opt   = opt,
              wtd   = 2    * train - test,
              wtd2  = 1.5  * train - test,
              wtd3  = train - 0.75 * test,
              wtd4  = train - 1.25 * test,
              wtd5  = train - 1.5 * test)
  
  if(type == 'shifted nonpar') {
    qu <- quantile(x, c(0.025, 0.975), na.rm=TRUE)
    r <- c(qu[1] - mopt, qu[2] - mopt)
    } else if(substr(type, 1, 2) == 'sd') {
    if(basis == 'v2') {
      v1 <- var(train)
      v2 <- var(train - test)
      s  <- sqrt(v1 + v2 - 0.5 * sqrt(v1 * v2))
      r <- c(theta - mopt - s * z, theta - mopt + s * z)
    }  else {   # dualSD is in Hmisc 5.2-4
    s <- dualSD(x, nmin=if(type == 'sd1') 100000 else 10)
    a <- if(type == 'sd2') 1:2 else 2:1
    r <- c(theta - mopt - s[a[1]] * z, theta - mopt + s[a[2]] * z)
    }
    }
  if(what == 'cltrue') return(c(left=r[1], right=r[2], true=true))
  r <- c(r[1] > true, r[2] < true)
  names(r) <- .q(left, right)
  r
  }

z  <- qnorm(0.975)

# validate used wtd4 sd2rev; check against that for Brier score
if(FALSE) {
  idx  <- 3
  type <- 'sd2rev'
  basis <- 'wtd4'
  for(i in c(1, nsim / 2, nsim)) {
    cat('i=', i, '\n')
    print(h(raw[[i]], 'cltrue'))
    print(s[3,,i])
  }
}

types <- c('shifted nonpar', 'sd1', 'sd2', 'sd2rev')
bases <- c('train', 'test', 'opt', 'wtd', 'wtd2', 'wtd3', 'wtd4', 'wtd5', 'v2')

D <- vector('list', 3)
for(idx in 1 :3) {
  idxname <- c('Dxy', 'Slope', 'B')[idx]
  cat('\n-------------------------------------------------------------------\nIndex: ',
      if(idxname == 'B') 'Brier' else idxname, '\n\n', sep='')

  # Analyze simulation results separately for each confidence interval construction method
  # For each simulation compute the components of confidence intervals
  # Then compute the per-simulation limits
  # Then compute for that simulation whether a non-coverage has occurred per tail

  d <- NULL
  for(type in types) {
    for(basis in bases) {
      if(type == 'shifted nonpar' && basis %nin% c('train', 'test')) next
      if(type != 'sd1'            && basis == 'v2') next
      w    <- sapply(raw, h)
      err  <- apply(w, 1, mean, na.rm=TRUE)
      dist <- sum(abs(err - 0.025))
      d <- rbind(d, data.frame(type, basis, left=rn(err[1]), right=rn(err[2]),
                               coverage=rn(1 - sum(err)), dist=rn(dist)))
    }
  }
  row.names(d) <- NULL
  print(d)
  D[[idx]] <- d
}

-------------------------------------------------------------------
Index: Dxy

             type basis  left right coverage  dist
1  shifted nonpar train 0.350 0.001    0.649 0.349
2  shifted nonpar  test 0.122 0.474    0.404 0.546
3             sd1 train 0.124 0.033    0.843 0.107
4             sd1  test 0.355 0.156    0.489 0.461
5             sd1   opt 0.122 0.035    0.842 0.108
6             sd1   wtd 0.010 0.001    0.989 0.039
7             sd1  wtd2 0.043 0.006    0.951 0.037
8             sd1  wtd3 0.128 0.037    0.836 0.114
9             sd1  wtd4 0.114 0.032    0.853 0.097
10            sd1  wtd5 0.103 0.028    0.870 0.080
11            sd1    v2 0.073 0.017    0.910 0.057
12            sd2 train 0.111 0.034    0.855 0.095
13            sd2  test 0.311 0.190    0.498 0.452
14            sd2   opt 0.115 0.036    0.848 0.102
15            sd2   wtd 0.007 0.001    0.992 0.042
16            sd2  wtd2 0.038 0.006    0.956 0.031
17            sd2  wtd3 0.118 0.039    0.843 0.107
18            sd2  wtd4 0.111 0.032    0.857 0.093
19            sd2  wtd5 0.104 0.026    0.869 0.081
20         sd2rev train 0.136 0.032    0.833 0.117
21         sd2rev  test 0.403 0.128    0.470 0.480
22         sd2rev   opt 0.129 0.034    0.837 0.113
23         sd2rev   wtd 0.014 0.001    0.985 0.035
24         sd2rev  wtd2 0.049 0.005    0.946 0.043
25         sd2rev  wtd3 0.137 0.035    0.829 0.121
26         sd2rev  wtd4 0.117 0.032    0.852 0.098
27         sd2rev  wtd5 0.103 0.029    0.868 0.082

-------------------------------------------------------------------
Index: Slope

             type basis  left right coverage  dist
1  shifted nonpar train 0.661 0.339    0.000 0.950
2  shifted nonpar  test 0.000 0.508    0.492 0.508
3             sd1 train 0.661 0.339    0.000 0.950
4             sd1  test 0.065 0.045    0.890 0.060
5             sd1   opt 0.065 0.045    0.890 0.060
6             sd1   wtd 0.065 0.045    0.890 0.060
7             sd1  wtd2 0.065 0.045    0.890 0.060
8             sd1  wtd3 0.170 0.076    0.754 0.196
9             sd1  wtd4 0.015 0.026    0.958 0.011
10            sd1  wtd5 0.002 0.015    0.983 0.033
11            sd1    v2 0.065 0.045    0.890 0.060
12            sd2 train 0.661 0.339    0.000 0.950
13            sd2  test 0.090 0.037    0.873 0.077
14            sd2   opt 0.047 0.056    0.896 0.054
15            sd2   wtd 0.047 0.056    0.896 0.054
16            sd2  wtd2 0.047 0.056    0.896 0.054
17            sd2  wtd3 0.143 0.092    0.765 0.185
18            sd2  wtd4 0.009 0.035    0.955 0.026
19            sd2  wtd5 0.002 0.021    0.977 0.027
20         sd2rev train 0.661 0.339    0.000 0.950
21         sd2rev  test 0.047 0.056    0.896 0.054
22         sd2rev   opt 0.090 0.037    0.873 0.077
23         sd2rev   wtd 0.090 0.037    0.873 0.077
24         sd2rev  wtd2 0.090 0.037    0.873 0.077
25         sd2rev  wtd3 0.207 0.066    0.727 0.223
26         sd2rev  wtd4 0.026 0.019    0.955 0.006
27         sd2rev  wtd5 0.004 0.010    0.986 0.036

-------------------------------------------------------------------
Index: Brier

             type basis  left right coverage  dist
1  shifted nonpar train 0.001 0.303    0.696 0.302
2  shifted nonpar  test 0.626 0.060    0.315 0.635
3             sd1 train 0.047 0.067    0.886 0.064
4             sd1  test 0.212 0.246    0.542 0.408
5             sd1   opt 0.016 0.045    0.938 0.029
6             sd1   wtd 0.000 0.001    0.999 0.049
7             sd1  wtd2 0.001 0.009    0.989 0.039
8             sd1  wtd3 0.025 0.051    0.924 0.026
9             sd1  wtd4 0.010 0.038    0.952 0.028
10            sd1  wtd5 0.006 0.031    0.963 0.025
11            sd1    v2 0.009 0.026    0.965 0.017
12            sd2 train 0.036 0.074    0.889 0.061
13            sd2  test 0.257 0.208    0.535 0.415
14            sd2   opt 0.009 0.058    0.933 0.049
15            sd2   wtd 0.000 0.003    0.997 0.047
16            sd2  wtd2 0.001 0.013    0.986 0.036
17            sd2  wtd3 0.015 0.066    0.919 0.051
18            sd2  wtd4 0.004 0.051    0.944 0.047
19            sd2  wtd5 0.002 0.044    0.954 0.043
20         sd2rev train 0.061 0.060    0.879 0.071
21         sd2rev  test 0.170 0.287    0.542 0.408
22         sd2rev   opt 0.030 0.034    0.936 0.014
23         sd2rev   wtd 0.000 0.001    0.999 0.049
24         sd2rev  wtd2 0.004 0.006    0.989 0.039
25         sd2rev  wtd3 0.037 0.042    0.921 0.029
26         sd2rev  wtd4 0.025 0.028    0.947 0.004
27         sd2rev  wtd5 0.016 0.021    0.963 0.013

For each bootstrap confidence interval method and each performance index, compute the sum of two distances from the target tail probabilities of 0.025, and then sum these distances over the three indexes and see which method has the best overall confidence interval coverage accuracy.

# Sum distances over the 3 indexes
di <- D[[1]]$dist + D[[2]]$dist + D[[3]]$dist
cbind(distance=di, rank=rank(di))
      distance rank
 [1,]    1.601 26.0
 [2,]    1.689 27.0
 [3,]    1.121 24.0
 [4,]    0.929 20.0
 [5,]    0.197 14.0
 [6,]    0.148  9.0
 [7,]    0.136  5.5
 [8,]    0.336 17.0
 [9,]    0.136  5.5
[10,]    0.138  7.0
[11,]    0.134  4.0
[12,]    1.106 23.0
[13,]    0.944 22.0
[14,]    0.205 16.0
[15,]    0.143  8.0
[16,]    0.121  2.0
[17,]    0.343 18.0
[18,]    0.166 13.0
[19,]    0.151 10.0
[20,]    1.138 25.0
[21,]    0.942 21.0
[22,]    0.204 15.0
[23,]    0.161 12.0
[24,]    0.159 11.0
[25,]    0.373 19.0
[26,]    0.108  1.0
[27,]    0.131  3.0

The winner (rank of 1) is sd2rev wtd4 which computes standard deviations of the 300 bootstrap re-sampled values of training - 1.25 \(\times\) test indexes, where training denotes apparent performance on the bootstrap sample and test denotes the bootstrap fit’s performance on the original whole sample. The simple SD of the 300 estimates of bias (optimism; sd1* opt) had overall ranks of 14, 16, and 15. It stands to reason that the best quantity on which to compute SDs puts more weight on the test sample indexes than it does on the training sample, because there is less variation in performance of bootstrap fits on the original sample over simulated datasets.

sd2rev refers to using the “top SD” for the lower confidence limit and the “bottom SD” for the upper confidence limit. The distance rank for using the ordinary SD was 5.5 and the rank when reversing the two directional SDs was 13. The standard parametric bootstrap using approximate normality does not handle asymmetric sampling distributions, but bottom and top SDs do. Better performance using two SDs indicates asymmetry of sampling distributions for the indexes.

Coverage for calibration slope and Brier score was estimated to be 0.955 and 0.947 for the best overall performing method. It did not perform so well for \(D_{xy}\), giving only 0.852 coverage. \(D_{xy}\) has a somewhat degenerate distribution when computed on regression fits, which typically do not allow \(D_{xy} < 0\), so normality is usually not a good fit for \(D_{xy}\).

Accuracy of Estimated Calibration Curves and Confidence Limits

Let’s study how well the overall winning bootstrap confidence interval estimator for three regular performance indexes fares when computing confidence bands for a vector of parameters, bootstrap overfitting-corrected calibration curves. Let’s fix the grid of 50 equally spaced points for which the calibration curve is estimated, using the 0.02 and 0.98 quantiles of pooled distributions of predicted probabilities for 10 repeated model fits. The simulations we are about to run are similar to those for the 3 statistical indexes, but now we have 50 indexes.

A histogram of the pooled distribution shows that the entire range of predicted probabilities is represented by the chosen logistic model and distribution of \(x_1\).

set.seed(1)
phat <- replicate(10, {
  d <- simd(N, np)
  f <- lrm(y ~ x, data=d)
  predict(f, type='fitted')
})
round(apply(phat, 2, quantile, c(0, 0.02, 0.98, 1)), 2)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
0%   0.03 0.03 0.01 0.07 0.03 0.02 0.08 0.02 0.02  0.05
2%   0.07 0.09 0.04 0.09 0.07 0.06 0.12 0.07 0.06  0.07
98%  0.95 0.94 0.96 0.90 0.93 0.94 0.84 0.86 0.90  0.93
100% 0.99 0.96 0.99 0.97 0.98 0.98 0.86 0.95 0.97  0.97
hist(phat, nclass=50, main=NULL, xlab=expression(hat(P)))

# On average for n=200, request calibrated probabilities for 5th to 196th
# largest observed predicted risks, which is pushing things a bit
r <- quantile(phat, probs=c(0.02, 0.98))
cat('Limits on predicted probabilities for which calibration curve is estimated:',
    round(r, 3), '\n')
Limits on predicted probabilities for which calibration curve is estimated: 0.075 0.933 
pp <- seq(r[1], r[2], length=50)

For a given logistic model fit f on dataset d compute the true calibration curve at pp. The calibration curve is a set of points \((\hat{P}, P)\), where \(\hat{P}\) are predicted probabilities and \(P\) are true probabilities that \(Y=1\) given \(\hat{P}\). The points are fitted using least squares in the logit (the true calibration is linear in the logits by design) to average over variation in \(X_{2}, \ldots, X_{15}\) that yields unnecessary variation in \(\hat{P}\) that arises from giving the last \(p - 1\) (noise) \(X\)’s a chance to get nonzero \(\hat{\beta}\)’s. So the true calibration curve is an average over covariate distributions, conditioning only on the covariates through conditioning on \(\hat{P}\).

truecal <- function(f, d, pl=FALSE) {
  Lhat  <- predict(f)
  Ltrue <- d$x[, 1]
  phat  <- plogis(Lhat)
  int_slope <- lm.fit.qr.bare(Lhat, Ltrue)$coefficients
  ptrue <- plogis(Ltrue)
  if(pl) plot(phat, ptrue, col='gray80')
  plogis(int_slope[1] + int_slope[2] * qlogis(pp))
}

# Test the function
par(mfrow=c(3,3), mar=c(2,2,0,0))
set.seed(3)
g <- function() {
  d <- simd(N, np)
  f <- lrm(y ~ x, data=d)
  tcal <- truecal(f, d, pl=TRUE)
  lines(pp, tcal)
  abline(a=0, b=1, col='gray80')
  coef(f)
}
round(replicate(9, g()), 2)

           [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]
Intercept  0.28  0.00 -0.02  0.07  0.13 -0.07  0.02  0.24 -0.24
x[1]       0.86  1.13  0.98  1.10  1.28  1.10  0.93  0.77  1.27
x[2]      -0.10  0.02  0.00  0.29  0.07  0.21  0.01 -0.03  0.15
x[3]       0.18  0.28  0.04  0.42  0.24  0.24  0.09 -0.19 -0.43
x[4]      -0.06  0.22  0.21 -0.10 -0.17 -0.05  0.21  0.54  0.16
x[5]      -0.12  0.01  0.08  0.02 -0.12 -0.23  0.20  0.12  0.35
x[6]      -0.07 -0.09  0.22  0.06 -0.53  0.16  0.44  0.05 -0.21
x[7]      -0.43 -0.12  0.09  0.14 -0.15 -0.32  0.15  0.06 -0.31
x[8]       0.12 -0.18 -0.09  0.00  0.12  0.38 -0.15 -0.19  0.07
x[9]      -0.07  0.12 -0.25 -0.21 -0.12 -0.07 -0.03 -0.09 -0.27
x[10]     -0.22 -0.04  0.08  0.03  0.03  0.12  0.00  0.30  0.19
x[11]     -0.12  0.03  0.09 -0.05 -0.03  0.01  0.03  0.16  0.38
x[12]      0.06  0.09 -0.26 -0.11 -0.34  0.03 -0.01 -0.17 -0.03
x[13]      0.15 -0.19 -0.15  0.34  0.21  0.06 -0.18  0.09 -0.28
x[14]      0.06  0.26  0.07  0.29 -0.01 -0.22 -0.23 -0.03  0.04
x[15]      0.00 -0.07  0.01  0.01  0.42  0.17 -0.01 -0.22 -0.01

The fact that the circles are sometimes far from the curves implies that an “average” true calibration curve is hiding some false variation in predicted risks due to variation in the extraneous predictors. In other words, many different predicted values (depending on the noise predictors) can give rise to one true probability. To have a gold standard to compare with bootstrap estimates we must take averages.

The default calibration estimator used in calibrate() is loess, using the R lowess function with outlier detection turned off (iter=0).

Define alternative parametric linear and quadratic (in the logit) estimators.

qcal <- function(x, y, deg=2) {
  # Curtail predicted probabilities to [0.001, 0.999]
  x <- pmax(0.001, pmin(x, 0.999))
  L <- qlogis(x)
  f <- lrm.fit(if(deg == 2) cbind(L, L^2) else L, y, compstats=FALSE)
  k <- coef(f)
  pgrid <- seq(min(x), max(x), length=200)
  Lgrid <- qlogis(pgrid)
  cgrid <- k[1] + k[2] * Lgrid
  if(deg == 2) cgrid <- cgrid + k[3] * Lgrid^2
  cal   <- plogis(cgrid)
  list(x=pgrid, y=cal)
}

Test the linear and quadratic calibrations in a setting where the calibration curve is known to be the line of identity, and also include loess estimates (red). Linear logistic calibration uses a black line, and quadratic a blue line.

par(mfrow=c(3, 3), mar=c(2, 2, 0, 0))
set.seed(1)
for(i in 1 : 9) {
  x <- runif(N)
  y <- ifelse(runif(N) <= x, 1, 0)
  w1 <- qcal(x, y, deg=1)
  plot(w1, type='l', xlim=c(0, 1), ylim=c(0, 1))
  w2 <- qcal(x, y, deg=2)
  lines(w2, col='blue')
  l <- lowess(x, y, iter=0)
  lines(l, col='red')
  abline(a=0, b=1, col='grey80')
}

Define a function that draws one sample, fits a binary logistic model, and computes bootstrap percentile confidence bands for the calibration curve at a regular sequence of predicted probabilities pp. Add the true calibration curve to the output, alongside \(\min(a, b)\), where \(a\) is number of observed predicted risks that are less than the current value of pp and \(b\) is the number of \(\hat{P}\) that are greater than the value of pp. In addition, save the standard deviation of \(\hat{P}\). This is done for one calibration estimator. A special case is smoother='basic' which stands for using the validate function to get bias-corrected intercept and slope estimates and to convert them to linear (in the logit) calibration curves, without confidence limits. Finally, smoother='apparent' does not use bootstrapping and instead takes the \(\hat{P}\) to (falsely) need no calibration, to gauge the accuracy of uncalibrated estimates.

# Use the following if you want to count "close" observations
# pptab <- data.table(pp, low = pp - 0.05, hi = pp + 0.05)
pptab <- data.table(ppgrid = pp)
sim1 <- function(n, p, B=300, smoother='lowess', pl=FALSE) {
  .i. <<- .i. + 1
  cat(.i., '', file='/tmp/z', append=TRUE)
  d <- simd(n, p)
  f <- lrm(y ~ x, data=d, x=TRUE, y=TRUE)
  phat    <- predict(f, type='fitted')
  ptrue   <- truecal(f, d)
  if(is.character(smoother) && smoother != 'lowess') {
    switch(smoother,
           basic = {
            val       <- validate(f, B=B)[, 'index.corrected']
            int_slope <- val[c('Intercept', 'Slope')]
            estcal    <- plogis(int_slope[1] + int_slope[2] * qlogis(pp)) },
           apparent = {
             estcal   <- pp } )
    lower <- NA
    upper <- NA
    if(pl) {
      plot(pp, estcal, type='l', xlim=c(0,1), ylim=c(0,1))
      lines(pp, ptrue, col='blue')
    }
  } else {
    bootcal <- calibrate(f, B=B, predy=pp, smoother=smoother)
    estcal <- bootcal[, 'calibrated.corrected']
    lower  <- pp + bootcal[, 'Lower']
    upper  <- pp + bootcal[, 'Upper']
    if(pl) {
      plot(bootcal, ylim=c(0,1))
      lines(pp, ptrue,   col='blue')
    }
  }
  d      <- data.table(phat)
  lo <- d[pptab, on = .(phat <= ppgrid), .N, by=.EACHI][, N]
  hi <- d[pptab, on = .(phat >= ppgrid), .N, by=.EACHI][, N]
  # count  <- d[pptab, on = .(phat >= low, phat <= hi), .N, by = .EACHI][, N]
  invisible(cbind(pp, estcal, ptrue, lower, upper, count=pmin(lo, hi), sphat=sd(phat)))
}

Run 9 tests for the default loess calibration estimator, plotting the confidence bands. The blue curve is the true calibration curve. In the upper left of each panel show the number of observations with low and high predicted probabilities.

set.seed(9)
par(mfrow=c(3,3), mar=c(2, 2, 0, 0))
for(j in 1 : 9) sim1(N, np, pl=TRUE)

n=200   Mean absolute error=0.047   Mean squared error=0.00321
0.9 Quantile of absolute error=0.101

n=200   Mean absolute error=0.055   Mean squared error=0.00358
0.9 Quantile of absolute error=0.081

n=200   Mean absolute error=0.065   Mean squared error=0.0046
0.9 Quantile of absolute error=0.087

n=200   Mean absolute error=0.05   Mean squared error=0.0031
0.9 Quantile of absolute error=0.081

n=200   Mean absolute error=0.055   Mean squared error=0.00338
0.9 Quantile of absolute error=0.077

n=200   Mean absolute error=0.056   Mean squared error=0.00415
0.9 Quantile of absolute error=0.114

n=200   Mean absolute error=0.042   Mean squared error=0.00206
0.9 Quantile of absolute error=0.067

n=200   Mean absolute error=0.041   Mean squared error=0.00258
0.9 Quantile of absolute error=0.091


n=200   Mean absolute error=0.042   Mean squared error=0.00225
0.9 Quantile of absolute error=0.08

Run 1000 simulations using all three calibration curve estimators. For each dataset simulated, record the number of observations with low/high predictions. Also compute the more direct estimated bias-corrected linear calibration curve by using the validate function to debias the intercept and slope computed from bootstrapping that is separate from the calibrate run.

sm <- list('lowess',
           function(x, y) qcal(x, y, deg=1),
           function(x, y) qcal(x, y, deg=2),
           'basic',
           'apparent')
set.seed(7)
if(file.exists('cal.rds')) R <- readRDS('cal.rds') else {
  R        <- vector('list', 5)
  names(R) <- c('loess', 'linear', 'quadratic', 'basic', 'apparent')
  for(est in 1 : 5) {
    cat('\nEstimator', est, '\n')
    .i. <- 0
    r <- replicate(1000, sim1(N, np, B=300, smoother=sm[[est]]))  # 44m
    R[[est]] <- r
  }
  saveRDS(R, 'cal.rds')
}

Restructure the results into a list of data tables.

convert2dt <- function(r) {
  npp  <- length(pp)
  nsim <- dim(r)[3]
  vn   <- dimnames(r)[[2]]
  # Restructure array so that pp and simulation # move fastest
  r <- aperm(r, c(1, 3, 2))
  # Stack into a taller matrix
  r <- matrix(r, nrow = npp * nsim, ncol=length(vn))
  d <- data.table(pp  = rep(pp, times=nsim),
                  sim = rep(1:nsim, each=npp) )
  d[, (vn) := as.data.table(r)]
  d
}
d <- lapply(R, convert2dt)

Accuracy of Bootstrap Calibration Estimates for \(n=200\)

Group the number of points within 0.05 of the target estimated risk \(\hat{P}\) into intervals and compute pseudo-medians of absolute calibration curve estimation error by calibration estimator, target \(\hat{P}\), and group, where groups are intervals of \(\hat{P}\) SDs or minimum tail frequencies.

The pseudo-median is the median of all possible pairwise means. It is more robust than the mean and is virtually as efficient as the mean, unlike the sample median.
pacc <- function(d, stratby) {
  # Restructure to a single tall and thin data table
  w <- rbindlist(d, idcol='estimator')
  if(! missing(stratby)) {
    w[, let(z         = abs(estcal - ptrue),
            group     = eval(stratby) ) ]
    w[, prn(table(group))]
    wall <- copy(w)
    wall[, group := 'All']
    w <- rbind(w, wall)
    w <- w[, .(err = pMedian(z, na.rm=TRUE)), by=.(estimator, group, pp)]
    g <- ggplot(w, aes(x=pp, y=err, color=estimator)) +
      facet_wrap(~ group)
  } else{
    w[, let(z = abs(estcal - ptrue))]
    w <- w[, .(err = pMedian(z, na.rm=TRUE)), by=.(estimator, pp)]
    g <- ggplot(w, aes(x=pp, y=err, color=estimator))
  }
  g + geom_line() + 
    xlab(expression(hat(P))) + ylab('pMedian |Calibration Estimation Error|')
}
pacc(d, expression(cut2(sphat, c(.175, .2, .25)))) + labs(caption='By groups of SD(Phat), n=200')

table(group)

group
[0.146,0.175) [0.175,0.200) [0.200,0.250) [0.250,0.362] 
         2050         10400        133250        104300 

pacc(d, expression(cut2(count, 50))) + labs(caption='By groups of min(tail count), n=200')

table(group)

group
[  0, 50) [ 50,100] 
   134168    115832 

It is typically true that even in heavily overfitted models, predictions that are near the grand mean estimated risk are accurate. This is reflected in the above plots which show the “no calibration” method of just trusting the apparent predicted risks to have the lowest expected error in the middle of the risk distribution. But it is highly overoptimistic for outer risks.

Run the following if you want to see a tensor spline surface for \(x = \hat{P}\), \(y=\) SD of \(\hat{P}\).

require(mgcv)
for(nam in names(d)) {
  w <- d[[nam]]
  f <- gam(abs(estcal - ptrue) ~ te(pp, sphat), data=w)
  plot(f, scheme=2, main=nam)
}

Accuracy of Approximate Bootstrap Confidence Limits for Calibration, \(n=200\)

For each value of pp compute the proportion of simulations for which lower > ptrue and the proportion upper < ptrue (tail non-coverage proportions). Also compute overall error proportions.

plotit <- function(R, title) {
  left  <- R[, 'lower', ] > R[, 'ptrue', ]
  right <- R[, 'upper', ] < R[, 'ptrue', ]
  
  rmn <- function(x) round(mean(x, na.rm=TRUE), 3)
  cat('Overall non-coverage for', title, ':\tleft=', rmn(left),
      ' right=', rmn(right), '\n')

  left  <- apply(left , 1, mean, na.rm=TRUE)
  right <- apply(right, 1, mean, na.rm=TRUE)
  nc    <- c(left, right)
  phat  <- c(pp, pp)
  k     <- length(left)
  side  <- c(rep('left', k), rep('right', k))
  ggplot(mapping=aes(x=phat, y=nc, col=side)) + geom_line() + xlab(expression(hat(P))) + 
      ylab('Non-Coverage') + ylim(0, 0.1) + labs(caption=title)
  }
for(nam in .q(loess, linear, quadratic))
  print(plotit(R[[nam]], title=nam))
Overall non-coverage for loess :    left= 0.039  right= 0.036 

Overall non-coverage for linear :   left= 0.015  right= 0.02 

Overall non-coverage for quadratic :    left= 0.017  right= 0.018 

The desired non-coverage probability is 0.025, which is achieved for interior predicted probabilities. The problem with coverage probabilities in the tails seems to be due to wild estimates (especially for loess) or extrapolations at more rare predicted probabilities, i.e., those closer to 0 or 1. Linear logistic calibration worked better than the more flexible calibration estimators. Note that the simulations were stacked in favor of linear calibration.

Calibration for \(n=800\) and \(n=1500\)

Try two larger sample sizes.

set.seed(8)
Ns <- c(800, 1500)
for(N in Ns) {
  fi <- paste0('cal', N, '.rds')
  if(file.exists(fi)) Rn <- readRDS(fi) else {
    Rn        <- vector('list', 5)
    names(Rn) <- c('loess', 'linear', 'quadratic', 'basic', 'apparent')
    for(est in 1 : 5) {
      cat('\nEstimator', est, '\n')
      i <- 0
      r <- replicate(1000, sim1(N, np, B=300, smoother=sm[[est]]))  # 91m
      Rn[[est]] <- r
    }
    saveRDS(Rn, fi)
  }
  if(N ==  800) R800  <- Rn
  if(N == 1500) R1500 <- Rn
}

Again compute accuracies of bias-corrected calibration estimates.

for(N in Ns) {
  Rn <- if(N == 800) R800 else R1500
  d <- lapply(Rn, convert2dt)
  cap <- paste0('By groups of SD(Phat), n=', N)
  g <- pacc(d, stratby=expression(cut2(sphat, g=4))) + labs(caption=cap)
  print(g)
  cap <- paste0('By groups of min(tail counts), n=', N)
  g <- pacc(d, stratby=expression(cut2(count, c(20, 50, 100)))) + labs(caption=cap)
  print(g)
}

table(group)

group
[0.168,0.208) [0.208,0.218) [0.218,0.227) [0.227,0.260] 
        62500         62500         62500         62500 


table(group)

group
[  0, 20) [ 20, 50) [ 50,100) [100,400] 
    27850     30174     36316    155660 


table(group)

group
[0.180,0.206) [0.206,0.213) [0.213,0.220) [0.220,0.250] 
        62500         62500         62500         62500 


table(group)

group
[  0, 20) [ 20, 50) [ 50,100) [100,750] 
    16819     22233     25206    185742 

Now display confidence interval non-coverage proportions.

for(N in Ns) {
  Rn <- if(N == 800) R800 else R1500
  for(nam in .q(loess, linear, quadratic))
    print(plotit(Rn[[nam]], title=paste0(nam, ', N=', N)))
}
Overall non-coverage for loess, N=800 : left= 0.017  right= 0.022 

Overall non-coverage for linear, N=800 :    left= 0.012  right= 0.008 

Overall non-coverage for quadratic, N=800 : left= 0.014  right= 0.015 

Overall non-coverage for loess, N=1500 :    left= 0.019  right= 0.018 

Overall non-coverage for linear, N=1500 :   left= 0.009  right= 0.01 

Overall non-coverage for quadratic, N=1500 :    left= 0.013  right= 0.011 

Especially for linear and quadratic calibration, increasing the sample size made coverage larger than 0.95 for linear and quadratic calibration. Perhaps it’s best to have this conservatism so that the method will not be anti-conservative for small sample sizes.

Summary

I considered approximate confidence bands for the true model performance indexes and entire calibration curves as estimated by a bootstrap overfitting-corrected approach. The dual standard deviation approximate bootstrap confidence interval ABCLOC balances tail non-coverage probabilities by allowing the bootstrap distribution of estimated biases to be asymmetric. ABCLOC seems to provide accurate enough confidence limits for individual accuracy scores such as the Brier score, calibration slope, and to a lesser extent Somers’ \(D_{xy}\), and I have a similar expectation of good performance for other scaler indexes. Considering vector measures such as whole calibration curves evaluated over a sequence of predicted risks, ABCLOC can be conservative. ABCLOC intervals are useful, and are certainly better than having no uncertainty intervals. They require no additional computation time.

I hope that others will refine ABCLOC, and run more diverse simulations.

The choice of calibration curve estimator also matters, and an important point demonstrated by simulation is the difficulty of estimating calibration (not to mention its uncertainty) at very low or high predicted probabilities. The more flexible loess method has worse confidence interval coverage, especially for predicted risks far from the mean. When the true calibration curve is logit-linear, linear calibration estimates, fitted by running binary logistic regression with a sole predictor equal to the logit of predicted risk, are superior as expected. Quadratic calibration is a good compromise, and the performance of restricted cubic spline estimators with 4 knots should also be explored. An advantage of parametric calibration methods (linear, quadratic, spline, …) is that just the parameters (3 for quadratic) of the estimated calibration curve can be saved for each bootstrap fit, and one does not need to track the individual calibration curve points (50 in number in simulations presented here) as must be done for loess.

At present, the approach of saving only the calibration parameters is only implemented in validate() functions for calibration intercept and slope.

Computing Environment

grateful::cite_packages(pkgs='Session', output='paragraph', out.dir='.',
    cite.tidyverse=FALSE, omit=c('grateful', 'ggplot2'))

We used R version 4.5.1 (R Core Team 2025) and the following R packages: data.table v. 1.17.8 (Barrett et al. 2025), Hmisc v. 5.2.4 (Harrell Jr 2025a), rms v. 8.1.0 (Harrell Jr 2025b).

The code was run on macOS Sequoia 15.5 on a Macbook Pro M2 Max.

References

Barrett, Tyson, Matt Dowle, Arun Srinivasan, Jan Gorecki, Michael Chirico, Toby Hocking, Benjamin Schwendinger, and Ivan Krylov. 2025. data.table: Extension of data.frame. https://doi.org/10.32614/CRAN.package.data.table.
Harrell Jr, Frank E. 2025a. Hmisc: Harrell Miscellaneous. https://hbiostat.org/R/Hmisc/.
———. 2025b. rms: Regression Modeling Strategies. https://hbiostat.org/R/rms/.
R Core Team. 2025. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Reuse