require(Hmisc)
require(rms) # Version 8.1-0 is required
require(ggplot2)
require(data.table)
options(prType='html')
<- function(n, p) {
simd <- matrix(rnorm(n * p), n, p)
x <- x[, 1]
L <- ifelse(runif(n) <= plogis(L), 1, 0)
y list(x=x, y=y)
}
Bootstrap Confidence Limits for Bootstrap Overfitting-Corrected Model Performance
Background
The goal here is strong internal validation after fitting a pre-specified regression model or one that was derived using backwards step-down variable selection such that the same variable selection procedure can be repeated afresh for each bootstrap repetition. So strong internal validation means estimating a variety of model performance measures in a way that does not reward them for overfitting and that penalizes for all aspects of model selection and derivation that utilized the outcome variable. The goal is estimating the likely future performance of the model on new data from the same stream as the data used in model development. A standard approach, as implemented in the R rms
package’s validate
and calibrate
functions, is the Efron-Gong optimism bootstrap discussed in detail in RMS.
A key omission from the rms
package has been computation of uncertainty intervals for bootstrap overfitting-corrected measures. In 2021 Hisashi Noma et al. wrote a very useful paper entitled Confidence intervals of prediction accuracy measures for multivariable prediction models based on the bootstrap-based optimism correction methods. They studied confidence interval coverage (although only in a two-tailed combined sense) of a full two-stage double bootstrap procedure to compute confidence intervals for Efron-Gong overfitting (bias/optimism)-corrected model performance measures. This is a very computationally demanding but accurate procedure. They also studied an approximate one-stage procedure that does not add to the computational burden of computing the overfitting-adjusted measures themselves.
Recall that the form of the bootstrap optimism-corrected measure \(\tau\) for a single index such as a Brier score or rank correlation between predicted and observed outcomes is
\[\tau = \theta + \bar{\theta}_{b} - \bar{\theta}_{w}\]
where \(\theta\) is the original index computed as usual on the whole sample (sometimes called the apparent index), \(\theta_b\) is the apparent performance in a bootstrap sample when model coefficients are fitted in the bootstrap sample, \(\theta_w\) is the performance of the bootstrap-fitted model on the original whole sample, and the horizontal bars represent averages over \(B\) bootstrap resamples. The estimated bias is \(\gamma = \bar{\theta}_{b} - \bar{\theta}_{w}\).
Noma et al.’s approximate bootstrap confidence intervals take the bootstrap distribution of \(\theta_{b}\), computes confidence intervals from it using the bootstrap percentile nonparametric method, and shifts the resulting intervals by \(\gamma\). They showed that in large samples, taking into account only the variation in \(\bar{\theta}_{b}\) is good enough, and that the overall confidence interval coverage was close to that of the expensive double bootstrap. But when the sample size is not large, considering only variation in \(\theta_b\) significantly underestimates uncertainties, and confidence interval coverage is far below the intended level (of 0.95, for example).
Fast Alternative Bootstrap Confidence Limits
To derive a fast approximate method that I expect to have better confidence coverage for non-large \(n\), consider \(B\) individual bootstrap sample estimates \(\theta_{b}\) and \(\theta_{w}\) and compute variances of several combinations of them, including \(\text{var}(\theta_{b})\), \(\text{var}(\theta_{w})\), \(\text{var}(\gamma)\), variance of weighted sums of \(\theta_{b}\) and \(\theta_{w}\), and other variants. Variances of vectors involving both \(\theta_{b}\) and \(\theta_{w}\) takes into account uncertainties in two quantities instead of one. One can think of the variance of a difference as a summative measure for two uncertainties, because the variance of a difference is the sum of the variances minus twice the covariance of the two. It is tempting to use the bootstrap nonparametric percentile interval, but we need to respect the width of the bootstrap distribution for the combination of \(\theta_{b}\) and \(\theta_{w}\) and pay no attention to its center. This pushes us to use normal-approximation bootstrap confidence intervals, which take an original point estimate (here a bootstrap bias-corrected index) \(\pm 1.96\times\) the bootstrap standard deviation, for example.
An additional complexity is that the bootstrap distribution may not be symmetric, and assuming that it is (as the standard deviation does) leads to symmetric confidence intervals instead of the needed asymmetric ones. The result is inaccurate confidence coverage in at least one of the tails. To fix that problem, we compute two standard deviations: the SD for upper values and the SD for lower ones. There is a question of where to make the high/low split. Traditionally the sample mean is chosen, but the mean is affected by asymmetry. So we could use the sample median, but this has lower precision. A nice remedy is to use the pseudo-median, i.e., the median of all possible pairwise averages. Each of the two confidence limits is computed using the square root of the average squared distance from the pseudo-median for observations in that limit’s side of the bootstrap distribution.
Next we attempt to find the best combination of variation of training and test indexes from bootstrap repetitions so that confidence interval coverage for overfitted-corrected indexes is good in both tails for three performance measures. The resulting bootstrap method will be called ABCLOC for asymmetric bootstrap confidence limits for overfitting-corrected model performance metrics.
Simulation to Derive ABCLOC to Obtain Accurate Confidence Coverage for Simple Performance Measures
Before proceeding to a more complex situation of studying accuracy of confidence intervals for estimated overfitting-corrected calibration curves, let’s study the performance of ABCLOC confidence limits for standard scalar predictive accuracy measures: Somers’ \(D_{xy}\) (equal to \(2\times (c - \frac{1}{2})\) where \(c\) is the concordance probability or area under the ROC curve), the calibration slope on the logit scale, and the Brier score. The R rms
package validate.lrm
function uses, among other methods, the Efron-Gong optimism bootstrap for debiasing the apparent performance indexes to obtain overfitting-corrected scores estimating likely future performance of the model for these three, and other indexes.
Define a function to simulate a matrix of p
predictors for a sample of size \(n\), and a sample of size \(n\) of binary responses from that predictor matrix, using a true model where the only non-zero regression coefficient for the \(p\) predictors is \(\beta_{1} = 1\).
In what follows let \(n=200\) be the sample size and \(p=15\) be the number of predictors. This is a setting with major overfitting.
To check confidence interval coverage we need to know the true Brier score for our data generating model. This can be well estimated by simulating a single sample of 200,000 subjects, and computing the Brier score for many models fitted on 200 subjects, averaging their Brier scores.
set.seed(1)
<- 200
N <- 15
np <- 200000
Nbig <- simd(Nbig, np)
big <- data.frame(x = I(big$x), y=big$y) # I keeps matrix x as a matrix
big <- big$y
Y <- 5000 nsim
Simulate 5000 samples of size 200. For each sample run 300 bootstrap resamples to compute approximate confidence limits. Also compute the true performance measures for each sample, to average.
# Define a function to generate and analyze one dataset
<- function(n, p, B=300) {
sim1 <<- .i. + 1
.i. cat(.i., '', file='/tmp/z', append=TRUE)
<- simd(n, np)
d <- lrm(y ~ x, data=d, x=TRUE, y=TRUE)
f <- predict(f, newdata=big, type='fitted')
phat
<- mean((phat - Y) ^ 2)
trueB <- somers2(phat, Y)['Dxy']
trueDxy <- lrm.fit(qlogis(phat), Y)$coef[2]
trueSlope # saveraw is passed to predab.re-sample in rms 8.1-0 to cause indexes
# from individual bootstrap resamples to be saved in a global object .predab_raw.
<- validate(f, B=B, saveraw=TRUE)
v <- .predab_raw.
r <- which(names(r$orig) %in% c('Dxy', 'Slope', 'B'))
w $orig <- r$orig[w]
r$btrain <- r$btrain[, w]
r$btest <- r$btest [, w]
r<<- r
praw[[.i.]] <- v[w, c('index.corrected', 'Lower', 'Upper')]
v <- cbind(v, true=c(trueDxy, trueSlope, trueB))
v
v
}<- 0
.i. <- NULL
praw set.seed(1)
if(file.exists('indexes.rds')) {
<- readRDS('indexes.rds')
s <- readRDS('praw.rds')
praw else {
} <- replicate(nsim, sim1(n=N, p=np))
s saveRDS(s, 'indexes.rds')
saveRDS(praw, 'praw.rds')
}
# Compute means across repetitions
<- function(x) round(x, 3)
rn rn(apply(s, 1:2, mean))
index.corrected Lower Upper true
Dxy 0.426 0.306 0.545 0.404
Slope 0.680 0.445 0.915 0.636
B 0.224 0.190 0.258 0.226
The average overfitting-corrected Brier score estimate closely tracks the average of all the true Brier scores. The optimism bootstrap overestimated the performance on the \(D_{xy}\) and calibration slope scales. The positive bias is 0.02 for \(D_{xy}\) and 0.04 for the calibration slope.
Next examine tail non-coverage probabilities for a variety of ways of utilizing the training and test indexes over the 300 resamples drawn for each simulated dataset. This is done separately for each of the 3 performance indexes.
# Add true indexes to praw
<- praw
raw for(i in 1 : nsim) raw[[i]]$true <- s[, 'true', i]
# Function applied to each element w of raw, for bootstrap type and basis and
# index number idx
# Result is a pair of logical values left and right with TRUE indicating that
# the true index is outside that tail
<- function(w, what=c('error', 'cltrue')) {
h <- match.arg(what)
what <- w$true [idx]
true <- w$btrain[, idx]
train <- w$btest [, idx]
test <- train - test
opt <- mean(opt)
mopt <- w$orig[idx]
theta <- switch(basis,
x train = train,
test = test,
opt = opt,
wtd = 2 * train - test,
wtd2 = 1.5 * train - test,
wtd3 = train - 0.75 * test,
wtd4 = train - 1.25 * test,
wtd5 = train - 1.5 * test)
if(type == 'shifted nonpar') {
<- quantile(x, c(0.025, 0.975), na.rm=TRUE)
qu <- c(qu[1] - mopt, qu[2] - mopt)
r else if(substr(type, 1, 2) == 'sd') {
} if(basis == 'v2') {
<- var(train)
v1 <- var(train - test)
v2 <- sqrt(v1 + v2 - 0.5 * sqrt(v1 * v2))
s <- c(theta - mopt - s * z, theta - mopt + s * z)
r else { # dualSD is in Hmisc 5.2-4
} <- dualSD(x, nmin=if(type == 'sd1') 100000 else 10)
s <- if(type == 'sd2') 1:2 else 2:1
a <- c(theta - mopt - s[a[1]] * z, theta - mopt + s[a[2]] * z)
r
}
}if(what == 'cltrue') return(c(left=r[1], right=r[2], true=true))
<- c(r[1] > true, r[2] < true)
r names(r) <- .q(left, right)
r
}
<- qnorm(0.975)
z
# validate used wtd4 sd2rev; check against that for Brier score
if(FALSE) {
<- 3
idx <- 'sd2rev'
type <- 'wtd4'
basis for(i in c(1, nsim / 2, nsim)) {
cat('i=', i, '\n')
print(h(raw[[i]], 'cltrue'))
print(s[3,,i])
}
}
<- c('shifted nonpar', 'sd1', 'sd2', 'sd2rev')
types <- c('train', 'test', 'opt', 'wtd', 'wtd2', 'wtd3', 'wtd4', 'wtd5', 'v2')
bases
<- vector('list', 3)
D for(idx in 1 :3) {
<- c('Dxy', 'Slope', 'B')[idx]
idxname cat('\n-------------------------------------------------------------------\nIndex: ',
if(idxname == 'B') 'Brier' else idxname, '\n\n', sep='')
# Analyze simulation results separately for each confidence interval construction method
# For each simulation compute the components of confidence intervals
# Then compute the per-simulation limits
# Then compute for that simulation whether a non-coverage has occurred per tail
<- NULL
d for(type in types) {
for(basis in bases) {
if(type == 'shifted nonpar' && basis %nin% c('train', 'test')) next
if(type != 'sd1' && basis == 'v2') next
<- sapply(raw, h)
w <- apply(w, 1, mean, na.rm=TRUE)
err <- sum(abs(err - 0.025))
dist <- rbind(d, data.frame(type, basis, left=rn(err[1]), right=rn(err[2]),
d coverage=rn(1 - sum(err)), dist=rn(dist)))
}
}row.names(d) <- NULL
print(d)
<- d
D[[idx]] }
-------------------------------------------------------------------
Index: Dxy
type basis left right coverage dist
1 shifted nonpar train 0.350 0.001 0.649 0.349
2 shifted nonpar test 0.122 0.474 0.404 0.546
3 sd1 train 0.124 0.033 0.843 0.107
4 sd1 test 0.355 0.156 0.489 0.461
5 sd1 opt 0.122 0.035 0.842 0.108
6 sd1 wtd 0.010 0.001 0.989 0.039
7 sd1 wtd2 0.043 0.006 0.951 0.037
8 sd1 wtd3 0.128 0.037 0.836 0.114
9 sd1 wtd4 0.114 0.032 0.853 0.097
10 sd1 wtd5 0.103 0.028 0.870 0.080
11 sd1 v2 0.073 0.017 0.910 0.057
12 sd2 train 0.111 0.034 0.855 0.095
13 sd2 test 0.311 0.190 0.498 0.452
14 sd2 opt 0.115 0.036 0.848 0.102
15 sd2 wtd 0.007 0.001 0.992 0.042
16 sd2 wtd2 0.038 0.006 0.956 0.031
17 sd2 wtd3 0.118 0.039 0.843 0.107
18 sd2 wtd4 0.111 0.032 0.857 0.093
19 sd2 wtd5 0.104 0.026 0.869 0.081
20 sd2rev train 0.136 0.032 0.833 0.117
21 sd2rev test 0.403 0.128 0.470 0.480
22 sd2rev opt 0.129 0.034 0.837 0.113
23 sd2rev wtd 0.014 0.001 0.985 0.035
24 sd2rev wtd2 0.049 0.005 0.946 0.043
25 sd2rev wtd3 0.137 0.035 0.829 0.121
26 sd2rev wtd4 0.117 0.032 0.852 0.098
27 sd2rev wtd5 0.103 0.029 0.868 0.082
-------------------------------------------------------------------
Index: Slope
type basis left right coverage dist
1 shifted nonpar train 0.661 0.339 0.000 0.950
2 shifted nonpar test 0.000 0.508 0.492 0.508
3 sd1 train 0.661 0.339 0.000 0.950
4 sd1 test 0.065 0.045 0.890 0.060
5 sd1 opt 0.065 0.045 0.890 0.060
6 sd1 wtd 0.065 0.045 0.890 0.060
7 sd1 wtd2 0.065 0.045 0.890 0.060
8 sd1 wtd3 0.170 0.076 0.754 0.196
9 sd1 wtd4 0.015 0.026 0.958 0.011
10 sd1 wtd5 0.002 0.015 0.983 0.033
11 sd1 v2 0.065 0.045 0.890 0.060
12 sd2 train 0.661 0.339 0.000 0.950
13 sd2 test 0.090 0.037 0.873 0.077
14 sd2 opt 0.047 0.056 0.896 0.054
15 sd2 wtd 0.047 0.056 0.896 0.054
16 sd2 wtd2 0.047 0.056 0.896 0.054
17 sd2 wtd3 0.143 0.092 0.765 0.185
18 sd2 wtd4 0.009 0.035 0.955 0.026
19 sd2 wtd5 0.002 0.021 0.977 0.027
20 sd2rev train 0.661 0.339 0.000 0.950
21 sd2rev test 0.047 0.056 0.896 0.054
22 sd2rev opt 0.090 0.037 0.873 0.077
23 sd2rev wtd 0.090 0.037 0.873 0.077
24 sd2rev wtd2 0.090 0.037 0.873 0.077
25 sd2rev wtd3 0.207 0.066 0.727 0.223
26 sd2rev wtd4 0.026 0.019 0.955 0.006
27 sd2rev wtd5 0.004 0.010 0.986 0.036
-------------------------------------------------------------------
Index: Brier
type basis left right coverage dist
1 shifted nonpar train 0.001 0.303 0.696 0.302
2 shifted nonpar test 0.626 0.060 0.315 0.635
3 sd1 train 0.047 0.067 0.886 0.064
4 sd1 test 0.212 0.246 0.542 0.408
5 sd1 opt 0.016 0.045 0.938 0.029
6 sd1 wtd 0.000 0.001 0.999 0.049
7 sd1 wtd2 0.001 0.009 0.989 0.039
8 sd1 wtd3 0.025 0.051 0.924 0.026
9 sd1 wtd4 0.010 0.038 0.952 0.028
10 sd1 wtd5 0.006 0.031 0.963 0.025
11 sd1 v2 0.009 0.026 0.965 0.017
12 sd2 train 0.036 0.074 0.889 0.061
13 sd2 test 0.257 0.208 0.535 0.415
14 sd2 opt 0.009 0.058 0.933 0.049
15 sd2 wtd 0.000 0.003 0.997 0.047
16 sd2 wtd2 0.001 0.013 0.986 0.036
17 sd2 wtd3 0.015 0.066 0.919 0.051
18 sd2 wtd4 0.004 0.051 0.944 0.047
19 sd2 wtd5 0.002 0.044 0.954 0.043
20 sd2rev train 0.061 0.060 0.879 0.071
21 sd2rev test 0.170 0.287 0.542 0.408
22 sd2rev opt 0.030 0.034 0.936 0.014
23 sd2rev wtd 0.000 0.001 0.999 0.049
24 sd2rev wtd2 0.004 0.006 0.989 0.039
25 sd2rev wtd3 0.037 0.042 0.921 0.029
26 sd2rev wtd4 0.025 0.028 0.947 0.004
27 sd2rev wtd5 0.016 0.021 0.963 0.013
For each bootstrap confidence interval method and each performance index, compute the sum of two distances from the target tail probabilities of 0.025, and then sum these distances over the three indexes and see which method has the best overall confidence interval coverage accuracy.
# Sum distances over the 3 indexes
<- D[[1]]$dist + D[[2]]$dist + D[[3]]$dist
di cbind(distance=di, rank=rank(di))
distance rank
[1,] 1.601 26.0
[2,] 1.689 27.0
[3,] 1.121 24.0
[4,] 0.929 20.0
[5,] 0.197 14.0
[6,] 0.148 9.0
[7,] 0.136 5.5
[8,] 0.336 17.0
[9,] 0.136 5.5
[10,] 0.138 7.0
[11,] 0.134 4.0
[12,] 1.106 23.0
[13,] 0.944 22.0
[14,] 0.205 16.0
[15,] 0.143 8.0
[16,] 0.121 2.0
[17,] 0.343 18.0
[18,] 0.166 13.0
[19,] 0.151 10.0
[20,] 1.138 25.0
[21,] 0.942 21.0
[22,] 0.204 15.0
[23,] 0.161 12.0
[24,] 0.159 11.0
[25,] 0.373 19.0
[26,] 0.108 1.0
[27,] 0.131 3.0
The winner (rank of 1) is sd2rev wtd4
which computes standard deviations of the 300 bootstrap re-sampled values of training - 1.25 \(\times\) test indexes, where training denotes apparent performance on the bootstrap sample and test denotes the bootstrap fit’s performance on the original whole sample. The simple SD of the 300 estimates of bias (optimism; sd1* opt
) had overall ranks of 14, 16, and 15. It stands to reason that the best quantity on which to compute SDs puts more weight on the test sample indexes than it does on the training sample, because there is less variation in performance of bootstrap fits on the original sample over simulated datasets.
sd2rev
refers to using the “top SD” for the lower confidence limit and the “bottom SD” for the upper confidence limit. The distance rank for using the ordinary SD was 5.5 and the rank when reversing the two directional SDs was 13. The standard parametric bootstrap using approximate normality does not handle asymmetric sampling distributions, but bottom and top SDs do. Better performance using two SDs indicates asymmetry of sampling distributions for the indexes.
Coverage for calibration slope and Brier score was estimated to be 0.955 and 0.947 for the best overall performing method. It did not perform so well for \(D_{xy}\), giving only 0.852 coverage. \(D_{xy}\) has a somewhat degenerate distribution when computed on regression fits, which typically do not allow \(D_{xy} < 0\), so normality is usually not a good fit for \(D_{xy}\).
Accuracy of Estimated Calibration Curves and Confidence Limits
Let’s study how well the overall winning bootstrap confidence interval estimator for three regular performance indexes fares when computing confidence bands for a vector of parameters, bootstrap overfitting-corrected calibration curves. Let’s fix the grid of 50 equally spaced points for which the calibration curve is estimated, using the 0.02 and 0.98 quantiles of pooled distributions of predicted probabilities for 10 repeated model fits. The simulations we are about to run are similar to those for the 3 statistical indexes, but now we have 50 indexes.
A histogram of the pooled distribution shows that the entire range of predicted probabilities is represented by the chosen logistic model and distribution of \(x_1\).
set.seed(1)
<- replicate(10, {
phat <- simd(N, np)
d <- lrm(y ~ x, data=d)
f predict(f, type='fitted')
})round(apply(phat, 2, quantile, c(0, 0.02, 0.98, 1)), 2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
0% 0.03 0.03 0.01 0.07 0.03 0.02 0.08 0.02 0.02 0.05
2% 0.07 0.09 0.04 0.09 0.07 0.06 0.12 0.07 0.06 0.07
98% 0.95 0.94 0.96 0.90 0.93 0.94 0.84 0.86 0.90 0.93
100% 0.99 0.96 0.99 0.97 0.98 0.98 0.86 0.95 0.97 0.97
hist(phat, nclass=50, main=NULL, xlab=expression(hat(P)))
# On average for n=200, request calibrated probabilities for 5th to 196th
# largest observed predicted risks, which is pushing things a bit
<- quantile(phat, probs=c(0.02, 0.98))
r cat('Limits on predicted probabilities for which calibration curve is estimated:',
round(r, 3), '\n')
Limits on predicted probabilities for which calibration curve is estimated: 0.075 0.933
<- seq(r[1], r[2], length=50) pp
For a given logistic model fit f
on dataset d
compute the true calibration curve at pp
. The calibration curve is a set of points \((\hat{P}, P)\), where \(\hat{P}\) are predicted probabilities and \(P\) are true probabilities that \(Y=1\) given \(\hat{P}\). The points are fitted using least squares in the logit (the true calibration is linear in the logits by design) to average over variation in \(X_{2}, \ldots, X_{15}\) that yields unnecessary variation in \(\hat{P}\) that arises from giving the last \(p - 1\) (noise) \(X\)’s a chance to get nonzero \(\hat{\beta}\)’s. So the true calibration curve is an average over covariate distributions, conditioning only on the covariates through conditioning on \(\hat{P}\).
<- function(f, d, pl=FALSE) {
truecal <- predict(f)
Lhat <- d$x[, 1]
Ltrue <- plogis(Lhat)
phat <- lm.fit.qr.bare(Lhat, Ltrue)$coefficients
int_slope <- plogis(Ltrue)
ptrue if(pl) plot(phat, ptrue, col='gray80')
plogis(int_slope[1] + int_slope[2] * qlogis(pp))
}
# Test the function
par(mfrow=c(3,3), mar=c(2,2,0,0))
set.seed(3)
<- function() {
g <- simd(N, np)
d <- lrm(y ~ x, data=d)
f <- truecal(f, d, pl=TRUE)
tcal lines(pp, tcal)
abline(a=0, b=1, col='gray80')
coef(f)
}round(replicate(9, g()), 2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
Intercept 0.28 0.00 -0.02 0.07 0.13 -0.07 0.02 0.24 -0.24
x[1] 0.86 1.13 0.98 1.10 1.28 1.10 0.93 0.77 1.27
x[2] -0.10 0.02 0.00 0.29 0.07 0.21 0.01 -0.03 0.15
x[3] 0.18 0.28 0.04 0.42 0.24 0.24 0.09 -0.19 -0.43
x[4] -0.06 0.22 0.21 -0.10 -0.17 -0.05 0.21 0.54 0.16
x[5] -0.12 0.01 0.08 0.02 -0.12 -0.23 0.20 0.12 0.35
x[6] -0.07 -0.09 0.22 0.06 -0.53 0.16 0.44 0.05 -0.21
x[7] -0.43 -0.12 0.09 0.14 -0.15 -0.32 0.15 0.06 -0.31
x[8] 0.12 -0.18 -0.09 0.00 0.12 0.38 -0.15 -0.19 0.07
x[9] -0.07 0.12 -0.25 -0.21 -0.12 -0.07 -0.03 -0.09 -0.27
x[10] -0.22 -0.04 0.08 0.03 0.03 0.12 0.00 0.30 0.19
x[11] -0.12 0.03 0.09 -0.05 -0.03 0.01 0.03 0.16 0.38
x[12] 0.06 0.09 -0.26 -0.11 -0.34 0.03 -0.01 -0.17 -0.03
x[13] 0.15 -0.19 -0.15 0.34 0.21 0.06 -0.18 0.09 -0.28
x[14] 0.06 0.26 0.07 0.29 -0.01 -0.22 -0.23 -0.03 0.04
x[15] 0.00 -0.07 0.01 0.01 0.42 0.17 -0.01 -0.22 -0.01
The fact that the circles are sometimes far from the curves implies that an “average” true calibration curve is hiding some false variation in predicted risks due to variation in the extraneous predictors. In other words, many different predicted values (depending on the noise predictors) can give rise to one true probability. To have a gold standard to compare with bootstrap estimates we must take averages.
The default calibration estimator used in calibrate()
is loess, using the R lowess
function with outlier detection turned off (iter=0
).
Define alternative parametric linear and quadratic (in the logit) estimators.
<- function(x, y, deg=2) {
qcal # Curtail predicted probabilities to [0.001, 0.999]
<- pmax(0.001, pmin(x, 0.999))
x <- qlogis(x)
L <- lrm.fit(if(deg == 2) cbind(L, L^2) else L, y, compstats=FALSE)
f <- coef(f)
k <- seq(min(x), max(x), length=200)
pgrid <- qlogis(pgrid)
Lgrid <- k[1] + k[2] * Lgrid
cgrid if(deg == 2) cgrid <- cgrid + k[3] * Lgrid^2
<- plogis(cgrid)
cal list(x=pgrid, y=cal)
}
Test the linear and quadratic calibrations in a setting where the calibration curve is known to be the line of identity, and also include loess estimates (red). Linear logistic calibration uses a black line, and quadratic a blue line.
par(mfrow=c(3, 3), mar=c(2, 2, 0, 0))
set.seed(1)
for(i in 1 : 9) {
<- runif(N)
x <- ifelse(runif(N) <= x, 1, 0)
y <- qcal(x, y, deg=1)
w1 plot(w1, type='l', xlim=c(0, 1), ylim=c(0, 1))
<- qcal(x, y, deg=2)
w2 lines(w2, col='blue')
<- lowess(x, y, iter=0)
l lines(l, col='red')
abline(a=0, b=1, col='grey80')
}
Define a function that draws one sample, fits a binary logistic model, and computes bootstrap percentile confidence bands for the calibration curve at a regular sequence of predicted probabilities pp
. Add the true calibration curve to the output, alongside \(\min(a, b)\), where \(a\) is number of observed predicted risks that are less than the current value of pp
and \(b\) is the number of \(\hat{P}\) that are greater than the value of pp
. In addition, save the standard deviation of \(\hat{P}\). This is done for one calibration estimator. A special case is smoother='basic'
which stands for using the validate
function to get bias-corrected intercept and slope estimates and to convert them to linear (in the logit) calibration curves, without confidence limits. Finally, smoother='apparent'
does not use bootstrapping and instead takes the \(\hat{P}\) to (falsely) need no calibration, to gauge the accuracy of uncalibrated estimates.
# Use the following if you want to count "close" observations
# pptab <- data.table(pp, low = pp - 0.05, hi = pp + 0.05)
<- data.table(ppgrid = pp)
pptab <- function(n, p, B=300, smoother='lowess', pl=FALSE) {
sim1 <<- .i. + 1
.i. cat(.i., '', file='/tmp/z', append=TRUE)
<- simd(n, p)
d <- lrm(y ~ x, data=d, x=TRUE, y=TRUE)
f <- predict(f, type='fitted')
phat <- truecal(f, d)
ptrue if(is.character(smoother) && smoother != 'lowess') {
switch(smoother,
basic = {
<- validate(f, B=B)[, 'index.corrected']
val <- val[c('Intercept', 'Slope')]
int_slope <- plogis(int_slope[1] + int_slope[2] * qlogis(pp)) },
estcal apparent = {
<- pp } )
estcal <- NA
lower <- NA
upper if(pl) {
plot(pp, estcal, type='l', xlim=c(0,1), ylim=c(0,1))
lines(pp, ptrue, col='blue')
}else {
} <- calibrate(f, B=B, predy=pp, smoother=smoother)
bootcal <- bootcal[, 'calibrated.corrected']
estcal <- pp + bootcal[, 'Lower']
lower <- pp + bootcal[, 'Upper']
upper if(pl) {
plot(bootcal, ylim=c(0,1))
lines(pp, ptrue, col='blue')
}
}<- data.table(phat)
d <- d[pptab, on = .(phat <= ppgrid), .N, by=.EACHI][, N]
lo <- d[pptab, on = .(phat >= ppgrid), .N, by=.EACHI][, N]
hi # count <- d[pptab, on = .(phat >= low, phat <= hi), .N, by = .EACHI][, N]
invisible(cbind(pp, estcal, ptrue, lower, upper, count=pmin(lo, hi), sphat=sd(phat)))
}
Run 9 tests for the default loess calibration estimator, plotting the confidence bands. The blue curve is the true calibration curve. In the upper left of each panel show the number of observations with low and high predicted probabilities.
set.seed(9)
par(mfrow=c(3,3), mar=c(2, 2, 0, 0))
for(j in 1 : 9) sim1(N, np, pl=TRUE)
n=200 Mean absolute error=0.047 Mean squared error=0.00321
0.9 Quantile of absolute error=0.101
n=200 Mean absolute error=0.055 Mean squared error=0.00358
0.9 Quantile of absolute error=0.081
n=200 Mean absolute error=0.065 Mean squared error=0.0046
0.9 Quantile of absolute error=0.087
n=200 Mean absolute error=0.05 Mean squared error=0.0031
0.9 Quantile of absolute error=0.081
n=200 Mean absolute error=0.055 Mean squared error=0.00338
0.9 Quantile of absolute error=0.077
n=200 Mean absolute error=0.056 Mean squared error=0.00415
0.9 Quantile of absolute error=0.114
n=200 Mean absolute error=0.042 Mean squared error=0.00206
0.9 Quantile of absolute error=0.067
n=200 Mean absolute error=0.041 Mean squared error=0.00258
0.9 Quantile of absolute error=0.091
n=200 Mean absolute error=0.042 Mean squared error=0.00225
0.9 Quantile of absolute error=0.08
Run 1000 simulations using all three calibration curve estimators. For each dataset simulated, record the number of observations with low/high predictions. Also compute the more direct estimated bias-corrected linear calibration curve by using the validate
function to debias the intercept and slope computed from bootstrapping that is separate from the calibrate
run.
<- list('lowess',
sm function(x, y) qcal(x, y, deg=1),
function(x, y) qcal(x, y, deg=2),
'basic',
'apparent')
set.seed(7)
if(file.exists('cal.rds')) R <- readRDS('cal.rds') else {
<- vector('list', 5)
R names(R) <- c('loess', 'linear', 'quadratic', 'basic', 'apparent')
for(est in 1 : 5) {
cat('\nEstimator', est, '\n')
<- 0
.i. <- replicate(1000, sim1(N, np, B=300, smoother=sm[[est]])) # 44m
r <- r
R[[est]]
}saveRDS(R, 'cal.rds')
}
Restructure the results into a list of data tables.
<- function(r) {
convert2dt <- length(pp)
npp <- dim(r)[3]
nsim <- dimnames(r)[[2]]
vn # Restructure array so that pp and simulation # move fastest
<- aperm(r, c(1, 3, 2))
r # Stack into a taller matrix
<- matrix(r, nrow = npp * nsim, ncol=length(vn))
r <- data.table(pp = rep(pp, times=nsim),
d sim = rep(1:nsim, each=npp) )
:= as.data.table(r)]
d[, (vn)
d
}<- lapply(R, convert2dt) d
Accuracy of Bootstrap Calibration Estimates for \(n=200\)
Group the number of points within 0.05 of the target estimated risk \(\hat{P}\) into intervals and compute pseudo-medians of absolute calibration curve estimation error by calibration estimator, target \(\hat{P}\), and group, where groups are intervals of \(\hat{P}\) SDs or minimum tail frequencies.
<- function(d, stratby) {
pacc # Restructure to a single tall and thin data table
<- rbindlist(d, idcol='estimator')
w if(! missing(stratby)) {
let(z = abs(estcal - ptrue),
w[, group = eval(stratby) ) ]
prn(table(group))]
w[, <- copy(w)
wall := 'All']
wall[, group <- rbind(w, wall)
w <- w[, .(err = pMedian(z, na.rm=TRUE)), by=.(estimator, group, pp)]
w <- ggplot(w, aes(x=pp, y=err, color=estimator)) +
g facet_wrap(~ group)
else{
} let(z = abs(estcal - ptrue))]
w[, <- w[, .(err = pMedian(z, na.rm=TRUE)), by=.(estimator, pp)]
w <- ggplot(w, aes(x=pp, y=err, color=estimator))
g
}+ geom_line() +
g xlab(expression(hat(P))) + ylab('pMedian |Calibration Estimation Error|')
}pacc(d, expression(cut2(sphat, c(.175, .2, .25)))) + labs(caption='By groups of SD(Phat), n=200')
table(group)
group
[0.146,0.175) [0.175,0.200) [0.200,0.250) [0.250,0.362]
2050 10400 133250 104300
pacc(d, expression(cut2(count, 50))) + labs(caption='By groups of min(tail count), n=200')
table(group)
group
[ 0, 50) [ 50,100]
134168 115832
It is typically true that even in heavily overfitted models, predictions that are near the grand mean estimated risk are accurate. This is reflected in the above plots which show the “no calibration” method of just trusting the apparent predicted risks to have the lowest expected error in the middle of the risk distribution. But it is highly overoptimistic for outer risks.
Run the following if you want to see a tensor spline surface for \(x = \hat{P}\), \(y=\) SD of \(\hat{P}\).
require(mgcv)
for(nam in names(d)) {
<- d[[nam]]
w <- gam(abs(estcal - ptrue) ~ te(pp, sphat), data=w)
f plot(f, scheme=2, main=nam)
}
Accuracy of Approximate Bootstrap Confidence Limits for Calibration, \(n=200\)
For each value of pp
compute the proportion of simulations for which lower > ptrue
and the proportion upper < ptrue
(tail non-coverage proportions). Also compute overall error proportions.
<- function(R, title) {
plotit <- R[, 'lower', ] > R[, 'ptrue', ]
left <- R[, 'upper', ] < R[, 'ptrue', ]
right
<- function(x) round(mean(x, na.rm=TRUE), 3)
rmn cat('Overall non-coverage for', title, ':\tleft=', rmn(left),
' right=', rmn(right), '\n')
<- apply(left , 1, mean, na.rm=TRUE)
left <- apply(right, 1, mean, na.rm=TRUE)
right <- c(left, right)
nc <- c(pp, pp)
phat <- length(left)
k <- c(rep('left', k), rep('right', k))
side ggplot(mapping=aes(x=phat, y=nc, col=side)) + geom_line() + xlab(expression(hat(P))) +
ylab('Non-Coverage') + ylim(0, 0.1) + labs(caption=title)
}for(nam in .q(loess, linear, quadratic))
print(plotit(R[[nam]], title=nam))
Overall non-coverage for loess : left= 0.039 right= 0.036
Overall non-coverage for linear : left= 0.015 right= 0.02
Overall non-coverage for quadratic : left= 0.017 right= 0.018
The desired non-coverage probability is 0.025, which is achieved for interior predicted probabilities. The problem with coverage probabilities in the tails seems to be due to wild estimates (especially for loess) or extrapolations at more rare predicted probabilities, i.e., those closer to 0 or 1. Linear logistic calibration worked better than the more flexible calibration estimators. Note that the simulations were stacked in favor of linear calibration.
Calibration for \(n=800\) and \(n=1500\)
Try two larger sample sizes.
set.seed(8)
<- c(800, 1500)
Ns for(N in Ns) {
<- paste0('cal', N, '.rds')
fi if(file.exists(fi)) Rn <- readRDS(fi) else {
<- vector('list', 5)
Rn names(Rn) <- c('loess', 'linear', 'quadratic', 'basic', 'apparent')
for(est in 1 : 5) {
cat('\nEstimator', est, '\n')
<- 0
i <- replicate(1000, sim1(N, np, B=300, smoother=sm[[est]])) # 91m
r <- r
Rn[[est]]
}saveRDS(Rn, fi)
}if(N == 800) R800 <- Rn
if(N == 1500) R1500 <- Rn
}
Again compute accuracies of bias-corrected calibration estimates.
for(N in Ns) {
<- if(N == 800) R800 else R1500
Rn <- lapply(Rn, convert2dt)
d <- paste0('By groups of SD(Phat), n=', N)
cap <- pacc(d, stratby=expression(cut2(sphat, g=4))) + labs(caption=cap)
g print(g)
<- paste0('By groups of min(tail counts), n=', N)
cap <- pacc(d, stratby=expression(cut2(count, c(20, 50, 100)))) + labs(caption=cap)
g print(g)
}
table(group)
group
[0.168,0.208) [0.208,0.218) [0.218,0.227) [0.227,0.260]
62500 62500 62500 62500
table(group)
group
[ 0, 20) [ 20, 50) [ 50,100) [100,400]
27850 30174 36316 155660
table(group)
group
[0.180,0.206) [0.206,0.213) [0.213,0.220) [0.220,0.250]
62500 62500 62500 62500
table(group)
group
[ 0, 20) [ 20, 50) [ 50,100) [100,750]
16819 22233 25206 185742
Now display confidence interval non-coverage proportions.
for(N in Ns) {
<- if(N == 800) R800 else R1500
Rn for(nam in .q(loess, linear, quadratic))
print(plotit(Rn[[nam]], title=paste0(nam, ', N=', N)))
}
Overall non-coverage for loess, N=800 : left= 0.017 right= 0.022
Overall non-coverage for linear, N=800 : left= 0.012 right= 0.008
Overall non-coverage for quadratic, N=800 : left= 0.014 right= 0.015
Overall non-coverage for loess, N=1500 : left= 0.019 right= 0.018
Overall non-coverage for linear, N=1500 : left= 0.009 right= 0.01
Overall non-coverage for quadratic, N=1500 : left= 0.013 right= 0.011
Especially for linear and quadratic calibration, increasing the sample size made coverage larger than 0.95 for linear and quadratic calibration. Perhaps it’s best to have this conservatism so that the method will not be anti-conservative for small sample sizes.
Summary
I considered approximate confidence bands for the true model performance indexes and entire calibration curves as estimated by a bootstrap overfitting-corrected approach. The dual standard deviation approximate bootstrap confidence interval ABCLOC balances tail non-coverage probabilities by allowing the bootstrap distribution of estimated biases to be asymmetric. ABCLOC seems to provide accurate enough confidence limits for individual accuracy scores such as the Brier score, calibration slope, and to a lesser extent Somers’ \(D_{xy}\), and I have a similar expectation of good performance for other scaler indexes. Considering vector measures such as whole calibration curves evaluated over a sequence of predicted risks, ABCLOC can be conservative. ABCLOC intervals are useful, and are certainly better than having no uncertainty intervals. They require no additional computation time.
I hope that others will refine ABCLOC, and run more diverse simulations.
The choice of calibration curve estimator also matters, and an important point demonstrated by simulation is the difficulty of estimating calibration (not to mention its uncertainty) at very low or high predicted probabilities. The more flexible loess method has worse confidence interval coverage, especially for predicted risks far from the mean. When the true calibration curve is logit-linear, linear calibration estimates, fitted by running binary logistic regression with a sole predictor equal to the logit of predicted risk, are superior as expected. Quadratic calibration is a good compromise, and the performance of restricted cubic spline estimators with 4 knots should also be explored. An advantage of parametric calibration methods (linear, quadratic, spline, …) is that just the parameters (3 for quadratic) of the estimated calibration curve can be saved for each bootstrap fit, and one does not need to track the individual calibration curve points (50 in number in simulations presented here) as must be done for loess.
validate()
functions for calibration intercept and slope.Computing Environment
::cite_packages(pkgs='Session', output='paragraph', out.dir='.',
gratefulcite.tidyverse=FALSE, omit=c('grateful', 'ggplot2'))
We used R version 4.5.1 (R Core Team 2025) and the following R packages: data.table v. 1.17.8 (Barrett et al. 2025), Hmisc v. 5.2.4 (Harrell Jr 2025a), rms v. 8.1.0 (Harrell Jr 2025b).
The code was run on macOS Sequoia 15.5 on a Macbook Pro M2 Max.