The ability to estimate how one continuous variable relates to another continuous variable is basic to the ability to create good predictions. Correlation coefficients are unitless, but estimating them requires similar sample sizes to estimating parameters we directly use in prediction such as slopes (regression coefficients). When the shape of the relationship between X and Y is not known to be linear, a little more sample size is needed than if we knew that linearity held so that all we had to estimate was a slope and an intercept. This will be addressed later.

Consider BBR Section 8.5.2 where it is shown that the sample size needed to estimate a correlation coefficient to within a margin of error as bad as ±0.2 with 0.95 confidence is about 100 subjects, and to achieve a better margin of error of ±0.1 requires about 400 subjects. Let’s reproduce that plot for the “hardest to estimate” case where the true correlation is 0.

`require(Hmisc)`

`knitrSet(lang='blogdown')`

```
plotCorrPrecision(rho=0, n=seq(10, 1000, length=100), ylim=c(0, .4), method='none')
abline(h=seq(0, .4, by=0.025), v=seq(25, 975, by=25), col=gray(.9))
```

I have seen many papers in the biomedical research literature in which investigators “turned loose” a machine learning or deep learning algorithm with hundreds of candidate features and a sample size that by the above logic is inadequate had there only been one candidate feature. How can ML possibly learn how hundreds of predictors combine to predict an outcome when our knowledge of statistics would say this is impossible? The short answer is that it can’t. Researchers claiming to have developed a useful predictive instrument with ML in the limited sample size case seldom do a rigorous internal validation that demonstrates the relationship between predicted and observed values (i.e., the calibration curve) to be a straight 45° line through the origin. I have worked with a colleague who had previously worked with a ML group who found a predictive signal (high R^{2}) with over 1000 candidate features and N=50 subjects. In trying to check their results on new subjects we appear to be finding an R^{2} about 1/4 as large as originally claimed.

Ploeg, Austin, and Steyerberg (2014) in their article Modern modelling techniques are data hungry estimated that to have a very high chance of rigorously validating, many machine learning algorithms require 200 events per *candidate* feature (they found that logistic regression requires 20 events per candidate features). So it seems that “big data” methods sometimes create the need for “big data” when traditional statistical methods may not require such huge sample sizes (at least when the dimensionality is not extremely high). [Note: in higher dimensonal situations it is possible to specify a traditional statistical model for the pre-specified “important” predictors and to add in principal components and other summaries of the remaining features.] For more about “data hunger” in machine learning see this. Machine learning algorithms do seem to have unique advantages in high signal:noise ratio situations such as image and sound pattern recognition problems. Medical diagnosis and outcome prediction problems involve a low signal:noise ratio, i.e., the R^{2} are typically low and the outcome variable Y is typically measured with error.

I’ve shown the sample size needed to estimate a correlation coefficient with a certain precision. What about the sample size needed to estimate the whole relationship between a single continuous predictor and the probability of a binary outcome? Similar to what is presented in RMS Notes Section 8.2.3, let’s simulate the average maximum (over a range of X) absolute prediction error (on the probability scale). The following R program does this, for various sample sizes. 1000 simulated datasets are analyzed for each sample size considered.

```
# X = universe of X values if X considered fixed, in random order
# xp = grid of x values at which to obtain and judge predictions
require(rms)
```

```
sim <- function(assume = c('linear', 'smooth'),
X,
ns=seq(25, 300, by=25), nsim=1000,
xp=seq(-1.5, 1.5, length=200), sigma=1.5) {
assume <- match.arg(assume)
maxerr <- numeric(length(ns))
pactual <- plogis(xp)
xfixed <- ! missing(X)
j <- 0
worst <- nsim
for(n in ns) {
j <- j + 1
maxe <- 0
if(xfixed) x <- X[1 : n]
nsuccess <- 0
for(k in 1 : nsim) {
if(! xfixed) x <- rnorm(n, 0, sigma)
P <- plogis(x)
y <- ifelse(runif(n) <= P, 1, 0)
f <- switch(assume,
linear = lrm(y ~ x),
smooth = lrm(y ~ rcs(x, 4)))
if(length(f$fail) && f$fail) next
nsuccess <- nsuccess + 1
phat <- predict(f, data.frame(x=xp), type='fitted')
maxe <- maxe + max(abs(phat - pactual))
}
maxe <- maxe / nsuccess
maxerr[j] <- maxe
worst <- min(worst, nsuccess)
}
if(worst < nsim) cat('For at least one sample size, could only run', worst, 'simulations\n')
list(x=ns, y=maxerr)
}
plotsim <- function(object, xlim=range(ns), ylim=c(0.04, 0.2)) {
ns <- object$x; maxerr <- object$y
plot(ns, maxerr, type='l', xlab='N', xlim=xlim, ylim=ylim,
ylab=expression(paste('Average Maximum ', abs(hat(P) - P))))
minor.tick()
abline(h=c(.05, .1, .15), col=gray(.85))
}
set.seed(1)
X <- rnorm(300, 0, sd=1.5) # Allows use of same X's for both simulations
simrun <- TRUE
# If blogdown handled caching, would not need to manually cache with Load and Save
if(simrun) Load(errLinear) else {
errLinear <- sim(assume='linear', X=X)
Save(errLinear)
}
plotsim(errLinear)
```

But wait—the above simulation assumes that we already knew that the relationship was linear. In practice, most relationships are nonlinear but we don’t know the true transformation. Assume the relationship between X and logit(Y=1) is smooth, we can estimate the relationship reliably with a restricted cubic spline function. Here we use 4 knots, which gives rise to the addition of two nonlinear terms to the model for a total of 3 parameters to estimate not counting the intercept. By estimating these parameters we are estimating the smooth transformation of X and by simulating this process repeatedly we are allowing for “transformation uncertainty”.

```
set.seed(1)
if(simrun) Load(errSmooth) else {
errSmooth <- sim(assume='smooth', X=X, ns=seq(50, 300, by=25))
Save(errSmooth)
}
plotsim(errSmooth, xlim=c(25, 300))
lines(errLinear, col=gray(.8))
```

You can see that the sample size must exceed 300 just to have sufficient reliability in estimating probabilities just over the range of X of [-1.5, 1.5] when we do not know that the relationship is linear and we allow it to be nonlinear.

The morals of the story are

- Beware of claims of good predictive ability for ML algorithms when sample sizes are not huge in relationship to the number of candidate features
- For any problem, whether using machine learning or regression, compute the sample size needed to obtain highly reliable predictions with only a single pre-specified predictive feature
- If you are not sure that relationships are simple so that you allow various transformations to be attempted, uncertainty increases and so does the expected absolute predicton error
- If your sample size is not much bigger than the above minimum, beware of doing any high-dimensional analysis unless you have very clean data and a high signal:noise ratio
- Also remember that when Y is binary, the minimum sample size necessary just to estimate the intercept in a logistic regression model (equivalent to estimating a single proporton) is 96 (see BBR Section 5.6.3)
So it is impossible with binary Y to accurately estimate P(Y=1 | X) when there are
*any*candidate predictors if n < 96 (and n=96 only achives a margin of error of ±0.1 in estimating risk). - When the number of candidate features is huge and the sample size is not, expect the list of “selected” features to be volatile, predictive discrimination to be overstated, and absolute predictive accuracy (calibration curve) to be very problematic
- In general, know how many observations are required to allow you to reliably learn from the number of candidate features you have

See BBR Chapter 20 for an approach to estimating the needed sample size for a given sample size and number of candidate predictors.

# References

Ploeg, Tjeerd van der, Peter C. Austin, and Ewout W. Steyerberg. 2014. “Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints.” *BMC Medical Research Methodology* 14 (1). BioMed Central Ltd: 137+. http://dx.doi.org/10.1186/1471-2288-14-137.