Biostatistical Modeling Plan

2023

accuracy-score

endpoints

ordinal

collaboration

data-reduction

design

medicine

prediction

regression

validation

bootstrap

This is an example statistical plan for project proposals where the goal is to develop a biostatistical model for prediction, and to do external or strong internal validation of the model.

Author

Affiliation

Frank Harrell

Department of Biostatistics
Vanderbilt University School of Medicine

Published

January 26, 2023

Introduction

Even though several of the hypotheses will be translated into statistical tests, at the heart of the aims is the determination of clinically useful predictors of outcomes and the estimation of the magnitude of effects of predictors on the outcomes. Therefore, the statistical plan emphasizes development of multivariable models relating predictors (patient baseline characteristics) to outcomes, and the use of such models for estimating adjusted (partial) effects of risk factors of interest, after controlling for the effects of other characteristics. The strategy to be used for developing multivariable models is given in detail in Frank E. Harrell (2015). This strategy involves such steps as

This has been used successfully as a template for multiple grant proposals.

This material covers only frequentist model development. A different template would be needed for (the preferred) Bayesian approach.

multiply imputing missing predictor values to make good use of partial information on a subject
choosing an appropriate statistical model based on the nature of the response variable
deciding on the allowable complexity of the model based on the effective sample size available
allowing for nonlinear predictor effects using regression splines
incorporating pre-specified interactions
checking distributional assumptions
adjusting the variance-covariance matrix for multiple imputation
graphically interpreting the model using partial effect plots and nomograms
quantifying the clinical utility (discrimination ability) of the model
internally validating the calibration and discrimination of the model using the bootstrap to estimate the model’s likely performance on a new sample of patients from the same patient stream
possibly do external validation

Regarding the choice of statistical model, we will use ordinary multiple regression for truly continuous responses, and for outcome scales that are not truly continuous or that have heavy ties in some of their values, the proportional odds ordinal logistic model (Walker and Duncan (1967)). Particular strategies and assumption checking procedures for these models are given in Frank E. Harrell (2015).

Sample size justification for development of reliable predictive models should be based on Richard D. Riley et al. (2019a) or Richard D. Riley et al. (2019b). For simplicity, in what follows sample size or allowable model complexity will be based on the rough rule of thumb that, on the average, a model must fit no more than \(\frac{m}{15}\) parameters for it to be reliable on future similar patients, where \(m\) is the effective sample size (Frank E. Harrell (2015)). For continuous response variables or ordinal responses with at least 5 well-populated values, \(m\) is the number of subjects. For binary outcomes, \(m\) is the number of incident events in the cohort. But one needs to take into account the number of subjects needed just to get an accurate overall average prediction, i.e., the number of subjects needed to estimate the model intercept (or the underlying survival curve in a Cox proportional hazards model). For a binary outcome, 96 subjects are required to estimate the intercept such that the margin of error (with 0.95 confidence) in estimating the overall risk is \(\pm 0.10\). For a continuous outcome, 70 subjects are required to estimate the residual variance to within a multiplicative margin of error of 1.2 with 0.95 confidence. See here for details about these minimum sample size calculations.

Example of Candidate Predictor Specification

The most comprehensive models have the following candidate predictors. Baseline value refers to the baseline value of the response variable, if one exists.

Two Possible Ways for Specifying Predictors and d.f.

Type is c for a categorical variable, n for a numeric continuous measurement, or g for a group of related variables.

Predictor	Type	d.f.
baseline score	n	2
disease category	c	3
disease history	g	5
age	n	2
…	.	.
Overall		42

Predictor Type	d.f.
Continuous Numeric Predictors
baseline score	2
age	2
Categorical Predictors (exclusive categories)
disease category	3
Non-exclusive Categorical Predictors
disease history	5
Overall	10

For categorical predictors the number of degrees of freedom (d.f., the numerator degrees of freedom in ordinary regression) is one less the number of categories. For continuous ones, degrees of freedom beyond one indicate nonlinear terms. When d.f. is two, the restricted cubic regression spline fit we will use is similar to a quadratic function.

A model with a complexity given by 42 d.f. can be reliably fitted when the effective sample size is at least 42 x 15 = 630 subjects. Note that we will minimize the use of stepwise variable selection techniques that attempt to reduce the apparent d.f., as this would add instability making the effective d.f. at least 42. We will also use a data reduction technique based on variable clustering to reduce the d.f. in some of the groups of related variables. Frank E. Harrell (2015) provides references for pitfalls of stepwise variable selection as well as for data reduction issues.

Alternative text:Data reduction techniques such as principal components analysis will be used to reduce related groups of variables not of primary interest to a single summary score to insert in the outcome model. Alternatively, penalized maximum likelihood estimation (Frank E. Harrell 2015, sec. 9.10) will be used to jointly model all predictors, with shrinkage (a kind of discounting) to eliminate overfitting. The amount of shrinkage to be used would be that yielding an effective d.f. of xx.

Interactions (Effect Modifiers)

The starting point for multivariable modeling is an additive effects model. A purely additive model, even one with allowance for flexible nonlinear relationships, will not fit adequately if interactive or synergistic effects exist. We do not find that an empirical search for interactions is a reliable approach, because there are too many possible interactions (product terms) and frequently interactions found by intensive search techniques are spurious. Instead interactions should be pre-specified based on subject matter knowledge, using general suggestions in (Frank E. Harrell 2015, sec. 2.7.2) which states that sensible interactions to entertain are

Treatment \(\times\) severity of disease being treated
Age \(\times\) risk factors
Age \(\times\) type of disease
Measurement \(\times\) state of a subject during measurement
Race \(\times\) disease
Calendar time \(\times\) treatment (learning curve)
Quality \(\times\) quantity of a symptom

Model Fitting and Testing Effects

Regression models will be fitted by least squares (for ordinary multiple regression) or maximum likelihood. Instead of deleting cases having some of the predictor variables missing, baseline predictors will be multiply imputed using the flexible semiparametric aregImpute algorithm in Harrell’s Hmisc package in R. When assessing the partial association of a predictor of interest with the response variable, it is important that there be as little residual confounding as possible (Greenland 2000). This is accomplished by avoiding stepwise variable selection (keeping all pre-specified variables in the model, significant or not) and by not assuming that effects are linear. For testing the partial effect of a predictor, adjusted for other independent variables, partial Wald tests will be used. When the predictor requires multiple degrees of freedom, these tests are composite or “chunk” tests. For pivotal or equivocal P-values, likelihood ratio tests will be used when maximum likelihood is used for fitting the model.

Model Validation

It is common practice when performing internal validations of predictive models to use a split sample technique in which the original sample is split into training and test samples. Although this provides an unbiased estimate of predictive accuracy of the model developed on the training sample, it has been shown to have a margin of error that is much larger than that obtained from resampling techniques. The reasons are (1) the arbitrariness of the sample split and (2) both the training and test samples must be smaller than the original, whole, sample. Pattern discovery, power, and precision of the effects of predictors are sacrificed by holding out data. For these reasons, we will use the bootstrap to internally validate the model, repeating all steps used in model development for each resample of the data. The bootstrap allows one to estimate the likely future performance of a predictive model without holding out data. It can be used to penalize estimates of predictive discrimination and calibration accuracy obtained from the original model (fit from the whole sample) for overfitting. Details of this approach are given in Harrell (Frank E. Harrell 2015, chap. 5).

Optional: Assessing Biases Caused by Dropouts

Some subjects may not return for their follow-up visit and hence may have missing values for their response variables. Frequently such subjects (“dropouts”) are not a random sample of the entire cohort, and analysis of only the complete cases will bias the results. Without intermediate clinic visits there are no good ways to adjust for such biases. Indirect information about the nonrandom dropouts will be obtained by using binary logistic models to predict the probability of dropping out on the basis of all the baseline predictor variables. When a baseline variable is a significant predictor of dropout, the distribution of that variable differs for completers and subjects who drop out.

Optional: Multiple Endpoints

There are multiple clinical endpoints in this study. We wish to answer separate questions and will report all negative as well as positive results, so no multiplicity corrections will be made for the number of endpoints (Cook and Farewell 1996).

Optional: Ordinal Response Models

Ordinal logistic models such as the proportional odds (PO) model (Walker and Duncan 1967) are the most popular models for analyzing responses when there is a natural ordering of the values of the response but an interval scale cannot be assumed. The usual ordinal logistic models (PO and the continuation ratio model) estimates a single odds ratio for each predictor. For the PO model, letting \(Y\) denote the response variable, \(j\) a level of \(Y\), and \(a\) and \(b\) be two levels for a predictor, the PO model assumes that the odds that \(Y>j\) given \(X=b\) divided by the odds that \(Y>j\) given \(X=a\) (the \(b:a\) odds ratio for \(X\)) depends on \(a\) and \(b\) but not on \(j\). This enables us to arrive at valid conclusions not affected by a choice of \(j\) while doing away with the need to have a separate predictor coefficient for each \(j\). The resulting odds ratio can be thought of as the impact of the predictor for moving \(Y\) to higher levels. Ordinal logistic models also work well for continuous \(Y\), just using the ranks of the variable. If \(Y\) has m unique levels, the ordinal logistic model will have \(m-1\) intercepts but \(m\) may equal \(n\) with nothing more than computer memory problems. Thus the ordinal response model has a great robustness advantage for skewed outcome variables while preserving power. Unlike ordinary regression and ANOVA, transforming \(Y\) will not affect regression coefficients, nor will a high outlier, nor is an equal spacing assumption assumed about \(Y\). The ordinal logistic model also has great power gains over the binary logistic model.

The Wilcoxon-Mann-Whitney two-sample rank test is a special case of the PO model, and like the PO model, the Wilcoxon test requires PO in order to be fully efficient. Details are here. General resources for ordinal modeling are here.

External Validation Plan (for binary outcome)

There are two important aspects of the proposed external validation, and two different statistical quantities to be validated for each of these. First, we will be interested in comparing whether predictive performance measures estimated in the internal validations accurately forecast those measures for the external validation proposed here. Secondly, and more importantly, we will interpret and rely upon the external validation especially in judging generalizability of our predictive models.

The primary methods to be used for external validation are as follows. First, we will estimate predictive discrimination using the c-index (F. E. Harrell et al. 1982) and its rescaled version, Somers’ \(D_{xy}\) rank correlation between predicted risk and observed binary outcome. Second, we will validate the absolute predictive accuracy of predicted risks from our previously developed prediction rules. For this we will use Cleveland’s loess nonparametric regression smoother (Cleveland 1979; Frank E. Harrell, Lee, and Mark 1996; Frank E. Harrell 2015) or a spline function to estimate the relationship between predicted risk and actual risk to obtain high-resolution calibration curves. This is done by regressing predicted risk on observed binary outcomes, with no binning (categorization), so that all levels of predicted risk supported by the data can be validated ¹. For testing whether the resulting calibration curve is ideal (i.e., is the 45 degree line of identity) we will use the single d.f. Spiegelhalter \(z\)-test (Spiegelhalter 1986) and the two d.f. test that simultaneously tests whether a linear logistic calibration curve has intercept of zero and slope of unity (Cox 1958; F. E. Harrell and Lee 1987; Miller, Hui, and Tierney 1991). Calibration accuracy will be summarized graphically by the estimated nonparametric calibration curve and by the mean absolute error and the 90th percentile of absolute calibration error. All of these quantities and tests are provided by the val.prob function in the R rms package (Frank E. Harrell 2020).

¹ We typically plot the calibration curve from the 10th smallest predicted risk to the 10th largest in the data.

Validation benefits from using the mapping of all the predictors down into one variable which is the predicted risk. So the sample size is that needed to very accurately estimate the relationship between predicted risk and observed outcome. Even though this relationship will be estimated by a smooth nonparametric calibration curve, for purposes of assessing the needed size of the validation sample we consider this process to be equivalent to estimating the risk in k distinct risk groups. Within each risk group, consider the worst case where the true risk is 0.5. To achieve a margin of error of \(\pm 0.075\) in estimating the true risk with 0.95 confidence requires a sample size of \(\frac{0.96}{0.075^{2}} = 171\). To be able to have this margin of error in any chosen group would require 171 subjects in that group.

Checklist of Things Reviewers Might Criticize

No plan for variable selection (need to explain why traditional variable selection is a bad idea)
Too little detail about missing value imputation
Important variables omitted from list of predictors
Too many predictors that are unmodifiable and hence less clinically useful

References

Cleveland, W. S. 1979. “Robust Locally Weighted Regression and Smoothing Scatterplots.” J Am Stat Assoc 74: 829–36.

Cook, Richard J., and Vern T. Farewell. 1996. “Multiplicity Considerations in the Design and Analysis of Clinical Trials.” J Roy Stat Soc A 159: 93–110.

Cox, D. R. 1958. “Two Further Applications of a Model for Binary Regression.” Biometrika 45 (3/4): 562–65. http://www.jstor.org/stable/2333203.

Greenland, Sander. 2000. “When Should Epidemiologic Regressions Use Random Coefficients?” Biometrics 56: 915–21. https://doi.org/10.1111/j.0006-341X.2000.00915.x.

Harrell, F. E., R. M. Califf, D. B. Pryor, K. L. Lee, and R. A. Rosati. 1982. “Evaluating the Yield of Medical Tests.” JAMA 247: 2543–46.

Harrell, F. E., and K. L. Lee. 1987. “Using Logistic Model Calibration to Assess the Quality of Probability Predictions.”

Harrell, Frank E. 2015. Regression Modeling Strategies, with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Second edition. New York: Springer. https://doi.org/10.1007/978-3-319-19425-7.

———. 2020. “rms: R Functions for Biostatistical/Epidemiologic Modeling, Testing, Estimation, Validation, Graphics, Prediction, and Typesetting by Storing Enhanced Model Design Attributes in the Fit.” https://hbiostat.org/R/rms.

Harrell, Frank E., Kerry L. Lee, and Daniel B. Mark. 1996. “Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors.” Stat Med 15: 361–87.

Miller, Michael E., Siu L. Hui, and William M. Tierney. 1991. “Validation Techniques for Logistic Regression Models.” Stat Med 10: 1213–26.

Riley, Richard D., Kym I. E. Snell, Joie Ensor, Danielle L. Burke, Frank E. Harrell, Karel G. M. Moons, and Gary S. Collins. 2019a. “Minimum Sample Size for Developing a Multivariable Prediction Model: Part I – Continuous Outcomes.” Statistics in Medicine 38 (7): 1262–75. https://doi.org/10.1002/sim.7993.

Riley, Richard D, Kym Ie Snell, Joie Ensor, Danielle L Burke, Frank E Harrell Jr, Karel Gm Moons, and Gary S Collins. 2019b. “Minimum Sample Size for Developing a Multivariable Prediction Model: PART II ‐ Binary and Time‐to‐event Outcomes.” Statistics in Medicine 38 (7): 1276–96. https://doi.org/10.1002/sim.7992.

Spiegelhalter, D. J. 1986. “Probabilistic Prediction in Patient Management and Clinical Trials.” Stat Med 5: 421–33. https://doi.org/10.1002/sim.4780050506.

Walker, S. H., and D. B. Duncan. 1967. “Estimation of the Probability of an Event as a Function of Several Independent Variables.” Biometrika 54: 167–78.

Reuse

CC BY 4.0