Statistical Thinking

Minimal-Assumption Estimation of Survival Probability vs. a Continuous Variable

Frank Harrell — Sat, 19 Apr 2025 05:00:00 GMT

Background

This article considers the following setting. Suppose we have one continuous predictor and an outcome variable and we wish to estimate a smooth, usually nonlinear, relationship between and some property of such as the mean or the probability that exceeds some specified value. When there is no censoring on , one can estimate such a smooth relationship nonparametrically using a standard smoother such as loess or the R “super smoother” supsmu. Semiparametric ordinal regression, using a regression spline for is also a good approach.

Now suppose that represents the time until an event, where there may be right-censoring, i.e., for some observations the time to event is only known to be beyond the last recorded follow-up time. If we were willing to make an assumption such as proportional hazards (PH), we could easily estimate the curve in question by fitting a Cox PH model using a regression spline in . But nonparametric smoothers along the lines of loess or supsmu have not been developed for this setting, and we stiill need to estimate the relationship with minimal assumptions. For example, we may want to avoid making the PH assumption (parallelism over in for different values of ).

An especially important application is the estimation of calibration curves, which is the process of estimating how the predicted probability of surviving past a specific relates to the estimated actual probability of survival. Here the predicted survival probability is the sole continuous covariate.

Binning and computing stratified Kaplan-Meier estimates is not a competitive procedure due to noise from not utilizing interpolation leading to increased mean squared error, and from the arbitrary choice of bins.

Estimation Methods

The vs. estimation methods considered here are as follows.

hazard regression using the R polspline package’s hare function. hare uses adaptive linear splines in and in to find a smooth function. Non-parallelism (e.g., non-PH) is handled by adaptively adding product terms involving linear spline terms in and in .
adaptive ordinal regression using the R rms package’s adapt_orm function (available as of version 8.0-1) with right-censoring, starting by modeling with a restricted cubic spline function with 4 knots. Four link functions are tried: logit, probit, log-log, and complementary log-log. The link function resulting in the lowest deviance is selected, and fits using that link are then tried with 0, 3, 4, 5, and 6 knots, where 0 represents a linear fit. The fit with the best AIC is selected. This fit is then used to estimate survival probabilities at a specific over a regular grid of . The models considered span a range from PH to non-PH accelerated failure time models.
moving overlapping window Kaplan-Meier (KM) estimates at a specific using the R Hmisc package movStats function. At each distinct value occurring in the data, an interval containing eps observations on either side of is formed. For each interval, the KM estimate at is computed. By default, these are then smoothed over all the estimates at all distinct values, using the R super smoother. The amount of smoothing in supsmu is controlled by the bass parameter.

Simulation

The following simulations under one data generating mechanism estimate the performance of each of the above methods, the last method being used multiple times for different eps (and bass if smoothing KM estimates).

For sample sizes ranging from 35 to 500, simulate right-censored data generated from a log-logistic accelerated failure time model that is quadratic in . Our goal is to estimate , i.e., 3-unit survival probability as a function of the sole predictor . Define a function simdat to simulate a single dataset, with the censoring distribution being . Plot the true as a function of .

require(Hmisc)
require(rms)
require(ggplot2)
require(data.table)
require(polspline)
simdat <- function(n) {
  cens <- runif(n, 2, 6)
  x    <- runif(n)
  # Logistic AFT model
  lp   <- 2 - 2 * (x - 1) ^ 2
  t    <- exp(lp + rlogis(n) / 2)
  e    <- t <= cens
  y    <- pmin(t, cens)
  S    <- survival::Surv(y, e)
  data.table(x, y, e)
}

# Function to compute true survival probability at t=3
# T = exp(lp + r / 2)
# P(T > 3) = P(exp(lp + r / 2) > 3) = P(lp + r / 2 > log(3)) =
# P(r / 2 > log(3) - lp) = P(r > 2 * (log(3) - lp))
surv <- function(x) {
  lp <- 2 - 2 * (x - 1) ^ 2
  1 - plogis(2 * (log(3) - lp))
}
x <- seq(0, 1, by=0.01)
plot(x, surv(x), type='l', ylab='S(3 | x)')

For this setup the probability that an observation is right-censored is 0.505.

Next create a function that runs any of the estimation methods on a dataset and computes its integrated mean squared error (and its square root) in estimating the true over . Squared errors are computed for each and averaged over all 101 values.

# Function to compute sqrt of average (over grid of x) 
# mean squared error in estimating the true function of x
# using one of several methods

crmse <- function(dat, meth, msmooth='smoothed',
                  eps=15, bass=8, penalty='BIC', u=3, pl=FALSE) {
  xs <- seq(0, 1, by = 0.01)
  x  <- dat$x
  y  <- dat$y
  e  <- dat$e
  S <- survival::Surv(y, e)
  O <- Ocens(y, ifelse(e == 1, y, Inf))
  if(meth == 'hare') {
    f <- if(penalty == 'BIC') hare(y, e, x, maxdim=6)
      else hare(y, e, x, penalty=2, maxdim=6)
    s <- 1 - phare(u, xs, f)
  }
  else if(meth == 'orm') {
    f <- adapt_orm(x, O)
    opt_link <<- c(opt_link, f$family)
    opt_df   <<- c(opt_df,   f$stats['d.f.'])
    s <- survest(f, data.frame(x=xs), times=u, conf.int=0)$surv
  }
  else {
    if(nrow(dat) < 2 * eps) return(NA_real_)
    if(msmooth == 'raw')
      f <- movStats(S ~ x, times=u, msmooth='raw', eps=eps, melt=TRUE)
    else
      f <- movStats(S ~ x, times=u, msmooth=msmooth,
                    eps=eps, bass=bass, melt=TRUE)
    s <- approx(f$x, 1 - f$incidence, xout=xs, rule=2)$y
  }
  if(pl) {
    plot(xs, s, type='l', xlab='x', ylab=expression(hat(S)(t)), ylim=c(0,1))
    lines(xs, surv(xs), col='red')
    title(sub=paste(meth, msmooth, eps, bass))
  }
  c(rmse=sqrt(mean((s - surv(xs)) ^ 2)))
}

Define a function that for one dataset runs all estimation methods.

run <- function(dat, u=3, pl=FALSE) {
  # Not all parameters pertain to all methods
  # Non-applicable parameters are set to NA by rbindlist(fill=TRUE)
  u1 <- expand.grid(meth = 'hare', penalty=c('AIC', 'BIC'))
  u2 <- data.frame( meth = 'orm')
  u3 <- expand.grid(meth = 'km', msmooth='raw',
                    eps=c(10,15,20,25,30))
  u4 <- expand.grid(meth = 'km', msmooth='smoothed',
                    eps=c(10,15,20,25,30), bass=c(1,3,5,7,9))
  u <- rbindlist(list(u1, u2, u3, u4), fill=TRUE)
  g <- function(x) if(is.na(x)) '' else x
  u[, .(rmse = crmse(dat, meth, as.character(msmooth),
                     eps, bass, penalty, pl=pl),
        method=paste(meth, g(msmooth), g(eps), g(bass), g(penalty))),
    by=.(meth, msmooth, eps, bass, penalty)]
}

Now run the simulations. To gain resolution in while minimizing the number of simulations and obtaining precise results, simply draw one simulated dataset per , and later smooth the results with respect to .

w <- data.table(n = 35 : 500)
set.seed(2)
if(file.exists('sim.rds')) {
  r        <- readRDS('sim.rds')
  opt      <- readRDS('opt.rds')
  opt_link <- opt$link
  opt_df   <- opt$df
  } else {
    opt_link <- character(0)
    opt_df   <- integer(0)
    r        <- w[, run(simdat(n)), by=n]
    opt      <- list(link=opt_link, df=opt_df)
    saveRDS(r,   'sim.rds')
    saveRDS(opt, 'opt.rds')
}
cat('\nFrequencies of selected links:\n\n')


Frequencies of selected links:

table(opt_link)

opt_link
 cloglog logistic   loglog   probit 
      26      237       63      140

cat('\nFrequencies of optimum number of x parameters:\n')


Frequencies of optimum number of x parameters:

table(opt_df)

opt_df
  1   2   3   4   5 
 90 278  53  21  24

r[, eps  := factor(eps)]
r[, bass := factor(bass)]
# Mark the best performing km raw and km smoothed settings
r[, best := (meth == 'km' & eps == 30) & (
              (msmooth =='raw') |
              (msmooth == 'smoothed' & bass == 9) )]

Graph results for methods that don’t have parameters, and the performance of the other methods, at the parameters resulting in lowest root mean squared error overall. These are eps=30, indicating moving windows with 30 observations on each side of the target x value, and bass=9 indicating maximum smoothing using the “super smoother” R function supsmu.

ggplot(r[meth %in% c('hare', 'orm') | best,],
       aes(x=n, y=rmse, col=method)) +
  geom_smooth(se=FALSE) +
  labs(caption='hare and orm vs. best moving window KM estimators')

Oddly enough, hare BIC performed slightly better than AIC for low sample sizes, and there was no advantage of BIC for large . Moving-window Kaplan-Meier estimates, either smoothed or unsmoothed, did not perform as well as the other methods. The best integrated mean squared estimation error was had with the adaptive-link orm method.

Next graph results for the method that has one parameter, unsmoothed moving-window KM.

ggplot(r[meth == 'km' & msmooth == 'raw',],
       aes(x=n, y=rmse, col=eps)) +
  geom_smooth(se=FALSE) +
  labs(caption='Unsmoothed KM estimator performance by sample size on either side of target x')

Best overall mean squared estimation error resulted from eps=30 observations on either side of the target predictor value x.

Now consider the method with two parameters – smoothed moving-window KM.

ggplot(r[meth == 'km' & msmooth == 'smoothed',],
       aes(x=n, y=rmse, col=eps)) +
  geom_smooth(se=FALSE) + facet_wrap(~ bass) +
  labs(caption='Smoothed KM performance by eps and smoothing parameter')

Show this another way to better judge effect of the smoothing parameter bass.

ggplot(r[meth == 'km' & msmooth == 'smoothed',],
       aes(x=n, y=rmse, col=bass)) +
  geom_smooth(se=FALSE) + facet_wrap(~ eps) +
  labs(caption='Smoothed KM performance by eps and smoothing parameter')

The best bass value is 9, i.e., maximum smoothing. The best window size was eps=30.

Recommendations

The overall winner is considering four link functions in an ordinal regression model with right censoring, fitting the predictor x with a restricted cubic spline function, and choosing the link with minimum deviance. The number of knots in the spline (or linearity) can be selected using AIC. If one wanted to accommodate more exotic effects of x over time, e.g., complex non-parallelism in survival curves, hare is a good choice.

Computing Environment

grateful::cite_packages(pkgs='Session', output='paragraph', out.dir='.',
    cite.tidyverse=FALSE, omit=c('grateful', 'ggplot2'))

We used R version 4.4.2 (R Core Team 2024) and the following R packages: data.table v. 1.17.0 (Barrett et al. 2025), Hmisc v. 5.2.4 (Harrell Jr 2025a), polspline v. 1.1.25 (Kooperberg 2024), rms v. 8.0.1 (Harrell Jr 2025b).

The code was run on macOS Sequoia 15.4.1 on a Macbook Pro M2 Max.

References

Barrett, Tyson, Matt Dowle, Arun Srinivasan, Jan Gorecki, Michael Chirico, Toby Hocking, Benjamin Schwendinger, and Ivan Krylov. 2025. data.table: Extension of “data.frame”. https://CRAN.R-project.org/package=data.table.

Harrell Jr, Frank E. 2025a. Hmisc: Harrell Miscellaneous. https://hbiostat.org/R/Hmisc/.

———. 2025b. rms: Regression Modeling Strategies. https://hbiostat.org/R/rms/.

Kooperberg, Charles. 2024. polspline: Polynomial Spline Routines. https://CRAN.R-project.org/package=polspline.

R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Reuse

CC BY 4.0

Bayesian Thinking

Frank Harrell — Mon, 27 Jan 2025 06:00:00 GMT

Janice Pogue Lecture in Biostatistics, Department of Health Research Methods, Evidence, and Impact, McMaster University, Hamilton, Ontario, Canada 2024-12-06
Center for Biostatistics, Dept. of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, 2025-03-18.
Department of Biostatistics, Vanderbilt University School of Medicine, 2025-04-23
Slides

Modernizing Clinical Trial Design and Analysis to Improve Efficiency & Flexibility

Frank Harrell — Thu, 05 Dec 2024 06:00:00 GMT

UCLA Cardiology Grand Rounds 2020-10-23 | Video (better video below)
Vanderbilt University Department of Biostatistics 2020-11-18
Vanderbilt Translational Research Forum 2021-11-04 | Video
Consilium Scientific 2024-03-14 | Video and here
Seventh Annual Janice Pogue Lectureship in Biostatistics, Population Health Research Institute, Hamilton, Ontario, Canada 2024-12-05
Slides
Pictures
Video

Statistical Computing Approaches to Maximum Likelihood Estimation

Frank Harrell — Thu, 28 Nov 2024 06:00:00 GMT

Overview

Maximum likelihood estimation (MLE) is a gold standard estimation procedure in non-Bayesian statistics, and the likelihood function is central to Bayesian statistics (even though it is not maximized in the Bayesian paradigm). MLE may be unpenalized (the standard approach) or various penalty functions such as L1 (lasso, absolute value penalty), and L2 (ridge regression; quadratic) penalties may be added to the log-likelihood to achieve shrinkage (aka regularization). I have been doing MLE my entire career, using mainly a Newton-Raphson algorithm with step-halving to achieve rapid convergence. I never stopped to think about other optimization algorithms, and the R language has a good many excellent algorithms included in the base stats package. There is also an excellent maxLik R package devoted to MLE (not used here), and two of its vignettes provide excellent introductions to MLE. General MLE background and methods including more about penalization may be found here, which includes details about how to do QR factorization and to back-transform after the fit.

If you think that ROC area or classification accuracy are good objective functions/optimality criteria, please think again.

In this article I’ll explore several optimization strategies and variations of them, in the context of binary and ordinal logistic regression. The programming scheme that is used here is the one used by many R packages: Write functions to compute the scaler objective function (here, -2 log likelihood), the gradient vector (vector of first derivatives), and the hessian matrix (matrix of second derivatives), and pass those functions to general optimization functions. Convergence (at least to local minima for -2 LL (log of the likelihood function)) is achieved for smooth likelihood functions when the gradient vector values are all within a small tolerance (say ) of zero or when the -2 LL objective function completely settles down in, say, the significant digit. The gradient vector is all zeros if the parameter values are exactly the MLEs (local minima achieved on -2 LL).

A different strategy is to use the Bayesian system Stan by specifying LL and letting Stan analytically compute the gradient, then using the Stan optimizer to compute MLEs. If you specify priors, the optimizer provides penalized MLEs. The likelihood function is the bridge between Bayesian and frequentist methods.

This article is based on rms version 7.0-0, a major new release of the package, which will likely be available on CRAN around 2025-01-08.

History

The R rms package lrm function is dedicated to maximum likelihood estimation (MLE) for fitting binary and ordinal (proportional odds) logistic regression models using the logit link, with or without quadratic (ridge) penalization. Semiparamteric regression models, also called ordinal regression models, allow one to do efficient analyses without depending on how the dependent variable Y is transformed. Ordinal models encode the entire cumulative distribution function of Y by having an intercept for each distinct Y level less one. For ordinal models, versions of lrm before rms 6.9-0 were efficient for up to 400 distinct Y-values (399 intercepts) in the sense that execution time was under 10 seconds for 10,000 observations on 10 predictors. The rms orm function is intended for modeling continuous outcome Y variables and was efficient for up to 8000 intercepts prior to rms 7.0-0. orm implements 4 link functions other than the logit. For rms 7.0-0, lrm and orm run in 2.5s for a sample size of 300,000 with continuous Y and 20 predictors, i.e., with 299,999 intercepts (9.5s for 40 predictors). lrm uses the R function lrm.fit for its heavy lifting, and likewise orm uses orm.fit. Much of lrm.fit was written in 1980 and served as the computational engine for the first SAS procedure for logistic regression, PROC LOGIST. It used Fortran 77 for computationally intensive work, and used only a Newton-Raphson algorithm with step-halving for iterative MLE. On rare occasions when serious collinearities were present, such as when multiple continuous variables were fitted using restricted cubic splines (which use a truncated power basis), lrm.fit would fail to converge.

Are Intercepts Regular Parameters?

Let be a smooth cumulative distribution function. The cumulative probability class of ordinal semiparametric models can be written as follows. It is more traditional to state the model in terms of but we use so that higher predicted values are associated with higher , and when is the logistic distribution the ordinal logistic (proportional odds) model reduces exactly to a binary logistic model. Let the ordered distinct values of be denoted by and let the intercepts associated with be , where because . Let . Then the cumulative probability semiparametric model is

When is the extreme value type I (also called the Gumbel minimum value) distribution , the inverse function is the link and the model is a proportional hazards (PH) model. The Cox 1972 PH model uses the only generating distribution such that the marginal (getting rid of the s) distribution of the ranks of can be evaluated without multi-dimensional integrals. This gives rise to Cox’s partial likelihood, which until time-dependent covariates are included can be computationally fast for any number of distinct failure times. The Cox approach requires a second step to estimate the intercepts (underlying survival curve for a person with some reference value of ). There is some arbitrariness to which second-step estimator is used, e.g., the Breslow estimator or the Kalbfleisch-Prentice estimator. And variances of are complicated because uncertainties from both of the steps must be included.

In fact, a partial likelihood, which also makes it difficult to handle interval censoring, is only needed until you realize that

the maximum likelihood estimate (MLE) of the vector of intercepts when are just the link function of all the one minus cumulative probabilities in the absence of censoring; this gives rise to instantly-computed and convergence-accelerating initial values for iterative MLE estimation
the s are always in descending order (ascending if using the more popular statement of ordinal models)
the gradient (first derivative) for the log likelihood function can be computed quickly no matter how large is
the hessian (second derivatives of the log likelihood function) can be computed in just over twice the time needed to compute the gradient, and takes twice the amount of array storage size as the gradient, so MLE scales wonderfully for large
for most needs, the entire information matrix (negative hessian) never needs to be inverted; portions of the inverse of the whole can be quickly computed without inverting the whole information matrix
the R Matrix package is made for efficient storage and calculation on such sparse hessians

Twice because the intercept portion of the hessian is tri-band diagonal and one only needs to store the diagonal and above-diagonal elements due to symmetry.

Because of the strict ordering of , MLE iterations are fast and the effective degrees of freedom of the model are more like where is the number of s. The comes from the following line of reasoning. In the no-covariate case, consider confidence bands for the empirical cumulative distribution function (ECDF) for , then fit a 4-parameter parametric distribution to the raw values. Confidence intervals for from this parametric fit will be about the same widths as those from the ECDF.

ECDF = when This exercise would also point out the lack of value in fitting flexible parametric distributions compared to generalizing the ECDF to handle covariates by using a semiparametric model.

Re-Write of `lrm.fit` and `orm.fit`

To modernize the Fortran code, better condition the design matrix ( observations and columns), and to explore a variety of optimization algorithms, I did a complete re-write of the rms package lrm and lrm.fit functions in November, 2024, for rms version 6.9-0. To reduce optimization divergence when there are extreme collinearities, and to better scale , I was interested in implementing mean centering followed by a QR factorization of to orthogonalize its columns. Details about how this is done and how the parameter estimates and covariance matrix are converted back to the original space may be found here. QR can be turned on by setting the lrm.fit argument transx to TRUE. When QR is in play, the rotated columns of are scaled to have standard deviation 1.0.

Besides dealing with fundamental statistical computing issues, I changed lrm.fit to use Therneau’s survival package concordancefit function to compute concordance probabilities used by various rank correlation indexes such as Somers’ . This got rid of a good deal of code. Previously, lrm.fit binned predicted probabilities into 501 bins to calculate rank measures almost instantly. But I decided it was time to use exact calculations now that concordancefit is so fast. Though not used in rms, concordancefit also computes accurate standard errors.

A new Fortran 2018 subroutine lrmll was written to efficiently calculate the -2 log-likelihood function (the deviance), the gradient vector, and the hessian matrix of all second partial derivatives of the log-likelihood with respect to intercept(s) and regression coefficients (slopes) .

lrm.fit now implements several optimization algorithms. When Y is binary and there is no penalization, it has the option to use glm.fit(..., family=binomial()) which runs iteratively reweighted least squares, a fast-converging algorithm for binary logistic regression (but does not extend to ordinal regression). lrm.fit implements 8 optimization algorithms.

optional for proportional odds models (according to lrm.fit initglm argument), does a first pass with glm.fit that fits a binary logistic regression for the probability that Y is greater than or equal to the median Y; this fit can then be used for starting values after shifting the default starting intercept values so that the middle intercept matches the intercept from the binary fit
nlminb: a function in the stats package, tied with NR as the fastest algorithm in general; uses hessians. This uses Fortran routines in the Bell Labs port library for which the original paper may be found here.
NR: Newton-Raphson iteration with step-halving, implemented here as R function newtonr. This algorithm is the default because it has the advantage of having full control over convergence criteria. It requires convergence with respect to 3 simultaneous criteria: changes in -2 LL, changes in parameter estimates, and nearness of gradient to zero. The user can relax any of the 3 criteria thresholds to relax conditions. This is the optimization method used in the old lrm.fit function, but written in Fortran there for slightly increased speed. Defaults for tolerance parameters are such that eps (iteration-to-iteration change in -2 LL) will usually dictate when convergence called.
LM: Levenberg-Marquart algorithm, which is a kind of Newton method with generalized step-halving
glm.fit: for binary Y without penalization only
nlm: the stats function that is usually recommended for maximum likelihood, but I found it is slower than nlminb without offering other advantages
BFGS and L-BFGS-B using the stats optim function: fast general-purpose algorithms that do not require the hessian, so these can be used with an unlimited number of intercepts as long as the user sets the lrm.fit parameter compvar to FALSE so that the hessian is not calculated once after convergence
CG and Nelder-Mead: see optim

This was inspired by the MASS package polr function

The last four methods do not involve computing the hessian, which is the most computationally intensive calculation in MLE. But they are still slower overall if you want to get absolute convergence, due to requiring many more evaluations of the object function (-2 LL).

See this for useful comparisons of some of the algorithms.

For rms 7.0-0, lrm.fit and orm.fit were changed to use the R Matrix package for much more efficient handling of sparse Hession/information matrices. Now there are no limitations on the number of distinct Y-values analyzed by lrm and orm. The primary differences between the two modeling procedures are

lrm only implements a single link function (logistic)
lrm implements multiple optimization methods
orm only implements Newton-Raphson optimization (with step-halving) and Levenberg-Marquardt
lrm output (from print.lrm) includes rank correlation model performance indexes that are more suitable for discrete Y
orm output (from print.orm) includes only Spearman’s as a rank predictive discrimination measure; this is more suitable for continuous Y
orm has a Quantile and an ExProb method; both lrm and orm have Mean methods, as means work on discrete numeric Y (unlike quantiles)
lrm implements transx for QR orthonormalization of the design matrix
orm implements scale for mean-centering and standard deviation scaling of

Background: Convergence

The two most commonly used convergence criteria for MLE are

relative convergence: stop iterations when the change in deviance is small or when the relative change in parameter estimates is small
absolute convergence: stop iterations when the first derivative of the log likelihood is small in absolute value for all parameters, or there is a very small absolute change in parameter values

Absolute convergence with respect to the first derivatives (gradients) is similar to demanding that none of the regression parameters change very much since the last iteration. From the standpoint of what is important statistically, convergence of the deviance (what the gold standard likelihood ratio test is based on) is sufficient with respect to what matters. Changing parameter values when the deviance does not change in the decimal place will be buried in the noise. Absolute convergence may affect , but relative convergence tends to result in a very stable . You might deem convergence satisfactory when parameter estimates between successive iterations change by less than 0.05 standard errors (which would require evaluation of the hessian to know), but approximately this corresponds to relative convergence judged by the deviance.

Despite achieving statistically relevant convergence more easily, there is a real issue of reproducibility. Different algorithms and software may result in semi-meaningful differences in taken in isolation (ignoring how much of is noise). Only by having all the implementations achieve absolute convergence will different analysts be very likely to reproduce each others’ work. If this is important to you, either avoid the BFGS algorithms (for which the R optim function does not have an absolute convergence criterion) or use a highly stringent relative convergence criterion, e.g., specify the lrm.fit argument reltol as . Below I explore how convergence and execution time are affected by reltol.

Overview of Findings

opt_method='NR' and 'nlminb' are the fastest methods, even slightly faster than glm.fit.

Limited tests of transx in lrm.fit to use QR factorization does not show not much benefit, but see below for details.

For ordinal Y, using opt_method='BFGS' with compstats=FALSE for lrm mimics the polr function in the MASS package. For a large number of intercepts, lrm.fit is much faster due to computing the deviance and derivatives in highly efficient compiled Fortran and capitalizing on sparse matrices.

Setting initglm=TRUE tells lrm.fit to get initial ordinal model parameter values from a binary logistic glm.fit run when cutting Y at the median. This does not seem to offer much benefit over setting starting values to covariate-less MLEs of (which are calculated instantly when ) and setting . In one example using nlminb the algorithm actually diverged with initglm=TRUE but ran fine without it. This is probably due to large intercept values with many distinct Y values.

Extensive tests of opt_method='BFGS' show good stochastic performance (convergence to what matters, deviance-wise), but unimpressive execution time if you want absolute convergence by setting reltol to a small number.

The best overall algorithm that uses the hessian is NR in terms of speed and convergence, with nlminb and LM close seconds. NR is the method used in the old lrm.fit, so for most datasets, the new optimization options are not needed.

Even though lrm.fit is optimized for the logistic link function, there is not much difference in execution time between lrm.fit and orm.fit for binary and ordinal logistic models.

Validation

Two kinds of validations appear below.

Validation of the Fortran-calculated deviance and derivatives: for a given small dataset, get parameter estimates from rms::orm with tight convergence criteria, and test the Fortran code by evaluating the deviance and derivatives at these parameter values. The gradient (first derivative) should be very close to zero, and the deviance should be identical to that from orm. The inverse of the negative of the hessian matrix should equal the variance-covariance matrix computed by orm.
Validate lrm.fit overall by letting it pick its usual starting values for iteration, and compare its output to that from orm and other fitting functions including the last pre-6.9-0 version of lrm.fit which is named olrm below.

require(rms)
# Fetch the old lrm.fit (function olrm)
require(orms)

# Define simple timer functions with and without printing

stime <- function(...) {
  ti <- system.time(...)['elapsed']
  cat('Elapsed time: ', ti, 's\n', sep='')
  invisible(ti)
}

stim <- function(...) system.time(...)['elapsed']

# Define a function that will time a series of lines of code, repeating each
# reps times (default is 10).  When only one line of code is given, the
# elapsed execution time is printed and the result of the line is returned.
# Otherwise, a vector of run times corresponding to all the lines of code is
# returned, and the values returned from the lines are stored in the global
# environment in object Res, with components named according to inputs to tim.
# When reps > 1 the values from running code lines are from the last rep.

tim <- function(...) {
  reps <- 10
  w <- sys.call()
  n <- names(w)
  m <- length(n)
  k <- 2 : m
  if('reps' %in% n) {
    i    <- which(n == 'reps')
    reps <- eval(w[[i]])
    k    <- setdiff(k, i)
  }
  n <- n[k]
  m <- length(n)
  r <- numeric(m)
  Res        <<- vector('list', m)   # put in global environment
  names(r)   <- n
  names(Res) <<- n
  l <- 0
  for(i in k) {
    l <- l + 1
    s <- stim(for(j in 1 : reps)
    R <- eval(w[[i]], parent.frame())) / reps
    r[l]      <- s
    Res[[l]] <<- R
  }
  label(r) <- paste('Per-run execution time in seconds, averaged over', reps, 'runs')
  if(m == 1) {
    print(r)
    return(R)
  }
  r
}

m   <- function(x) max(abs(x))
mad <- function(a, b) c(mad    =     mean(abs(a - b)),
                        relmad = 2 * mean(abs(a - b) / (abs(a) + abs(b))))
wratio <- function(r) exp(max(abs(log(r))))  # worst ratio, whether < or > 1.0

# Function creating a table of matrix dimensions for matrices in a list
mdim <- function(w) {
  i <- sapply(w, is.matrix)
  g <- function(x) c(rows=nrow(x), columns=ncol(x))
  sapply(w[i], g)
}
            
# Function to summarize a series of model fits stored in Res
smod <- function() {
  max_abs_u <- sapply(Res, function(x) if(length(x$u)) m(x$u) else NA)
  iter      <- sapply(Res, function(x) if(length(x$iter)) tail(x$iter, 1) else NA)
  deviance  <- sapply(Res, function(x) tail(x$deviance, 1))
  print(data.frame(deviance, max_abs_u, iter))
  l <- length(Res)
  n <- names(Res)
  if(l > 1) {
    for(i in 1 : l) {
      r <- Res[[i]]
      # See if polr or BFGS
      a <- inherits(r, 'polr') || (length(r$opt_method) && r$opt_method=='BFGS')
      r$var <- if(inherits(r, 'orm'))
        vcov(r, intercepts='all') else if(! a) vcov(r)
      if(inherits(r, 'polr'))
        r$coefficients <- c(-r$zeta, coef(r))
      Res[[i]] <- r
    }
    
    cat('\nMaximum |difference in coefficients|,',
        'Maximum |relative difference|\n',
        'worst ratio of covariance matrices\n\n')
    d <- NULL
    for(i in 1 : (l-1))
      for(j in (i+1) : l) {
        ri <- Res[[i]]; rj <- Res[[j]]
        comp <- paste(n[i], 'vs.', n[j])
        co <- mad(ri$coefficients, rj$coefficients)
        w <- data.frame(Comparison         = comp,
                        `Max |difference|` = co[1],
                        `Max |rel diff|`   = co[2],
                        `Cov ratio`        = NA, check.names=FALSE)
        if(length(ri$var) && length(rj$var)) 
          w$`Cov ratio` <- wratio(ri$var / rj$var)
        d <- rbind(d, w)
        }
  }
  d$`Cov ratio` <- ifelse(is.na(d$`Cov ratio`), '', format(d$`Cov ratio`))
  rownames(d) <- NULL
  d
}

Check -2 Log Likelihood and Derivatives for a Simple Model

First define an R function that makes it easy to run the Fortran subroutine.

rfort <- function(x, y, alpha, beta, what=2L, debug=0L, penhess=1L,
                  offset=rep(0., n), wt=rep(1., n),
                  penmat=matrix(0., p, p)) {
  x  <- as.matrix(x)
  p  <- ncol(x)
  n  <- nrow(x)
  yd <- sort(unique(y))
  k  <- max(yd)
  nv <- as.integer(k + p)
  if(length(yd) != k + 1 || any(yd != 0 : k))
    stop('y must be coded 0-k for lrmll')
  storage.mode(x) <- storage.mode(offset) <- storage.mode(wt) <- storage.mode(penmat) <-
    storage.mode(alpha) <- storage.mode(beta) <- 'double'
  w <- .Fortran('lrmll', n, as.integer(k), p, x, as.integer(y), offset, wt, penmat, alpha, beta,
                 logL=numeric(1), grad=numeric(nv),
                 a=matrix(0e0, k, 2), b=matrix(0e0, p, p), ab=matrix(0e0, k, p),
                 what=as.integer(what), debug=as.integer(debug),
                 penhess=as.integer(penhess), salloc=integer(1))
  # lrmll creates 3 compact hessian submatrices
  # Put them together into a single hessian
  w$hess <- infoMxop(w[c('a', 'b', 'ab')])
  w
}

Binary Y

x <- 1 : 10
y <- c(0, 1, 0, 0, 0, 1, 0, 1, 1, 1)

# From orm.  Deviance = 13.86294, 10.86673
alpha <- -2.4412879506377 ; beta <- 0.4438705364796 
w <- rfort(x, y, alpha, beta)
w$logL

[1] 10.86673

w$grad

[1] -1.814104e-13 -1.156408e-12

w <- rfort(x, y, alpha, beta, what=3L)
w$hess   # negative inverse of covariance matrix

          [,1]       [,2]
[1,] -1.813852  -9.976185
[2,] -9.976185 -66.123262

- solve(vcov(glm(y ~x, family=binomial(), control=list(epsilon=1e-12))))

            (Intercept)          x
(Intercept)   -1.813852  -9.976185
x             -9.976185 -66.123262

Res <- list(glm     = glm(y ~ x, family=binomial(), control=list(epsilon=1e-12)),
            olrm    = olrm(x, y, eps=1e-10),
            lrm.fit = lrm.fit(x, y, reltol=1e-12),
            orm     = orm(y ~ x, eps=1e-10) )

smod()

        deviance    max_abs_u iter
glm     10.86673           NA    4
olrm    10.86673 2.220446e-15   NA
lrm.fit 10.86673 3.270271e-08    5
orm     10.86673 3.552714e-15    5

Maximum |difference in coefficients|, Maximum |relative difference|
 worst ratio of covariance matrices

        Comparison Max |difference| Max |rel diff| Cov ratio
1     glm vs. olrm     5.551115e-16   4.320309e-16         1
2  glm vs. lrm.fit     5.551115e-16   4.320309e-16         1
3      glm vs. orm     0.000000e+00   0.000000e+00         1
4 olrm vs. lrm.fit     0.000000e+00   0.000000e+00         1
5     olrm vs. orm     5.551115e-16   4.320309e-16         1
6  lrm.fit vs. orm     5.551115e-16   4.320309e-16         1

Cov ratio in the above output is the anti-log of the absolute value of the log of the ratio of elements of two covariance matrices. So it represents the worst disagreement in the two matrices, with 1.0 being perfect agreement to 7 decimal places. Max |difference| is the highest absolute difference in estimated regression coefficients between two methods, and Max |rel diff| is the maximum ratio of absolute differences to the sum of absolute values of two coefficient estimates.

Y=0, 1, 2

x <- 1 : 10
y <- c(0, 2, 0, 1, 0, 2, 2, 1, 1, 2)
f <- orm(y ~ x, eps=1e-10) # deviance 21.77800  19.79933
- solve(vcov(f, intercepts='all'))

          y>=1       y>=2           x
y>=1 -2.336337   1.148893   -5.103426
y>=2  1.148893  -2.657167  -10.455125
x    -5.103426 -10.455125 -110.128337

- solve(vcov(lrm(y ~ x, eps=1e-10)))

          y>=1       y>=2           x
y>=1 -2.336337   1.148893   -5.103426
y>=2  1.148893  -2.657167  -10.455125
x    -5.103426 -10.455125 -110.128337

- solve(olrm(x, y, eps=1e-10)$var)

          y>=1       y>=2        x[1]
y>=1 -2.336337   1.148893   -5.103426
y>=2  1.148893  -2.657167  -10.455125
x[1] -5.103426 -10.455125 -110.128337

- solve(vcov(MASS::polr(factor(y) ~ x)))

             x       0|1       1|2
x   -110.12802  5.103400 10.455131
0|1    5.10340 -2.336318  1.148877
1|2   10.45513  1.148877 -2.657156

# Note a problem with VGAM
- solve(vcov(VGAM::vglm(y ~ x, VGAM::cumulative(reverse=TRUE, parallel=TRUE))))

              (Intercept):1 (Intercept):2           x
(Intercept):1     -2.434082      1.155134   -5.148614
(Intercept):2      1.155134     -2.580855   -9.505256
x                 -5.148614     -9.505256 -100.063658

- solve(vcov(ordinal::clm(factor(y) ~ x)))

          0|1       1|2           x
0|1 -2.336337  1.148893    5.103426
1|2  1.148893 -2.657167   10.455125
x    5.103426 10.455125 -110.128337

alpha <- c(-0.8263498291155, -2.3040967379853)
beta <- 0.3091154153068 

# Analytically compute 2nd derivative of log L wrt beta
info <- function(x) {
  p1 <- plogis(alpha[1] + x * beta) 
  p2 <- plogis(alpha[2] + x * beta)
  d <- p1 - p2
  v1 <- p1 * (1 - p1)
  v2 <- p2 * (1 - p2)
  v1 - v2
  w1 <- p1 * (1 - p1) * (1 - 2 * p1)
  w2 <- p2 * (1 - p2) * (1 - 2 * p2)
  w1 - w2
  x * x * ((w1 - w2) * d - (v1 - v2)^2) / d / d
}

# Compute 2nd derivative of log(p1 - p2) wrt beta numerically
dif <- function(x, beta) log(plogis(alpha[1] + x * beta) - plogis(alpha[2] + x * beta))
del = 1e-6
d2 <- function(x) ((dif(x, beta + del) - dif(x, beta)) / del - (dif(x, beta) - dif(x, beta - del)) / del) / del
c(info(4), info(8), info(9))

[1]  -6.88269 -24.55641 -27.93077

num <- c(d2(4), d2(8), d2(9))
print(num)

[1]  -6.882495 -24.556135 -27.931213

sum(num)

[1] -59.36984

w <- rfort(x, y, alpha, beta)
w$logL

[1] 19.79933

w$grad

[1] -1.896261e-13 -1.408873e-13 -2.330580e-12

w <- rfort(x, y, alpha, beta, what=3L)
w$hess

3 x 3 sparse Matrix of class "dgCMatrix"
                                     
[1,] -2.336337   1.148893   -5.103426
[2,]  1.148893  -2.657167  -10.455125
[3,] -5.103426 -10.455125 -110.128337

Simple Ordinal Model With Weights, Offsets, and Penalties

We first ignore weights, offsets, and penalties, then incorporate them.

set.seed(1)
x1 <- rnorm(50)
x2 <- rnorm(50)
X  <- cbind(x1, x2)
y  <- sample(0:5, 50, TRUE)
wt <- runif(50)
wt <- wt / sum(wt)
of <- rnorm(50)
pm <- cbind(c(1.2, 0.6), c(0.6, 1.2))

f <- olrm(X, y, eps=1e-15)
f$deviance

[1] 177.1350 176.8656

cof <- coef(f)
w <- rfort(X, y, cof[1:5], cof[6:7])
w$logL

[1] 176.8656

w$grad

[1]  3.330669e-16  3.552714e-15  2.886580e-15 -2.886580e-15  1.998401e-15
[6]  1.998401e-15  1.332268e-15

w <- rfort(X, y, cof[1:5], cof[6:7], what=3L)
range(w$hess + solve(vcov(f)))

[1] -1.421085e-14  1.065814e-14

# Needed reltol=1e-15 to get gradient to 1e-8 with BFGS
# CG achieved 1e-6 with default, with 475 function evaluations

g <- lrm.fit(X, y, trace=1, opt_method='nlm', gradtol=1e-14,
             transx=TRUE, compstats=FALSE)

iteration = 0
Step:
[1] 0 0 0 0 0 0 0
Parameter:
[1]  1.51634749  0.57536414 -0.08004271 -0.84729786 -2.19722458  0.00000000
[7]  0.00000000
Function Value
[1] 177.135
Gradient:
[1]  1.199041e-14 -2.664535e-15 -1.243450e-14  7.993606e-15  3.330669e-15
[6]  2.510851e+00  3.400018e+00

iteration = 4
Parameter:
[1]  1.53972787  0.60066749 -0.06107634 -0.83760533 -2.19876783 -0.07124309
[7] -0.10606549
Function Value
[1] 176.8656
Gradient:
[1] -8.388246e-11 -3.561773e-11  8.475087e-11  1.565637e-11  7.530643e-13
[6]  5.342615e-12  9.455770e-13

Last global step failed to locate a point lower than x.
Either x is an approximate local minimum of the function,
the function is too non-linear for this algorithm,
or steptol is too large.

m(g$u); g$iter  # L-BFGS-B tool factor as low as 1e2 to get u=3e-6, 50 iter

[1] 4.237544e-11

[1] 4

mad(cof, g$coefficients)

         mad       relmad 
1.377462e-12 4.676386e-12

range(f$var / vcov(g))

[1] 1 1

Now use weights, offset, and penalties.

f <- olrm(X, y, offset=of, weights=wt, penalty.matrix=2*pm, eps=1e-12)
f$deviance

[1] 3.475801 3.803296 3.799336

cof <- coef(f)
w <- rfort(X, y, cof[1:5], cof[6:7], offset=of, wt=wt, penmat=2*pm)
w$logL

[1] 3.799336

w$grad

[1] -1.578598e-16  1.526557e-16 -6.938894e-17  3.642919e-17 -4.250073e-17
[6]  1.734723e-17  1.387779e-17

w <- rfort(X, y, cof[1:5], cof[6:7], offset=of, wt=wt, penmat=2*pm, what=3L)
range(w$hess + solve(vcov(f)))

[1] -6.661338e-16  1.332268e-15

g <- lrm.fit(X, y, trace=3, offset=of, weights=wt,
             penalty.matrix=2e0*pm, opt_method='nlminb', compstats=FALSE)

  0:     3.8301785:  1.23911 0.611204 -0.0435624 -0.772679 -2.45652
  3:     3.8032957:  1.50965 0.925273 0.280639 -0.513375 -2.39790
  0:     3.8032957:  1.50965 0.925273 0.280639 -0.513375 -2.39790  0.00000  0.00000
  3:     3.7993356:  1.50984 0.927634 0.283085 -0.513316 -2.39750 0.0149656 -0.0431891

m(g$u); g$iter

[1] 3.404934e-09

          iterations evaluations.function evaluations.gradient 
                   3                    4                    3

mad(coef(f), coef(g))

         mad       relmad 
7.703453e-09 7.127106e-09

g$iter

          iterations evaluations.function evaluations.gradient 
                   3                    4                    3

g$u

         y>=1          y>=2          y>=3          y>=4          y>=5 
 3.404934e-09  6.040723e-10 -1.193834e-09 -1.123210e-09 -1.671228e-09 
           x1            x2 
 3.391934e-10  7.970911e-11

f$deviance

[1] 3.475801 3.803296 3.799336

g$deviance

[1] 3.475801 3.803296 3.799336

range(vcov(f) / vcov(g))

[1] 0.9999999 1.0000002

Check Accuracy Against Old `lrm.fit` For a Variety of Levels of Y

set.seed(1)
n <- 150
w <- NULL
for(i in 1 : 40) {
  k <- sample(1 : 20, 1, TRUE)
  y <- sample(0 : k, n, TRUE)
  x1 <- runif(n)
  x2 <- exp(rnorm(n))
  X  <- cbind(x1, x2)
  f <- olrm(X, y, eps=1e-10)
  g <- lrm.fit(    X, y, opt_method='nlminb', compstats=FALSE)
  d <- coef(f) - coef(g)
  r <- wratio(vcov(f) / vcov(g))
  w <- rbind(w, data.frame(i, k, mad.beta=m(d), Cov.ratio=r))
}
range(w$Cov.ratio)

[1] 1 1

with(w, plot(k, mad.beta, log='y'))

Study Convergence and Timings

Fortran vs. R

Regarding execution speed, a key question is whether it’s worth the effort to code part of the calculations in a compiled language such as Fortran, C, or C++, as compared to just using R. Let’s explore this by coding the gradient vector calculation in R and timing it against the new Fortran code. Also write a function making the Fortran routine easy to call when computing the gradient.

The code below makes use of the facts that and . The philosophy of this code is that nothing is calculated unless it is relevant, which makes for more lines of code. For example, the code does not create extra intercepts to yield probabilities of 0 or 1, but instead handles each case Y=0, Y=, separately. For the non-interior levels of Y the gradient is very simple as probabilities are expits and not differences.

grad <- function(alpha, beta, x, y) {
  k  <- length(alpha)
  p  <- length(beta)
  f  <- plogis
  xb <- as.vector(x %*% matrix(beta, nrow=p))
  xb <- as.vector(x %*% beta)
  P1 <- P2 <- numeric(n)
  i0 <- y == 0
  ik <- y == k
  ib <- y > 0 & y < k
  P1[i0] <- 1e0
  P2[i0] <- f(alpha[1] + xb[i0])
  P1[ik] <- f(alpha[k] + xb[ik])
  P2[ik] <- 0e0
  P1[ib] <- f(alpha[y[ib]    ] + xb[ib])
  P2[ib] <- f(alpha[y[ib] + 1] + xb[ib])
  pq1    <- P1 * (1e0 - P1)
  pq2    <- P2 * (1e0 - P2)
  P      <- P1 - P2
  U      <- rep(0e0, k + p)
  
  U[1] <- - sum(1e0 - P[i0])
  U[k] <-   U[k] + sum(1e0 - P[ik])
  
  # Gradiant for intercepts
  if(k > 1) for(m in 1 : k) {  # only interior y values create complexity
    if(m < k) U[m] <- U[m] + sum(pq1[y == m] / P[y == m])
    if(m > 1) U[m] <- U[m] - sum(pq2[y == m - 1] / P[y == m - 1])
  }
  # Gradient for slopes
  for(m in 1 : p) {
    U[k + m] <-            - sum(x[i0, m] * (1e0 - P[i0]))
    U[k + m] <-   U[k + m] + sum(x[ik, m] * (1e0 - P[ik]))
    if(k > 1) for(i in 1 : (k - 1)) {
    j <- y == i
    U[k + m] <-   U[k + m] + sum(x[j,  m] * (pq1[j] - pq2[j]) / P[j])
    }
  }
  U
}

fgrad <- function(alpha, beta, x, y) rfort(x, y, alpha, beta)$grad
  
# This calls a version of the Fortran code using the alpha extension approach
fgrad2 <- function(alpha, beta, x, y) {
  x <- as.matrix(x)
  n <- nrow(x)
  p <- ncol(x)
  k <- max(y)  # y assumed to be coded 0-k
  .Fortran('lrmll2',
           as.integer(n), as.integer(k), as.integer(p),
           x, y, rep(0e0, n), rep(1e0, n), matrix(0e0, nrow=p, ncol=p),
           alpha, beta, numeric(1), u=numeric(k + p), numeric(1), 2L, 0L, 0L)$u
}

But is the elegance of following the letter of the proportional odds model’s definition worth the trouble? What if we used the trick that is most often used in MLE and Bayesian modeling where we define extra intercepts so that all values of Y appear to be interior values and the same difference in probabilities can be computed everywhere, and we did not use special cases to compute derivatives of log likelihood components? Specifically we can write the model as the following, with y = and .

by expanding the original vector of intercepts by adding and . Then for any . The is chosen so that is indistinguishable from 1 and 0.

We need to run some computational tests to make sure that the upcoming shortcuts do not cause any computational inefficiencies or inaccuracies.

# Check that R computes expit very quickly for extreme values of x
tim(smallvals = plogis(rep(  1, 100000)),
    m50       = plogis(rep(-50, 100000)),
    p50       = plogis(rep( 50, 100000)))

Per-run execution time in seconds, averaged over 10 runs 
smallvals       m50       p50 
    9e-04     9e-04     1e-03

# Check that taking the log of probabilities is as accurate as
# using plogis' special log probability calculation
x <- seq(-50, 50, by=1); m(log(plogis(x)) - plogis(x, log=TRUE))

[1] 8.881784e-16

So it appears that the “intercept extension” approach will not cause any numerical problems. To code this method while computing the gradient, we need the derivative of the log of the difference in probabilities (call this ) given above. Consider a general parameter which may be one of the (interior) s or one of the s.

Since the main part of ,

is the 0/1 indicator function when and is when . Now code this in R.

grad2 <- function(alpha, beta, x, y) {
  k     <- length(alpha)
  p     <- length(beta)
  xb    <- as.vector(x %*% matrix(beta, nrow=p))
  xb    <- as.vector(x %*% beta)
  alpha <- c(100e0, alpha, -100e0)
  # Must add 1 to y to compute P1 and P2 since index starts at 1, not 0
  P1    <- plogis(alpha[y + 1] + xb)
  P2    <- plogis(alpha[y + 2] + xb)
  Q     <- P1 - P2
  pq1   <- P1 * (1e0 - P1)
  pq2   <- P2 * (1e0 - P2)
  U     <- numeric(k + p)
  
  # Gradiant for intercepts
  for(m in 1 : k)
    U[m] <- sum((pq1 * (y == m) - pq2 * (y + 1 == m)) / Q)
  
  # Use element-wise multiplication then get the sum for each column
  # Element-wise = apply the same value of the weights to each row of x
  U[(k + 1) : (k + p)] <- colSums(x * (pq1 - pq2) / Q)

  U
}

Check that grad and grad2 yield the same answer and check their relative speeds.

set.seed(1)
n <- 50000; p <- 50; k <- 100
x <- matrix(rnorm(n * p), nrow=n)
y <- sample(0 : k, n, TRUE)
stopifnot(length(unique(y)) == k + 1)
alpha <- seq(-6, 6, length=k)
beta  <- runif(p, -0.5, 0.5)
tim(g1 = grad( alpha, beta, x, y),
    g2 = grad2(alpha, beta, x, y),
    reps = 5)   # creates Res

Per-run execution time in seconds, averaged over 5 runs 
    g1     g2 
1.0028 0.0598

m(Res$g1 - Res$g2)  # maximum absolute difference

[1] 4.774847e-11

Even though the streamlined code in grad2 required evaluating a few quantities that are known to be 0 or 1, its vectorization resulted in significantly faster R code. Later this will be compared with the speed of Fortran code.

Fortran Code for Gradient

Here is the central part of the Fortran code for computing the gradient vector. Fortran is blazing fast and easier to learn than C and C++, so more users may wish to translate some execution-time critical portions of their R code to Fortran 2018. R makes it easy to include Fortran code in packages, and it is also easy to include Fortran functions in RStudio sessions.

   u = 0_dp

    ! All obs with y=0
    ! The derivative of log expit(x) wrt x is expit(-x)
    ! Prob element is expit(-alpha(1) - lp)
    u(1) = - sum(wt(i0) * (1_dp - d(i0)))
    if(p > 0) then
      do l = 1, p
        u(k + l) = - sum(wt(i0) * x(i0, l) * (1_dp - d(i0)))
      end do
    end if
    ! All obs with y=k
    ! Prob element is expit(alpha(k) + lp)
    u(k) = u(k) + sum(wt(ik) * (1_dp - d(ik)))
    if(p > 0) then
      do l = 1, p
        u(k + l) = u(k + l) + sum(wt(ik) * x(ik, l) * (1_dp - d(ik)))
      end do
    end if
    ! All obs with 0 < y < k
    if(nb > 0) then
      do ii = 1, nb
        i = ib(ii)
        j = y(i)
        ! For p1, D() = 1 for alpha(j), 0 for alpha(j+1)
        ! For p2, D() = 0 for alpha(j), 1 for alpha(j+1)
        u(j)     = u(j)     + wt(i) * v1(i) / d(i)
        u(j + 1) = u(j + 1) - wt(i) * v2(i) / d(i)
        if(p > 0) then
          do l = 1, p
            u(k + l) = u(k + l) + wt(i) * x(i, l) * (v1(i) - v2(i)) / d(i)
          end do
        end if
      end do

This code can be streamlined using the extension approach:

    ealpha = [100d0, alpha, -100d0]
    p1 = expit(ealpha(y + 1) + lp)
    p2 = expit(ealpha(y + 2) + lp)
    q  = p1 - p2
    pq1 = p1 * (1_dp - p1)
    pq2 = p2 * (1_dp - p2)
     do j = 1, k
      u(j) = sum((pq1 * merge(1_dp, 0_dp, y     == j) - &
                  pq2 * merge(1_dp, 0_dp, y + 1 == j)) / (q / wt))
    end do
    if(p > 0) then
      do j = 1, p
        u(k + j) = sum(x(:, j) * (pq1 - pq2) / (q / wt))
      end do
    end if

But this code runs faster (this is the code tested below):

   do i = 1, n
      w = q(i) / wt(i)
      do j = 1, k
        if(y(i)     == j) u(j) = u(j) + pq1(i) / w
        if(y(i) + 1 == j) u(j) = u(j) - pq2(i) / w
      end do
      if(p > 0) then
        do j = 1, p
          u(k + j) = u(k + j) + x(i, j) * (pq1(i) - pq2(i)) / w
        end do
      end if
    end do

First let’s compare accuracy and speed of two ways of coding the gradient calculation in Fortran (click above to see both versions).

tim(Fortran  = fgrad (alpha, beta, x, y),
    Fortran2 = fgrad2(alpha, beta, x, y), reps=44/10 )

Per-run execution time in seconds, averaged over 4.4 runs 
   Fortran   Fortran2 
0.02522727 0.01022727

m(Res$Fortran - Res$Fortran2)

[1] 6.82121e-12

Though it produces the same result to within , the streamlined Fortran is slower than the longer Fortran code.

Run the R grad2 function defined above, and the Fortran routine included in the new rms package, for , predictors, and intercepts.

# Check agreement of R and Fortran code
tim(R       = grad2(alpha, beta, x, y),
    Fortran = fgrad(alpha, beta, x, y), reps=40 )

Per-run execution time in seconds, averaged over 40 runs 
      R Fortran 
 0.0581  0.0146

m(Res$R - Res$Fortran)

[1] 6.82121e-12

We see that the compiled Fortran code is faster than the R code. In a nutshell Fortran allows you to not worry about vectorizing calculations, allowing for simpler code (there are many functions in Fortran for vectorizing operations but these are used more for brevity than for speed).

A more vectorized version of the code, written by ChatGPT, gave completely incorrect results but only ran faster by a ratio of 0.84. After much prompting, ChatGPT could only get the right answer if it re-wrote the code to be very inefficient by using excessive loops.

ChatGPT’s Compact But Non-Working Code

As streamlined as this code is, it does not improve execution time over my R grad function, taking 1.1s to run on the data given above while yielding the wrong answer.

gradient_proportional_odds <- function(alpha, beta, X, Y) {
  n <- nrow(X)  # Number of observations
  p <- ncol(X)  # Number of predictors
  k <- length(alpha)  # Number of thresholds (max Y value)

  # Compute linear predictors
  eta <- X %*% beta  # n x 1 vector

  # Expand eta to match dimensions with alpha
  eta_matrix <- matrix(eta, n, k, byrow = FALSE)  # n x k matrix

  # Compute expit(alpha_y + eta) for all thresholds y
  eta_alpha <- eta_matrix + matrix(alpha, n, k, byrow = TRUE)
  expit_vals <- 1 / (1 + exp(-eta_alpha))  # n x k matrix of expit values

  # Compute probabilities for P(Y = y)
  expit_upper <- cbind(1, expit_vals)  # P(Y >= 0) = 1
  expit_lower <- cbind(expit_vals, 0)  # P(Y >= k+1) = 0
  prob_Y <- expit_upper[, 1:k] - expit_lower[, 1:k]  # P(Y = y)

  # Indicator matrix for observed Y
  Y_ind <- matrix(0, n, k)
  for (i in 1:k) Y_ind[, i] <- as.numeric(Y == (i - 1))

  # Compute weights (observed minus predicted probabilities)
  weights <- (Y_ind - prob_Y)

  # Gradients w.r.t. alpha
  grad_alpha <- colSums(weights)

  # Gradients w.r.t. beta
  d_expit <- expit_vals * (1 - expit_vals)  # Derivative of expit
  grad_beta <- numeric(p)
  for (j in 1:p) {
    grad_beta[j] <- sum(weights * d_expit * X[, j])
  }

  # Combine gradients
  grad <- c(grad_alpha, grad_beta)
  return(grad)
}

The hessian requires about a factor of more calculations than the gradient when computed inefficiently, so the Fortran code pays off even more when using hessian-based optimization algorithms or computing the final covariance matrix. The lrmll Fortran code called by the new lrm.fit, and ormll called by orm.fit, capitalize on the tri-band diagonal form of the hessian for the cumulative probability model. In rms 7.0-0 lrm.fit and orm.fit were completely re-written to benefit from Fortran code anx to use the Matrix package for more efficient sparse matrix handling.

Efficient Computation of the hessian for General Cumulative Probability Models

The lrmll Fortran subroutine computes the hessian very efficiently for the proportional odds model (logit link), making use of simplifications for the logistic model. Now consider general links. For is written in terms of the cumulative probability function by . We need all the second partial derivatives of the log of this difference in cumulative probabilities. Let’s simplify the expression to . ChatGPT provided the following results, which I simplified somewhat and corrected a sign error in .

Let
Let
Substitute for post differentiation
Divide all the second partial derivatives below by

For derivatives with respect only to , substitute for to get second partial derivatives with respect to and .

Let for and otherwise. For the interior values of , first consider which has probability . The second partial derivatives are, when multiplied by ,

Here stands for etc.

For the hessian elements are, when multiplied by ,

All these formulas are implemented in the ormll Fortran subroutine used in the 6.9-1 version of rms.

Use of Jacobians to Translate hessian on One Scale To Another Scale

Sometimes it’s easier to derive second derivatives on the original scale without logging, and using a Jacobian to translate to the other scale. ChatGPT provided the following Jacobian solution for our example. This is not used in the Fortran code because it might slow it down a bit. I believe that MASS::polr uses the Jacobian approach.

To systematically compute and represent the transformation of the second partial derivatives of into the second partial derivatives of , we organize the derivatives into a hessian matrix and apply the transformation rules using Jacobian operations.

Step 1: Define the hessian Matrix for

The hessian matrix of with respect to is:

Using the expressions for second partial derivatives derived earlier:

Thus:

Step 2: Define the Transformation for

We apply the chain rule:

This can be written compactly in matrix form.

Step 3: Matrix Transformation

Let:

be the hessian matrix of
be the scalar value of .

The hessian of is given by:

Explanation:

: Scales the hessian of by
: Accounts for the interaction of first derivatives of

Step 4: Components of

From earlier:

Thus:

Step 5: Compute

The outer product is:

Final Hessian of

The hessian of is:

Substitute and explicitly to get the full matrix. This provides the second derivatives of in terms of , its derivatives, and the structure of .

Check Speed of `NR`, `LM`, `nlminb`, and `glm.fit`

glm.fit is tailored to be efficient when Y is binary and there is no penalization, by using iteratively weighted least squares. How does it stack up against NR and nlminb? Let’s try fitting a large binary logistic model.

set.seed(1)
n <- 100000; p <- 100
X <- matrix(rnorm(n * p), nrow=n)
y <- sample(0 : 1, n, TRUE)
tim(NR      = lrm.fit(X, y, compstats=FALSE),
    LM      = lrm.fit(X, y, opt_method='LM',     compstats=FALSE),
    nlminb  = lrm.fit(X, y, opt_method='nlminb', compstats=FALSE),
    glm.fit = glm.fit(cbind(1, X), y), reps=3)

Per-run execution time in seconds, averaged over 3 runs 
      NR       LM   nlminb  glm.fit 
1.358000 1.252000 1.276667 1.731000

# transx=TRUE adds 2.3s to lrm.fit

Check Convergence Under Complete Separation

Consider a simple example where there is complete separation because the predictor values are identical to the response. In this case the MLE of the intercept is and the MLE of the slope is . The MLEs are approximated by , yielding predicted logits of and because the s are sufficiently close to 0.0 and 1.0.

plogis(-50, log=TRUE)

[1] -50

plogis( 50, log=TRUE)

[1] -1.92875e-22

exp(plogis(-50, log=TRUE))

[1] 1.92875e-22

exp(plogis( 50, log=TRUE))

[1] 1

See how 4 optimization algorithms fare with their default parameters and with some adjustments. In some of the trace output, the first floating point number listed is -2LL, and following that are the current estimates.

set.seed(1)
x <- sample(0 : 1, 20, TRUE)
y <- x
w <- try(lrm.fit(x, y, opt_method='NR', trace=1))  # default opt_method

Iteration:1  -2LL:26.92047  Max |gradient|:9.6  Max |change in parameters|:4.166667
Iteration:2  -2LL:4.704224  Max |gradient|:1.754066  Max |change in parameters|:2.249045
Iteration:3  -2LL:1.588994  Max |gradient|:0.616103  Max |change in parameters|:2.08089
Iteration:4  -2LL:0.5685801  Max |gradient|:0.2233137  Max |change in parameters|:2.028578
Iteration:5  -2LL:0.2071309  Max |gradient|:0.0817248  Max |change in parameters|:2.010364
Iteration:6  -2LL:0.07592908  Max |gradient|:0.03000809  Max |change in parameters|:2.003793
Iteration:7  -2LL:0.02789647  Max |gradient|:0.01103173  Max |change in parameters|:2.001393
Iteration:8  -2LL:0.01025764  Max |gradient|:0.004057316  Max |change in parameters|:2.000512
Iteration:9  -2LL:0.003772913  Max |gradient|:0.001492464  Max |change in parameters|:2.000188
Iteration:10  -2LL:0.001387888  Max |gradient|:0.000549028  Max |change in parameters|:2.000069
Iteration:11  -2LL:0.0005105632  Max |gradient|:0.0002019736  Max |change in parameters|:2.000025
Iteration:12  -2LL:0.0001878241  Max |gradient|:7.430157e-05  Max |change in parameters|:2.000009
Iteration:13  -2LL:6.90964e-05  Max |gradient|:2.733397e-05  Max |change in parameters|:2.000003
Iteration:14  -2LL:2.541911e-05  Max |gradient|:1.00556e-05  Max |change in parameters|:2.000001

w <- try(lrm.fit(x, y, opt_method='LM', trace=1))

Iteration:1  -2LL:4.726435  Max |gradient|:9.6  Max |change in parameters|:4.15697
Iteration:2  -2LL:1.596267  Max |gradient|:1.760569  Max |change in parameters|:2.249698
Iteration:3  -2LL:0.5711236  Max |gradient|:0.6183991  Max |change in parameters|:2.081206
Iteration:4  -2LL:0.2080488  Max |gradient|:0.2241376  Max |change in parameters|:2.028698
Iteration:5  -2LL:0.07626438  Max |gradient|:0.08202485  Max |change in parameters|:2.010408
Iteration:6  -2LL:0.0280195  Max |gradient|:0.03011806  Max |change in parameters|:2.003809
Iteration:7  -2LL:0.01030286  Max |gradient|:0.01107213  Max |change in parameters|:2.001399
Iteration:8  -2LL:0.003789542  Max |gradient|:0.004072171  Max |change in parameters|:2.000514
Iteration:9  -2LL:0.001394004  Max |gradient|:0.001497928  Max |change in parameters|:2.000189
Iteration:10  -2LL:0.0005128133  Max |gradient|:0.0005510379  Max |change in parameters|:2.00007
Iteration:11  -2LL:0.0001886518  Max |gradient|:0.0002027129  Max |change in parameters|:2.000026
Iteration:12  -2LL:6.94009e-05  Max |gradient|:7.457357e-05  Max |change in parameters|:2.000009
Iteration:13  -2LL:2.553113e-05  Max |gradient|:2.743404e-05  Max |change in parameters|:2.000003
Iteration:14  -2LL:9.392375e-06  Max |gradient|:1.009241e-05  Max |change in parameters|:2.000001

w <- try(lrm.fit(x, y, opt_method='nlminb',  trace=1))

  0:     26.920467: -0.405465  0.00000
  1:     5.9001181: -1.83776  3.67925
  2:     1.9469377: -2.99693  5.99700
  3:    0.69222129: -4.04687  8.09673
  4:    0.25163200: -5.06435  10.1316
  5:   0.092171572: -6.07067  12.1442
  6:   0.033854569: -7.07298  14.1489
  7:   0.012447190: -8.07383  16.1506
  8:  0.0045780905: -9.07414  18.1512
  9:  0.0016840535: -10.0743  20.1514
 10: 0.00061951084: -11.0743  22.1515
 11: 0.00022790289: -12.0743  24.1515
 12: 8.3840460e-05: -13.0743  26.1515
 13: 3.0843137e-05: -14.0743  28.1515
 14: 1.1346550e-05: -15.0743  30.1515
 15: 4.1741617e-06: -16.0743  32.1515
 16: 1.5355882e-06: -17.0743  34.1515
 17: 5.6491130e-07: -18.0743  36.1515
 18: 2.0781925e-07: -19.0743  38.1515
 19: 7.6452432e-08: -20.0743  40.1515
 20: 2.8125276e-08: -21.0743  42.1515
 21: 1.0346714e-08: -22.0743  44.1515
 22: 3.8063437e-09: -23.0743  46.1515
 23: 1.4002772e-09: -24.0743  48.1515
 24: 5.1513283e-10: -25.0743  50.1515
 25: 1.8950352e-10: -26.0743  52.1515
 26: 6.9714901e-11: -27.0743  54.1515
 27: 2.5648816e-11: -28.0743  56.1515
 28: 9.4360075e-12: -29.0743  58.1515
 29: 3.4692249e-12: -30.0743  60.1515
 30: 1.2789769e-12: -31.0743  62.1515
 31: 4.7073456e-13: -32.0743  64.1515
 32: 1.6875390e-13: -33.0743  66.1515
 33: 6.2172489e-14: -34.0743  68.1515
 34: 2.6645353e-14: -35.0743  70.1515
 35: 8.8817842e-15: -36.0743  72.1515
 36:     0.0000000: -37.0743  74.1515
 37:     0.0000000: -37.0743  74.1515

w <- try(lrm.fit(x, y, opt_method='glm.fit', trace=1))

Deviance = 3.368709 Iterations - 1
Deviance = 1.16701 Iterations - 2
Deviance = 0.4207147 Iterations - 3
Deviance = 0.1536571 Iterations - 4
Deviance = 0.0563787 Iterations - 5
Deviance = 0.02072057 Iterations - 6
Deviance = 0.007619969 Iterations - 7
Deviance = 0.002802865 Iterations - 8
Deviance = 0.001031067 Iterations - 9
Deviance = 0.0003793016 Iterations - 10
Deviance = 0.0001395364 Iterations - 11
Deviance = 5.133244e-05 Iterations - 12
Deviance = 1.888413e-05 Iterations - 13
Deviance = 6.947082e-06 Iterations - 14
Deviance = 2.555688e-06 Iterations - 15
Deviance = 9.401851e-07 Iterations - 16
Deviance = 3.458748e-07 Iterations - 17
Deviance = 1.272402e-07 Iterations - 18
Deviance = 4.680906e-08 Iterations - 19
Deviance = 1.722009e-08 Iterations - 20
Deviance = 6.33492e-09 Iterations - 21
Deviance = 2.330491e-09 Iterations - 22
Deviance = 8.573409e-10 Iterations - 23
Deviance = 3.15401e-10 Iterations - 24
Deviance = 1.160316e-10 Iterations - 25
Deviance = 4.268585e-11 Iterations - 26
Deviance = 1.570299e-11 Iterations - 27
Deviance = 5.782042e-12 Iterations - 28

w <- try(lrm.fit(x, y, opt_method='BFGS', trace=1))

initial  value 26.920467 
iter  10 value 0.000153
iter  20 value 0.000017
iter  30 value 0.000004
iter  40 value 0.000001
iter  50 value 0.000000
final  value 0.000000 
stopped after 50 iterations

nlminb took 37 iterations, going so far that the hessian matrix was singular. It should have stopped with 10.
glm.fit took 28 iterations; should have stopped with 10
BFGS: stopped after 50 iterations; should have stopped with 10

Now specify arguments to lrm.fit that are tuned to this task.

w <- lrm.fit(x, y, opt_method='nlminb',  abstol=1e-3, trace=1)

  0:     26.920467: -0.405465  0.00000
  1:     5.9001181: -1.83776  3.67925
  2:     1.9469377: -2.99693  5.99700
  3:    0.69222129: -4.04687  8.09673
  4:    0.25163200: -5.06435  10.1316
  5:   0.092171572: -6.07067  12.1442
  6:   0.033854569: -7.07298  14.1489
  7:   0.012447190: -8.07383  16.1506
  8:  0.0045780905: -9.07414  18.1512
  9:  0.0016840535: -10.0743  20.1514
 10: 0.00061951084: -11.0743  22.1515

w <- lrm.fit(x, y, opt_method='glm.fit', reltol=1e-3, trace=1)

Deviance = 3.368709 Iterations - 1
Deviance = 1.16701 Iterations - 2
Deviance = 0.4207147 Iterations - 3
Deviance = 0.1536571 Iterations - 4
Deviance = 0.0563787 Iterations - 5
Deviance = 0.02072057 Iterations - 6
Deviance = 0.007619969 Iterations - 7
Deviance = 0.002802865 Iterations - 8
Deviance = 0.001031067 Iterations - 9
Deviance = 0.0003793016 Iterations - 10
Deviance = 0.0001395364 Iterations - 11
Deviance = 5.133244e-05 Iterations - 12

w <- lrm.fit(x, y, opt_method='BFGS',    reltol=1e-4, trace=1)

initial  value 26.920467 
iter  10 value 0.000153
final  value 0.000100 
converged

The tolerance parameters are too large to use when infinite coefficients are not a problem.

Check Algorithms With k=1000

For timings that follow, compstats=FALSE is specified to lrm.fit so that we can focus on computationally efficiencies of various optimization algorithms. Some of the differences in run times may not seem to be consequential, but once extremely large datasets are analyzed or one needs to fit models in a bootstrap or Monte Carlo simulation loop, the differences in speed will matter.

When the number of distinct Y values is large, and this far exceeds the number of predictors (), the rms lrm and orm function are highly efficient. They take into account the sparsity of the intercept portion of the hessian matrix, which is tri-band diagonal and has only nonzero values for distinct Y values, taking the matrix’s symmetry into account. Outside of lrm, orm and SAS JMP, ordinal regression fitting software treats the hessian matrix as being and does not capitalize on sparsity.

Note that for Bayesian MCMC this is not an issue as posterior sampling does not need the hessian.

set.seed(1)
n <- 10000; k <- 1000
x <- rnorm(n); y <- sample(0:k, n, TRUE)
length(unique(y))

[1] 1001

tim(orm    = orm.fit(x, y, eps=1e-10,           compstats=FALSE),
    ormlm  = lrm.fit(x, y, opt_method='LM',     compstats=FALSE),
    bfgs   = lrm.fit(x, y, opt_method='BFGS',   compstats=FALSE, maxit=100),
    nlminb = lrm.fit(x, y, opt_method='nlminb', compstats=FALSE),
    nr     = lrm.fit(x, y, opt_method='NR',     compstats=FALSE),
    nlm    = lrm.fit(x, y, opt_method='nlm',    compstats=FALSE),
    polr   = MASS::polr(factor(y) ~ x, control=list(reltol=1e-10)),
    reps= 2)

Per-run execution time in seconds, averaged over 2 runs 
   orm  ormlm   bfgs nlminb     nr    nlm   polr 
0.0155 0.0145 0.0430 0.3320 0.0110 6.6985 6.6995

smod()

       deviance    max_abs_u iter
orm    137142.5 1.118337e-09    4
ormlm  137142.5 5.287807e-04    3
bfgs   137142.5 2.376188e+00    4
nlminb 137142.5 1.431033e-09    3
nr     137142.5 7.625981e-05    2
nlm    137142.5 7.625981e-05    1
polr   137142.5           NA   NA

Maximum |difference in coefficients|, Maximum |relative difference|
 worst ratio of covariance matrices

         Comparison Max |difference| Max |rel diff| Cov ratio
1     orm vs. ormlm     4.528191e-06   2.116413e-05  1.020062
2      orm vs. bfgs     6.991343e-06   3.950698e-04          
3    orm vs. nlminb     1.219205e-13   1.169461e-13  1.000000
4        orm vs. nr     1.219707e-13   1.168651e-13  1.000000
5       orm vs. nlm     4.136310e-07   3.812141e-07  1.001368
6      orm vs. polr     9.637588e-06   1.359060e-05          
7    ormlm vs. bfgs     3.000666e-06   3.747458e-04          
8  ormlm vs. nlminb     4.528191e-06   2.116413e-05  1.020062
9      ormlm vs. nr     4.528191e-06   2.116413e-05  1.020062
10    ormlm vs. nlm     4.219032e-06   2.103439e-05  1.021457
11   ormlm vs. polr     8.539945e-06   3.049453e-05          
12  bfgs vs. nlminb     6.991343e-06   3.950698e-04          
13      bfgs vs. nr     6.991343e-06   3.950698e-04          
14     bfgs vs. nlm     6.776204e-06   3.949523e-04          
15    bfgs vs. polr     9.425718e-06   4.030520e-04          
16    nlminb vs. nr     1.629917e-16   3.095490e-16  1.000000
17   nlminb vs. nlm     4.136309e-07   3.812140e-07  1.001368
18  nlminb vs. polr     9.637588e-06   1.359060e-05          
19       nr vs. nlm     4.136309e-07   3.812140e-07  1.001368
20      nr vs. polr     9.637588e-06   1.359060e-05          
21     nlm vs. polr     9.231301e-06   1.326760e-05

nlminb is slower then NR because to run nlminb requires converting the Hession from a sparse Matrix into a regular dense matrix.

Check Timing and Agreement for n=100000, k=10, p=5

Check timing and calculation agreement for n=100000, 10 intercepts, 5 predictors.

set.seed(1)
n <- 100000
y <- sample(0 : 10, n, TRUE)
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
x5 <- rnorm(n)
X <- cbind(x1, x2, x3, x4, x5)

tim(old.lrm.fit = olrm(X, y, eps=1e-7),
    nr          = lrm.fit(X, y, opt_method='NR', compstats=FALSE),
    lm          = orm.fit(X, y, opt_method='LM', compstats=FALSE),
    nlm         = lrm.fit(X, y, opt_method='nlm', compstats=FALSE),
    nlminb      = lrm.fit(X, y, opt_method='nlminb', compstats=FALSE, transx=TRUE),
    nlminb.notransx = lrm.fit(X, y, opt_method='nlminb', compstats=FALSE, transx=FALSE),
    bfgs        = lrm.fit(X, y, opt_method='BFGS', compstats=FALSE),
    bfgs.reltol = lrm.fit(X, y, opt_method='BFGS',
                          compstats=FALSE, reltol=1e-12),
    polr        = MASS::polr(factor(y) ~ X,
                             control=list(reltol=1e-10)),
    reps = 5)

Per-run execution time in seconds, averaged over 5 runs 
    old.lrm.fit              nr              lm             nlm          nlminb 
         0.1468          0.0822          0.0732          0.5538          0.0996 
nlminb.notransx            bfgs     bfgs.reltol            polr 
         0.0832          0.4174          0.6706          2.4946

smod()

                deviance    max_abs_u iter
old.lrm.fit     479562.9 8.754331e-12   NA
nr              479562.9 1.992452e-07    3
lm              479562.9 4.246614e-05    3
nlm             479562.9 5.051226e-02    1
nlminb          479562.9 1.992456e-07    3
nlminb.notransx 479562.9 1.992552e-07    3
bfgs            479562.9 6.987317e-01    7
bfgs.reltol     479562.9 3.683979e-05   19
polr            479562.9           NA   NA

Maximum |difference in coefficients|, Maximum |relative difference|
 worst ratio of covariance matrices

                        Comparison Max |difference| Max |rel diff| Cov ratio
1               old.lrm.fit vs. nr     5.540598e-16   1.373265e-14  1.000000
2               old.lrm.fit vs. lm     1.964277e-09   3.686250e-08  1.000000
3              old.lrm.fit vs. nlm     3.004074e-06   2.122140e-05  1.000202
4           old.lrm.fit vs. nlminb     1.053797e-11   4.463604e-11  1.000000
5  old.lrm.fit vs. nlminb.notransx     1.053798e-11   4.462038e-11  1.000000
6             old.lrm.fit vs. bfgs     9.057305e-06   7.174198e-04          
7      old.lrm.fit vs. bfgs.reltol     5.712939e-08   1.703834e-06          
8             old.lrm.fit vs. polr     5.977469e-06   2.409730e-03          
9                        nr vs. lm     1.964277e-09   3.686251e-08  1.000000
10                      nr vs. nlm     3.004074e-06   2.122140e-05  1.000202
11                   nr vs. nlminb     1.053798e-11   4.462390e-11  1.000000
12          nr vs. nlminb.notransx     1.053798e-11   4.460824e-11  1.000000
13                     nr vs. bfgs     9.057305e-06   7.174198e-04          
14              nr vs. bfgs.reltol     5.712939e-08   1.703834e-06          
15                     nr vs. polr     5.977469e-06   2.409730e-03          
16                      lm vs. nlm     3.002323e-06   2.123986e-05  1.000202
17                   lm vs. nlminb     1.954251e-09   3.688525e-08  1.000000
18          lm vs. nlminb.notransx     1.954251e-09   3.688523e-08  1.000000
19                     lm vs. bfgs     9.056084e-06   7.173837e-04          
20              lm vs. bfgs.reltol     5.903326e-08   1.726187e-06          
21                     lm vs. polr     5.978924e-06   2.409739e-03          
22                  nlm vs. nlminb     3.004063e-06   2.122136e-05  1.000202
23         nlm vs. nlminb.notransx     3.004063e-06   2.122136e-05  1.000202
24                    nlm vs. bfgs     8.867603e-06   7.344671e-04          
25             nlm vs. bfgs.reltol     3.059332e-06   2.243215e-05          
26                    nlm vs. polr     7.995997e-06   2.398944e-03          
27      nlminb vs. nlminb.notransx     2.695688e-17   2.221290e-14  1.000000
28                 nlminb vs. bfgs     9.057301e-06   7.174198e-04          
29          nlminb vs. bfgs.reltol     5.713989e-08   1.703866e-06          
30                 nlminb vs. polr     5.977476e-06   2.409730e-03          
31        nlminb.notransx vs. bfgs     9.057301e-06   7.174198e-04          
32 nlminb.notransx vs. bfgs.reltol     5.713989e-08   1.703866e-06          
33        nlminb.notransx vs. polr     5.977476e-06   2.409730e-03          
34            bfgs vs. bfgs.reltol     9.069540e-06   7.167658e-04          
35                   bfgs vs. polr     1.347117e-05   2.800575e-03          
36            bfgs.reltol vs. polr     5.951212e-06   2.410693e-03

Other Speed Tests

n=1,000,000, p=100 predictors, k=50 intercepts

lrm.fit: 13.5s (13.6 with opt_method=nlminb, 120s with opt_method='BFGS')
orm.fit: 9s
MASS::polr: 70s without the hessian

n=300,000, p=20, k=299,999

lrm.fit: 2.25s
orm.fit: 2s

Execution time is proportional to

Check Impact of `initglm` and `transx`

Generate a sample of 500 with 30 predictors and 269 levels of Y where a subset of the predictors are strongly related to Y and there are collinearities.

set.seed(1)
n <- 500
p <- 30
x <- matrix(runif(n * p), nrow=n)
x[, 1:8] <- x[, 1:8] + 2 * x[, 9]
s <- varclus(~ x)$hclust
plot(as.dendrogram(s), horiz=TRUE, axes=FALSE,
     xlab=expression(paste('Spearman ', rho^2)))
rh <- seq(0, 1, by=0.1)
axis(1, at=1 - rh, labels=format(rh))

y <- x[, 1] + 2 * x[, 2] + 3 * x[, 3] + 4 * x[, 4]+ 5 * x[, 5] +
     3 * runif(n, -1, 1)
y <- round(y, 1)
length(unique(y))

[1] 269

f <- function(..., opt_method='NR')
  lrm.fit(x, y, compstats=FALSE, opt_method=opt_method, maxit=1000, ...)
# f(initglm=TRUE) (using nlminb) would not work: NA/NaN gradient evaluation
# This did not happen without collinearities
tim(default          = f(),
    transx           = f(transx=TRUE),
    nlm              = f(opt_method='nlm'),
    bfgs             = f(opt_method='BFGS'),
    nlminb           = f(opt_method='NR'),
    reps = 10 )

Per-run execution time in seconds, averaged over 10 runs 
default  transx     nlm    bfgs  nlminb 
 0.0131  0.0175  0.6193  0.3089  0.0141

smod()

        deviance    max_abs_u iter
default 3879.848 6.112532e-05    7
transx  3879.848 6.112532e-05    7
nlm     3879.848 6.112532e-05    6
bfgs    3879.848 9.721526e-03  587
nlminb  3879.848 6.112532e-05    7

Maximum |difference in coefficients|, Maximum |relative difference|
 worst ratio of covariance matrices

           Comparison Max |difference| Max |rel diff| Cov ratio
1  default vs. transx     7.418667e-15   8.758898e-16  1.000000
2     default vs. nlm     1.063035e-05   5.990158e-07  1.000291
3    default vs. bfgs     1.918945e-05   3.544624e-06          
4  default vs. nlminb     0.000000e+00   0.000000e+00  1.000000
5      transx vs. nlm     1.063035e-05   5.990158e-07  1.000291
6     transx vs. bfgs     1.918945e-05   3.544624e-06          
7   transx vs. nlminb     7.418667e-15   8.758898e-16  1.000000
8        nlm vs. bfgs     2.249980e-05   3.725436e-06          
9      nlm vs. nlminb     1.063035e-05   5.990158e-07  1.000291
10    bfgs vs. nlminb     1.918945e-05   3.544624e-06

`lrm.fit` vs. `orm.fit` as k

The fitting function for rms::orm, orm.fit, uses sparse hessian matrices so that the computation time is roughly proportional to where is the number of intercepts and is the number of predictors. Computation of the hessian in lrm.fit needs about computations, but some parts of the computation are faster and there is some overhead of handling sparse matrices in orm.fit. Let’s explore execution time as a function of when and varies. There should be very little difference.

if(! file.exists('breakeven.rds')) {
  set.seed(1)
  n  <- 10000
  p  <- 30
  ks <- seq(100, 10000, by=200)
  l  <- length(ks)
  t1 <- t2 <- d <- numeric(l)
  x  <- matrix(rnorm(n * p), nrow=n)
  for(i in 1 : l) {
    cat(ks[i], ' ')
    y     <- rep(0 : ks[i], length=n)
    t1[i] <- stim(for(j in 1:20) f <- rms::lrm.fit(x, y)) / 20
    t2[i] <- stim(for(j in 1:20) g <- rms::orm.fit(x, y)) / 20
    d [i] <- m(coef(f) - coef(g))
  }
  w <- llist(ks, t1, t2, d)
  saveRDS(w, 'breakeven.rds')
} else {
  w <- readRDS('breakeven.rds')
  ks <- w$ks; t1 <- w$t1; t2 <- w$t2; d <- w$d
}
# Make sure coefficients agree
range(d)

[1] 8.881784e-16 1.532996e-12

plot(ks, t1, type='b', xlab='k', ylab='Time, seconds', ylim=c(0.02, 0.18))
points(ks, t2, col='red')
lines (ks, t2, col='red')

Execution time for both functions is linear in k. orm is consistently a little faster than lrm. Since the code in the orm.fit Fortran ormll subroutine is more general, currently implementing 5 link functions, there is no real reason to maintain separate code. In the future I plan to merge the functions to minimize duplication, and having an lrm front-end for orm for backward compatibility.

Better Understanding Convergence with BFGS Optimizer

Using the same simulated data just used with k=20, use BFGS to fit an ordinal model with relative tolerance varying from to . Estimates are compared to orm. In addition to comparing parameter estimates as done above we also compute differences in units of standard errors as computed by orm.

set.seed(1)
y  <- sample(0:20, n, TRUE)
g  <- orm(y ~ x, eps=1e-10)
se <- sqrt(diag(vcov(g, intercepts='all')))
length(se)

[1] 50

if(file.exists('bfgs-reltol.rds')) d <- readRDS('bfgs-reltol.rds') else {
  d <- NULL
  for(i in 2 : 20) {
    cat(i, '')
    s <- stim({
      for(j in 1:5)
        f <- lrm.fit(x, y, compstats=FALSE,
                     opt_method='BFGS',
                     maxit=1000,
                     reltol=10^(-i))
    } )
  
  w <- data.frame(i, elapsed=s / 5, maxu=m(f$u),
                  maxbeta=m(coef(g) - coef(f)),
                  maxbeta.per.se=m((coef(g) - coef(f)) / se), 
                  deviance=tail(f$deviance, 1),
                  iter=tail(f$iter, 1))
  d <- rbind(d, w)
}
rownames(d) <- NULL
saveRDS(d, 'bfgs-reltol.rds')
}
d

    i elapsed         maxu      maxbeta maxbeta.per.se deviance iter
1   2  0.0150 1.474752e+02 2.559219e-02   1.485374e+00 60853.10    1
2   3  0.0156 1.474752e+02 2.559219e-02   1.485374e+00 60853.10    1
3   4  0.0520 4.754120e+01 8.310188e-03   4.823241e-01 60843.74    3
4   5  0.0676 2.731752e+01 6.488538e-03   3.765954e-01 60843.47    4
5   6  0.1164 2.570406e+01 4.628291e-03   1.152693e-01 60842.69   10
6   7  0.1634 3.232073e+00 4.015687e-03   9.284606e-02 60842.60   16
7   8  0.4220 1.849727e+00 4.627223e-04   1.288221e-02 60842.57   55
8   9  0.4202 4.694068e-01 8.486176e-04   2.144243e-02 60842.57   53
9  10  0.5654 1.027168e-01 1.576543e-04   3.340893e-03 60842.57   75
10 11  0.3670 3.118309e-02 5.997566e-05   1.292625e-03 60842.57   52
11 12  0.5104 1.435433e-02 6.238079e-06   1.344461e-04 60842.57   71
12 13  0.3986 9.525472e-03 1.005773e-05   2.167691e-04 60842.57   56
13 14  0.3882 1.054183e-03 1.688144e-06   3.638369e-05 60842.57   57
14 15  0.5556 1.669547e-03 1.222338e-06   2.832486e-05 60842.57   70
15 16  0.4710 4.720818e-04 1.373869e-07   7.956884e-06 60842.57   60
16 17  0.3992 4.720818e-04 1.373869e-07   7.956884e-06 60842.57   60
17 18  0.3950 4.720818e-04 1.373869e-07   7.956884e-06 60842.57   60
18 19  0.4126 4.720818e-04 1.373869e-07   7.956884e-06 60842.57   60
19 20  0.4522 4.720818e-04 1.373869e-07   7.956884e-06 60842.57   60

z <- c(deviance=8, beta=10, beta.se=6, grad=15)  # minimum i for which success achieved
h <- function() abline(v=z, col=gray(0.60))
h <- function() {}    # remove this line to show reference lines
par(mfrow=c(2, 3), mar=c(4, 4, 1, 1), las=1, mgp=c(2.8, .45, 0))
with(d, {
  plot(i, elapsed, type='l'); h()
  plot(i, maxu, type='l', log='y'); h()
  plot(i, maxbeta, type='l', log='y'); h()
  plot(i, maxbeta.per.se, type='l'); h()
  plot(i, deviance - 3879, type='l', log='y'); h()
  plot(i, iter, type='l', log='y'); h()
})

NULL

See that stochastic convergence, as judged by deviance, occurs by the time the relative tolerance is , by to control maximum absolute parameter difference, by to get parameter estimates to within 0.06 standard error, and by to achieve gradients in absolute value.

When using BFGS a recommendation is to use a relative tolerance of to nail down estimates to the extent that it matters precision-wise, and use to achieve reproducibility.

Recall that BFGS is only appealing when the number of intercepts is large, you don’t need the covariance matrix, and you are not using orm.

Matrix Inversion

The information matrix (negative hessian) must be inverted to compute the variance-covariance matrix. The default inversion method in R is the solve function, which defaults to using the LU decomposition. This is a fast algorithm, but the Cholesky decomposition is faster and behaves as well as LU numerically. Another approach uses the QR decomposition, implemented in the qr.solve function. Let’s compare speed and accuracy of the three approaches when applied to a high-dimensional almost-singular matrix.

# ChatGPT created this function to generate an almost-singular symmetric
# positive definite matrix of dimension p x p
genas <- function(p, epsilon = 1e-8) {
  # Generate a random symmetric positive definite matrix
  A <- matrix(rnorm(p^2), p, p)
  A <- A + t(A) + p * diag(p)  # Symmetric and positive definite
  
  # Modify eigenvalues to make the matrix almost singular
  eigen_decomp <- eigen(A)
  eigen_decomp$values[p] <- epsilon  # Set the smallest eigenvalue close to zero
  eigen_decomp$vectors %*% diag(eigen_decomp$values) %*% t(eigen_decomp$vectors)
}
set.seed(3)
x <- genas(750, 1e-9)
x[1:3, 1:3]

           [,1]        [,2]        [,3]
[1,] 748.051108  -1.0242634   0.3929600
[2,]  -1.024263 747.2599805   0.9261079
[3,]   0.392960   0.9261079 745.6809913

# LU method
stime(for(i in 1:3) a <- solve(x))

Elapsed time: 0.656s

mad(x, solve(a))       # reverse the inversion and compare to x

         mad       relmad 
0.0001921486 0.0006909615

m(diag(750) - x %*% a)  # compare x inverse * x to identity matrix

[1] 3.083591e-05

# QR
stime(for(i in 1:3) b <- qr.solve(x, tol=1e-13))

Elapsed time: 1.576s

mad(x, qr.solve(b, tol=1e-13))

         mad       relmad 
0.0002050283 0.0007250491

m(diag(750) - x %*% b)

[1] 3.917762e-05

# Cholesky
stime(for(i in 1:3) ch <- chol2inv(chol(x)))

Elapsed time: 0.334s

mad(x, chol2inv(chol(ch)))  # mean |difference| and mean relative difference

         mad       relmad 
0.0002323721 0.0008362380

m(diag(750) - x %*% ch)   # mean |difference|

[1] 1.71175e-05

QR takes significantly longer and offers no accuracy advantange. Inversion via Cholesky decomposition was almost twice as fast as LU, though both methods took less than of a second to invert a matrix. Cholesky was a little more accurate in getting the product of the original matrix and its inverse closer to an identity matrix. Cholesky was very slightly worse in recovering the original matrix by inverting its inverse.

One of the reasons lrm.fit and orm.fit are more efficient in rms 7.0 is that the entire information matrix is not inverted upon convergence when creating the final fit object. Much use is made of the Matrix package for efficient storage and computation of sparse matrices, and only the 3 minimal submatrices that make up the information matrix are stored. These are operated on quite generally by the new rms infoMxop function, which is called by the vcov method to invert pieces of the information matrix only as needed. During Newton-type updating, the Matrix solve function is used, which is quite fast as it uses sparse representations and does not actually invert the Hession but solves for the inverse of the hessian multiplied by the gradient vector.

Here is an example where only elements 10:20 from the inverse of a 1000 x 1000 matrix are obtained. This type of coding is used in infoMxop.

# Create a 1000 x 1000 symmetric positive definite matrix
set.seed(1)
x <- matrix(rnorm(10000 * 1000), ncol=1000)
v <- crossprod(x)
i <- 10:20    # submatrix of v inverse that we want
l <- length(i)
w <- matrix(0, 1000, l)
w[cbind(i, 1:l)] <- 1    # w is all zeros except for 1 in i elements
sum(w)

[1] 11

stime(vi <- solve(v, w)[i, , drop=FALSE])

Elapsed time: 0.123s

dim(vi)

[1] 11 11

stime(vi_slow <- solve(v))

Elapsed time: 0.505s

range(vi - vi_slow[i, i])

[1] 0 0

What about inverting the kind of sparse matrices that ordinal models deal with? Let’s build one using the Matrix package and rms::infoMxop.

require(Matrix)
p <- 200
k <- 20000
set.seed(1)
w <- list(a = cbind(runif(k), runif(k)),
          b = v[1:p, 1:p],
          ab= matrix(runif(k * p), k, p) )
stime(z <- infoMxop(w))

Elapsed time: 0.142s

object.size(w)

32641088 bytes

object.size(z)

97282280 bytes

# Sparse representation of intercept components used by infoMxop
ia <- Matrix::bandSparse(k, k=c(0,1), diagonals=w$a, symmetric=TRUE)
object.size(ia)

561720 bytes

# Time to invert this sparse matrix
stime(via <- Matrix::solve(ia))

Elapsed time: 2.066s

dim(via)

[1] 20000 20000

# The inverse of a tri-band diagonal matrix is dense but can be represented efficiently
object.size(via)

383980824 bytes

length(via@x)

[1] 31991591

# Compute size needed if did not make use of sparsity
8 * (p + k) ^ 2

[1] 3264320000

# Get covariate portion of inverted matrix
stime(ub <- infoMxop(w, i='x'))

Elapsed time: 0.381s

# Get the first intercept and beta portion of the inverse
i <- c(1, (k + 1) : (k + p))
stime(u  <- infoMxop(w, i=i))    # 26s for i=one element

Elapsed time: 49.375s

dim(u); dim(ub); range(u[-1, -1] - ub)

[1] 201 201

[1] 200 200

[1] -5.247539e-17  5.086263e-17

# Don't try this:  infoMxop(w, invert=TRUE))

The specialized method with i='x' for getting just the portion of the inverse corresponding to the s is very fast. Otherwise there are speed challenges but the sparse representation does allow the overall inverse to be computed, something not possible with naive matrix calculations.

Now consider execution time for computing the standard errors of predicted values . Consider a predicted value of the form involving a single intercept . Let be the variance-covariance matrix for where is a vector of length . Let the matrix consisting of a column of ones followed by columns of predictor settings. is the number of observations for which predictions are sought. Standard errors of interest are square roots of the diagonal of the variance-covariance matrix for , where is an matrix.

This also applies to contrasts where differences in are substituted for .

rms::infoMxop makes computations such as efficient using the following strategy. Let be a matrix that is mainly zeros but with ones in positions that indicate which elements of to compute. Let denote the information matrix for the entire model, with rows and columns. Concentrate on computation of . Instead of computing by itself, compute only the needed elements of it by computing solve(I, J) to get . But we quickly want to post-multiply by so use solve(I, JX'). Let’s see how fast this is when there are 299,999 intercepts.

set.seed(1)
n <- 300000; p <- 10; x <- matrix(runif(n * p), ncol=p); y <- 1 : n
k <- length(unique(y)) - 1
f <- orm(y ~ x)
# Get intercept number corresponding to median of y
j <- f$interceptRef
# Compute all parameter numbers needed
h    <- c(j, (k + 1) : (k + p))
info <- f$info.matrix
mdim(info)   # show dimensions of submatrices

             a  b     ab
rows    299999 10 299999
columns      2 10     10

# Form covariate values for 2 observations for predicting
X <- rbind(c(1, rep(.2, p)), c(1, rep(.6, p)))
X

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,]    1  0.2  0.2  0.2  0.2  0.2  0.2  0.2  0.2   0.2   0.2
[2,]    1  0.6  0.6  0.6  0.6  0.6  0.6  0.6  0.6   0.6   0.6

# Get VX' the slow way
# system.time(infoMxop(info, invert=TRUE)[h, h] %*% t(X))   5.4s for k=9999
et <- stime(a <- infoMxop(info, i=h, B=t(X)))

Elapsed time: 0.325s

                   [,1]          [,2]
y>=150000  1.933352e-04 -4.675175e-05
x[1]      -3.589965e-05  1.195570e-05
x[2]      -3.618740e-05  1.220834e-05
x[3]      -3.635284e-05  1.205578e-05
x[4]      -3.595975e-05  1.195893e-05
x[5]      -3.587224e-05  1.212975e-05
x[6]      -3.568397e-05  1.180846e-05
x[7]      -3.607189e-05  1.206576e-05
x[8]      -3.581018e-05  1.194599e-05
x[9]      -3.613558e-05  1.194314e-05
x[10]     -3.610594e-05  1.212468e-05

stime(infoMxop(info, i=h))     # time required to compute needed submatrix of V

Elapsed time: 0.429s

stime(infoMxop(info, i=100))   # time required to retrieve a single intercept

Elapsed time: 0.32s

stime(infoMxop(info, i='x'))   # time to get beta part of v

Elapsed time: 0.128s

0.32 seconds to get the variance of predicted values when there are 299999 intercepts is quite sufficient! The time required to compute the portion of the inverse needed is only 0.1s longer however.

Note that in the source code for lrm and orm you’ll see a shortcut for computing the diagonal elements:

nx <- ncol(X)
X  <- cbind(1, X)
v  <- infoMxop(info, i=c(f$interceptRef, (nrp + 1) : (nrp + nx)), B=t(X))
se <- drop(sqrt((t(v) * X) %*% rep(1, nx + 1)))

What is Fast and What is Slow When is Large

For continuous Y when there is a large number of intercepts, here is a breakdown of what kind of computations involving ordinal regression models are fast:

solving for the MLEs
computing the covariance matrix for the s alone and using computing it for and a small number of intercepts
any assessment that is relative (e.g., odds ratios as opposed to absolute risk estimates)
- contrasts
- Wald tests
- likelihood ratio tests (which don’t require covariance matrices)
- standard errors (SEs) and confidence bands for differences on the link (e.g., logit) scale
- predicted absolute quantities (exceedance probabilities, cell probabilities, means, quantiles) without SEs or confidence intervals

Slow operations for very large , e.g., :

computing the covariance matrix for all the intercepts or for all parameters combined

For some computations it will be faster to bootstrap the model fit rather than to compute SEs and CLs. The rms bootcov function does this efficiently since it uses lrm.fit or orm.fit with compstats=FALSE to streamline the computation of MLEs from bootstrap samples.

bootcov needs all the intercepts to be represented in all bootstrap samples. To minimally group Y values to make this happen, see the new rms ordGroupBoot function.

But for some “absolute” computations, run time is still exceptionally fast for large , because the entire information matrix does not need inverting, but instead the inverse is multiplied by a vector as was done above, so that solve can be used to quickly solve a system of equations instead of fully inverting. As an example let’s time the calculation of the estimation of mean Y without confidence limits, and then with them, for . The Mean function uses the -method to estimate the needed standard error for the normal approximation for the confidence interval for a covariate-specific population mean..

set.seed(1)
n <- 15000
y <- 1:n
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
dd <- datadist(x1, x2, x3); options(datadist='dd')
f <- orm(y ~ x1 + x2 + x3, x=TRUE)
d <- data.frame(x1=0, x2=0, x3=0)
X <- predict(f, d, type='x')  # need original design matrix for accurate CLs of means
X

  x1 x2 x3
1  0  0  0

M <- Mean(f)
stime(print(M(predict(f, d))))

       1 
7500.703 
Elapsed time: 0.023s

stime(print(M(predict(f, d), conf.int=0.95, X=X)))

       1 
7500.703 
attr(,"limits")
attr(,"limits")$lower
     1 
7431.4 

attr(,"limits")$upper
       1 
7570.006 

Elapsed time: 0.044s

stime(print(Predict(f, x1=0, x2=0, x3=0, fun=M)))

  x1 x2 x3     yhat  lower    upper
1  0  0  0 7500.703 7431.4 7570.006

Response variable (y):  

Limits are 0.95 confidence limits
Elapsed time: 0.05s

This is fast because the limiting step is like this, inside the M R function:

info <- f$info.matrix
mdim(info)   # show dimensions of 3 submatrices

            a b    ab
rows    14999 3 14999
columns     2 3     3

np   <- sum(dim(info$ab))
np  # total no. parameters = # rows and cols of info matrix

[1] 15002

# Multiply info inverse times B
stime(infoMxop(info, B=matrix(runif(np), ncol=1)))

Elapsed time: 0.008s

Other Resources

Video by Richad McElreath explaining numerical accuracy issues in log likelihood calculations.
R CRAN Task Views on optimizers
Vignettes by John Nash et al.
R ucminf package
R nloptr package
R code: lrm and lrm.fit
Fortran code: lrmll

Computing Environment

grateful::cite_packages(pkgs='Session', output='paragraph', out.dir='.',
    cite.tidyverse=FALSE, omit=c('grateful', 'ggplot2'))

We used R version 4.4.0 (R Core Team 2024) and the following R packages: Hmisc v. 5.2.2 (Harrell Jr 2025a), Matrix v. 1.7.1 (Bates, Maechler, and Jagan 2024), orms v. 1.0.0 (Harrell Jr 2010), rms v. 7.0.0 (Harrell Jr 2025b), survival v. 3.8.3 (Terry M. Therneau and Patricia M. Grambsch 2000; Therneau 2024).

The code was run on macOS 15.2 on a Macbook Pro M2 Max, running on a single core.

References

Bates, Douglas, Martin Maechler, and Mikael Jagan. 2024. Matrix: Sparse and Dense Matrix Classes and Methods. https://CRAN.R-project.org/package=Matrix.

Harrell Jr, Frank E. 2010. orms: Regression Modeling Strategies. http://biostat.mc.vanderbilt.edu/rms.

———. 2025a. Hmisc: Harrell Miscellaneous. https://hbiostat.org/R/Hmisc/.

———. 2025b. rms: Regression Modeling Strategies. https://hbiostat.org/R/rms/.

R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Terry M. Therneau, and Patricia M. Grambsch. 2000. Modeling Survival Data: Extending the Cox Model. New York: Springer.

Therneau, Terry M. 2024. A Package for Survival Analysis in r. https://CRAN.R-project.org/package=survival.

Reuse

CC BY 4.0

Ordinal State Transition Models as a Unifying Risk Prediction Framework

Frank Harrell — Mon, 18 Nov 2024 06:00:00 GMT

Event:
- International Chinese Statistical Association Applied Statistics Symposium, Nashville, Tennessee USA 2024-06-17
- CANSSI Ontario STatistics Seminars (CAST), Virtual, 2024-11-18
Slides

Adjudication and Statistical Efficiency

Frank Harrell — Thu, 17 Oct 2024 05:00:00 GMT

Background

In clinical and epidemiologic studies one is frequently tasked with maximizing accuracy when assessing the presence of clinical conditions (symptoms, diagnoses, syndromes, etc.) or verifying outcome events such as stroke, myocardial infarction, or death from a specific cause. Prospective studies have the advantage of standardizing definitions of clinical conditions, minimizing bias, and being honest about disagreements about clinical designations. Many studies have clinical endpoint committees or adjudication committees. Statistical efficiency and completeness of reporting are optimized by having as many committee members as feasible, and having the members operate as independently as possible.

Statistical efficiency also comes from minimizing forced choices and utilizing gray zones. For example, if a study has only one adjudicator, and this clinical expert is uncertain about some of the designations, it is best for her to code determinations using at least one level of gray. The way to understand why this is more statistically efficient than having forced choices is to consider a 3-level (negative, uncertain, positive) clinical outcome that is being correlated with a 5-level severity of a symptom. Uncertain outcomes may occur more often for patients having a middle symptom severity. Making use of 3 levels of outcome will capitalize on this to increase power.

Sometimes the clinical condition needs to be used not as a multilevel ordinal outcome but is instead used in subsetting patients. For example, one may want to analyze a subset of the cohort consisting of patients designated as having a certain clinical syndrome at baseline. It is not hard to analyze subsets when the subsetting is uncertain. For example, if one translated an adjudication to the probability the patient has a syndrome, one can easily use multiple imputation to analyze subsets under uncertainties. If a given patient has a probability of 0.6 of having syndrome X, 10 imputations of the binary syndrome can be generated. In the long run, of the imputations will be positive for the syndrome. The needed subset analysis can be done by including, for each of the multiple imputations, all the patients imputed to be positive for X. By repeating this process over, say, 10 multiple imputations, noise in this process will average out and one-time forced choice classification is unnecessary.

Even if one does not want to use multiple imputation or Bayesian models to account for adjudication uncertainties, it is important to design the adjudications to lead to an optimum final negative/positive designation.

A Hierarchy of Statistical Information and Power

Besides having independent adjudicators, statistical information is maximized when one delays forced-choice designations as much as possible and respects gray zones to the extent possible. Here is a hierarchy of statistical information/efficiency/power from highest to lowest, for various strategies.

Have each adjudicator record the probability the patient is in the clinical category of interest, then average these probabilities to yield a final result that is used in analyses. When the clinical category is used as an outcome variable, ordinal regression may be used in the final analysis. This can be used to estimate the probability that the outcome is at a certain level or higher, for any and for any level of baseline variables.
Classify the patient as negative/positive depending on whether this average probability the condition exists exceeds a pre-specified level.
Have each adjudicator record a forced choice of negative/positive. Code the final result as the proportion (over adjudicators) of positives.
Have each adjudicator record a forced choice of negative/positive. Code the final result as negative/positive depending on a majority rule. One would need to have an odd number of reviewers for this rule.

When one has a probability of being in a clinical class and such probabilities are not all near 0 or 1, the probabilities are self-contained in terms of capturing the difficulty of the task of classifying patients. This translates directly to quantifying the arbitrariness of forced-choice classifications.

Resources

Probabilistic readjudication of heart failure hospitalization events in the PARAGON-HF study by GM Felker, J Butler, JL Januzzi, AS Desai, JJV McMurray, SD Solomon (includes a multiple imputation approach)
A comparison of approaches for adjudicating outcomes in clinical trials by BC Kahan, B Feagan, V Jairath
Descriptive approach to analyzing observer variability
How breaking ties in a variable increases statistical power
Against diagnosis in favor of matters of degree, by AJ Vickers, E Basch, MW Kattan
Probabilistic prediction in patient management and cinical trials by DJ Spiegelhalter
The end of the “syndrome” in critical care by Lawrence Lynn

Reuse

CC BY 4.0

The Burden of Demonstrating Statistical Validity of Clusters

Frank Harrell — Sun, 06 Oct 2024 05:00:00 GMT

Background

Clustering of patients to find new “phenotypes” is now a fad. For example, repeating the false assertion that diabetes was ever a binary diagnosis, Ahlqvist et al claimed to have found 5 diabetes subtypes using a purely statistical analysis not driven by clinical knowledge. What they found is likely just inefficient prognostic stratification that could be improved upon by directly relating patient characteristics to outcomes.

Maarten van Smeden showed that clustering algorithms easily get the wrong number of clusters when the true number of clusters is known, and Darren Dahly showed in a simple example that clustering is essentially telling us, for example, that people who are older than 65 are older than people who are under 65. van Smeden, Harrell, and Dahly wrote a letter to the editor concerning the Ahlqvist paper, casting extreme doubt on the original authors’ assertions that “new forms of diabetes” have been identified or that this is a useful “step towards precision medicine in diabetes”. van Smeden et al pointed out that direct modeling of outcomes is likely to have much greater payoff, and that the clusters found by Ahlqvist et al are very unlikely to be what they seem. Ahlqvist et al did not even assess within-cluster homogeneity of the component variables nor did they assess within-cluster outcome homogeneity.

What is the Question and Why Cluster Patients?

Most medical applications of statistical clustering techniques fail to address the most basic questions such as

What is the ultimate goal to which the results of the statistical analysis will be used?
Is the disease being studied all-or-nothing as assumed by clustering algorithms when doing the analysis on “diseased” patients?
What is the best way to summarize the result? Is it patient cluster membership, a clinical prediction model (which much better handles categorical patient characteristics), or is it variable clustering?

Variable clustering often is more likely to meet investigators’ goals than patient clustering. Variable clustering does not discard nearly as much information as patient clustering, is less arbitrary, scales to more variables, and better handles collinearities / redundancies. Sparse principal components analysis (PCA) is also a very useful tool, combining variable clustering with PCA to handle collinearities while providing a more sparse representation of the patient baseline variables. Both of these variable clustering approaches can easiily feed their results into standard clinical prediction models to learn how various dimensions of the patient relate to outcomes.

How Should Clustering Results be Presented?

In the minority of cases where patient clustering is most likely to meet clinical goals, investigators must be made aware that forced-choice classification (assigning each patient to a cluster with no gray zone) is not often the best way to represent clusters. Assignment to discrete clusters assumes that R. A. Fisher’s definition of clusters as compact sets is in play. In other words, pretending that clusters are discrete assumes that clusters are compact, i.e., there is no meaningful heterogeneity within a cluster. When, for example, a patient at the outer boundary of one cluster is closer to a patient at the outer boundary of a different cluster than she is to the center of her own cluster, simply labeling her as a member of “her” cluster is misleading.

It is much more natural to use the results of patient clustering in a continuous, less assumption-laden fashion. For example, one can summarize the results of clustering in the following ways when relating clusters to outcomes:

For clusters and for each patient, compute the distances from the cluster centers as outcome predictors.
Similarly, compute the probabilities that the patient belongs to each of the clusters and use the logits of these probabilities as predictors.

Forced-Choice Cluster Classification Requires Verifying Adequacy of Mere Cluster Membership

When probabilities of cluster membership are not all near 0 or 1, the clusters are not compact enough to be used in forced-choice cluster classification, and likewise if the distributions of distances from cluster centers are wide.

Consider computing the median distance between all possible pairs of cluster centers, and show that the individual patient distances from their own cluster centers is below, say 1/5th of the median distance between cluster centers more than 4/5 of the time to demonstrate that cluster membership is not far from an all-or-nothing phenomenon.

If forced-choice cluster assignments are still of interest, these assignments must be validated with regard to adequacy of summarization of statistical information contained in the original component variables. In other words, demonstrate that the cluster identifiers are sufficient for conveying the information (e.g., phenotypes) the clusters are purported to contain, when there is an outcome or response variable that the clusters are supposed to predict. Here are some useful steps in that endeavor:

Define as the set of indicator variables for membership in clusters
Define as the set of distances a patient has from each of the cluster centers
Fit models to predict patient outcome, with the models containing as predictors both sets and , and models containing and separately
Compute likelihood ratio tests to assess the prognostic information due to each set
Compute the proportion of overall likelihood ratio for & combined that is due to each of the sets
Verify that the proportion of predictive information provided by , after adjusting for , is small. See this and this for more information.
Demonstrate that the clusters provide new prognostic information after accounting for previously known prognostic variables. In a similar fashion to the previous demonstration, replace set with known prognostic variables and compute the fraction of new prognostic information that is provided by the cluster indicators.
Demonstrate that cluster assignments cannot be easily predicted from simple features, using for example polytomous (multinomial) logistic regression.

Demonstrating Stability

Besides adequacy of statistical summarization of component variables, clusters must be validated for stability. A simple bootstrap procedure can document stability of found clusters, and when the number of clusters was not completely pre-specified (before analyzing the data), the number of clusters should be allowed to “float” across resamples, and the frequency distribution of the number of found clusters provided.

Ultimate Validations

Statistical validations of cluster structure and especially of the adequacy of the cluster summarizations are easy. But the clusters then need to be validated in the more difficult way, by demonstrating clinical usefulness of the clusters. Examples of clinical usefulness include

demonstrating that the clusters are clinically interpretable and that patients are homogeneous within the finest level of detail used to summarize clusters
- If forced-choice classification is used, show that there is no remaining clinical information within each choice.
- If distance from all cluster centers are used, show that there is no remaining clinical information once the distance is fixed.
demonstrating within a randomized clinical trial that the clusters are uniquely useful for capturing differential treatment effect, e.g., showing that there is an important interaction between treatment and clusters but no important interaction between treatment and pre-specified raw baseline variables
showing that the number of clusters is clinically correct.

Do similar likelihood ratio assessments as above to compare the total treatment cluster interaction effect to the total treatment cluster distance effects to the total treatment original variable effects. Forced-choice clusters will be embarrassed if the log-likelihood accounted for by simple raw variable (or cluster distance) interactions exceeds that accounted for by cluster memberships.

Reuse

CC BY 4.0

Hosting Web Content

Frank Harrell — Sun, 29 Sep 2024 05:00:00 GMT

One of my best decisions was to build my own web sites hbiostat.org and fharrell.com so that I have total control of content and formatting and can easily and quickly post content updates. I want to share a few things I’ve learned.

While your organization’s web pages are great for static content, my public-facing content evolves rapidly with constant improvements made to course web pages, miscellaneous web pages such as hbiostat.org/data, blog articles, and course handouts. To make it easy to update and to add new pages I’ve found it productive to take control of the situation using web sites that are served by the amazing netlify.com. Some people prefer to create web sites with Github but I like to have total control of formats.

With the Netlify approach you create a static web site on your local computer (I have my main one under ~/web for hbiostat.org), and whenever there is a significant change to your material, you have Netlify re-deploy the web site to Netlify (it only sends what has been changed). I use my own domains but you can use free *.netlify.net domains. Your local computer provides a mirror of what is available publicly and you can easily preview changes by just opening one of your .html files. Deployment can be handled interactively or better by using the Netlify command line app – you just run a single command which I abbreviate as e.g. hdeploy after doing a one-time authentication. If you ever decide to quit using Netlify you have all the web content locally for easy deployment to AWS or anywhere else you want to put it.

I have the deluxe Netlify paid plan because some pages have high traffic but you can do a lot with the Netlify free plan.hdeploy stands for netlify deploy -p -s hbiostat -d ~/web.

To create web content locally I recommend one of the following (or a mix of them):

Simple markdown .md files that are converted to .html using pandoc (creates very small and fast html)
R markdown run through R to create html
Quarto (produces the most beautiful output but if you want the html file to be self-contained and not part of a whole Quarto web site it will create larger html files)

The beauty of Quarto is that it creates books, individual reports, blogs, presentations, and whole web sites. fharrell.com was created completely by Quarto. When you have connected web pages (book chapters; blog articles) the individual html files are lean. Here is an example of a standalone Quarto web page: hbiostat.org/r/hmisc.

The hbiostat.org home page was created using R markdown. The R markdown script is here.

Here is a course web page which I converted from a wiki (see below for a conversion script) to a simple markdown file: hbiostat.org/b2 . The markdown for this is at hbiostat.org/b2/index.md . The pandoc command to convert from .md to .html is

  pandoc --toc --css=https://bootswatch.com/5/cerulean/bootstrap.min.css \
    -s -o index.html index.md

An example of course notes created with Quarto is hbiostat.org/rmsc.

One other lesson I’ve learned over the years after hosting my web pages on Amazon Web Services (AWS) is that when you have to support your own Linux or Windows web server such as an AWS Lightsail instance, the time spent in keeping the site secure and software updated is significant, and doing updates to web pages is not as easy as the local ~/web Netlify mirroring approach. It is far easier to host a static web site where Netlify takes care of 100% of system and web server software issues. There is nothing to update on your site other than the actual web content.

Miscellaneous Tips

Creating an `index` File

Sometimes you want to add a directory full of files to a public web page without taking the time to create index.md to point to each file. The following shell script uses the wonderful tree Linux/Mac app to create index.html such as the one appearing here. Dates are file-last-modified dates.

#! /bin/bash
# Stored in ~/bin/mkindexd

tree --ignore-case -C -I '*confidential*|*cache*|*courseregistrants*' \
  --timefmt "%F  " \
  -H . | sed -e "s/\(.*\)
/\\2\ \ \ 
/" \
  -e "s/\[  \]  //" > index.html

Converting Wiki Content to Markdown

Here is a shell script that converts legacy wiki markdown to regular markdown.

#! /bin/bash
#
# Convert from foswiki-type wiki markup to markdown
# Run e.g. wiki2md foo to convert foo.wiki to foo.md
cat $1.wiki | sed -E -e "s/\[\[(\S+?)\]\[(.+?)\]\]/[\2](\1)/g" \
-e "s/\[\[http(\S+?) (.+?)\]\]/[\2](http\1)/g" \
-e "s/^---+++/###/g" \
-e "s/^---++/##/g" \
-e "s/^---+/#/g" \
-e "s/%N%//g" > $1.md

Reuse

CC BY 4.0

Tips for Biostatisticians Collaborating with Non-Biostatistician Medical Researchers

Frank Harrell — Tue, 30 Jul 2024 05:00:00 GMT

Slides

Rare Degenerative Diseases & Statistics:Methods for Analyzing Composite Patient Outcomes

Frank Harrell — Thu, 11 Jul 2024 05:00:00 GMT

Traditional Frequentist Inference Uses Unrealistic Priors

Frank Harrell — Mon, 10 Jun 2024 05:00:00 GMT

Background

Consider these four conditions:

There is no reliable prior information about an effect and an uninformative prior is used in the Bayesian analysis
There is only one look at the data
The look was pre-planned and not data-dependent
A one-sided assessment is of interest, so that one-tailed p-values and Bayesian posterior probabilities are used, where is the effect parameter of interest (e.g., difference in means, log effect ratio) and means “conditional on” or “given”.

One-Sided vs. Two-Sided Assessment

A two-tailed frequentist test contains a multiplicity adjustment that is designed as if the researcher is equally interested in making a claim for harm as she is for benefit of a treatment. When comparing two-tailed tests to the usual Bayesian posterior probability that the benefit is greater than zero, Bayes’ directionality will give it an instant benefit. Quantifying evidence for either a positive or negative benefit through will also give Bayes a benefit because this maximum must be . Bayes and frequentist two-sided assessments can be put on an equal footing by computing the posterior probability for a certain sample-size-dependent (the posterior probability is 1.0 for since we assume is a continuous parameter with . For simplicity in what follows I address only one-sided assessments.

If all four of the above conditions hold, then Bayesian inference about a positive effect will coincide largely with one minus a frequentist one-tailed p-value. However this way of thinking ignores the very important fact that even when there are no reliable data about the specific magnitude of treatment effect, there is always a reliable constraint on that unknown effect. For example, we know that most treatments are not curative, so it is impossible for the true treatment effect to have, for example, an odds ratio or hazard ratio of 0.0. Turning this idea around, since Bayesian and frequentist inference are “close” if an uninformative prior is used, what are the implications to frequentist inference?

Flatter and Flatter Priors

Motivated by this which contains a quote from the Prior Distributions for rstanarm Models, there are wide implications of placing no constraints on the unknown parameter . Before giving the implication of a truly uninformative prior on the scale, consider a prior that is a Gaussian distribution with mean zero and standard deviation . What does say about what we know about ?

If the raw data have a standard deviation of 1.0, a value of a difference in means of equal to 3 would be judged to be quite large. In the vast majority of studies, there would be strong expert opinion that the chance that would exceed the chance that . Let’s compute prior probabilities as a function of .

Code

Borrowing Information Across Outcomes

Frank Harrell — Tue, 30 Apr 2024 05:00:00 GMT

Background

As explained here, the power for a group comparison can be greatly increased over that provided by a binary endpoint, with greater increase when an ordinal endpoint has several well-populated categories or has a great many categories, in which it becomes a standard continuous variable. When a randomized clinical trial (RCT) is undertaken and deaths can occur, there are disadvantages to

excluding the death and analyzing responses only on survivors
using death as a competing risk, which makes for hard-to-interpret results and doesn’t penalize efficacy for death
using a complex estimand that involves counterfactuals or other complexities

By making death the worst level of an ordinal response , nothing is swept under the rug, and a treatment having more deaths is penalized for that. Evidence for treatment effectiveness may be driven by the nonfatal outcomes. Suppose for example that is renal function at 6 weeks, measured by serum creatinine, with death coded as a value higher than the highest observed creatinine (it doesn’t matter how high for ordinal analyses). Evidence for treatment effectiveness in improving may be stated as “the treatment improved renal function accounting for death”.

Often sponsors want evidence for a specific effect on mortality, even though they are unwilling to budget for a study large enough to provide evidence for a mortality benefit on its own. In that case, the only way to have Bayesian or frequentist power to detect a mortality improvement is to assume that some of the treatment benefit on nonfatal outcome components spills over to mortality. The partial proportional odds (PO) semiparametric ordinal logistic regression model by Peterson & Harrell, 1990 when coupled with a Bayesian implementation of the model provides a very formal way to borrow treatment effect information across levels of Y.

Suppose that for simplicity we ignore power-enhancing baseline covariates, and have an outcome variable where represents death. The PO model can be written as

where , (inverse logit), is the intercept corresponding to a cutoff of (), for treatment B and for treatment A, and is the B:A log odds ratio. Hence is the B:A odds ratio (OR). Under the PO assumption the possible B:A ORs for are the same for all . For example the treatment effect on death is , just like the treatment effect on the last three categories combined, for example if .

Peterson and Harrell proposed the partial PO model and the constrained partial PO model. Using the latter we allow for a special effect of treatment only on Y= and assume a constant OR for all other -cutoffs, for example. This constrained partial PO model is

where is an indicator variable that is if , otherwise. represents a “special effect of treatment B” for . So the B:A odds ratio for is .

If the model is fitted using a frequentist maximum likelihood approach, or using a Bayesian procedure that puts a non-informative prior on , the precision of (or its anti-log, the OR for mortality) will come from the effective sample size for a pure death outcome.

Example Partial Proportional Odds Analysis

Suppose that we have a parallel-group two-treatment randomized trial, and are in the bizarre situation where there are no patient risk factors, i.e., patient outcomes are homogeneous within treatment so that no covariates are needed. The treatment is designed to keep hospitalized patients from requiring mechanical ventilation, and hopefully also to lower in-hospital mortality. Suppose the outcomes are for alive and not on ventilator, alive and on ventilator, or dead, respectively. Suppose the following outcome frequencies and summary statistics arise.

Code

Proportional Odds Model Power Calculations for Ordinal and Mixed Ordinal/Continuous Outcomes

Frank Harrell — Mon, 22 Apr 2024 05:00:00 GMT

Background

A binary endpoint in a clinical trial is a minimum-information endpoint that yields the lowest power for treatment comparisons. A time-to-event outcome, when only a minority of subjects suffer the event, has little power gain over a pure binary endpoint, since its power comes from the number of events (number of uncensored observations). The highest power endpoint would be from a continuous variable that is measured precisely and reflects the clinical outcome situation. An ordinal outcome with, say, five or more well-populated levels yields power that can approach that of a truly continuous outcome. When the original binary outcome represents the union of several clinical endpoints, the analysis treats all endpoints as equally important. An ordinal outcome variable, even if it has only a few levels, breaks the ties and improves power. When there are clinical responses that are ordinal or continuous and one is willing to assume that any clinical event overrides any level of such a scale, one can easily construct an ordinal scale that incorporates both clinical events and the scale. A great advantage of this approach, besides improved power, is that it gives one a place to record bad clinical events such as death, without the need for using a complex statistical analysis that has to deal with such questions as “what would have happened had the patient not died” or “how do I handle missing data when the patient died before the ordinal/continuous was measured?”.

Continuous and ordinal scales can be compared between treatment groups using the Wilcoxon-Mann-Whitney two-sample test. This does not allow for adjustment of covariates, nor does it properly handle a large number of ties in the patient responses. The proportional odds (PO) ordinal logistic regression model is a generalization of the Wilcoxon test, and it handles arbitrarily heavy ties. Since the Wilcoxon test assumes within-group homogeneity of outcome tendencies, the Wilcoxon test makes more assumptions than the PO model. We will use the PO model in what follows. For now we consider the oversimplified situation in which a patient has recorded the worst category outcome that occurred within 12m of randomization. For the actual clinical trial analysis we might use a longitudinal Markov state transition model in which the ordinal outcome scale is assessed weekly until the patient dies or follow-up ends. This approach counts multiple hospitalizations as having more weight than a single hospitalization. Since the power calculations below use only one overall measurement per patient, it represents a lower bound on the power that will actually be obtained. The power formula used here is due to Whitehead¹ using the R² Hmisc package function popower or by simulation using the simRegOrd function³.

The PO model handles treatment effects through odds ratios. Let the ordinal outcome be denoted by and one of its levels be . Consider the probability that for a patient on treatment A and for a patient on treatment B. The odds that for a treatment is the probability divided by one minus the probability. The B : A odds ratio for is some constant OR and by the PO assumption this OR is the same no matter which cutoff y is chosen. When the hard clinical events are at the high end of the ordinal scale and a patient oriented outcome scale is at the low end, the PO model assumes that the treatment effect (on the OR scale) for, say, death is the same as the OR for the outcome scale being worse than any given level or the patient dying. When the treatment has a different effect on nonfatal outcomes as for fatal ones, the overall OR represents a kind of weighted average over the various treatment effects. See this for an example where the treatment effect on death is allowed to differ from the effect on nonfatal outcomes when computing power.

Initially Considered Outcome Scale

Consider the Kansas City Cardiomyopathy Questionnaire of JA Spertus et al, which measures symptoms, social and physical limitations, and quality of life in heart failure patients. KCCQ ranges from 0 to 100 with 100 being the most desirable outcome. To account for clinical events, we extend the KCCQ scale with three clinical event overrides. Assume that KCCQ and event status are assessed one year post randomization.

Y	Meaning
103	Cardiovascular death
102	Non-cardiovascular death
101	Hospitalization for HF/Renal Disease
100	KCCQ=0
99	KCCQ=1
..	…
1	KCCQ=99
0	KCCQ=100

KCCQ is the 1 year KCCQ overall summary score. Non-CV death is placed after CV death because we wanted to give more weight to treatment effects on the latter, and we expect less effect on the former. For the actual analysis of trial data, KCCQ will have non-integer values and will be analyzed as a continuous variable, i.e., there will be one intercept in the PO model per distinct value of KCCQ, less one, plus three for clinical events.

A sample of KCCQ summary scores was provided by Vanderbilt cardiologist Brian Lindman, for which summary statistics are shown below along with a histogram.

Code

The log-rank Test Assumes More Than the Cox Model

Frank Harrell — Thu, 28 Mar 2024 05:00:00 GMT

Background

The log-rank test is a Mantel-Haenszel “observed - expected frequency” type of test that was derived in a slightly ad hoc way by Nathan Mantel in 1966 and named the logrank test by R Peto and J Peto in 1972. It was later formally derived as the rank test having optimal local power for a shift in the type I extreme value (Gumbel) distribution. This horizontal shift is equivalent to a vertical shift in survival distributions after log-log transforming them. This is identical to saying the two survival distributions are in proportional hazards, i.e., that one survival curve is the other one raised to a constant power.

It is well known that when a Cox proportional hazards (PH) model contains only a single covariate, and it is binary, this two-group-comparison setup gives rise to a Rao score test¹ for testing the difference between the two groups. This score test is identical to the log-rank test statistic when there are no ties in the failure times. So we already know that the log-rank test makes all the assumptions of a Cox PH model, and in a sense we can go further than that. The two approaches are one and the same when there are only groups and no covariates.

¹ The score test has in the numerator the first derivative of the Cox log-likelihood function with respect to the regression coefficient , evaluated at .

Therefore the only question that is unsettled relates to the fact that the score test is not used very frequently with the Cox model, opting for the Wald statistic or the gold-standard likelihood ratio (LR) statistic. This article uses simulated datasets to show how little it matters when one compares the log-rank statistic with the Cox LR instead of the score statistic. I also show that the Pike log-rank HR estimate agrees extremely well with the Cox HR estimate.

There are two HRs that go along with the log-rank test, the Pike estimator and the Peto estimator. The linked article shows that the Pike HR estimator is better. This HR estimator is the ratio of two ratios. Each of these ratios is the ratio of the observed number of events in a group to the expected number of events in the group.

See this article for more about the log-rank test and the difference between semiparametric and truly nonparametric assumption-free methods.

Simulated Numerical Examples

Fast code for computing the two-sample log-rank statistic and the Pike HR is found in the Hmisc package:

Code

What Does a Statistical Method Assume?

Frank Harrell — Sat, 23 Mar 2024 05:00:00 GMT

A Definition

All statistical procedures have assumptions. Even the most simple response variable (Y) where the possible values are 0 and 1, when analyzed using the proportion that Y=1, assumes that Y is truly binary, every observation has the same probability that Y=1, and that observations are independent. Non-categorical Y have more assumptions. Even simple descriptive statistics have assumptions as described below. But what does it mean that an assumption is required for using a statistical procedure? I’ll offer the following situations in which we deem that a specific assumption (A) is involved in using a specific statistical procedure or estimator (S).

S performs worse when A is not met and better when A is met; ideally S performs as well as any other method when A is met
S is difficult to interpret when A is not met and easier to interpret when A is met
S was derived explicitly under A
S is a special case of a more general method that was derived under A
If S is an estimator and the usual method of estimating uncertainty in S works with A and doesn’t when A does not hold

Performance may be of several kinds, for example:

Bias (in a frequentist procedure)
Variance
Mean squared error, mean absolute error, etc.
Actual type I assertion probability equals the stated (in a frequentist procedure)
High frequentist or Bayesian power, which is related to high relative efficiency (e.g., variance or sample size ratios)
Actual compatibility (confidence) interval coverage equals the stated coverage
- For a 2-sided compatibility interval with coverage , of intervals constructed using the procedure should have the lower limit above the true unknown parameter and of such intervals should have the upper limit below the true unknown value
Accuracy of uncertainty estimates such as standard error

Most of the usual statistical estimators and procedures have these hidden assumptions:

the data to be representative of a process to which you want to infer (e.g., the data are a random sample from a population of interest)
observations are independent unless dependencies are explicitly taken into account
measurements are unbiased unless nonrandom measurement errors are explicitly taken into account
observations are homogeneous (all observations have the same statistical tendencies such as mean and variance) with regard to non-adjusted-for factors. Examples:
- A simple proportion for Y=0/1 is intended to be used on a sample where every observation has the same chance of Y=1
- A two-sample -test assumes homogeneity within each of the two groups
- A linear model for doing analysis of covariance to compare two treatments adjusted for age assumes homogeneity within groups defined by treatment and age combinations
certain other aspects of the study design are taken into account

I take it as given that if the output (e.g., parameter estimate or test statistic) of statistical method 1 has a one-to-one relationship with an output of statistical method 2, with the rank correlation between the two outputs equal to 1.0 over all datasets, then method 1 makes the same assumptions as method 2. In that case method 1 (even if it is an ad hoc procedure) is just method 2 (even if it is a formal model) in disguise.

If violation of assumption A causes equal damage to statistical procedures 1 and 2, those procedures are making assumption A to the same degree.

More About Hidden Assumptions

Some estimators provide reasonable estimates even when there are correlations among observations, but estimates of uncertainties of estimates can be badly affected by correlations.

In a linear model, un-modeled heterogeneity reduces and is added into the error term (residuals are larger). As detailed here, heterogeneity in nonlinear models with no error term act much differently, tending to attenuate the regression coefficients of modeled factors towards zero.

Historical Note

The Wilcoxon-Mann-Whitney two-sample rank-sum test was developed independently by Wilcoxon in 1945 and Mann and Whitney in 1947. It was developed as a test of equality of distributions against a stochastic ordering alternative. Neither the exact form of the difference in the two distributions for which the test has optimum sensitivity nor a model from which the test could be derived were known at that time. Not until a general theory of linear rank tests¹ did a general method for deriving rank tests to detect specific alternatives become available. Then the Wilcoxon test was derived as the most locally powerful linear rank test for detecting a location shift in two logistic distributions (proportional odds). In 1980, McCullagh showed that the numerator of the Rao efficient score test in the proportional odds model is identical to the Wilcoxon statistic.

In a similar way the log-rank test was proposed in a somewhat ad hoc fashion by Nathan Mantel in 1966 and named the logrank test by R Peto and J Peto in 1972. Later the log-rank test was put into the context of the general theory of linear rank statistics, from which it is derived as a solution to the type of distributon shift that makes log-rank the locally most powerful rank test. That particular shift (location shift in Gumbel distributions) represents a proportional hazards alternative.

That the Wilcoxon and log-rank tests are not nonparametric (unlike the Kolmogorov-Smirnov two-sample test) is readily seen by their achieving very low power when the two distribution curves cross in the middle.

¹ J. Hájek, Z. Sidák, Theory of Rank Tests , Acad. Press (1967)

Examples

Consider assumptions that are specific to the method, but keep in mind the hidden assumptions above. Of the examples below, the ones that could be labeled as truly nonparametric are the simple proportion, quantiles, compatibility interval for a quantile, empirical cumulative distribution, Kaplan-Meier estimator, and Kolmogorov-Smirnov two-sample test. The other examples are semiparametric or parametric. But note that even though they are nonparametric procedures, quantiles and the Kolmogorov-Smirnov test assume continuous distributions (i.e., few ties in the data).

Simple Proportion

For binary (0/1) Y, there are only hidden assumptions, one of which is homogeneity. When a simple proportion is computed for a heterogeneous sample, the result may be precise but difficult to interpret. For example, if males and females have different probabilities that Y=1 and sex is not accounted for in computing the proportion, the proportion will estimate a marginal probability that depends on the F:M mix in the sample. When the sample F:M ratio is not the same as in the population of interest, the marginal estimate will not be very helpful.

That a simple proportion assumes homogeneity is further seen by considering an accepted measure of uncertainty. The variance of a proportion with denominator in estimating a population probability that Y=1 of is . When the observations are heterogeneous, each observation may have a different . Suppose that the observations have true probabilities of . The variance of the overall proportion is , which may be much different from where is the average of all s.

Contingency Table

For binary Y, comparing the probability that Y=1 between groups A and B leads to the Pearson test. This test assumes the Y is truly binary and that there is homogeneity, i.e., every observation in group A has the same probability of Y=1, and likewise for group B.

Mean

The sample mean assumes that extreme values are not present to the extent of destroying the mean as a measure of central tendency. When extreme values are present, the mean is not representative of the entire distribution but is heavily swayed by the extreme values. The mean is used because it is sensitive to all data values which gives it precision when you want such sensitivity and when the tails of the distribution are not heavy.

Standard Deviation

The standard deviation assumes that Y has a symmetric distribution whose dispersion is well described by a root mean squared measure. One could argue that when a mean squared difference measure is sought for an asymmetric distribution, then the half-SD should be used. There are two half-SDs: the square root of the average squared difference from the mean for those observations below the mean, and likewise for those above the mean. For asymmetric distributions the two half-SDs differ.

When the Y distribution is not symmetric, the SD may not be representative of the overall dispersion of Y, as opposed to measures such as Gini’s mean difference, presenting three quartiles, or using the median absolute difference from the median.

When the true distribution is asymmetric or has tails that are heavier than the normal distribution, it is easy to find examples where adding a point makes the difference in two sample means much greater but makes the statistic much smaller by “blowing up” the SD.

Quantiles

The quantile is the percentile of a distribution. The sample median is the 0.5 sample quantile. The use of sample quantiles in effect assumes a continuous distribution. This is seen by the fact that sample quantiles can jump suddenly if a single observation is added to the dataset, or can not move at all if several observations are added, if there are many ties in the data. So in the non-continuous case, sample quantiles can be simultaneously volatile and insensitive to major changes.

Compatibility Interval for a Quantile

A compatibility interval for a population quantile is one of the few truly nonparametric (other than assuming continuity) uncertainty intervals in statistics. See this example for computation of the interval for a median.

Empirical Cumulative Distribution Function and Kaplan-Meier Estimator

The ECDF, which is a cumulative histogram with bins each containing only one distinct data value, has no explicit assumptions. The version of the ECDF that deals with right-censored (e.g., lost to follow-up) observations is the Kaplan-Meier estimator. K-M assumes that the censoring process is independent of the failure process.

Kolmogorov-Smirnov Two-Sample Test

The Kolmogorov-Smirnov test is a test that with sufficient sample size will detect any difference between two distributions. The test assumes that both distributions are continuous. It also assumes that you are equally interested in all aspects of the distribution. Otherwise it will suffer power-wise, when compared to tests of more specific distribution characteristics.

Two-Sample -test

The standard two-sample test assumes normality of the raw data, which implies that the mean is a great measure of central tendency and SD is a great measure of dispersion. The standard test also assumes equality of variances in the two groups. We know these assumptions are made because if the normality or the equal variance assumption is violated, the -test loses efficiency (power) and can have erroneous under the null hypothesis that the two populations have the same mean. When the two sample sizes are unequal and normality holds but the variances are unequal, can be triple its claimed value.

A Bayesian -test can easily allow for non-normality and unequal variances

Central Limit Theorem To The Rescue?

At this point many statisticians will rush to claim that the central limit theorem protects the analyst. That is not the case. First of all, it is a limit theorem which is not intended to apply to non-huge sample sizes. Secondly, when there is high skewness in the data, the asymmetric data distribution makes the SD not independent of the mean (which implies the standard error of the mean difference is also dependent on the means), and neither a nor a normal distribution applies to the ratio between the difference in means and the standard error of this difference when the two are not independent. Sample sizes of even 50,000 can result in poor compatibility interval coverage from the central limit theorem when extreme skewness is present.

Among the many things the central limit theorem cannot do for you, getting transformations of continuous Y “right” is one of them.

Multiple Regression

The standard linear model assumes normality and equal variance of residuals, and assumes the population mean is the specified function of the predictors. If the mean or variance assumptions is violated, least squares estimates may still provide an overall unbiased mean, but the estimate of the mean may be wrong for some covariate settings, or it may be inefficient. Non-normality of residuals will lower power and result in inaccurate compatibility interval coverage. Recall that the best estimate of mean Y when Y has a log-normal distribution is a function of the mean and SD of log(Y).

Suppose the analyst should have taken log(Y) instead of analyzing Y, and that residuals from log(Y) are normal with constant variance. Suppose further that on the log(Y) scale there is goodness-of-fit for most of the combinations of predictor settings. A linear model fitted on Y will then have wrong predictions for every observation even though the mean of all the predictions will equal the sample mean of Y. Every regression coefficient can be meaningless, and false interactions will be induced by analyzing Y on the wrong scale. So the linear model assumes normality of residuals, equal variance, and properly transformed Y.

Freedom from worrying about how to transform Y is a key reason for using semiparametric (ordinal) regression models, which are Y-transformation invariant.

Having non-normal residuals doesn’t necessarily mean that ordinary least squares estimates of regression coefficients are useless, but they are no longer efficient, and non-normal residuals frequently indicate that the transformation used for Y is inappropriate.

Log-rank Test

The log-rank test is a test for whether two survival distributions are the same, against specific alternatives (types of differences). It makes a fundamental assumption that there are no important covariates when two groups are being compared. Ignoring that for now, to uncover its assumptions we have to study which alternatives to the null hypothesis the test was designed to detect.

The log-rank test was derived as the rank test with optimum efficiency for detecting a simple location shift in two extreme-value type I (Gumbel) distributions, with cumulative distribution function . This optimum rank test is similar to a Wilcoxon test, but instead of using the standard ranks in the calculation it uses a linear combination of logs of the ranks. A location shift in a Gumbel distribution equates to parallel log-log survival curves, which stated another way means that the two survival curves are connected by where is the group 2 : group 1 hazard ratio. Thus the log-rank test makes the proportional hazards (PH) assumption in order to have full efficiency (optimum power in the homogeous survival distribution comparison).

Another way to conclude that the log-rank test makes the PH assumption is to know that the log-rank test statistic is exactly the Rao efficient score test that arises from the semiparametric Cox PH model partial likelihood function, when there are no tied failure times. It is difficult to come up with an example where one procedure assumes something and the other doesn’t when the correlation between the results of the two procedures is 1.0 in real data. Finally, since the log-rank test is a special case of the Cox model, it makes all of the assumptions of the Cox model, and more (homogeneity of survival distributions within groups, i.e., there are no risk factors or important baseline covariates). The better likelihood ratio statistic from the Cox model has an extremely high rank correlation with the log-rank over huge varieties of datasets. The log-rank test is asymptotically equivalent to the Cox model likelihood ratio test.

The log-rank test and the Cox model regression coefficient for group always agree on both the presence and the direction of the treatment effect. This is because

the score function is the first derivative of the log-likelihood at
the Rao score statistic for testing has as its numerator the score function at
the maximum likelihood estimate of the log hazard ratio is zero if and only if the score function is zero at (hazard ratio 1.0), so that the score statistic is also zero
the score statistic is the log-rank statistic and zero on the scale is its most null value
the direction of is reflected by the score function at , and the same thing is reflected in the log-rank statistic, so the direction of the group effect from log-rank and Cox will be identical

Neither the log-rank test nor the Cox model assumes PH under (since PH automatically holds in that case, and the hazard ratio is the constant 1.0) but they both assume PH otherwise, or both will lose power. The only way for the log-rank test to not assume PH is for the Cox model to not assume PH. I would go even further: the two methods are really one method, if there are no covariates and especially if attention is restricted to score tests. Non-PH hurts the log-rank test to the exact same degree as it hurts analysis based on the Cox model. The fact that one doesn’t immediately see a likelihood function for the log-rank test doesn’t mean there is not one lurking in the background.

Kaplan-Meier estimates are nonparametric, only assuming independence of failure time and censoring. But the average difference between the of two K-M curves is the log hazard ratio.

Whatever one wants to say about the assumptions of the log-rank test, the test assumes PH to exactly the same degree as the Cox model.

Wilcoxon Test

The Wilcoxon-Mann-Whitney two-sample rank-sum test was derived as the optimum linear rank statistic for detecting a simple location shift in two logistic distributions with density functions like . A location shift exists between two logistic distributions when the logits of their cumulative distribution functions are parallel. This is the proportional odds (PO) assumption. Since the Wilcoxon test was designed to have optimum efficiency under a logistic distribution location shift, it has always made the PO assumption.

The proportional odds assumption is an exact analogy to the equal variance assumption in the -test.

To bolster this argument, the Wilcoxon statistic is exactly the numerator of the Rao efficient score test from the PO model when there are only two groups and no covariates. Furthermore, consider the Wilcoxon statistic scaled to be in [0, 1]. This simple linear translation results in the concordance probability, also known as the -index or probability index. Consider a random dataset where one computes the scaled Wilcoxon statistic (concordance probability ) and the maximum likelihood estimate (MLE) of the regression coefficient for treatment group from a PO ordinal regression model. The MLE of the odds ratio is which I’ll denote . As shown here if and only if , the Wilcoxon statistic’s most null value. This is because when is the MLE, the first derivative of the log-likelihood is zero, so the score function evaluated at is exactly zero. Since the mumerator of the Rao score statistic is the Wilcoxon statistic, centered so that zero is the null value, the exact agreement of and follows mathematically.

Furthermore, if an only if the estimated OR in the PO model and if and only if . So the Wilcoxon test and the PO model always agree on whether or not there is any group effect, and on the direction of the effect. Not only do the two procedures computationally agree on presence and direction of group effects, they agree almost exactly on the estimated effect. The between and is 0.996 over a huge variety of datasets with with PO and non-PO in play. The the mean absolute error in estimating the [0,1]-scaled Wilcoxon statistic from the PO model OR estimate is over datasets. The Wilcoxon statistic is almost perfectly calculated from the estimated odds ratio using the equation .

The only way for the Wilcoxon test not to assume PO is for the PO model to not assume PO. It’s not just that both methods make the PO assumption; the methods are essentially one method if there are no covariates. Non-PO hurts the Wilcoxon test by exactly the same amount that it hurts the PO model.

Random Intercepts Models

Random intercepts (RI) mixed-effects models apply well to clustered data in which elements of a cluster are exchangeable. The compound symmetry assumption of an RI model means an assumption of equal correlation between any two measurements in the same subject is being made. When an individual subject is a cluster, random effects could be used to model rapidly repeated measurements within subject, where elapsed time is not important. Things are different in longitudinal data, for which correlation patterns are almost always such that the correlation between measurements made far apart is less than the correlation between measurements that have a small time gap. This typical serial correlation pattern is in conflict with the symmetric correlation structure assumed by an RI model, and the failure of the RI model to properly fit the correlation structure can invalidate standard errors, p-values, and confidence intervals from such models.

Adding random slopes to RIs makes the model more flexible correlation structure-wise, but this induces a rather strange correlation pattern that is still unlikely to fit the data.

Comparison of Parametric and Semiparametric Model Assumptions

Consider two types of observations, with respective covariate settings of and . Let be the regression coefficients, and let . For covariate-less two-sample tests (-test, log-rank, Wilcoxon), for group B and for group A, and is the group difference (B - A difference in means, log hazard ratio, log odds ratio, respectively). Let be the Gaussian (normal) cumulative distribution function, and be its inverse, i.e, the -transformation. Let be the cumulative distribution function for Y conditional on covariate combination . Then the models discussed here make these assumptions:

Linear model and -test: and are parallel straight lines with vertical separation (parallel lines = equal variances)
Cox PH model and log-rank test: and are parallel curves with vertical separation (parallel curves = proportional hazards)²
PO model and Wilcoxon test: and are parallel curves with vertical separation (parallel curves = proportional odds)

² For the Weibull parametric proportional hazards survival model, parallelism and linearity in is assumed.

For the straight line assumption, think of quantile-quantile, i.e., Q-Q plots of observed quantiles vs. theoretical quantiles, where a straight line indicates agreement between the sample and theoretical (assumed) distribution. Note the vast distinction between assuming something is a straight line and assuming something is a curve. The straight line assumption equates to a parametric assumption, i.e., assuming a specific shape of distribution, here Gaussian. The two semiparametric models make no distributional assumption for Y given . All three models make a parallelism assumption.

Both semiparametric models and rank tests are distribution-free, as they don’t depend in any way on the shape of a given group’s distribution to achieve optimum operating characteristics. Both models and “nonparametric” tests make an assumption about the connection between two distributions, e.g., proportional hazards or odds, to the same degree.

Summary

It is important to have a definition in mind for examining whether an assumption is being made. It is also important to note that even though the label nonparametric is frequently used, there a few truly assumption-free statistical procedures. “Nonparametric” tests such as the log-rank, Wilcoxon, and Kruskal-Wallis tests are just special cases of semiparametric regression models, so they make all of the assumptions of the semiparametric models and more. For example, both the log-rank and Wilcoxon tests assumes homogeneity of distributions, i.e., absence of important covariates. Semiparametric models easily handle covariates.

Sometimes it is said that nonparametric tests make assumptions only under the null hypothesis while statistical models make assumptions under both the null and any alternative. That this is not the case was discussed above. For example, for the log-rank and Wilcoxon tests to operate optimally (have maximum local power) under the alternative, the alternative must be, respectively, a proportional hazards or a proportional odds situation.

There are major advantages to stopping the practice of using “nonparametric” tests that are special cases of semiparametric models:

There would be less to teach.
Covariate adjustment is readily handled by models.
Semiparametric models have likelihood functions, so they bridge frequentist and Bayesian approaches. Prior information can be used on treatment effects, and shrinkage priors can be used on covariate effects.
Semiparametric models allow one to not only estimate effect ratios, but also to estimate derived quantities such as exceedance probabilities, means, and quantiles. Some examples are:
- For a Cox model one can estimate hazard ratios, survival probabilities, and mean restricted lifetimes.
- For a PO model one can estimate odds ratios, exceedance probabilities, cell probabilities, covariate-specific mean Y (if Y is interval-scaled) and covariate-specific quantiles of Y (if Y is continuous)³.
Semiparametric models are easily extended to multilevel and longitudinal models (with serial correlation structures).
Semiparametric models are easily extended to allow for lower and upper detection limits, interval censoring, and other complexities.

³ The effect measure that is usually associated with the Wilcoxon test is the Hodges-Lehmann estimator, which is the median of all pairwise differences, taking one observation from each group. It is perhaps not as interpretable as the difference in means or medians that one can obtain from the PO model, and a Bayesian PO model provides exact uncertainty intervals for derived quantities such as these.

Reuse

CC BY 4.0

Football Multiplicities

Frank Harrell — Sun, 10 Mar 2024 06:00:00 GMT

Background

Consider the problem of comparing two treatments by doing squential analyses by avoiding putting too much faith into a fixed sample size design. As shown here the lowest expected sample size will result from looking at the developing data as often as possible in a Bayesian design. The Bayesian approach computes probabilities about unknowns, e.g., the treatment effect, and one can update the current evidence base as often as desired, knowing that the current information has made previous evidence simply obsolete. A stopping rule based on, say a posterior probability of efficacy exceeding 0.95, will result in perfectly calibrated posterior probabilities at the moment of stopping, as demonstrated in a simple simulation.

On the other hand, sequential frequentist analysis gives data more opportunities to be extreme under the null hypothesis, leading to multiplicity challenges if one wants to preserve the type I assertion probability . The real multiplicity issues in the frequentist approach makes traditionally-trained statisticians hesitant to admit that probabilities about parameters are fundamentally different from probabilities about data, and do not involve multiplicities in the sequential testing setting.

The purpose of this article is to demonstrate the vast difference between forwards and backwards probabilities in a simple setting in which the ultimate truth is known.

Data

For many years, the U.S. National Football League (NFL) has provided play-by-play updates to carefully constructed probability estimates. The NFL model is for the probability that the home team will ultimately win the game. The model is based on a large number of variables, and sensibly gives heavy weight to the current score and the time left in the 60-minute American football game. The play-by-play NFL data analyzed in this article comes from the nflreadr R package for the 2008-2023 football seasons¹.

¹ Ho T, Carl S (2024). nflreadr: Download nflverse Data. R package version 1.4.0, https://github.com/nflverse/nflreadr, https://nflreadr.nflverse.com

Code

Overview of Composite Outcome Scales & Statistical Approaches for Analyzing Them

Frank Harrell — Tue, 30 Jan 2024 06:00:00 GMT

How Does a Compound Symmetric Correlation Structure Translate to a Markov Model?

Frank Harrell — Sun, 03 Dec 2023 06:00:00 GMT

Random intercepts in a model induce a compound symmetric correlation structure in which the correlation between two responses on the same subject have the same correlation no matter the time gap between the responses. How does one capture such an unnatural structure using a Markov semiparametric model?

Let’s generate some data with repeatedly measured outcome per subject where the outcome is a coursened continuous response from a linear model, treated as ordinal, and the random effects have a distribution. Generate 5000 subjects having 10 measurements each.

Code

Incorporating Historical Control Data Into an RCT

Frank Harrell — Sat, 04 Nov 2023 05:00:00 GMT

Background

In studies of treatment effects for rare diseases, it is sometimes impossible to recruit enough patients to be able to randomize some of them to control therapy (e.g., standard of care). It other situations, it is difficult to recruit patients when they learn that half of them will not receive the new treatment. In pediatric drug studies, it may be easy to recruit adults into a randomized trial, but too few children may get the disease that is so prevalent in adults. In all these settings it may be desirable to borrow information from non-study patients. There are two major types of information to borrow. For pediatric studies, relative effectiveness of a drug in children can borrow information about relative effectiveness in adults. For other studies, historical data (HD) may be borrowed to provide or bolster the control arm in the RCT. For example, study leaders may wish to use 2:1 experimental : control randomization, but they know that there is a loss of precision and power from such an imbalance. Adding historical control information can make up for that if the HD are reliable and relevant enough.

An area of highly active research in biostatistics is the choice of a prior distribution to properly discount HD for not being concurrent, for selection bias, and for other factors. One may take a posterior distribution from observational data and use discounting when incorporating it into a prior for the new study. For some procedures this amounts to lowering the effective sample size of the HD. This doesn’t directly account for the bias from HD.

An excellent background paper on historical controls is by Schmidli et al. They discuss several ideas not covered here, including the need to adjust for differences in covariate distributions (including by matching on propensity scores). The approach outlined below readily extends to covariate adjustment if the same covariates used in the study are available in the raw HD. Missing values can be handled by adding more models to the joint Bayesian modeling process.

What is presented here is related to the mean squared error (variance plus square of bias) of treatment effects estimated from observational data. A simulated example here shows how to compute the RCT sample size that will overcome the enhanced precision provided by larger observational studies that are biased.

Another Approach

Consider HD used to inform outcome tendencies in control patients. Rather than lowering the effective sample size of the HD, think about respecting the HD sample size but recognizing that the HD is estimating a different quantity than the actual prospective RCT control group outcome tendency. This different quantity is the true unknown control group performance plus the bias in the HD. This approach was proposed by Stuart Pocock in a frequentist context.

Björn Holzhauer developed related Bayesian approaches for borrowing exponential hazard rates from HD.

Now couple that idea with another one. In trying to specify a posterior distribution from an HD analysis to be fed into a prior for the new study, the same goal may be accomplished by noting that Bayesian MCMC simulation procedures (to obtain posterior samples of the parameters of interest) allow multiple models to be processed simultaneously. This is especially the case with Stan. The information from HD can be incorporated into the new analysis by actually specifying a model for the HD and bringing the HD raw data into the analysis of the new trial data. Next, when forming the second model (for HD), explicitly model the possibility that HD are estimating a different estimand than the RCT. When the HD being used consist solely of control patient data, this different estimand is the true unknown outcome measure for randomized concurrent control patients, plus an unknown bias.

There is subtle and important ramification of the approach of merging multiple raw data streams and distinguishing the data source by having multiple models being fitted simultaneously. Most applications of data borrowing oversimplify the data being borrowed. For example, researchers often use summary statistics from historical data including standard errors (SEs). SEs are only estimates of precision of parameter estimates, and treating the SEs as if they are calculated without error is ignoring important uncertainty, giving false confidence in the historical data. By including raw HD in the RCT analysis, all uncertainties are accounted for, making final Bayesian uncertainty intervals properly wider to avoid false confidence. On a related note, when there is important within-treatment outcome heterogeneity necessitating the use of covariate adjustment in both non-randomized and randomized data, researchers very frequently make the mistake of using marginal treatment effect summaries of the HD. In nonlinear models (e.g., Cox and logistic models) estimated treatment effects are shrunken towards the null by the unexplained outcome heterogeneity (caused by wide covariate distributions) that could have been easily explained through direct covariate adjustment¹. Incorporation of raw HD allows for proper covariate adjustment. The same covariate adjustment also balances the playing field with regard to differences in covariate distributions between RCT and HD. Instead of discarding data using propensity matching of RCT to HD to account for, say, patients in the HD being older, age can be directly adjusted for in a much more efficient exclude-no-data approach.

¹ Suppose that sex is an important prognostic factor. Ignoring sex when comparing treatments may make a within-treatment outcome distribution become a mixture of two distributions and even be bimodal, and make a treatment that operates in proportional hazards for both males and females separately not act in proportional hazards when indiscriminately pooling males and females.

For the examples that follow, we assume that no data are available for estimating the bias, and that bias is described by a prior distribution that is Gaussian with a mean bias of zero and with a standard deviation of sigma. When sigma is zero, the bias is known to be exactly zero, and the HD are pooled with any study control data, with full weight given to HD. When sigma is , we know nothing at all about bias, and HD are non-informative and effectively ignored.

Example: Augmenting a Control Arm with HD

Let’s see how this plays out with several choices of sigma. First let’s get access to the needed R packages including cmdstanr which uses Command Stan, and specify the Stan modeling code we need to get the posterior draws from our two-model system. The key idea in the code below is the model for the HD yh which is yh ~ normal(mua + bias, 1.0). The historical data for treatment A are connected to the RCT data by the yh model having the parameter mua in common with the RCT control arm. The two are also disconnected by having the bias parameter added to the mean model for yh. So mua is estimated by combining study data and HD after de-biasing HD.

Code

Wedding Bayesian and Frequentist Designs Created a Mess

Frank Harrell — Tue, 22 Aug 2023 05:00:00 GMT

Background

Medical Setting

Severe asthma that cannot be managed by noninvasive pharmacologic intervention is a serious quality of life issue for patients. Bronchial thermoplasty is an invasive treatment for such patients. It involves inserting a bronchoscope equipped with a device employing radio-frequency ablation to destroy some of the smooth muscle in the airway to allow the patient to breathe more freely. In order to run a rigorous randomized clinical trial (RCT) to unbiasedly determine the clinical effectiveness of thermoplasty it is necessary to randomize some patients to a sham procedure in which a bronchoscope is inserted without performing an intervention, but is manipulated in ways that are almost identical to a true ablation, and stays inserted the same amount of time.

Asthmatx, Inc. created the Alair bronchial thermoplasty system, now owned by Boston Scientific. Asthmatx bravely funded a rigorous pivotal RCT to evaluate Alair, with 2:1 active:control randomization and a true sham control. The trial enrolled 297 patients who remained symptomatic after conventional high dose inhaled corticosteroids. The study protocol is summarized here. The primary study outcome was the Asthma Quality of Life Questionnaire score, assessed at 6w, 3m, 6m, 9m, 12m, with a primary estimand being the between-treatment difference in average scores over the last three follow-ups.

Frequentist vs. Bayesian Approach in a Nutshell

The Bayesian approach to treatment comparisons quantifies evidence for an effect being in any given interval, through the use of posterior probabilities. The probabilities are conditional upon cumulative data collected in the trial to date, and on a prior distribution and data model. It is important to understand that posterior probabilities such as (probability of any efficacy) pertain to what was observed (the data) and have nothing to do with what might have happened. This is in stark contrast to classical frequentist statistics, in which assertions of efficacy rest on “the degree to which the data are embarrassed by the null hypothesis” (Maxwell), based on the unlikeliness of getting data more extreme than the observed data (-value) under the supposition of no effect. In the frequentist paradigm one typically asserts efficacy if some preset . The type I assertion probability is an operating characteristic that is the probability one will assert an effect at any look at the data over the life of the study, if the treatment truly has no effect. Controlling means limiting this probability. In a Bayesian design, can be simulated based on the schedule for intended data looks. More data looks mean more opportunities for extreme data, which means higher .

Bayesian probabilities represent the current state of knowledge about effects, based on current data. When another data look happens, the previously computed posterior probability is merely obsolete and is ignored. There are no multiplicities with Bayes in sequential testing. One can only cheat with Bayes in the sequential setting by changing the prior after looking at the data or by attempting to reverse the flow of time and information. For example, if the first look yielded a probability of efficacy of 0.94 and the second look 0.89, it would be cheating (by failing to properly condition on all information) to revert back to the 0.94 and declare the study over.

Adjusting for frequentist multiplicities entails discounting observed results. Bayes discounts evidence about a specific effect by incorporating a skeptical prior for that effect. It does not discount one effect because you looked at another effect as does frequentist inference.

An exception occurs in hierarchical models when a large number of effects are connected through a common variance.

So in what follows keep in mind that Bayes deals with “what happened” and frequentist hypothesis testing deals also with “what might have happened.” is a pre-study quantity that involves a thought experiment in which one looks at various values of test statistics that may occur over time after the real study starts. Computation of nowhere uses any study result.

Statistical Plan

Berry Consultants wrote the statistical study design and analysis plan. On Berry’s recommendation I was hired as a consultant by Becker & Associates Consulting (later acquired by NSF International). I was charged with composing difficult questions for Asthmatx to help them prepare for an FDA advisory committee meeting.

What is reported in this article is publicly available.

The Bayesian RCT design used a flat prior for the treatment effect . The pre-specified plan specified one interim efficacy analysis and the final analysis. It specified a posterior probability cutoff of 0.99 for declaring efficacy at the interim look, and a cutoff of 0.964 for declaring efficacy at the final look. These cutoffs were chosen so that the overall type I assertion probability for this plan is 0.05.

In retrospect this does not seem reasonable as it puts the same probabilities on a huge benefit, huge harm, and no effect.My personal opinion is that if one did want to control (which is irrelevant when computing forward-information-flow probabilities), it should be controlled either by using a skeptical prior or by requiring the amount of efficacy to be non-trivial, not just greater than zero. One implication of that alternate design is that one would not modify the posterior probability cutoff for the interim look; the interim look would automatically be more conservative because of the prior or a nonzero efficacy cutoff.

Result

Designers of the clinical trial thought that patients would be reluctant to enroll in a study where they had a chance of having an uncomfortable sham bronchoscope inserted with no possibility of benefitting. To their great surprise, desperate refractory asthma patients quickly enrolled in numbers. Surprisingly, enrollment was so successful that there simply wasn’t time to do the interim analysis.

What does this have to do with the marriage between Bayesian and frequentist design in this study? A lot. The final posterior probability of efficacy was 0.96, below the success target of 0.964. So the study failed to meet its primary endpoint. This happened because (1) an arbitrary threshold was used in the first place, taking the decision away from regulatory decision makers, (2) the threshold for the posterior probability was changed to preserve , and (3) the threshold was further penalized to allow for one additional look. And the interim look never happened.

Fortunately for Asthmatx, Alair demonstrated a very large reduction in emergency room visits for patients. The advisory committee overrode the negative primary endpoint by voting 6-1 in favor of approval on 2009-10-28, with conditions.

Summary

The study described was “negative” solely because of a data look that never happened. considers intentions to analyze. has nothing to do with the chance of making a decision error, and when Bayesians are forced to incorporate into their statistical plans, needless complexity arises. This case of an actual clinical trial using a hybrid approach demonstrates the illogic of considering what might have happened in a Bayesian probability calculation. The efficacy target was missed because of a planned interim analysis that never occurred.

Posterior probabilities stand perfectly well on their own, with the only logical argument being about the choice of the prior distribution.

Mixing a Bayesian probability that an assertion is true and a frequentist probability of making an assertion if you magically knew the treatment to be ignorable is like mixing apples and coconuts, and the frequentist nut is hard to crack. It can’t translate a probability about data assuming an assertion is correct into a probability that the assertion is correct.

Tying interpretation of a Bayesian procedure to has another subtle, serious implication: “Preserving ” requires selection of a sample size so that spending can be defined. This hampers the ability of a sponsor to extend a promising study to obtain definitive results. For example, if at the planned final sample size the probability of efficacy was 0.93, the sponsor should be able to decide to spend resources to obtain more evidence. After all, sample size calculations are quite arbitrary and make many assumptions that turn out to be false. The sponsor would have to withstand the real possibility that the cumulative evidence will actually be less impressive after the study extension. Planning studies around limits flexibility and offers no added value regarding interpretation.

Frequentist unblinded sample size re-estimation procedures exist. But because they spend additional , they in effect require the already-collected data to be discounted, the logic of which eludes Bayesian thinking. It is possible for a frequentist study extension to actually result in a lower effective sample size because of the -spending penalty.

Statistical Thinking

Minimal-Assumption Estimation of Survival Probability vs. a Continuous Variable

Background

Estimation Methods

Simulation

Recommendations

Computing Environment

References

Reuse

Bayesian Thinking

Modernizing Clinical Trial Design and Analysis to Improve Efficiency & Flexibility

Statistical Computing Approaches to Maximum Likelihood Estimation

Overview

History

Are Intercepts Regular Parameters?

Re-Write of lrm.fit and orm.fit

Background: Convergence

Overview of Findings

Validation

Check -2 Log Likelihood and Derivatives for a Simple Model

Binary Y

Y=0, 1, 2

Simple Ordinal Model With Weights, Offsets, and Penalties

Check Accuracy Against Old lrm.fit For a Variety of Levels of Y

Study Convergence and Timings

Fortran vs. R

Efficient Computation of the hessian for General Cumulative Probability Models

Step 1: Define the hessian Matrix for

Step 2: Define the Transformation for

Step 3: Matrix Transformation

Explanation:

Step 4: Components of

Step 5: Compute

Final Hessian of

Check Speed of NR, LM, nlminb, and glm.fit

Check Convergence Under Complete Separation

Check Algorithms With k=1000

Check Timing and Agreement for n=100000, k=10, p=5

Other Speed Tests

Check Impact of initglm and transx

lrm.fit vs. orm.fit as k

Better Understanding Convergence with BFGS Optimizer

Matrix Inversion

What is Fast and What is Slow When is Large

Other Resources

Computing Environment

References

Reuse

Ordinal State Transition Models as a Unifying Risk Prediction Framework

Adjudication and Statistical Efficiency

Background

A Hierarchy of Statistical Information and Power

Resources

Reuse

The Burden of Demonstrating Statistical Validity of Clusters

Background

What is the Question and Why Cluster Patients?

How Should Clustering Results be Presented?

Forced-Choice Cluster Classification Requires Verifying Adequacy of Mere Cluster Membership

Demonstrating Stability

Ultimate Validations

Reuse

Hosting Web Content

Miscellaneous Tips

Creating an index File

Converting Wiki Content to Markdown

Reuse

Tips for Biostatisticians Collaborating with Non-Biostatistician Medical Researchers

Rare Degenerative Diseases & Statistics:Methods for Analyzing Composite Patient Outcomes

Traditional Frequentist Inference Uses Unrealistic Priors

Background

Flatter and Flatter Priors

Borrowing Information Across Outcomes

Background

Example Partial Proportional Odds Analysis

Proportional Odds Model Power Calculations for Ordinal and Mixed Ordinal/Continuous Outcomes

Background

Initially Considered Outcome Scale

The log-rank Test Assumes More Than the Cox Model

Background

Re-Write of `lrm.fit` and `orm.fit`

Check Accuracy Against Old `lrm.fit` For a Variety of Levels of Y

Check Speed of `NR`, `LM`, `nlminb`, and `glm.fit`

Check Impact of `initglm` and `transx`

`lrm.fit` vs. `orm.fit` as k

Creating an `index` File