require(Hmisc)
# Load the test package for the new lrm.fit which also creates
# lrm.fit.old
require(tlrm)
# Define a function that will time a series of lines of code, repeating each
# reps times (default is 10). When only one line of code is given, the
# elapsed execution time is printed and the result of the line is returned.
# Otherwise, a vector of run times corresponding to all the lines of code is
# returned, and the values returned from the lines are stored in the global
# environment in object Res, with components named according to inputs to tim.
# When reps > 1 the values from running code lines are from the last rep.
<- function(...) {
tim <- 10
reps <- sys.call()
w <- names(w)
n <- length(n)
m <- 2 : m
k if('reps' %in% n) {
<- which(n == 'reps')
i <- eval(w[[i]])
reps <- setdiff(k, i)
k
}<- n[k]
n <- length(n)
m <- numeric(m)
r <<- vector('list', m) # put in global environment
Res names(r) <- n
names(Res) <<- n
<- 0
l for(i in k) {
<- l + 1
l <- system.time(for(j in 1 : reps)
s <- eval(w[[i]], parent.frame()))['elapsed'] / reps
R <- s
r[l] <<- R
Res[[l]]
}label(r) <- paste('Per-run execution time in seconds, averaged over', reps, 'runs')
if(m == 1) {
print(r)
return(R)
}
r
}
<- function(x) max(abs(x))
m <- function(a, b) c(mad = mean(abs(a - b)),
mad relmad = 2 * mean(abs(a - b) / (abs(a) + abs(b))))
<- function(r) exp(max(abs(log(r)))) # worst ratio, whether < or > 1.0
wratio
# Function to summarize a series of model fits stored in Res
<- function() {
smod <- sapply(Res, function(x) if(length(x$u)) m(x$u) else NA)
max_abs_u <- sapply(Res, function(x) if(length(x$iter)) tail(x$iter, 1) else NA)
iter <- sapply(Res, function(x) tail(x$deviance, 1))
deviance print(data.frame(deviance, max_abs_u, iter))
<- length(Res)
l <- names(Res)
n if(l > 1) {
for(i in 1 : l) {
<- Res[[i]]
r # See if polr or BFGS
<- inherits(r, 'polr') || (length(r$opt_method) && r$opt_method=='BFGS')
a $var <- if(inherits(r, 'orm'))
rvcov(r, intercepts='all') else if(! a) vcov(r)
if(inherits(r, 'polr'))
$coefficients <- c(-r$zeta, coef(r))
r<- r
Res[[i]]
}
cat('\nMaximum |difference in coefficients|,',
'Maximum |relative difference|\n',
'worst ratio of covariance matrices\n\n')
<- NULL
d for(i in 1 : (l-1))
for(j in (i+1) : l) {
<- Res[[i]]; rj <- Res[[j]]
ri <- paste(n[i], 'vs.', n[j])
comp <- mad(ri$coefficients, rj$coefficients)
co <- data.frame(Comparison = comp,
w `Max |difference|` = co[1],
`Max |rel diff|` = co[2],
`Cov ratio` = NA, check.names=FALSE)
if(length(ri$var) && length(rj$var))
$`Cov ratio` <- wratio(ri$var / rj$var)
w<- rbind(d, w)
d
}
}$`Cov ratio` <- ifelse(is.na(d$`Cov ratio`), '', format(d$`Cov ratio`))
drownames(d) <- NULL
d }
Statistical Computing Approaches to Maximum Likelihood Estimation
rms
package logistic regression function lrm
I explored several of these issues. Comparisons of execution time in R vs. Fortran are given. Different coding styles in both R and Fortran are also explored. Hopefully some of these explorations will help others who may not have studied MLE optimization and related statistical computing algorithms.
Overview
Maximum likelihood estimation (MLE) is a gold standard estimation procedure in non-Bayesian statistics, and the likelihood function is central to Bayesian statistics (even though it is not maximized in the Bayesian paradigm). MLE may be unpenalized (the standard approach) or various penalty functions such as L1 (lasso, absolute value penalty), and L2 (ridge regression; quadratic) penalties may be added to the log-likelihood to achieve shrinkage (aka regularization). I have been doing MLE my entire career, using mainly a Newton-Raphson algorithm with step-halving to achieve rapid convergence. I never stopped to think about other optimization algorithms, and the R language has a good many excellent algorithms included in the base stats
package. There is also an excellent maxLik
R package devoted to MLE (not used here), and two of its vignettes provide excellent introductions to MLE. General MLE background and methods including more about penalization may be found here, which includes details about how to do QR factorization and to back-transform after the fit.
In this article I’ll explore several optimization strategies and variations of them, in the context of binary and ordinal logistic regression. The programming scheme that is used here is the one used by many R packages: Write functions to compute the scaler objective function (here, -2 log likelihood), the gradient vector (vector of first derivatives), and the Hessian matrix (matrix of second derivatives), and pass those functions to general optimization functions. Convergence (at least to local minima for -2 LL (log of the likelihood function)) is achieved for smooth likelihood functions when the gradient vector values are all within a small tolerance (say \(10^{-7}\)) of zero or when the -2 LL objective function completely settles down in, say, the \(8^\text{th}\) significant digit. The gradient vector is all zeros if the parameter values are exactly the MLEs (local minima achieved on -2 LL).
History
The R rms
package lrm
function is dedicated to maximum likelihood estimation (MLE) for fitting binary and ordinal (proportional odds) logistic regression models using the logit link, with or without quadratic (ridge) penalization. For ordinal models, lrm
is efficient for up to 400 distinct Y-values (399 intercepts) in the sense that execution time is under 10 seconds for 10,000 observations on 10 predictors. lrm
uses the R function lrm.fit
for its heavy lifting. Much of lrm.fit
was written in 1980 and served as the computational engine for the first SAS procedure for logistic regression, PROC LOGIST
. It used Fortran 77 for computationally intensive work, and used only a Newton-Raphson algorithm with step-halving for iterative MLE. On rare occasions when serious collinearities were present, such as when multiple continuous variables were fitted using restricted cubic splines (which use a truncated power basis), lrm.fit
would fail to converge.
rms
orm
function is efficient for more than 8,000 intercepts. It runs in 0.45s for 400 intercepts and 10.5s for 8000 and provides the sparse covariance matrix.Re-Write of lrm.fit
To modernize the Fortran code, better condition the design matrix \(X_{n\times p}\) (\(n\) observations and \(p\) columns), and to explore a variety of optimization algorithms, I did a complete re-write of the rms
package lrm
and lrm.fit
functions in November, 2024, for rms
version 6.9-0. To reduce optimization divergence when there are extreme collinearities, and to better scale \(X\), I was interested in implementing mean centering followed by a QR factorization of \(X\) to orthogonalize its columns. Details about how this is done and how the parameter estimates and covariance matrix are converted back to the original \(X\) space may be found here. QR can be turned on by setting the lrm.fit
argument transx
to TRUE
. When QR is in play, the rotated columns of \(X\) are scaled to have standard deviation 1.0.
rms
6.9-0 will be in CRAN around 2024-12-10. It requires an update to the Hmisc
package which is already on CRAN.Besides dealing with fundamental statistical computing issues, I changed lrm.fit
to use Therneau’s survival
package concordancefit
function to compute concordance probabilities used by various rank correlation indexes such as Somers’ \(D_{xy}\). This got rid of a good deal of code. Previously, lrm.fit
binned predicted probabilities into 501 bins to calculate rank measures almost instantly. But I decided it was time to use exact calculations now that concordancefit
is so fast. Though not used in rms
, concordancefit
also computes accurate standard errors.
A new Fortran 2018 subroutine lrmll
was written to efficiently calculate the -2 log-likelihood function (the deviance), the gradient vector, and the Hessian matrix of all second partial derivatives of the log-likelihood with respect to intercept(s) \(\alpha\) and regression coefficients (slopes) \(\beta\).
lrm.fit
now implements several optimization algorithms. When Y is binary and there is no penalization, it has the option to use glm.fit(..., family=binomial())
which runs iteratively reweighted least squares, a fast-converging algorithm for binary logistic regression (but does not extend to ordinal regression). lrm.fit
implements 8 optimization algorithms.
- optional for proportional odds models (according to
lrm.fit
initglm
argument), does a first pass withglm.fit
that fits a binary logistic regression for the probability that Y is greater than or equal to the median Y; this fit can then be used for starting values after shifting the default starting intercept values so that the middle intercept matches the intercept from the binary fit nlminb
: a function in thestats
package, tied withNR
as the fastest algorithm in general; uses Hessians. This uses Fortran routines in the Bell Labs port library for which the original paper may be found here.NR
: Newton-Raphson iteration with step-halving, implemented here as R functionnewtonr
. This algorithm is the default because it has the advantage of having full control over convergence criteria. It requires convergence with respect to 3 simultaneous criteria: changes in -2 LL, changes in parameter estimates, and nearness of gradient to zero. The user can relax any of the 3 criteria thresholds to relax conditions. This is the optimization method used in the oldlrm.fit
function, but written in Fortran there for slightly increased speed. Defaults for tolerance parameters are such thateps
(iteration-to-iteration change in -2 LL) will usually dictate when convergence called.glm.fit
: for binary Y without penalization onlynlm
: thestats
function that is usually recommended for maximum likelihood, but I found it is slower thannlminb
without offering other advantagesBFGS
andL-BFGS-B
using thestats
optim
function: fast general-purpose algorithms that do not require the Hessian, so these can be used with an unlimited number of intercepts as long as the user sets thelrm.fit
parametercompvar
toFALSE
so that the Hessian is not calculated once after convergenceCG
andNelder-Mead
: seeoptim
MASS
package polr
functionThe last four methods do not involve computing the Hessian, which is the most computationally intensive calculation in MLE. But they are still slower overall if you want to get absolute convergence, due to requiring many more evaluations of the object function (-2 LL).
See this for useful comparisons of some of the algorithms.
Background: Convergence
The two most commonly used convergence criteria for MLE are
- relative convergence: stop iterations when the change in deviance is small or when the relative change in parameter estimates is small
- absolute convergence: stop iterations when the first derivative of the log likelihood is small in absolute value for all parameters, or there is a very small absolute change in parameter values
Absolute convergence with respect to the first derivatives (gradients) is similar to demanding that none of the regression parameters change very much since the last iteration. From the standpoint of what is important statistically, convergence of the deviance (what the gold standard likelihood ratio test is based on) is sufficient with respect to what matters. Changing parameter values when the deviance does not change in the \(6^\text{th}\) decimal place will be buried in the noise. Absolute convergence may affect \(\hat{\beta}\), but relative convergence tends to result in a very stable \(\frac{\hat{\beta}}{\text{s.e.}}\). You might deem convergence satisfactory when parameter estimates between successive iterations change by less than 0.05 standard errors (which would require evaluation of the Hessian to know), but approximately this corresponds to relative convergence judged by the deviance.
Despite achieving statistically relevant convergence more easily, there is a real issue of reproducibility. Different algorithms and software may result in semi-meaningful differences in \(\hat{\beta}\) taken in isolation (ignoring how much of \(\hat{\beta}\) is noise). Only by having all the implementations achieve absolute convergence will different analysts be very likely to reproduce each others’ work. If this is important to you, either avoid the BFGS algorithms (for which the R optim
function does not have an absolute convergence criterion) or use a highly stringent relative convergence criterion, e.g., specify the lrm.fit
argument reltol
as \(10^{-11}\). Below I explore how convergence and execution time are affected by reltol
.
Overview of Findings
opt_method='NR'
and 'nlminb'
are the fastest methods, even slightly faster than glm.fit
.
Limited tests of transx
in lrm.fit
to use QR factorization does not show not much benefit, but see below for details.
For ordinal Y, using opt_method='BFGS'
with compvar=FALSE, compstats=FALSE
mimics the polr
function in the MASS
package. For a large number of intercepts, lrm.fit
is much faster due to computing the deviance and derivatives in highly efficient compiled Fortran, as long as the number of intercepts is less than a few hundred.
Setting initglm=TRUE
tells lrm.fit
to get initial ordinal model parameter values from a binary logistic glm.fit
run when cutting Y at the median. This does not seem to offer much benefit over setting starting values to covariate-less MLEs of \(\alpha\) (which are calculated instantly when \(\beta=0\)) and setting \(\beta=0\). In one example using nlminb
the algorithm actually diverged with initglm=TRUE
but ran fine without it. This is probably due to large intercept values with many distinct Y values.
Extensive tests of opt_method='BFGS'
show good stochastic performance (convergence to what matters, deviance-wise), but unimpressive execution time if you want absolute convergence by setting reltol
to a small number.
The best overall algorithm that uses the Hessian is NR
in terms of speed and convergence, with nlminb
a close second. NR
is the method used in the old lrm.fit
, so for most datasets, the new optimization options are not needed.
Validation
Two kinds of validations appear below.
- Validation of the Fortran-calculated deviance and derivatives: for a given small dataset, get parameter estimates from
rms::orm
with tight convergence criteria, and test the Fortran code by evaluating the deviance and derivatives at these parameter values. The gradient (first derivative) should be very close to zero, and the deviance should be identical to that fromorm
. The inverse of the negative of the Hessian matrix should equal the variance-covariance matrix computed byorm
. - Validate
lrm.fit
overall by letting it pick its usual starting values for iteration, and compare its output to that fromorm
and other fitting functions.
Check -2 Log Likelihood and Derivatives for a Simple Model
Binary Y
<- 1 : 10
x <- c(0, 1, 0, 0, 0, 1, 0, 1, 1, 1)
y
<- matrix(as.double(x), ncol=1)
xx
# From orm. Deviance = 13.86294, 10.86673
<- -2.4412879506377 ; beta <- 0.4438705364796
alpha <- .Fortran('lrmll', 10L, 1L, 1L, xx, as.integer(y), numeric(10), rep(1e0, 10), 0e0,
w logL=numeric(1), grad=numeric(2), numeric(0), 2L, 0L, 1L)
alpha, beta, $logL w
[1] 10.86673
$grad w
[1] -1.814104e-13 -1.156408e-12
<- .Fortran('lrmll', 10L, 1L, 1L, xx, as.integer(y), numeric(10), rep(1e0, 10), 0e0,
w logL=numeric(1), grad=numeric(2), hess=matrix(0e0, nrow=2, ncol=2), 3L, 0L, 1L)
alpha, beta, $hess w
[,1] [,2]
[1,] -1.813852 -9.976185
[2,] -9.976185 -66.123262
- solve(vcov(rms::lrm(y ~ x, eps=1e-10)))
Intercept x
Intercept -1.813852 -9.976185
x -9.976185 -66.123262
<- list(glm = glm(y ~ x, family=binomial(), control=list(epsilon=1e-12)),
Res lrm.fit = lrm.fit(xx, y, reltol=1e-12),
old.lrm.fit = lrm.fit.old(xx, y, eps=1e-10),
orm = rms::orm(y ~ x, eps=1e-10) )
smod()
deviance max_abs_u iter
glm 10.86673 NA 4
lrm.fit 10.86673 3.552714e-15 7
old.lrm.fit 10.86673 2.220446e-15 NA
orm 10.86673 NA NA
Maximum |difference in coefficients|, Maximum |relative difference|
worst ratio of covariance matrices
Comparison Max |difference| Max |rel diff| Cov ratio
1 glm vs. lrm.fit 3.358425e-15 2.245424e-15 1
2 glm vs. old.lrm.fit 5.551115e-16 4.320309e-16 1
3 glm vs. orm 7.771561e-16 5.229848e-16 1
4 lrm.fit vs. old.lrm.fit 2.803313e-15 1.813393e-15 1
5 lrm.fit vs. orm 2.581269e-15 1.722439e-15 1
6 old.lrm.fit vs. orm 2.220446e-16 9.095388e-17 1
# Check agreement of stats including rank correlation measures
m(Res$lrm.fit$stats - Res$old.lrm.fit$stats)
[1] 1.421085e-14
Cov ratio
in the above output is the anti-log of the absolute value of the log of the ratio of elements of two covariance matrices. So it represents the worst disagreement in the two matrices, with 1.0 being perfect agreement to 7 decimal places. Max |difference|
is the highest absolute difference in estimated regression coefficients between two methods, and Max |rel diff|
is the maximum ratio of absolute differences to the sum of absolute values of two coefficient estimates.Y=0, 1, 2
<- 1 : 10
x <- matrix(as.double(x), ncol=1)
xx <- c(0, 2, 0, 1, 0, 2, 2, 1, 1, 2)
y <- rms::orm(y ~ x, eps=1e-10) # deviance 21.77800 19.79933
f - solve(vcov(f, intercepts='all'))
y>=1 y>=2 x
y>=1 -2.336337 1.148893 -5.103426
y>=2 1.148893 -2.657167 -10.455125
x -5.103426 -10.455125 -110.128337
- solve(vcov(rms::lrm(y ~ x, eps=1e-10)))
y>=1 y>=2 x
y>=1 -2.336337 1.148893 -5.103426
y>=2 1.148893 -2.657167 -10.455125
x -5.103426 -10.455125 -110.128337
- solve(vcov(MASS::polr(factor(y) ~ x)))
x 0|1 1|2
x -110.12802 5.103400 10.455131
0|1 5.10340 -2.336318 1.148877
1|2 10.45513 1.148877 -2.657156
# Note a problem with VGAM
- solve(vcov(VGAM::vglm(y ~ x, VGAM::cumulative(reverse=TRUE, parallel=TRUE))))
(Intercept):1 (Intercept):2 x
(Intercept):1 -2.434082 1.155134 -5.148614
(Intercept):2 1.155134 -2.580855 -9.505256
x -5.148614 -9.505256 -100.063658
- solve(vcov(ordinal::clm(factor(y) ~ x)))
0|1 1|2 x
0|1 -2.336337 1.148893 5.103426
1|2 1.148893 -2.657167 10.455125
x 5.103426 10.455125 -110.128337
<- c(-0.8263498291155, -2.3040967379853)
alpha <- 0.3091154153068
beta
# Analytically compute 2nd derivative of log L wrt beta
<- function(x) {
info <- plogis(alpha[1] + x * beta)
p1 <- plogis(alpha[2] + x * beta)
p2 <- p1 - p2
d <- p1 * (1 - p1)
v1 <- p2 * (1 - p2)
v2 - v2
v1 <- p1 * (1 - p1) * (1 - 2 * p1)
w1 <- p2 * (1 - p2) * (1 - 2 * p2)
w2 - w2
w1 * x * ((w1 - w2) * d - (v1 - v2)^2) / d / d
x
}
# Compute 2nd derivative of log(p1 - p2) wrt beta numerically
<- function(x, beta) log(plogis(alpha[1] + x * beta) - plogis(alpha[2] + x * beta))
dif = 1e-6
del <- function(x) ((dif(x, beta + del) - dif(x, beta)) / del - (dif(x, beta) - dif(x, beta - del)) / del) / del
d2 c(info(4), info(8), info(9))
[1] -6.88269 -24.55641 -27.93077
<- c(d2(4), d2(8), d2(9))
num print(num)
[1] -6.882495 -24.556135 -27.931213
sum(num)
[1] -59.36984
<- .Fortran('lrmll', 10L, 2L, 1L, xx, as.integer(y), numeric(10), rep(1e0, 10), 0e0,
w logL=numeric(1), grad=numeric(3), numeric(0), 2L, 0L, 1L)
alpha, beta, $logL w
[1] 19.79933
$grad w
[1] -1.896261e-13 -1.408873e-13 -2.330580e-12
<- .Fortran('lrmll', 10L, 2L, 1L, xx, as.integer(y), numeric(10), rep(1e0, 10), 0e0,
w logL=numeric(1), grad=numeric(3), hess=matrix(0e0, nrow=3, ncol=3), 3L, 0L, 1L)
alpha, beta, $hess w
[,1] [,2] [,3]
[1,] -2.336337 1.148893 -5.103426
[2,] 1.148893 -2.657167 -10.455125
[3,] -5.103426 -10.455125 -110.128337
Simple Ordinal Model With Weights, Offsets, and Penalties
We first ignore weights, offsets, and penalties, then incorporate them.
set.seed(1)
<- rnorm(50)
x1 <- rnorm(50)
x2 <- cbind(x1, x2)
X <- sample(0:5, 50, TRUE)
y <- runif(50)
wt <- wt / sum(wt)
wt <- rnorm(50)
of <- cbind(c(1.2, 0.6), c(0.6, 1.2))
pm
<- lrm.fit.old(X, y, eps=1e-15) # (y ~ x1 + x2, eps=1e-15)
f $deviance f
[1] 177.1350 176.8656
<- coef(f)
cof <- .Fortran('lrmll', 50L, 5L, 2L, X, as.integer(y), rep(0e0, 50),
w rep(1e0, 50), 0 * pm,
1:5], cof[6:7], logL=numeric(1), grad=numeric(7), numeric(0),
cof[2L, 0L, 0L)
$logL w
[1] 176.8656
$grad w
[1] 3.330669e-16 3.552714e-15 2.886580e-15 -2.886580e-15 1.998401e-15
[6] 1.998401e-15 1.332268e-15
<- .Fortran('lrmll', 50L, 5L, 2L, X, as.integer(y), rep(0e0, 50),
w rep(1e0, 50), 0 * pm,
1:5], cof[6:7], logL=numeric(1), grad=numeric(7),
cof[hess=matrix(0e0, nrow=7, ncol=7),
3L, 0L, 0L)
range(w$hess + solve(vcov(f)))
[1] -1.421085e-14 1.065814e-14
# Needed reltol=1e-15 to get gradient to 1e-8 with BFGS
# CG achieved 1e-6 with default, with 475 function evaluations
<- lrm.fit(X, y, trace=1, opt_method='nlm', gradtol=1e-14,
g transx=TRUE, compstats=FALSE)
iteration = 0
Step:
[1] 0 0 0 0 0 0 0
Parameter:
[1] 1.51634749 0.57536414 -0.08004271 -0.84729786 -2.19722458 0.00000000
[7] 0.00000000
Function Value
[1] 177.135
Gradient:
[1] 1.199041e-14 -2.664535e-15 -1.243450e-14 7.993606e-15 3.330669e-15
[6] 2.510851e+00 3.400018e+00
iteration = 4
Parameter:
[1] 1.53972787 0.60066749 -0.06107634 -0.83760533 -2.19876783 -0.07124309
[7] -0.10606549
Function Value
[1] 176.8656
Gradient:
[1] -8.388246e-11 -3.561773e-11 8.475087e-11 1.565637e-11 7.530643e-13
[6] 5.342615e-12 9.455770e-13
Last global step failed to locate a point lower than x.
Either x is an approximate local minimum of the function,
the function is too non-linear for this algorithm,
or steptol is too large.
m(g$u); g$iter # L-BFGS-B tool factor as low as 1e2 to get u=3e-6, 50 iter
[1] 4.237544e-11
[1] 4
mad(cof, g$coefficients)
mad relmad
1.377462e-12 4.676386e-12
range(f$var / g$var)
[1] 1 1
Now use weights, offset, and penalties.
<- rms::lrm(y ~ x1 + x2 + offset(of), weights=wt, penalty=2,
f penalty.matrix=pm, eps=1e-12)
$deviance f
[1] 3.475801 3.803296 3.799336
<- coef(f)
cof <- .Fortran('lrmll', 50L, 5L, 2L, cbind(x1, x2), as.integer(y), of, wt,
w 2e0 * pm,
1:5], cof[6:7], logL=numeric(1), grad=numeric(7), numeric(0),
cof[2L, 0L, 1L)
$logL w
[1] 3.799336
$grad w
[1] 3.404934e-09 6.040724e-10 -1.193834e-09 -1.123210e-09 -1.671228e-09
[6] 3.391934e-10 7.970911e-11
<- .Fortran('lrmll', 50L, 5L, 2L, cbind(x1, x2), as.integer(y), of, wt,
w 2 * pm,
1:5], cof[6:7], logL=numeric(1), grad=numeric(7),
cof[hess=matrix(0e0, nrow=7, ncol=7),
3L, 0L, 1L)
range(w$hess + solve(vcov(f)))
[1] -6.661338e-16 7.771561e-16
<- lrm.fit(cbind(x1, x2), y, trace=3, offset=of, weights=wt,
g penalty.matrix=2e0*pm, opt_method='nlminb', compstats=FALSE)
0: 3.8301785: 1.23911 0.611204 -0.0435624 -0.772679 -2.45652
3: 3.8032957: 1.50965 0.925273 0.280639 -0.513375 -2.39790
0: 3.8032957: 1.50965 0.925273 0.280639 -0.513375 -2.39790 0.00000 0.00000
3: 3.7993356: 1.50984 0.927634 0.283085 -0.513316 -2.39750 0.0149656 -0.0431891
m(g$u); g$iter
[1] 3.404934e-09
iterations evaluations.function evaluations.gradient
3 4 3
mad(coef(f), coef(g))
mad relmad
5.699806e-17 1.881056e-16
$iter g
iterations evaluations.function evaluations.gradient
3 4 3
$u g
y>=1 y>=2 y>=3 y>=4 y>=5
3.404934e-09 6.040723e-10 -1.193834e-09 -1.123210e-09 -1.671228e-09
x1 x2
3.391934e-10 7.970911e-11
$deviance f
[1] 3.475801 3.803296 3.799336
$deviance g
[1] 3.475801 3.803296 3.799336
range(f$var / g$var)
[1] 1 1
Check Accuracy Against Old lrm.fit
For a Variety of Levels of Y
set.seed(1)
<- 150
n <- NULL
w for(i in 1 : 40) {
<- sample(1 : 20, 1, TRUE)
k <- sample(0 : k, n, TRUE)
y <- runif(n)
x1 <- exp(rnorm(n))
x2 <- cbind(x1, x2)
X <- lrm.fit.old(X, y, eps=1e-10)
f <- lrm.fit( X, y, opt_method='nlminb', compstats=FALSE)
g <- coef(f) - coef(g)
d <- wratio(vcov(f) / vcov(g))
r <- rbind(w, data.frame(i, k, mad.beta=m(d), Cov.ratio=r))
w
}range(w$Cov.ratio)
[1] 1 1
with(w, plot(k, mad.beta, log='y'))
Study Convergence and Timings
Fortran vs. R
Regarding execution speed, a key question is whether it’s worth the effort to code part of the calculations in a compiled language such as Fortran, C, or C++, as compared to just using R. Let’s explore this by coding the gradient vector calculation in R and timing it against the new Fortran code. Also write a function making the Fortran routine easy to call when computing the gradient.
The code below makes use of the facts that \(\frac{\partial \log \text{expit}(x)}{\partial x} = \text{expit}(-x) = 1 - \text{expit}(x)\) and \(\frac{\partial \text{expit}(x)}{\partial x} = \text{expit(x)}(1 - \text{expit(x)})\). The philosophy of this code is that nothing is calculated unless it is relevant, which makes for more lines of code. For example, the code does not create extra intercepts to yield probabilities of 0 or 1, but instead handles each case Y=0, Y=\(k\), \(0 < \text{Y} < k\) separately. For the non-interior levels of Y the gradient is very simple as probabilities are expits and not differences.
<- function(alpha, beta, x, y) {
grad <- length(alpha)
k <- length(beta)
p <- plogis
f <- as.vector(x %*% matrix(beta, nrow=p))
xb <- as.vector(x %*% beta)
xb <- P2 <- numeric(n)
P1 <- y == 0
i0 <- y == k
ik <- y > 0 & y < k
ib <- 1e0
P1[i0] <- f(alpha[1] + xb[i0])
P2[i0] <- f(alpha[k] + xb[ik])
P1[ik] <- 0e0
P2[ik] <- f(alpha[y[ib] ] + xb[ib])
P1[ib] <- f(alpha[y[ib] + 1] + xb[ib])
P2[ib] <- P1 * (1e0 - P1)
pq1 <- P2 * (1e0 - P2)
pq2 <- P1 - P2
P <- rep(0e0, k + p)
U
1] <- - sum(1e0 - P[i0])
U[<- U[k] + sum(1e0 - P[ik])
U[k]
# Gradiant for intercepts
if(k > 1) for(m in 1 : k) { # only interior y values create complexity
if(m < k) U[m] <- U[m] + sum(pq1[y == m] / P[y == m])
if(m > 1) U[m] <- U[m] - sum(pq2[y == m - 1] / P[y == m - 1])
}# Gradient for slopes
for(m in 1 : p) {
+ m] <- - sum(x[i0, m] * (1e0 - P[i0]))
U[k + m] <- U[k + m] + sum(x[ik, m] * (1e0 - P[ik]))
U[k if(k > 1) for(i in 1 : (k - 1)) {
<- y == i
j + m] <- U[k + m] + sum(x[j, m] * (pq1[j] - pq2[j]) / P[j])
U[k
}
}
U
}
<- function(alpha, beta, x, y) {
fgrad <- as.matrix(x)
x <- nrow(x)
n <- ncol(x)
p <- max(y) # y assumed to be coded 0-k
k .Fortran('lrmll',
as.integer(n), as.integer(k), as.integer(p),
rep(0e0, n), rep(1e0, n), matrix(0e0, nrow=p, ncol=p),
x, y, numeric(1), u=numeric(k + p), numeric(1), 2L, 0L, 0L)$u
alpha, beta,
}
# This calls a version of the Fortran code using the alpha extension approach
<- function(alpha, beta, x, y) {
fgrad2 <- as.matrix(x)
x <- nrow(x)
n <- ncol(x)
p <- max(y) # y assumed to be coded 0-k
k .Fortran('lrmll2',
as.integer(n), as.integer(k), as.integer(p),
rep(0e0, n), rep(1e0, n), matrix(0e0, nrow=p, ncol=p),
x, y, numeric(1), u=numeric(k + p), numeric(1), 2L, 0L, 0L)$u
alpha, beta, }
But is the elegance of following the letter of the proportional odds model’s definition worth the trouble? What if we used the trick that is most often used in MLE and Bayesian modeling where we define extra intercepts so that all values of Y appear to be interior values and the same difference in probabilities can be computed everywhere, and we did not use special cases to compute derivatives of log likelihood components? Specifically we can write the model as the following, with y = \(0, 1, \ldots, k\) and \(\text{expit}(x) = \frac{1}{1 + \exp(-x)}\).
\[\Pr(Y \geq y) = \text{expit}(\alpha_y + X\beta)\]
by expanding the original vector of intercepts \(\alpha\) by adding \(\alpha_0 = 100\) and \(\alpha_{k+1} = -100\). Then \(\Pr(Y = y) = \text{expit}(\alpha_y + X\beta) - \text{expit}(\alpha_{y+1} + X\beta)\) for any \(y=0, \ldots, k\). The \(\pm 100\) is chosen so that \(\text{expit}(x)\) is indistinguishable from 1 and 0.
We need to run some computational tests to make sure that the upcoming shortcuts do not cause any computational inefficiencies or inaccuracies.
# Check that R computes expit very quickly for extreme values of x
tim(smallvals = plogis(rep( 1, 100000)),
m50 = plogis(rep(-50, 100000)),
p50 = plogis(rep( 50, 100000)))
Per-run execution time in seconds, averaged over 10 runs
smallvals m50 p50
1e-03 9e-04 1e-03
# Check that taking the log of probabilities is as accurate as
# using plogis' special log probability calculation
<- seq(-50, 50, by=1); m(log(plogis(x)) - plogis(x, log=TRUE)) x
[1] 8.881784e-16
So it appears that the “intercept extension” approach will not cause any numerical problems. To code this method while computing the gradient, we need the derivative of the log of the difference in probabilities (call this \(Q = P_1 - P_2\)) given above. Consider a general parameter \(\theta\) which may be one of the (interior) \(\alpha\)s or one of the \(\beta\)s.
\[\frac{\partial \log(Q)}{\partial\theta} = \frac{\frac{\partial P_1}{\partial\theta} - \frac{\partial P_2}{\partial\theta}}{Q}\] Since the main part of \(\frac{\partial \text{expit}(x)}{\partial\theta} = \text{expit(x)}(1 - \text{expit(x)})\),
\[\frac{\partial P_1}{\partial\theta} = P_1 (1 - P_1) \frac{\partial(\alpha_y + X\beta)}{\partial\theta}\] \[\frac{\partial P_2}{\partial\theta} = P_2 (1 - P_2) \frac{\partial(\alpha_{y + 1} + X\beta)}{\partial\theta}\]
\(\frac{\partial(\alpha_{y} + X\beta)}{\partial\theta}\) is the 0/1 indicator function \([y=j]\) when \(\theta = \alpha_{j}\) and is \(X_{j}\) when \(\theta = \beta_{j}\). Now code this in R.
<- function(alpha, beta, x, y) {
grad2 <- length(alpha)
k <- length(beta)
p <- as.vector(x %*% matrix(beta, nrow=p))
xb <- as.vector(x %*% beta)
xb <- c(100e0, alpha, -100e0)
alpha # Must add 1 to y to compute P1 and P2 since index starts at 1, not 0
<- plogis(alpha[y + 1] + xb)
P1 <- plogis(alpha[y + 2] + xb)
P2 <- P1 - P2
Q <- P1 * (1e0 - P1)
pq1 <- P2 * (1e0 - P2)
pq2 <- numeric(k + p)
U
# Gradiant for intercepts
for(m in 1 : k)
<- sum((pq1 * (y == m) - pq2 * (y + 1 == m)) / Q)
U[m]
# Use element-wise multiplication then get the sum for each column
# Element-wise = apply the same value of the weights to each row of x
+ 1) : (k + p)] <- colSums(x * (pq1 - pq2) / Q)
U[(k
U }
Check that grad
and grad2
yield the same answer and check their relative speeds.
set.seed(1)
<- 50000; p <- 50; k <- 100
n <- matrix(rnorm(n * p), nrow=n)
x <- sample(0 : k, n, TRUE)
y stopifnot(length(unique(y)) == k + 1)
<- seq(-6, 6, length=k)
alpha <- runif(p, -0.5, 0.5)
beta tim(g1 = grad( alpha, beta, x, y),
g2 = grad2(alpha, beta, x, y),
reps = 5) # creates Res
Per-run execution time in seconds, averaged over 5 runs
g1 g2
0.8862 0.0550
m(Res$g1 - Res$g2) # maximum absolute difference
[1] 4.774847e-11
Even though the streamlined code in grad2
required evaluating a few quantities that are known to be 0 or 1, its vectorization resulted in significantly faster R code. Later this will be compared with the speed of Fortran code.
Here is the central part of the Fortran code for computing the gradient vector. Fortran is blazing fast and easier to learn than C and C++, so more users may wish to translate some execution-time critical portions of their R code to Fortran 2018. R makes it easy to include Fortran code in packages, and it is also easy to include Fortran functions in RStudio
sessions.
= 0_dp
u
! All obs with y=0
! The derivative of log expit(x) wrt x is expit(-x)
! Prob element is expit(-alpha(1) - lp)
1) = - sum(wt(i0) * (1_dp - d(i0)))
u(if(p > 0) then
do l = 1, p
+ l) = - sum(wt(i0) * x(i0, l) * (1_dp - d(i0)))
u(k end do
end if
! All obs with y=k
! Prob element is expit(alpha(k) + lp)
= u(k) + sum(wt(ik) * (1_dp - d(ik)))
u(k) if(p > 0) then
do l = 1, p
+ l) = u(k + l) + sum(wt(ik) * x(ik, l) * (1_dp - d(ik)))
u(k end do
end if
! All obs with 0 < y < k
if(nb > 0) then
do ii = 1, nb
= ib(ii)
i = y(i)
j ! For p1, D() = 1 for alpha(j), 0 for alpha(j+1)
! For p2, D() = 0 for alpha(j), 1 for alpha(j+1)
= u(j) + wt(i) * v1(i) / d(i)
u(j) + 1) = u(j + 1) - wt(i) * v2(i) / d(i)
u(j if(p > 0) then
do l = 1, p
+ l) = u(k + l) + wt(i) * x(i, l) * (v1(i) - v2(i)) / d(i)
u(k end do
end if
end do
This code can be streamlined using the \(\alpha\) extension approach:
= [100d0, alpha, -100d0]
ealpha = expit(ealpha(y + 1) + lp)
p1 = expit(ealpha(y + 2) + lp)
p2 = p1 - p2
q = p1 * (1_dp - p1)
pq1 = p2 * (1_dp - p2)
pq2 do j = 1, k
= sum((pq1 * merge(1_dp, 0_dp, y == j) - &
u(j) * merge(1_dp, 0_dp, y + 1 == j)) / (q / wt))
pq2 end do
if(p > 0) then
do j = 1, p
+ j) = sum(x(:, j) * (pq1 - pq2) / (q / wt))
u(k end do
end if
But this code runs faster (this is the code tested below):
do i = 1, n
= q(i) / wt(i)
w do j = 1, k
if(y(i) == j) u(j) = u(j) + pq1(i) / w
if(y(i) + 1 == j) u(j) = u(j) - pq2(i) / w
end do
if(p > 0) then
do j = 1, p
+ j) = u(k + j) + x(i, j) * (pq1(i) - pq2(i)) / w
u(k end do
end if
end do
First let’s compare accuracy and speed of two ways of coding the gradient calculation in Fortran (click above to see both versions).
tim(Fortran = fgrad (alpha, beta, x, y),
Fortran2 = fgrad2(alpha, beta, x, y), reps=40 )
Per-run execution time in seconds, averaged over 40 runs
Fortran Fortran2
0.008075 0.012375
m(Res$Fortran - Res$Fortran2)
[1] 6.82121e-12
Though it produces the same result to within \(7\times 10^{-12}\), the streamlined Fortran is \(1.5\times\) slower than the longer Fortran code.
Run the R grad2
function defined above, and the Fortran routine included in the new rms
package, for \(n=50,000\), \(p=50\) predictors, and \(k=100\) intercepts.
# Check agreement of R and Fortran code
tim(R = grad2(alpha, beta, x, y),
Fortran = fgrad(alpha, beta, x, y), reps=40 )
Per-run execution time in seconds, averaged over 40 runs
R Fortran
0.059850 0.008325
m(Res$R - Res$Fortran)
[1] 6.82121e-12
We see that the compiled Fortran code is \(> 7\times\) faster than the R code. In a nutshell Fortran allows you to not worry about vectorizing calculations, allowing for simpler code (there are many functions in Fortran for vectorizing operations but these are used more for brevity than for speed).
As streamlined as this code is, it does not improve execution time over my R grad
function, taking 1.1s to run on the data given above while yielding the wrong answer.
<- function(alpha, beta, X, Y) {
gradient_proportional_odds <- nrow(X) # Number of observations
n <- ncol(X) # Number of predictors
p <- length(alpha) # Number of thresholds (max Y value)
k
# Compute linear predictors
<- X %*% beta # n x 1 vector
eta
# Expand eta to match dimensions with alpha
<- matrix(eta, n, k, byrow = FALSE) # n x k matrix
eta_matrix
# Compute expit(alpha_y + eta) for all thresholds y
<- eta_matrix + matrix(alpha, n, k, byrow = TRUE)
eta_alpha <- 1 / (1 + exp(-eta_alpha)) # n x k matrix of expit values
expit_vals
# Compute probabilities for P(Y = y)
<- cbind(1, expit_vals) # P(Y >= 0) = 1
expit_upper <- cbind(expit_vals, 0) # P(Y >= k+1) = 0
expit_lower <- expit_upper[, 1:k] - expit_lower[, 1:k] # P(Y = y)
prob_Y
# Indicator matrix for observed Y
<- matrix(0, n, k)
Y_ind for (i in 1:k) Y_ind[, i] <- as.numeric(Y == (i - 1))
# Compute weights (observed minus predicted probabilities)
<- (Y_ind - prob_Y)
weights
# Gradients w.r.t. alpha
<- colSums(weights)
grad_alpha
# Gradients w.r.t. beta
<- expit_vals * (1 - expit_vals) # Derivative of expit
d_expit <- numeric(p)
grad_beta for (j in 1:p) {
<- sum(weights * d_expit * X[, j])
grad_beta[j]
}
# Combine gradients
<- c(grad_alpha, grad_beta)
grad return(grad)
}
The Hessian requires about a factor of \(\frac{k + p}{2}\) more calculations than the gradient, so the Fortran code pays off even more when using Hessian-based optimization algorithms or computing the final covariance matrix.
Check Speed of NR
, nlminb
, and glm.fit
glm.fit
is tailored to be efficient when Y is binary and there is no penalization, by using iteratively weighted least squares. How does it stack up against NR
and nlminb
? Let’s try fitting a large binary logistic model.
set.seed(1)
<- 100000; p <- 100
n <- matrix(rnorm(n * p), nrow=n)
X <- sample(0 : 1, n, TRUE)
y tim(NR = lrm.fit(X, y, compstats=FALSE, compvar=FALSE),
nlminb = lrm.fit(X, y, opt_method='nlminb', compstats=FALSE, compvar=FALSE),
glm.fit = glm.fit(cbind(1, X), y), reps=3)
Per-run execution time in seconds, averaged over 3 runs
NR nlminb glm.fit
1.349667 1.244667 1.642333
# transx=TRUE adds 2.3s to lrm.fit
Check Convergence Under Complete Separation
Consider a simple example where there is complete separation because the predictor values are identical to the response. In this case the MLE of the intercept is \(-\infty\) and the MLE of the slope is \(\infty\). The MLEs are approximated by \(\alpha=-50, \beta=100\), yielding predicted logits of \(-50\) and \(50\) because the \(\text{expit}\)s are sufficiently close to 0.0 and 1.0.
plogis(-50, log=TRUE)
[1] -50
plogis( 50, log=TRUE)
[1] -1.92875e-22
exp(plogis(-50, log=TRUE))
[1] 1.92875e-22
exp(plogis( 50, log=TRUE))
[1] 1
See how 4 optimization algorithms fare with their default parameters and with some adjustments. In some of the trace output, the first floating point number listed is -2LL, and following that are the current \(\alpha, \beta\) estimates.
set.seed(1)
<- sample(0 : 1, 20, TRUE)
x <- x
y <- try(lrm.fit(x, y, opt_method='NR', trace=1)) # default opt_method w
Iteration:1 -2LL:26.92047 Max |gradient|:9.6 Max |change in parameters|:4.166667
Iteration:2 -2LL:4.704224 Max |gradient|:1.754066 Max |change in parameters|:2.249045
Iteration:3 -2LL:1.588994 Max |gradient|:0.616103 Max |change in parameters|:2.08089
Iteration:4 -2LL:0.5685801 Max |gradient|:0.2233137 Max |change in parameters|:2.028578
Iteration:5 -2LL:0.2071309 Max |gradient|:0.0817248 Max |change in parameters|:2.010364
Iteration:6 -2LL:0.07592908 Max |gradient|:0.03000809 Max |change in parameters|:2.003793
Iteration:7 -2LL:0.02789647 Max |gradient|:0.01103173 Max |change in parameters|:2.001393
Iteration:8 -2LL:0.01025764 Max |gradient|:0.004057316 Max |change in parameters|:2.000512
Iteration:9 -2LL:0.003772913 Max |gradient|:0.001492464 Max |change in parameters|:2.000188
Iteration:10 -2LL:0.001387888 Max |gradient|:0.000549028 Max |change in parameters|:2.000069
Iteration:11 -2LL:0.0005105632 Max |gradient|:0.0002019736 Max |change in parameters|:2.000025
<- try(lrm.fit(x, y, opt_method='nlminb', trace=1)) w
0: 26.920467: -0.405465 0.00000
1: 5.9001181: -1.83776 3.67925
2: 1.9469377: -2.99693 5.99700
3: 0.69222129: -4.04687 8.09673
4: 0.25163200: -5.06435 10.1316
5: 0.092171572: -6.07067 12.1442
6: 0.033854569: -7.07298 14.1489
7: 0.012447190: -8.07383 16.1506
8: 0.0045780905: -9.07414 18.1512
9: 0.0016840535: -10.0743 20.1514
10: 0.00061951084: -11.0743 22.1515
11: 0.00022790289: -12.0743 24.1515
12: 8.3840460e-05: -13.0743 26.1515
13: 3.0843137e-05: -14.0743 28.1515
14: 1.1346550e-05: -15.0743 30.1515
15: 4.1741617e-06: -16.0743 32.1515
16: 1.5355882e-06: -17.0743 34.1515
17: 5.6491130e-07: -18.0743 36.1515
18: 2.0781925e-07: -19.0743 38.1515
19: 7.6452432e-08: -20.0743 40.1515
20: 2.8125276e-08: -21.0743 42.1515
21: 1.0346714e-08: -22.0743 44.1515
22: 3.8063437e-09: -23.0743 46.1515
23: 1.4002772e-09: -24.0743 48.1515
24: 5.1513283e-10: -25.0743 50.1515
25: 1.8950352e-10: -26.0743 52.1515
26: 6.9714901e-11: -27.0743 54.1515
27: 2.5648816e-11: -28.0743 56.1515
28: 9.4360075e-12: -29.0743 58.1515
29: 3.4692249e-12: -30.0743 60.1515
30: 1.2789769e-12: -31.0743 62.1515
31: 4.7073456e-13: -32.0743 64.1515
32: 1.6875390e-13: -33.0743 66.1515
33: 6.2172489e-14: -34.0743 68.1515
34: 2.6645353e-14: -35.0743 70.1515
35: 8.8817842e-15: -36.0743 72.1515
36: 0.0000000: -37.0743 74.1515
37: 0.0000000: -37.0743 74.1515
Error in chol.default(info, tol = tol) :
the leading minor of order 1 is not positive
Error in chol.default(info, tol = tol) :
the leading minor of order 1 is not positive
<- try(lrm.fit(x, y, opt_method='glm.fit', trace=1)) w
Deviance = 3.368709 Iterations - 1
Deviance = 1.16701 Iterations - 2
Deviance = 0.4207147 Iterations - 3
Deviance = 0.1536571 Iterations - 4
Deviance = 0.0563787 Iterations - 5
Deviance = 0.02072057 Iterations - 6
Deviance = 0.007619969 Iterations - 7
Deviance = 0.002802865 Iterations - 8
Deviance = 0.001031067 Iterations - 9
Deviance = 0.0003793016 Iterations - 10
Deviance = 0.0001395364 Iterations - 11
Deviance = 5.133244e-05 Iterations - 12
Deviance = 1.888413e-05 Iterations - 13
Deviance = 6.947082e-06 Iterations - 14
Deviance = 2.555688e-06 Iterations - 15
Deviance = 9.401851e-07 Iterations - 16
Deviance = 3.458748e-07 Iterations - 17
Deviance = 1.272402e-07 Iterations - 18
Deviance = 4.680906e-08 Iterations - 19
Deviance = 1.722009e-08 Iterations - 20
Deviance = 6.33492e-09 Iterations - 21
Deviance = 2.330491e-09 Iterations - 22
Deviance = 8.573409e-10 Iterations - 23
Deviance = 3.15401e-10 Iterations - 24
Deviance = 1.160316e-10 Iterations - 25
Deviance = 4.268585e-11 Iterations - 26
Deviance = 1.570299e-11 Iterations - 27
Deviance = 5.782042e-12 Iterations - 28
<- try(lrm.fit(x, y, opt_method='BFGS', trace=1)) w
initial value 26.920467
iter 10 value 0.000153
iter 20 value 0.000017
iter 30 value 0.000004
iter 40 value 0.000001
iter 50 value 0.000000
final value 0.000000
stopped after 50 iterations
nlminb
took 37 iterations, going so far that the Hessian matrix was singular. It should have stopped with 10.glm.fit
took 28 iterations; should have stopped with 10BFGS
: stopped after 50 iterations; should have stopped with 10
Now specify arguments to lrm.fit
that are tuned to this task.
<- lrm.fit(x, y, opt_method='nlminb', abstol=1e-3, trace=1) w
0: 26.920467: -0.405465 0.00000
1: 5.9001181: -1.83776 3.67925
2: 1.9469377: -2.99693 5.99700
3: 0.69222129: -4.04687 8.09673
4: 0.25163200: -5.06435 10.1316
5: 0.092171572: -6.07067 12.1442
6: 0.033854569: -7.07298 14.1489
7: 0.012447190: -8.07383 16.1506
8: 0.0045780905: -9.07414 18.1512
9: 0.0016840535: -10.0743 20.1514
10: 0.00061951084: -11.0743 22.1515
<- lrm.fit(x, y, opt_method='glm.fit', reltol=1e-3, trace=1) w
Deviance = 3.368709 Iterations - 1
Deviance = 1.16701 Iterations - 2
Deviance = 0.4207147 Iterations - 3
Deviance = 0.1536571 Iterations - 4
Deviance = 0.0563787 Iterations - 5
Deviance = 0.02072057 Iterations - 6
Deviance = 0.007619969 Iterations - 7
Deviance = 0.002802865 Iterations - 8
Deviance = 0.001031067 Iterations - 9
Deviance = 0.0003793016 Iterations - 10
Deviance = 0.0001395364 Iterations - 11
Deviance = 5.133244e-05 Iterations - 12
<- lrm.fit(x, y, opt_method='BFGS', reltol=1e-4, trace=1) w
initial value 26.920467
iter 10 value 0.000153
final value 0.000100
converged
The tolerance parameters are too large to use when infinite coefficients are not a problem.
Check 4 Algorithms With k=1000
For timings that follow, compstats=FALSE
is specified to lrm.fit
so that we can focus on computationally efficiencies of various optimization algorithms. Some of the differences in run times may not seem to be consequential, but once extremely large datasets are analyzed or one needs to fit models in a bootstrap or Monte Carlo simulation loop, the differences in speed will matter.
When the number of distinct Y values is large, and this far exceeds the number of predictors (\(k >> p\)), the rms
orm
function should be used. It takes into account the sparsity of the intercept portion of the Hessian matrix, which is tri-band diagonal and has only \(3\times k\) nonzero values for \(k+1\) distinct Y values. Outside of orm
and SAS JMP
, ordinal regression fitting software treats the Hessian matrix as being \(k\times k\) and does not capitalize on sparsity.
set.seed(1)
<- 10000; k <- 1000
n <- rnorm(n); y <- sample(0:k, n, TRUE)
x length(unique(y))
[1] 1001
tim(orm = rms::orm.fit(x, y, eps=1e-10),
bfgs = lrm.fit(x, y, opt_method='BFGS', compvar=FALSE, compstats=FALSE, maxit=100),
nlminb = lrm.fit(x, y, opt_method='nlminb', compstats=FALSE),
nr = lrm.fit(x, y, opt_method='NR', compstats=FALSE),
nlm = lrm.fit(x, y, opt_method='nlm', compstats=FALSE),
polr = MASS::polr(factor(y) ~ x, control=list(reltol=1e-10)),
reps= 2)
Per-run execution time in seconds, averaged over 2 runs
orm bfgs nlminb nr nlm polr
0.4855 0.0430 0.5905 0.5075 9.4195 6.3760
smod()
deviance max_abs_u iter
orm 137142.5 NA NA
bfgs 137142.5 2.376188e+00 4
nlminb 137142.5 1.431033e-09 3
nr 137142.5 7.625981e-05 2
nlm 137142.5 7.625981e-05 1
polr 137142.5 NA NA
Maximum |difference in coefficients|, Maximum |relative difference|
worst ratio of covariance matrices
Comparison Max |difference| Max |rel diff| Cov ratio
1 orm vs. bfgs 6.991343e-06 3.950698e-04
2 orm vs. nlminb 1.213373e-13 1.145216e-13 1.000000
3 orm vs. nr 1.213202e-13 1.139669e-13 1.000000
4 orm vs. nlm 4.136310e-07 3.812141e-07 1.001368
5 orm vs. polr 9.637588e-06 1.359060e-05
6 bfgs vs. nlminb 6.991343e-06 3.950698e-04
7 bfgs vs. nr 6.991343e-06 3.950698e-04
8 bfgs vs. nlm 6.776204e-06 3.949523e-04
9 bfgs vs. polr 9.425718e-06 4.030520e-04
10 nlminb vs. nr 3.307971e-16 9.636021e-16 1.000000
11 nlminb vs. nlm 4.136309e-07 3.812140e-07 1.001368
12 nlminb vs. polr 9.637588e-06 1.359060e-05
13 nr vs. nlm 4.136309e-07 3.812140e-07 1.001368
14 nr vs. polr 9.637588e-06 1.359060e-05
15 nlm vs. polr 9.231301e-06 1.326760e-05
Check Timing and Agreement for n=100000, k=10, p=5
Check timing and calculation agreement for n=100000, 10 intercepts, 5 predictors.
set.seed(1)
<- 100000
n <- sample(0 : 10, n, TRUE)
y <- rnorm(n)
x1 <- rnorm(n)
x2 <- rnorm(n)
x3 <- rnorm(n)
x4 <- rnorm(n)
x5 <- cbind(x1, x2, x3, x4, x5)
X
tim(old.lrm.fit = lrm.fit.old(X, y, eps=1e-7),
nr = lrm.fit(X, y, opt_method='NR', compstats=FALSE),
nlm = lrm.fit(X, y, opt_method='nlm', compstats=FALSE),
nlminb = lrm.fit(X, y, opt_method='nlminb', compstats=FALSE, transx=TRUE),
nlminb.notransx = lrm.fit(X, y, opt_method='nlminb', compstats=FALSE, transx=FALSE),
bfgs = lrm.fit(X, y, opt_method='BFGS', compvar=FALSE, compstats=FALSE),
bfgs.reltol = lrm.fit(X, y, opt_method='BFGS', compvar=FALSE,
compstats=FALSE, reltol=1e-12),
polr = MASS::polr(factor(y) ~ X,
control=list(reltol=1e-10)),
reps = 5)
Per-run execution time in seconds, averaged over 5 runs
old.lrm.fit nr nlm nlminb nlminb.notransx
0.1540 0.0854 0.5158 0.1104 0.0804
bfgs bfgs.reltol polr
0.4230 0.6380 2.4726
smod()
deviance max_abs_u iter
old.lrm.fit 479562.9 8.754331e-12 NA
nr 479562.9 1.992553e-07 3
nlm 479562.9 5.051226e-02 1
nlminb 479562.9 1.992456e-07 3
nlminb.notransx 479562.9 1.992552e-07 3
bfgs 479562.9 6.987317e-01 7
bfgs.reltol 479562.9 3.683979e-05 19
polr 479562.9 NA NA
Maximum |difference in coefficients|, Maximum |relative difference|
worst ratio of covariance matrices
Comparison Max |difference| Max |rel diff| Cov ratio
1 old.lrm.fit vs. nr 5.269016e-12 2.231582e-11 1.000000
2 old.lrm.fit vs. nlm 3.004074e-06 2.122140e-05 1.000202
3 old.lrm.fit vs. nlminb 1.053797e-11 4.463604e-11 1.000000
4 old.lrm.fit vs. nlminb.notransx 1.053798e-11 4.462038e-11 1.000000
5 old.lrm.fit vs. bfgs 9.057305e-06 7.174198e-04
6 old.lrm.fit vs. bfgs.reltol 5.712939e-08 1.703834e-06
7 old.lrm.fit vs. polr 5.977469e-06 2.409730e-03
8 nr vs. nlm 3.004068e-06 2.122138e-05 1.000202
9 nr vs. nlminb 5.268959e-12 2.232022e-11 1.000000
10 nr vs. nlminb.notransx 5.268964e-12 2.230456e-11 1.000000
11 nr vs. bfgs 9.057303e-06 7.174198e-04
12 nr vs. bfgs.reltol 5.713464e-08 1.703850e-06
13 nr vs. polr 5.977473e-06 2.409730e-03
14 nlm vs. nlminb 3.004063e-06 2.122136e-05 1.000202
15 nlm vs. nlminb.notransx 3.004063e-06 2.122136e-05 1.000202
16 nlm vs. bfgs 8.867603e-06 7.344671e-04
17 nlm vs. bfgs.reltol 3.059332e-06 2.243215e-05
18 nlm vs. polr 7.995997e-06 2.398944e-03
19 nlminb vs. nlminb.notransx 2.695688e-17 2.221290e-14 1.000000
20 nlminb vs. bfgs 9.057301e-06 7.174198e-04
21 nlminb vs. bfgs.reltol 5.713989e-08 1.703866e-06
22 nlminb vs. polr 5.977476e-06 2.409730e-03
23 nlminb.notransx vs. bfgs 9.057301e-06 7.174198e-04
24 nlminb.notransx vs. bfgs.reltol 5.713989e-08 1.703866e-06
25 nlminb.notransx vs. polr 5.977476e-06 2.409730e-03
26 bfgs vs. bfgs.reltol 9.069540e-06 7.167658e-04
27 bfgs vs. polr 1.347117e-05 2.800575e-03
28 bfgs.reltol vs. polr 5.951212e-06 2.410693e-03
Check Impact of initglm
and transx
Generate a sample of 500 with 30 predictors and 269 levels of Y where a subset of the predictors are strongly related to Y and there are collinearities.
set.seed(1)
<- 500
n <- 30
p <- matrix(runif(n * p), nrow=n)
x 1:8] <- x[, 1:8] + 2 * x[, 9]
x[, <- varclus(~ x)$hclust
s plot(as.dendrogram(s), horiz=TRUE, axes=FALSE,
xlab=expression(paste('Spearman ', rho^2)))
<- seq(0, 1, by=0.1)
rh axis(1, at=1 - rh, labels=format(rh))
<- x[, 1] + 2 * x[, 2] + 3 * x[, 3] + 4 * x[, 4]+ 5 * x[, 5] +
y 3 * runif(n, -1, 1)
<- round(y, 1)
y length(unique(y))
[1] 269
<- function(..., opt_method='NR')
f lrm.fit(x, y, compstats=FALSE, opt_method=opt_method, maxit=1000, ...)
# f(initglm=TRUE) (using nlminb) would not work: NA/NaN gradient evaluation
# This did not happen without collinearities
tim(default = f(),
transx = f(transx=TRUE),
nlm = f(opt_method='nlm'),
bfgs = f(opt_method='BFGS'),
nlminb = f(opt_method='NR'),
reps = 10 )
Per-run execution time in seconds, averaged over 10 runs
default transx nlm bfgs nlminb
0.0454 0.0575 0.3746 0.2421 0.0381
smod()
deviance max_abs_u iter
default 3879.848 6.112532e-05 7
transx 3879.848 6.112532e-05 7
nlm 3879.848 6.112532e-05 6
bfgs 3879.848 9.721526e-03 587
nlminb 3879.848 6.112532e-05 7
Maximum |difference in coefficients|, Maximum |relative difference|
worst ratio of covariance matrices
Comparison Max |difference| Max |rel diff| Cov ratio
1 default vs. transx 3.545735e-14 2.931589e-15 1.000000
2 default vs. nlm 1.063035e-05 5.990158e-07 1.000291
3 default vs. bfgs 1.918945e-05 3.544624e-06
4 default vs. nlminb 0.000000e+00 0.000000e+00 1.000000
5 transx vs. nlm 1.063035e-05 5.990158e-07 1.000291
6 transx vs. bfgs 1.918945e-05 3.544624e-06
7 transx vs. nlminb 3.545735e-14 2.931589e-15 1.000000
8 nlm vs. bfgs 2.249980e-05 3.725436e-06
9 nlm vs. nlminb 1.063035e-05 5.990158e-07 1.000291
10 bfgs vs. nlminb 1.918945e-05 3.544624e-06
Better Understanding Convergence with BFGS Optimizer
Using the same simulated data just used, use BFGS to fit an ordinal model with relative tolerance varying from \(10^{-2}\) to \(10^{-20}\). Estimates are compared to orm
. In addition to comparing parameter estimates as done above we also compute differences in units of standard errors as computed by orm
.
<- rms::orm(y ~ x, eps=1e-10)
g <- sqrt(diag(vcov(g, intercepts='all')))
se
<- NULL
d for(i in 2 : 20) {
<- system.time(for(j in 1:5)
s <- lrm.fit(x, y, compstats=FALSE,
f opt_method='BFGS',
maxit=1000,
reltol=10^(-i)))['elapsed']
<- data.frame(i, elapsed=s / 5, maxu=m(f$u),
w maxbeta=m(coef(g) - coef(f)),
maxbeta.per.se=m((coef(g) - coef(f))/se),
deviance=tail(f$deviance, 1),
iter=tail(f$iter, 1))
<- rbind(d, w)
d
}rownames(d) <- NULL
d
i elapsed maxu maxbeta maxbeta.per.se deviance iter
1 2 0.0084 1.781729e+02 3.361068e+01 2.063781e+01 5399.000 1
2 3 0.0100 9.738995e+01 3.361066e+01 2.063722e+01 5094.243 3
3 4 0.0140 6.384426e+01 3.361066e+01 2.063699e+01 5081.404 8
4 5 0.0680 1.892680e+01 3.360747e+01 2.058161e+01 4900.327 91
5 6 0.1972 1.128232e+00 5.167229e-02 4.465468e-02 3879.877 428
6 7 0.2132 3.220871e-01 1.185453e-02 1.024456e-02 3879.851 504
7 8 0.2162 1.538043e-01 2.510429e-03 1.842095e-03 3879.848 531
8 9 0.2392 4.632918e-02 3.877973e-04 3.351305e-04 3879.848 556
9 10 0.2292 9.721526e-03 1.342389e-04 1.055845e-04 3879.848 587
10 11 0.2382 1.345625e-03 8.725462e-05 6.402546e-05 3879.848 608
11 12 0.2554 6.036158e-04 7.179465e-05 4.097949e-05 3879.848 647
12 13 0.2822 1.547283e-04 6.205067e-05 3.721365e-05 3879.848 679
13 14 0.2968 5.574421e-05 5.959543e-05 3.619120e-05 3879.848 706
14 15 0.3056 7.187127e-05 5.930900e-05 3.584911e-05 3879.848 729
15 16 0.3644 2.330094e-05 4.968143e-05 2.996054e-05 3879.848 804
16 17 0.3470 2.330094e-05 4.968143e-05 2.996054e-05 3879.848 804
17 18 0.3484 2.330094e-05 4.968143e-05 2.996054e-05 3879.848 804
18 19 0.3336 2.330094e-05 4.968143e-05 2.996054e-05 3879.848 804
19 20 0.3462 2.330094e-05 4.968143e-05 2.996054e-05 3879.848 804
<- c(deviance=8, beta=10, beta.se=6, grad=15) # minimum i for which success achieved
z <- function() abline(v=z, col=gray(0.60))
h <- function() {} # remove this line to show reference lines
h par(mfrow=c(2, 3), mar=c(4, 4, 1, 1), las=1, mgp=c(2.8, .45, 0))
with(d, {
plot(i, elapsed, type='l'); h()
plot(i, maxu, type='l', log='y'); h()
plot(i, maxbeta, type='l', log='y'); h()
plot(i, maxbeta.per.se, type='l'); h()
plot(i, deviance - 3879, type='l', log='y'); h()
plot(i, iter, type='l', log='y'); h()
})
NULL
See that stochastic convergence, as judged by deviance, occurs by the time the relative tolerance is \(10^{-8}\), by \(10^{-10}\) to control maximum absolute parameter difference, by \(10^{-6}\) to get parameter estimates to within 0.06 standard error, and by \(10^{-15}\) to achieve gradients \(< 10^{-4}\) in absolute value.
When using BFGS a recommendation is to use a relative tolerance of \(10^{-8}\) to nail down estimates to the extent that it matters precision-wise, and use \(10^{-10}\) to achieve reproducibility.
Recall that BFGS is only appealing when the number of intercepts is large, you don’t need the covariance matrix, and you are not using orm
.
Matrix Inversion
The information matrix (negative Hessian) must be inverted to compute the variance-covariance matrix. The default inversion method in R is the solve
function, which defaults to using the LU decomposition. This is a fast algorithm, but the Cholesky decomposition is faster and behaves as well as LU numerically. Another approach uses the QR decomposition, implemented in the qr.solve
function. Let’s compare speed and accuracy of the three approaches when applied to a high-dimensional almost-singular matrix.
# ChatGPT created this function to generate an almost-singular symmetric
# positive definite matrix of dimension p x p
<- function(p, epsilon = 1e-8) {
genas # Generate a random symmetric positive definite matrix
<- matrix(rnorm(p^2), p, p)
A <- A + t(A) + p * diag(p) # Symmetric and positive definite
A
# Modify eigenvalues to make the matrix almost singular
<- eigen(A)
eigen_decomp $values[p] <- epsilon # Set the smallest eigenvalue close to zero
eigen_decomp$vectors %*% diag(eigen_decomp$values) %*% t(eigen_decomp$vectors)
eigen_decomp
}set.seed(3)
<- genas(750, 1e-9)
x 1:3, 1:3] x[
[,1] [,2] [,3]
[1,] 748.051108 -1.0242634 0.3929600
[2,] -1.024263 747.2599805 0.9261079
[3,] 0.392960 0.9261079 745.6809913
# LU method
system.time(for(i in 1:3) a <- solve(x))
user system elapsed
0.657 0.016 0.695
mad(x, solve(a)) # reverse the inversion and compare to x
mad relmad
0.0001921486 0.0006909615
m(diag(750) - x %*% a) # compare x inverse * x to identity matrix
[1] 3.083591e-05
# QR
system.time(for(i in 1:3) b <- qr.solve(x, tol=1e-13))
user system elapsed
1.559 0.045 1.665
mad(x, qr.solve(b, tol=1e-13))
mad relmad
0.0002050283 0.0007250491
m(diag(750) - x %*% b)
[1] 3.917762e-05
# Cholesky
system.time(for(i in 1:3) ch <- chol2inv(chol(x)))
user system elapsed
0.321 0.004 0.325
mad(x, chol2inv(chol(ch))) # mean |difference| and mean relative difference
mad relmad
0.0002323721 0.0008362380
m(diag(750) - x %*% ch) # mean |difference|
[1] 1.71175e-05
QR takes significantly longer and offers no accuracy advantange. Inversion via Cholesky decomposition was almost twice as fast as LU, though both methods took less than \(\frac{1}{4}\) of a second to invert a \(750\times 750\) matrix. Cholesky was a little more accurate in getting the product of the original matrix and its inverse closer to an identity matrix. Cholesky was very slightly worse in recovering the original matrix by inverting its inverse.
lrm.fit
uses Cholesky when inverting the final Hessian matrix. During Newton-Raphson updating when opt_method='NR'
, it uses solve
(LU method) because it is more accurate to let solve
multiply the inverse Hessian by the gradient vector in one step.
Other Resources
- Video by Richad McElreath explaining numerical accuracy issues in log likelihood calculations.
- R CRAN Task Views on optimizers
- Vignettes by John Nash et al.
- R
ucminf
package - R
nloptr
package
Computing Environment
::cite_packages(pkgs='Session', output='paragraph', out.dir='.',
gratefulcite.tidyverse=FALSE, omit=c('grateful', 'ggplot2'))
We used R version 4.4.0 (R Core Team 2024) and the following R packages: Hmisc v. 5.2.1 (Harrell Jr 2024a), tlrm v. 0.0.1 (Harrell Jr 2024b).
The code was run on macOS Sonoma 14.7.1.