# Background

Effect ratios such as odds ratios (OR) and hazard ratios (HR) are useful measures of relative treatment effects and are used extensively in randomized clinical trials (RCT). In their simplest form where they represent a non-covariate-adjusted treatment effect, they were designed for homogeneous patient populations, i.e., situations in which there are no known risk factors. They were not designed for the case where there is substantial outcome heterogeneity even for patients receiving the same treatment. When this outcome heterogeneity is present (e.g., strong risk factors exist), patients come from a mixture of distributions, and this mixing causes problems for many statistics that are not linear in the outcome data. In the Cox proportional hazards (PH) model this mixing causes non-proportional hazards of treatments and in both Cox and logistic regression models the mixing causes an attenuation of the treatment effect. This is explained in detail in Chapter 13 of Biostatistics in Biomedical Research.

In a linear model applied to an RCT, failure to adjust for covariates, or omitting a strong risk factor from the set of adjustment variables, results in residuals having larger absolute values, which increases the residual variance. The residual variance absorbs such unaccounted heterogeneity in the outcome, and increasing this variance decreases the power of the study. In Cox and logistic models there are no residual terms, and unaccounted outcome heterogeneity has nowhere to go. So it goes into the regression coefficients that are in the model, attenuating them towards zero. Failure to adjust for easily accountable outcome heterogeneity in nonlinear models causes a loss of power as a result. As detailed in Chapter 13 of the above referenced text, adjusting for outcome heterogeneity keeps Cox and logistic models from losing power, unlike the situation in linear models where adjusting actually increases power. The important impact of covariate adjustment on power for making treatment comparisons has been studied extensively by Steyerberg.

The underlying issue is the non-collapsibility of ORs and HRs. Non-collapsibility means that the conditional ratio is different from the marginal (unadjusted) ratio even in the complete absence of confounding (as in our example dataset below). By the way, don’t make the mistake of concluding that non-collapsibility is undesirable. Any measure that has the potential for summarizing a treatment effect with one constant for all types of patients will be non-collapsible when the outcome is categorical or represents time to event1. Collapsible measures such as absolute risk reduction and relative risk reduction must vary over risk factors (creating mathematical but not subject-matter-relevant interactions), otherwise probabilities will arise that are outside the allowable range of [0,1]. Log odds and log hazard ratios have an unlimited ranges and can possibly apply to everyone. This makes them good bases for studying heterogeneity of treatment effect.

When risk factors exist and a clinical trial enrolls patients having a variety of values of these risk factors, there will be outcome heterogeneity. To easily account for known outcome heterogeneity, it is a good idea to pre-specify known-to-be-important covariates for the primary analysis of an RCT. Failure to do so will result in a significant power loss.

# Odds Ratio Example

Beginning with a slight modification of an example provided by Mitch Gail presented in the ANCOVA chapter of BBR, consider a 4000 patient randomized clinical trial with one important covariate: patient sex. This prognostic variable has perfect balance over treatments A and B, and has an odds ratio of 9 for the female:male effect within either treatment group. The A:B treatment odds ratio is also 9, for either sex. The data are shown below.

MalesABTotal
Y=05009001400
Y=1500100600
Total100010002000OR=9

FemalesABTotal
Y=0100500600
Y=19005001400
Total100010002000OR=9

PooledABTotal
Y=060014002000
Y=114006002000
Total200020004000OR=5.44

By fitting a binary logistic regression model we see there is no sex x treatment interaction on the unrestricted log odds scale, and the adjusted odds ratios for both treatment and sex are 9.

d <- expand.grid(y=0:1, tx=c('A','B'),sex=c('male', 'female'))
# Order treatment levels to get OR > 1
d$tx <- factor(d$tx, c('B','A'))
dfreq <- c(500, 500, 900, 100, 100, 900, 500, 500) kable(d, align='r')  ytxsexfreq 0Amale500 1Amale500 0Bmale900 1Bmale100 0Afemale100 1Afemale900 0Bfemale500 1Bfemale500 f <- lrm(y ~ tx * sex, weights=freq, data=d)  f  Logistic Regression Model  lrm(formula = y ~ tx * sex, data = d, weights = freq)  Sum of Weights by Response Category  0 1 2000 2000  Model Likelihood Ratio Test Discrimination Indexes Rank Discrim. Indexes Obs 8LR χ2 1472.26R2 0.411C 0.800 0 4d.f. 3g 1.883Dxy 0.600 1 4Pr(>χ2) <0.0001gr 6.575γ 0.851 Sum of weights 4000gp 0.343τa 0.300 max |∂log L/∂β| 3×10-9Brier 0.170 βS.E.Wald ZPr(>|Z|) Intercept -2.1972 0.1054-20.84<0.0001 tx=A 2.1972 0.122917.87<0.0001 sex=female 2.1972 0.122917.87<0.0001 tx=A × sex=female 0.0000 0.17380.001.0000 exp(coef(f)['tx=A'])  tx=A 9 As an aside, absolute risk difference does not fix the problem in general, as sicker patients will show more absolute treatment benefit. Back to odds ratios, the A:B odds ratio not conditioning on patient sex is 5.44. Note that this is not a weighted average of 9 and 9. The unadjusted OR applies neither to males nor females. A subtle issue is that pooled marginal estimates (OR=5.44 above) are difficult to interpret and do not generalize to clinical populations with covariate distributions (here, sex) that differ from the RCT sample composition. As an example, let’s take a probability sample of females from our 4000 patient study and see how the A:B treatment OR changes. Compare the result from the whole RCT sample to the result after keeping the following proportions of females: 0.8, 0.6, 0.4, 0.2, 0.1, 0.05. fraction.females <- c(1, .8, .6, .4, .2, .1, 0.05) OR <- numeric(7) j <- 0 for(frac in fraction.females) { j <- j + 1 s <- d i <- dsex == 'female'
s$freq[i] <- s$freq[i] * frac
# Compute marginal OR
g <- lrm(y ~ tx, weights=freq, data=s)
OR[j] <- round(exp(coef(g)[2]), 2)
}

kable(cbind(fraction.females, OR), row.names=FALSE, align='r',
col.names=c('Fraction of females retained', 'OR'))

Fraction of females retainedOR
1.005.44
0.805.47
0.605.57
0.405.84
0.206.54
0.107.33
0.057.99

The marginal OR depends on the distribution of the sex variable in the sample, and does not transport to populations with a different sex ratio than the trial enrollment achieved. It is conditional (adjusted) ORs that generalize to other populations. These calculations illustrate that the sex-conditional OR equals the marginal OR only if the distribution is altered so that the conditioning doesn’t matter (e.g., all the males or all the females are excluded). But what is the exact interpretation of the original marginal OR of 5.44 since it involves hidden conditioning on a 1:1 sex ratio in our example? A definition for this example is that 5.44 is the unconditional OR only when there are equal numbers of males and females because that’s how the sample was constituted. But what is the interpretation when one wants to apply the RCT results to individual patients? It would seem to apply only to those rare situations where the patient is being counseled but for some reason we don’t know the patient’s sex2. The marginal estimate needs the physician to conceptualize the clinical population (or at least the sex ratio) from which the patient came since it does not want to take into account the patient’s actual sex.

To say this another way, the sample-averaged OR of 5.44 does not apply to males, does not apply to females, and can conceivably only apply to a patient whose sex the physician refuses to know and for which their probability of being female is exactly 0.5. Even then the fact that the treatment OR is the same for both males and females makes the use of the sample averaged value highly questionable.

Mitch Gail summarized this situation well as quoted by Hauck, Anderson, and Marcus:

For use in a clinician–patient context, there is only a single person, that patient, of interest. The subject-specific measure then best reflects the risks or benefits for that patient. Gail has noted this previously [ENAR Presidential Invited Address, April 1990], arguing that one goal of a clinical trial ought to be to predict the direction and size of a treatment benefit for a patient with specific covariate values. In contrast, population–averaged estimates of treatment effect compare outcomes in groups of patients. The groups being compared are determined by whatever covariates are included in the model. The treatment effect is then a comparison of average outcomes, where the averaging is over all omitted covariates.

1. An exception is a log-normal survival model which is a linear model in log(Y). ↩︎

2. Note that the example is oversimplified because we are using an out-of-date binary conceptualization of sex. ↩︎

##### Frank Harrell
###### Professor of Biostatistics

My research interests include Bayesian statistics, predictive modeling and model validation, statistical computing and graphics, biomedical research, clinical trials, health services research, cardiology, and COVID-19 therapeutics.