Two-day in-person course Joint Statistical Meetings/ASA Toronto ONT CA August 2023

]]>https://fharrell.com/course/rms/index.htmlTue, 16 May 2023 05:00:00 GMTRegression Modeling Strategies Pre-Course
https://fharrell.com/course/prerms/index.html

One-day virtual course 2023-05-12

]]>https://fharrell.com/course/prerms/index.htmlFri, 12 May 2023 05:00:00 GMTSeven Common Errors in Decision Curve AnalysisAndrew Vickers
https://fharrell.com/post/edca/index.html

Introductory Remarks

In their classic paper on the evaluation of prediction models, Steyerberg et al. outlined a three-step process: discrimination, calibration, clinical utility. We first ask: does our model discriminate between those who do and don’t have disease? If so, we go on to ask: does the risk we give to an individual patient correspond to their true risk? But even if the answers to the first two questions are positive, we still have to ask whether using the model in clinical practice to aid clinical decision making would do more good than harm. Steyerberg at al. explicitly recommended decision curve analysis as a method to evaluate the clinical utility of models.

Decision curve analysis is now widely used in medical research. For instance, in 2022, PubMed finds over 1500 papers that used the phrase “decision curve analysis” in the abstract, and, of course, there will be many more papers using the method that do not use that exact phrase in the abstract. Naturally, decision curve analysis is sometimes used well, and sometimes less well. Below I discuss some of the more common errors in empirical practice. For more information on decision curve analysis, including code, tutorials, data sets and a bibliography of introductory papers, see www.decisioncurveanalysis.org.

Error 1: Failure to Specify the Clinical Decision

Statistical prediction models are most commonly created to inform either a specific medical decision or a small number of related decisions. For instance, a model to predict a patient’s risk of cancer is typically used to inform a decision to biopsy; a model to predict risk of postoperative complications might be used to select patients for “prehabilitation” (where risk is moderate) or to advise patients against surgery (where risk is high). Decision curve analysis evaluates those decisions, asking whether a patient’s clinical outcomes are improved if they follow the prediction model compared to the default strategies of “treat all” or “treat none”. However, if no decisions are specified by the investigators, then it becomes hard to see what the decision curve analysis is actually evaluating.

One exception is for general prognostic models intended for use in patient counselling. An example comes from cancer care. Cancer patients naturally ask “how long do you think I have, doctor?” because there all manner of personal decisions that will be based on the answer: Should I retire? Start that new project? Get my financial affairs in order? Go see my kids? Such models are the exception that proves the rule. Hence: A decision curve analysis should specify the decisions that would be informed by the model under study or, alternatively, state that it is a general prognostic model that would be used to inform a very wide range of personal decisions.

Error 2: Showing Too Wide a Range of Threshold Probabilities

It is not uncommon to see a decision curve analysis where the model is clearly intended to inform a specific decision, such as biopsy or prophylactic therapy, but the x axis includes a very wide range of threshold probabilities. Many of these threshold probabilities will be uninformative. For instance, there is no need to know the net benefit for a cancer prediction model at a threshold probability of 80%, because no reasonable decision-maker should demand an 80% risk of cancer before agreeing to biopsy. Unless they are evaluating a prognostic model used for general patient counseling, as described above, investigators should prespecify a limited range of reasonable threshold probabilities and limit the x axis to that range. A reasonable exception to this rule is if the lower bound of the threshold probability is low, such as 5% or 10%, in which case investigators may choose to start the x axis at 0.

Error 3: Too Much White Space Below the -axis

Negative net benefits are not very interesting. Investigators should truncate the y axis so that it starts at some negative level of net benefit, -0.01 is a typical choice, such that the graph shows where curves have negative net benefit without creating excessive white space below the y axis. the graph on the left (what you’d want to avoid) with that on the right (where the y axis is truncated, “ymin(-0.01)” in Stata, “ylim=c(-0.01,0.04)” in R).

Unless you are careful, it is possible set the parameters of the graph so that a curve is left dangling above the x axis. For instance:

This is problematic because we need to know where net benefit becomes less than zero, albeit not how much less.

Error 4: Not Smoothing Out Statistical Noise

As threshold probability increases, net benefit should decrease smoothly until it reaches zero, at which point it may continue to decrease or, in some cases, remain zero for all higher values of threshold probability. When net benefit is calculated from a finite data set, however, statistical imprecision (“noise”) can cause local artifacts. It might be, for instance, that by chance, there are no patients who have a predicted probability at a given probability threshold, but then several have prediction probabilities at the next highest level and that causes a big difference in net benefit for a small change in threshold probability. This is shown on the left hand graph. In the right hand graph, net benefit has been set to be calculated every 2.5% (e.g. by “xby(0.025)” in Stata or “thresholds = seq(0, 0.4, 0.025)” in R) and a smoother has been added (e.g. “smooth” in Stata or “plot(smooth = TRUE)” in R). It should be noted that there is some disagreement between statisticians on this point. Some feel you should just “show the data”. My view is that the graph should reflect a best guess as to an underlying scientific truth and that would be a smooth curve without local artifacts.

Error 5: Recommending Threshold Probabilities on the Basis of the Results

Decision curve analysis involves three steps: specifying a range of reasonable threshold probabilities; calculating net benefit across that range; determining whether net benefit is highest for the model across the complete range. Some investigators have used the results of the decision curve to choose the threshold probability for the model, rather than using the threshold probabilities to evaluate the model. For instance, see the graph below, a decision curve for a model to predict the outcome of cancer biopsy, where a wide range of threshold probabilities is shown for didactic purposes. Some authors might draw the incorrect conclusion that the model should be used with threshold probabilities of 25% - 50%. The correct conclusion would be that, as typical thresholds for biopsy would be around 10%, use of the model would do harm compared to the default strategy of biopsying all patients.

Error 6: Ignoring the Results

Decision curve analysis is not immune from the unfortunate tendency of investigators to ignore the actual results of a statistical analysis and declare success anyway. The decision curve shown under “Error 5” above clearly shows that the model should not be used to inform cancer biopsy, but an investigator might regardless state that the model “showed net benefit over a wide range of threshold probabilities”.

Error 7: Not Correcting for Overfit

Models created and tested on the same data set are prone to overfit, such that they appear to have better performance that they actually do. There are a number of simple and widely-used methods to correct for overfit. It is not uncommon that these are applied for calculation of discrimination (e.g. area-under-the-ROC-curve) but ignored for the decision curve. Authors can create predicted probabilities for each patient in a data set using a cross-validation approach and then use those probabilities for calculating both area-under-the-ROC-curve and net benefit for decision curve analysis. A methodology paper describing this approach has been published, with step-by-step instructions found in the tutorials at www.decisioncurveanalysis.org.

]]>decision-makingdiagnosismedicine2023https://fharrell.com/post/edca/index.htmlSat, 18 Mar 2023 05:00:00 GMTRandomized Clinical Trials Do Not Mimic Clinical Practice, Thank GoodnessFrank Harrell
https://fharrell.com/post/rct-mimic/index.html

What clinicians learn from clinical practice, unless they routinely do n-of-one studies, is based on comparisons of unlikes. Then they criticize like-vs-like comparisons from randomized trials for not being generalizable. This is made worse by not understanding that clinical trials are designed to estimate relative efficacy, and relative efficacy is surprisingly transportable.

Many clinicians do not even track what happens to their patients to be able to inform their future patients. At the least, randomized trials track everyone.

Parallel-group RCTs enroll volunteers whose characteristics do not mimic any population. They are then assigned treatment, and the result is the ability to estimate between-group shifts in outcomes, not to estimate population outcome tendencies in one treatment group. Think of a linear regression. The RCT is used to estimate the slope (relative shift), not the intercept (absolute anchor).

Randomized clinical trials (RCTs) have various goals, including providing evidence that

First published 2017-01-27; major revision 2023-02-13

a treatment is superior than another treatment in a way that is likely to benefit patients

a new treatment yields patient outcomes that are similar enough to an established treatment that the two may be considered interchangeable

a diagnostic device or other technology provides information that improves patient management or outcomes

Let’s consider only the first goal. RCTs have long been held as the gold standard for generating evidence about the effectiveness of medical and surgical treatments, and for good reason. But I commonly hear clinicians lament that the results of RCTs are not generalizable to medical practice, primarily for two reasons:

Patients in clinical practice are different from those enrolled in RCTs

Drug adherence in clinical practice is likely to be lower than that achieved in RCTs, resulting in lower efficacy.

Point 2 is hard to debate because RCTs are run under protocol, and research personnel are watching and asking about patients’ adherence (but more about this below). But point 1 is a misplaced worry in the majority of trials. The explanation requires getting to the heart of what RCTs are really intended to do:

Provide evidence for relative treatment effectiveness over an adequate time horizon for assessing target patient outcomes

Provide evidence for relative safety over a somewhat adequate time horizon for assessing non-target safety outcomes

Let’s go into the meaning of relative effectiveness, for two types of outcome variables. For a continuous response Y such as systolic blood pressure (SBP), which for practical purposes may be considered to have an unrestricted range, the efficacy measure of interest is often the difference in two means, i.e., the mean reduction in SBP. Letting denote expected value or long-term average, denote treatment assignment ( or ), and denote a vector of baseline (pre-randomization) patient characteristics, a key quantity of interest for continuous is

The minority of RCTs actually use covariate adjustment for the primary analysis, a sad fact frequently lamented by regulatory authorities. The highly problematic consequences of this are discussed here, especially the resulting inflation of sample size to make up for failing to account for within-treatment patient outcome heterogeneity. Besides wasting time and resources, designating unadjusted analysis as the primary analysis leads to ethical concerns about exposing too many patients to experimental therapies. It also leads to much confusion about whether and how to handle observed baseline imbalance which would have been circumvented by pre-specifying covariate adjustment for important factors. Here we assume that the primary analysis uses best statistical practices so is covariate-adjusted.

where denotes “conditional on” or “holding constant”. Since the long-run mean is a linear operator and we typically use a linear model to analyze the data, the average is a collapsible quantity, meaning that the covariate-specific treatment effect equals the marginal treatment effect . Covariate adjustment is still needed to estimate this difference to reduce variance and hence achieve optimum power and precision. The mean difference can be estimated from the patients in the RCT and this also estimates a population-averaged treatment effect. The patient mix does not matter unless there are interactions between one or more s and and the distribution of the interacting factors in differs between RCT and target populations.

More About Conditioning

When we condition on baseline characteristics in these describe types of patients. So we obtain patient-type-specific tendencies of . When omits an important patient characteristic we obtain patient-type-specific values up to the resolution of how “type” is measured. Such conditional estimates will be marginal over omitted covariates, i.e., they will average over the sample distribution of omitted covariates. For linear models this is consequential only in not further reducing the residual variance, so some efficiency is lost. For nonlinear models such as logistic and Cox models, the consequence is that the treatment effect is a kind of weighted average over the sample distribution of omitted covariates that are important. That doesn’t make it wrong or unhelpful. The effect of this on the average is to underestimate the true treatment effect that compares like with like by conditioning on all important covariates.

Any covariate conditioning is better than none. Estimating unadjusted treatment effects in nonlinear model situations will result in stronger attenuation of the treatment effect (e.g., move an OR towards 1.0) on the average, will get the model wrong, and will not lend itself to understanding the ARR distribution nor provide any basis for treatment interaction/assessment of differential treatment effect. Regarding “get the model wrong”, a good example is that if the treatment effect is constant over time upon covariate adjustment (i.e., the proportional hazards (PH) assumption holds), the unadjusted treatment effect will violate PH. As an example let there be a large difference in survival time between males and females. Failure to condition on sex will make the analyst see a complex bimodal survival time distribution with unexplained modes, and this can lead to violating PH for treatment. Practical experience has found more studies with PH after covariate adjustment than studies with PH without covariate adjustment.

To obtain not only patient-type-specific treatment effects but also patient-specific effects requires conditioning on patient, otherwise estimates are marginalized with respect to patient. Conditioning on patient requires having random effects for patients, e.g., random intercepts. To have random effects requires having multiple post-randomization observations per patient—either, for example, a 6-period 2-treatment randomized crossover study or a longitudinal study with lots of longitudinal assessments per patient.

For continuous , the difference in means quantifies both absolute and relative efficacy. When the outcome is time-to-event, it is possible to have an absolute efficacy measure such as difference in mean time until event (e.g., gain in left expectancy), but as in the case with absolute risk reduction with binary discussed below, this absolute effectiveness measure only makes sense when it is covariate-adjusted. So let’s consider the usual treatment effect parameters for binary and time-to-event outcomes. Except for a log-Gaussian accelerated failure time model, most models for these two types of outcomes have non-collapsible parameters, i.e., the treatment effect parameter has a different meaning whether conditioning on covariates or not. Conditioning on is required to make the results mesh with how clinical decision making is done—one patient at a time. It is necessary to allow patient preferences and trade-offs to be taken into account.

Recognition of these issues will hopefully make some readers realize that this simple approach to personalized medicine can have more impact that measuring new biomarkers

When is binary, the only effect measure that can possibly mean the same thing for every patient is the one that conditions on patient characteristics —either a relative effect or individualized -specific risk estimates under alternative therapies. The use of ideas such as ATE (average treatment effects) is not in alignment with medical decision making. This fact is most evident for binary outcomes. When effects of are more than trivial, generating a wide distribution of risk across subjects, the ATE may not apply to any patient in the trial or encountered in clinical practice.

How are models chosen for such classes of ? Among other criteria, models are chosen so that (1) it is possible that there be a single number, confidence/compatibility interval, or Bayesian posterior distribution for the treatment effect, and (2) the model form has been found to provide a satisfactory fit in a large number of patient outcome studies. These considerations lead to the popularity of the logistic regression model for binary or ordinal and the Cox proportional hazards model for time-to-event . Relative treatment effects in these two models are, respectively, odds ratios and hazard ratios. These two ratio measures have the distinct advantage that their logarithms (the parameters actually used in the models) do not have mathematical constraints. It is therefore possible for a relative effect model to have a single parameter for treatment, i.e, it is possible for treatment not to interact with any .

Examining variation of absolute risk reduction (ARR) has made many researchers claim that heterogeneity of treatment effect is present, forgetting that treatment benefit is typically smaller for minimally diseased or younger patients who don’t have much room to improve. Variation in ARR in the absence of interactions on the relative scale merely represents patient heterogeneity and not heterogeneity of treatment effects.

A binary logistic regression model for treatment and covariate-specific probability of the outcome event may be stated as where , is a 0/1 indicator for treatment (0=, 1=), is the log OR and its anti-log is the adjusted (for ) OR.

Treatment Interactions

The omission of interaction terms in the above model is a default position related to the times I’ve analyzed trial data that were large enough to assess interactions and found no evidence for them, and also the huge number of published funnel plots showing remarkable consistency of ORs across patient subgroups (see also this). Subject matter considerations or secondary RCT efficacy analyses (or sensitivity analysis) would cause us to add interaction terms to the model. The material that follows is still relevant but would involve covariate-specific ORs. Were interaction parameters added to the model, the result would be primary relative efficacy estimands. For the (up to) absolute risk reductions, since each one already conditions on , the form of the computations wouldn’t change. ARRs would be exaggerated for certain levels of interacting factors, and it would be helpful to display the ARR distributions by levels of interacting factors.

The decision to include interactions needs to be sample-size dependent in addition to being driven by subject matter knowledge, and should also recognize the huge variance-bias trade-offs involved. Reduction in bias by inclusion of treatment interactions can easily be offset by large increases in variances, so that it would have been better to pretend that the relative treatment effect is constant. This is explored here. In the best of situations, where there is a single binary interacting factor having a prevalence of 0.5, the sample size needed to estimate an interaction effect with a specific precision/margin of error is higher than the sample size needed to estimate a main effect to the same precision. In that ideal case, the precision (e.g., confidence interval width) of the treatment effect estimate for one level of the interacting factor is worse by a factor of than the precision for a treatment main effect.

In RCTs, primary analyses should be pre-specified. Other analyses can be more adaptive. A useful pragmatic strategy when the number of covariates is manageable (e.g., 10 or fewer) is to ask this question: Will predictions of patient-type-specific treatment effects be better made with inclusion of all treatment interactions, or by ignoring them? This question can be answered by comparing Akaike’s information criterion (AIC) of models with and without the interactions and choosing the model with the smaller AIC. This is equivalent to basing the decision on whether the likelihood ratio test statistic for all interactions combined exceeds twice its degrees of freedom. [See this for related ideas.]

Even better is to use a more linear non-dichotomous process whereby interactions are not “in” or “out” of the model but are always partially “in”. Parameters for interaction terms are present but are discounted using either (1) cross-validation-like considerations to choose a penalty parameter in penalized maximum likelihood estimation, or (2) by setting Bayesian priors that may specify, for example, that interaction effects are unlikely to be larger than main effects or that interaction effects are unlikely to be beyond a certain magnitude (e.g., the ratio of ORs is unlikely to exceed or be less than ). An example of (1) is here.

More assumption-free ways to incorporate covariates into the analysis to gain precision in estimating the average treatment effect hide the problem of interactions and do not provide insights about effect modification/differential treatment effect.

There are those who believe that traditional statistical models should not be used in RCTs. They tend to favor the use of machine learning or nonparametric risk models that allow treatment to interact with every baseline variable, and use such models to estimate the average risk as if every patient were on treatment B, then estimate the average risk as if they were on treatment A. The differences in these two average risk estimates is a sample average treatment effect (SATE). Unless effectiveness is summarized with a difference in means, the SATE is a function of the distribution of characteristics of patients who happened to enter the trial, and it cannot be used to estimate the population average treatment effect (PATE) because probability samples are not used to select patients at random to enroll in RCTs. Here are some other comments about this disdain for traditional ANCOVA.

I’ll take a method that has assumptions that are not testable that represent reasonable approximations, over a method that ignores things that are clearly present.

Minimal-assumption almost-nonparametric approaches to estimation fail to account for the large variance-bias tradeoff they entail. By targeting the estimation of SATEs, the instabilities of such approaches average out to result in precise SATE estimates, but the resulting treatment effectiveness estimates that are average over unlikes may not apply to any patient in or out of the RCT. And they can’t give you the goal PATE.

To not average/to estimate effectiveness for individual patients, it’s not possible to allow for all possible treatment interactions without exploding the needed sample size. Approximating a needed patient-type-specific estimand is better than not attempting to estimate it.

This famous quote from John Tukey comes to mind.

Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.

To select a transportable relative effect measure for binary , we are then seeking a function such that, in the absence of interactions, satisfies

where is a relative effect ratio measure and is a single number. For the logistic model which is the conversion of risk to odds, and is the OR.

So that the RCT provides a single measure that is likely to transport outside of the trial, where there will certainly be a different patient mix on , we must choose a measure such that cancels out in the ratio. The OR does that. Thus our key estimand for treatment effectiveness (in the absence of interactions) is in the logistic model.

There have been papers arguing that the logistic regression model is not robust enough to trust as representing the treatment effect, but evidence for such worry is scant.

The cost of having a transportable treatment effect parameter that is consistent with individual patient decision making is to specify a statistical model for how the measurable part of patient heterogeneity happens, so that easily explainable outcome heterogeneity can be explained.

A RCT with reasonably wide patient inclusion criteria can also provide good estimates of absolute risk of the trial’s outcome , and the statistical model (or even a machine learning algorithm) can be used to estimate risk as if every patient were given B, then the risk as if every patient were given A, then subtract to estimate patient-type-specific absolute risk reduction (ARR) due to treatment. If there are no ties among covariate combinations present in the data, there will be as many ARR estimates as there are patients (see also this). Though the entire distribution of ARR doesn’t appear in RCT reports, it would be highly informative to make this a standard inclusion.

Fortunately these estimates are all connected by low dimensional parameters .

If there are no interactions and no ties in there are estimands of interest for patients—the absolute risk reductions along with the OR . Evidence for any effectiveness is the same for all estimates, e.g., the Bayesian posterior probabilities of a treatment effect being in the right direction are all equal. As for usage of the RCT estimates in medical practice, four numbers could be provided: individualized estimated outcome risk under , risk under , their difference, and the relative treatment effect (OR).

In practice the estimates would be summarized over a regular grid of values.Evidence about OR and ARR are identical because ARR=0 if and only if OR=1. For assessing evidence of a clinically worthwhile effect, e.g., ARR > 0.025, the posterior probabilities of efficacy will vary with .

RCTs of even drastically different patients can provide estimates of relative treatment benefit on odds or hazard ratio scales that are highly transportable. This is most readily seen in subgroup analyses provided by the trials themselves - so called forest plots that demonstrate remarkable constancy of relative treatment benefit. When an effect ratio is applied to a population with a much different risk profile, that relative effect can still fully apply. It is only likely that the absolute treatment benefit will change, and it is easy to estimate the absolute benefit (e.g., risk difference) for a patient given the relative benefit and the absolute baseline risk for the subject. This is covered in detail in Biostatistics for Biomedical Research, Section 13.6. See also Stephen Senn’s excellent presentation.

See also the remarkable constancy of ORs in this large RCT, where every opportunity was given for a covariate to interact with treatment, but the best predicting model forced all interactions to zero.

More About Transportability

Transportability of relative efficacy estimated from an RCT to patients in the field depends on a number of factors that need to be elucidated. For example, relative efficacy from an RCT is more likely to be transportable if patients in the RCT differ from those in the field by a matter of degree, e.g., are younger or further out on a disease severity continuum. Success can also happen when patients differ in an important etiologic or structural way when there are covariates that are well correlated with these characteristics that were captured in the RCT. But if etiology or structure differ in an undescribed way, the translation of RCT estimates to the field may fail, as is also the case if there is an important covariate treatment interaction that was omitted from the RCT model and the interacting factor has a much different distribution in the field than in the RCT patients. This latter phenomenon is covered in detail here.

Now that we have dived into relative effects and what RCTs are designed to estimate, consider how the “real world” does not provide what is needed to learn about treatment effectiveness in the sense of estimating what using a new treatment instead of an old treatment is likely to accomplish. Clinical practice provides anecdotal evidence that biases clinicians. What a clinician sees in her practice is patient on treatment and patient on treatment . She may remember how patient fared in comparison to patient , not appreciate confounding by indication, and suppose this provides a valid estimate of the difference in effectiveness in treatment vs. . But the real therapeutic question is how does the outcome of a patient were she given treatment compare to her outcome were she given treatment . The gold standard design is thus the randomized crossover design, when the treatment is short acting. Stephen Senn eloquently writes about how a 6-period 2-treatment crossover study can even do what proponents of personalized medicine mistakenly think they can do with a parallel-group randomized trial: estimate treatment effectiveness for individual patients beyond what is predicted by covariates.

For clinical practice to provide the evidence really needed, the clinician would have to see patients and assign treatments using one of the top four approaches listed in the hierarchy of evidence below. Entries are in the order of strongest evidence requiring the least assumptions to the weakest evidence. Note that crossover studies, when feasible, even surpass randomized studies of matched identical twins in the quality and relevance of information they provide.

Let denote patient and the treatments be denoted by and . Thus represents patient 2 on treatment . represents the average outcome over a sample of patients from which patient 1 was selected.

Design

Patients Compared

6-period crossover

vs (directly measure HTE)

2-period crossover

vs

RCT in identical twins

vs

group RCT

vs , on avg

Observational, good artificial control

vs , hopefully on avg

Observational, poor artificial control

vs , on avg

Real-world physician practice

vs

The best experimental designs yield the best evidence a clinician needs to answer the “what if” therapeutic question for the one patient in front of her. Covariate adjustment allows line four in the above table to be translated to patient-type-specific outcomes and not just group averages.

Regarding adherence, proponents of “real world” evidence advocate for estimating treatment effects in the context of making treatment adherence low as in clinical practice. This would result in lower efficacy and the abandonment of many treatments. It is hard to argue that a treatment should not be available for a potentially adherent patient because her fellow patients were poor adherers. Note that an RCT is by far the best hope for estimating efficacy as a function of adherence, through for example an instrumental variable analysis (the randomization assignment is a truly valid instrument). Much more needs to be said about how to handle treatment adherence and what should be the target adherence in an RCT, but overall it is a good thing that RCTs do not mimic clinical practice. We are entering a new era of pragmatic clinical trials. Pragmatic trials are worthy of in-depth discussion, but it is not a stretch to say that the chief advantage of pragmatic trials is not that they provide results that are more relevant to clinical practice but that they are cheaper and faster than traditional randomized trials.

An observational study has great difficulty unbiasedly estimating the average treatment effect. Using the same data to attempt to estimate efficacy under a specific degree of adherence is near impossible.

]]>generalizabilitydesignmedicineRCTdrug-evaluationpersonalized-medicineevidence20172023https://fharrell.com/post/rct-mimic/index.htmlTue, 14 Feb 2023 06:00:00 GMTBiostatistical Modeling PlanFrank Harrell
https://fharrell.com/post/modplan/index.html

Introduction

Even though several of the hypotheses will be translated into statistical tests, at the heart of the aims is the determination of clinically useful predictors of outcomes and the estimation of the magnitude of effects of predictors on the outcomes. Therefore, the statistical plan emphasizes development of multivariable models relating predictors (patient baseline characteristics) to outcomes, and the use of such models for estimating adjusted (partial) effects of risk factors of interest, after controlling for the effects of other characteristics. The strategy to be used for developing multivariable models is given in detail in Frank E. Harrell (2015). This strategy involves such steps as

This has been used successfully as a template for multiple grant proposals.

This material covers only frequentist model development. A different template would be needed for (the preferred) Bayesian approach.

multiply imputing missing predictor values to make good use of partial information on a subject

choosing an appropriate statistical model based on the nature of the response variable

deciding on the allowable complexity of the model based on the effective sample size available

allowing for nonlinear predictor effects using regression splines

incorporating pre-specified interactions

checking distributional assumptions

adjusting the variance-covariance matrix for multiple imputation

graphically interpreting the model using partial effect plots and nomograms

quantifying the clinical utility (discrimination ability) of the model

internally validating the calibration and discrimination of the model using the bootstrap to estimate the model’s likely performance on a new sample of patients from the same patient stream

possibly do external validation

Regarding the choice of statistical model, we will use ordinary multiple regression for truly continuous responses, and for outcome scales that are not truly continuous or that have heavy ties in some of their values, the proportional odds ordinal logistic model (Walker and Duncan (1967)). Particular strategies and assumption checking procedures for these models are given in Frank E. Harrell (2015).

Sample size justification for development of reliable predictive models should be based on Riley, Snell, Ensor, Burke, Harrell, et al. (2019) or Riley, Snell, Ensor, Burke, Jr, et al. (2019). For simplicity, in what follows sample size or allowable model complexity will be based on the rough rule of thumb that, on the average, a model must fit no more than parameters for it to be reliable on future similar patients, where is the effective sample size (Frank E. Harrell (2015)). For continuous response variables or ordinal responses with at least 5 well-populated values, is the number of subjects. For binary outcomes, is the number of incident events in the cohort. But one needs to take into account the number of subjects needed just to get an accurate overall average prediction, i.e., the number of subjects needed to estimate the model intercept (or the underlying survival curve in a Cox proportional hazards model). For a binary outcome, 96 subjects are required to estimate the intercept such that the margin of error (with 0.95 confidence) in estimating the overall risk is . For a continuous outcome, 70 subjects are required to estimate the residual variance to within a multiplicative margin of error of 1.2 with 0.95 confidence. See here for details about these minimum sample size calculations.

Example of Candidate Predictor Specification

The most comprehensive models have the following candidate predictors. Baseline value refers to the baseline value of the response variable, if one exists.

Two Possible Ways for Specifying Predictors and d.f.

Type is c for a categorical variable, n for a numeric continuous measurement, or g for a group of related variables.

Predictor

Type

d.f.

baseline score

n

2

disease category

c

3

disease history

g

5

age

n

2

…

.

.

Overall

42

Predictor Type

d.f.

Continuous Numeric Predictors

baseline score

2

age

2

Categorical Predictors (exclusive categories)

disease category

3

Non-exclusive Categorical Predictors

disease history

5

Overall

10

For categorical predictors the number of degrees of freedom (d.f., the numerator degrees of freedom in ordinary regression) is one less the number of categories. For continuous ones, degrees of freedom beyond one indicate nonlinear terms. When d.f. is two, the restricted cubic regression spline fit we will use is similar to a quadratic function.

A model with a complexity given by 42 d.f. can be reliably fitted when the effective sample size is at least 42 x 15 = 630 subjects. Note that we will minimize the use of stepwise variable selection techniques that attempt to reduce the apparent d.f., as this would add instability making the effective d.f. at least 42. We will also use a data reduction technique based on variable clustering to reduce the d.f. in some of the groups of related variables. Frank E. Harrell (2015) provides references for pitfalls of stepwise variable selection as well as for data reduction issues.

Alternative text:Data reduction techniques such as principal components analysis will be used to reduce related groups of variables not of primary interest to a single summary score to insert in the outcome model. Alternatively, penalized maximum likelihood estimation (Frank E. Harrell 2015, sec. 9.10) will be used to jointly model all predictors, with shrinkage (a kind of discounting) to eliminate overfitting. The amount of shrinkage to be used would be that yielding an effective d.f. of xx.

Interactions (Effect Modifiers)

The starting point for multivariable modeling is an additive effects model. A purely additive model, even one with allowance for flexible nonlinear relationships, will not fit adequately if interactive or synergistic effects exist. We do not find that an empirical search for interactions is a reliable approach, because there are too many possible interactions (product terms) and frequently interactions found by intensive search techniques are spurious. Instead interactions should be pre-specified based on subject matter knowledge, using general suggestions in (Frank E. Harrell 2015, sec. 2.7.2) which states that sensible interactions to entertain are

Treatment severity of disease being treated

Age risk factors

Age type of disease

Measurement state of a subject during measurement

Race disease

Calendar time treatment (learning curve)

Quality quantity of a symptom

Model Fitting and Testing Effects

Regression models will be fitted by least squares (for ordinary multiple regression) or maximum likelihood. Instead of deleting cases having some of the predictor variables missing, baseline predictors will be multiply imputed using the flexible semiparametric aregImpute algorithm in Harrell’s Hmisc package in R. When assessing the partial association of a predictor of interest with the response variable, it is important that there be as little residual confounding as possible (Greenland 2000). This is accomplished by avoiding stepwise variable selection (keeping all pre-specified variables in the model, significant or not) and by not assuming that effects are linear. For testing the partial effect of a predictor, adjusted for other independent variables, partial Wald tests will be used. When the predictor requires multiple degrees of freedom, these tests are composite or “chunk” tests. For pivotal or equivocal P-values, likelihood ratio tests will be used when maximum likelihood is used for fitting the model.

Model Validation

It is common practice when performing internal validations of predictive models to use a split sample technique in which the original sample is split into training and test samples. Although this provides an unbiased estimate of predictive accuracy of the model developed on the training sample, it has been shown to have a margin of error that is much larger than that obtained from resampling techniques. The reasons are (1) the arbitrariness of the sample split and (2) both the training and test samples must be smaller than the original, whole, sample. Pattern discovery, power, and precision of the effects of predictors are sacrificed by holding out data. For these reasons, we will use the bootstrap to internally validate the model, repeating all steps used in model development for each resample of the data. The bootstrap allows one to estimate the likely future performance of a predictive model without holding out data. It can be used to penalize estimates of predictive discrimination and calibration accuracy obtained from the original model (fit from the whole sample) for overfitting. Details of this approach are given in Harrell (Frank E. Harrell 2015, chap. 5).

Optional: Assessing Biases Caused by Dropouts

Some subjects may not return for their follow-up visit and hence may have missing values for their response variables. Frequently such subjects (“dropouts”) are not a random sample of the entire cohort, and analysis of only the complete cases will bias the results. Without intermediate clinic visits there are no good ways to adjust for such biases. Indirect information about the nonrandom dropouts will be obtained by using binary logistic models to predict the probability of dropping out on the basis of all the baseline predictor variables. When a baseline variable is a significant predictor of dropout, the distribution of that variable differs for completers and subjects who drop out.

Optional: Multiple Endpoints

There are multiple clinical endpoints in this study. We wish to answer separate questions and will report all negative as well as positive results, so no multiplicity corrections will be made for the number of endpoints (Cook and Farewell 1996).

Optional: Ordinal Response Models

Ordinal logistic models such as the proportional odds (PO) model (Walker and Duncan 1967) are the most popular models for analyzing responses when there is a natural ordering of the values of the response but an interval scale cannot be assumed. The usual ordinal logistic models (PO and the continuation ratio model) estimates a single odds ratio for each predictor. For the PO model, letting denote the response variable, a level of , and and be two levels for a predictor, the PO model assumes that the odds that given divided by the odds that given (the odds ratio for ) depends on and but not on . This enables us to arrive at valid conclusions not affected by a choice of while doing away with the need to have a separate predictor coefficient for each . The resulting odds ratio can be thought of as the impact of the predictor for moving to higher levels. Ordinal logistic models also work well for continuous , just using the ranks of the variable. If has m unique levels, the ordinal logistic model will have intercepts but may equal with nothing more than computer memory problems. Thus the ordinal response model has a great robustness advantage for skewed outcome variables while preserving power. Unlike ordinary regression and ANOVA, transforming will not affect regression coefficients, nor will a high outlier, nor is an equal spacing assumption assumed about . The ordinal logistic model also has great power gains over the binary logistic model.

The Wilcoxon-Mann-Whitney two-sample rank test is a special case of the PO model, and like the PO model, the Wilcoxon test requires PO in order to be fully efficient. Details are here. General resources for ordinal modeling are here.

External Validation Plan (for binary outcome)

There are two important aspects of the proposed external validation, and two different statistical quantities to be validated for each of these. First, we will be interested in comparing whether predictive performance measures estimated in the internal validations accurately forecast those measures for the external validation proposed here. Secondly, and more importantly, we will interpret and rely upon the external validation especially in judging generalizability of our predictive models.

The primary methods to be used for external validation are as follows. First, we will estimate predictive discrimination using the c-index (F. E. Harrell et al. 1982) and its rescaled version, Somers’ rank correlation between predicted risk and observed binary outcome. Second, we will validate the absolute predictive accuracy of predicted risks from our previously developed prediction rules. For this we will use Cleveland’s loess nonparametric regression smoother (Cleveland 1979; Frank E. Harrell, Lee, and Mark 1996; Frank E. Harrell 2015) or a spline function to estimate the relationship between predicted risk and actual risk to obtain high-resolution calibration curves. This is done by regressing predicted risk on observed binary outcomes, with no binning (categorization), so that all levels of predicted risk supported by the data can be validated ^{1}. For testing whether the resulting calibration curve is ideal (i.e., is the 45 degree line of identity) we will use the single d.f. Spiegelhalter -test (Spiegelhalter 1986) and the two d.f. test that simultaneously tests whether a linear logistic calibration curve has intercept of zero and slope of unity (Cox 1958; F. E. Harrell and Lee 1987; Miller, Hui, and Tierney 1991). Calibration accuracy will be summarized graphically by the estimated nonparametric calibration curve and by the mean absolute error and the 90th percentile of absolute calibration error. All of these quantities and tests are provided by the val.prob function in the R rms package (Frank E. Harrell 2020).

^{1} We typically plot the calibration curve from the 10th smallest predicted risk to the 10th largest in the data.

Validation benefits from using the mapping of all the predictors down into one variable which is the predicted risk. So the sample size is that needed to very accurately estimate the relationship between predicted risk and observed outcome. Even though this relationship will be estimated by a smooth nonparametric calibration curve, for purposes of assessing the needed size of the validation sample we consider this process to be equivalent to estimating the risk in k distinct risk groups. Within each risk group, consider the worst case where the true risk is 0.5. To achieve a margin of error of in estimating the true risk with 0.95 confidence requires a sample size of . To be able to have this margin of error in any chosen group would require 171 subjects in that group.

Checklist of Things Reviewers Might Criticize

No plan for variable selection (need to explain why traditional variable selection is a bad idea)

Too little detail about missing value imputation

Important variables omitted from list of predictors

Too many predictors that are unmodifiable and hence less clinically useful

References

Cleveland, W. S. 1979. “Robust Locally Weighted Regression and Smoothing Scatterplots.”J Am Stat Assoc 74: 829–36.

Cook, Richard J., and Vern T. Farewell. 1996. “Multiplicity Considerations in the Design and Analysis of Clinical Trials.”J Roy Stat Soc A 159: 93–110.

Harrell, F. E., R. M. Califf, D. B. Pryor, K. L. Lee, and R. A. Rosati. 1982. “Evaluating the Yield of Medical Tests.”JAMA 247: 2543–46.

Harrell, F. E., and K. L. Lee. 1987. “Using Logistic Model Calibration to Assess the Quality of Probability Predictions.”

Harrell, Frank E. 2015. Regression Modeling Strategies, with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Second edition. New York: Springer. https://doi.org/10.1007/978-3-319-19425-7.

———. 2020. “rms: R Functions for Biostatistical/Epidemiologic Modeling, Testing, Estimation, Validation, Graphics, Prediction, and Typesetting by Storing Enhanced Model Design Attributes in the Fit.”https://hbiostat.org/R/rms.

Harrell, Frank E., Kerry L. Lee, and Daniel B. Mark. 1996. “Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors.”Stat Med 15: 361–87.

Miller, Michael E., Siu L. Hui, and William M. Tierney. 1991. “Validation Techniques for Logistic Regression Models.”Stat Med 10: 1213–26.

Riley, Richard D., Kym I. E. Snell, Joie Ensor, Danielle L. Burke, Frank E. Harrell, Karel G. M. Moons, and Gary S. Collins. 2019. “Minimum Sample Size for Developing a Multivariable Prediction Model: Part I – Continuous Outcomes.”Statistics in Medicine 38 (7): 1276–96. https://doi.org/10.1002/sim.7993.

Riley, Richard D., Kym IE Snell, Joie Ensor, Danielle L. Burke, Frank E. Harrell Jr, Karel GM Moons, and Gary S. Collins. 2019. “Minimum Sample Size for Developing a Multivariable Prediction Model: PART II - Binary and Time-to-Event Outcomes.”Statistics in Medicine 38 (7): 1276–96. https://doi.org/10.1002/sim.7992.

Spiegelhalter, D. J. 1986. “Probabilistic Prediction in Patient Management and Clinical Trials.”Stat Med 5: 421–33. https://doi.org/10.1002/sim.4780050506.

Walker, S. H., and D. B. Duncan. 1967. “Estimation of the Probability of an Event as a Function of Several Independent Variables.”Biometrika 54: 167–78.

]]>2023accuracy-scoreendointsordinalcollaborationdata-reductiondesignmedicinepredictionregressionvalidationbootstraphttps://fharrell.com/post/modplan/index.htmlThu, 26 Jan 2023 06:00:00 GMTHow to Do Bad Biomarker ResearchFrank Harrell
https://fharrell.com/post/badb/index.html

Prelude

Biomarker Uncertainty Principle: A molecular signature derived from high-dimensional data can be either parsimonious or predictive, but not both.

We have more data than ever, more good data than ever, a lower proportion of data that are good, a lack of strategic thinking about what data are needed to answer questions of interest, sub-optimal analysis of data, and an occasional tendency to do research that should not be done.

Fundamental Principles of Statistics

Before discussing how much of biomarker research has gone wrong from a methodologic viewpoint, it’s useful to recall statistical principles that are used to guide good research.

Use methods grounded in theory or extensive simulation

Understand uncertainty

Design experiments to maximize information

Verify that the sample size will support the intended analyses, or pre-specify a simpler analysis for which the sample size is adequate; live within the confines of the information content of the data

Use all information in data during analysis

Use discovery and estimation procedures not likely to claim that noise is signal

Strive for optimal quantification of evidence about effects

Give decision makers the inputs other than the utility function that optimize decisions

Present information in ways that are intuitive, maximize information content, and are correctly perceived

Example of What Can Go Wrong

A paper in JAMA Psychiatry (McGrath et al. (2013)) about a new neuroimaging biomarker in depression typifies how violations of some of the above principles lead to findings that are very unlikely to either be replicated or to improve clinical practice. It also illustrates how poor understanding of research methodology by the media leads to hype.

The Times article claimed that we now know whether to treat a specific depressed patient with drug therapy or behavioral therapy if a certain imaging study was done. This finding rests on assessment of interactions of imaging results with treatment in a randomized trial.

To be able to estimate differential treatment benefit requires either huge numbers of patients or the use of a high-precision patient outcome. Neither of these is the case in the authors’ research. The authors based the analysis of treatment response on the Hamilton Depression Rating Scale (HDRS), which is commonly used in depression drug studies. The analysis of a large number of candidate PET image markers against the difference in HDRS achieved by each treatment would not lead to a very reliable result with the starting sample size of 82 randomized patients. But the authors made things far worse by engaging in dichotomania, defining an arbitrary “remission” as the patient achieving HDRS of 7 or less at both weeks 10 and 12 of treatment. Besides making the analysis arbitrary (the cutoff of 7 is not justified by any known data), dichotomization of HDRS results in a huge loss in precision and power. So the investigators began with an ordinal response, then lost much of the information in HDRS by dichotomization. This loses information about the severity of depression and misses close calls around the HDRS=7 threshold. It treats patients with HDRS=7 and 8 as more different than patients with HDRS=8 and HDRS=25. No claim of differential treatment effect should be believed unless a dose-response relationship for the interacting factor can be demonstrated. The “dumbing-down” of HDRS reduces the effective sample size and makes the results highly depend on the cutoff of 7. Lower effective size means decreased reproducibility.

There are also potential problems with study exclusion criteria that may make the results not fully applicable to the depression population, and potential serious problems with exclusions of 15 of 82 randomized patients because of termination before end of follow-up.

The simplest question that can possibly be answered using binary HDRS involves estimating the proportion of responders in a single treatment group. The sample size needed to estimate this simple proportion to within a 0.05 margin of error with 0.95 confidence is is n=384. The corresponding sample size for comparing two proportions for equal sample sizes n=768in each group. In the best possible (balanced) case the sample size needed to estimate a differential treatment effect (interaction affect) is four times this.

The idea of estimating differential treatment effects over a large number of image characteristics with a sample size n=82 that is inadequate for estimating a single proportion is preposterous. Yet the investigators claimed to learn from complex brain imaging analysis on 6 non-responders to drug and 9 non-responders to behavioral therapy which depression patients would have better outcomes under drug therapy vs. behavioral therapy.

Estimation of a single proportion is a task that is simpler by two orders than what the authors attempted to do, i.e., does not require estimating a difference or double difference.

The cherry-picked right anterior insula was labeled as the optimal treatment-specific biomarker candidate (see their Figure 3 below). It is very unlikely to be reproduced.

A key result of McGrath et al. (2013) is given in their Table 1, reproduced below.

An interesting finding in the next-to-last row of this table. There was evidence for an interaction between treatment and the baseline HAMA total score, which is an anxiety score. The direction of the interaction is such that patients with higher anxiety levels at baseline did worse with behavioral therapy. This involves a low-tech metric but has some face validity and is undoubtedly stronger and more clinically relevant than any of the authors’ findings from brain imaging, if validated. The authors’ emphasis of imaging results over anxiety levels is puzzling, especially in view of fact that the the clinical variable analyses involved less multiplicity problems than the analysis of imaging parameters.

Analysis That Should Have Been Done

The analysis should use full information in the data and not use an arbitrary information-losing responder analysis. The bootstrap should be used to fully expose the difficulty of the task of identifying image markers that are predictive of differential treatment effect. This is related to the likelihood that the results will be replicated in other datasets.

Here are some steps that would accomplish the biomarker identification goal or would demonstrate the futility of such an analysis. The procedure outlined recognizes that the biomarker identification goal is at its heart a ranking and selection problem.

Keep HDRS as an ordinal variable with no binning.

Use a proportional odds ordinal logistic model to predict final HDRS adjusted for treatment and a flexible spline function of baseline HDRS.

Similar to what was done in the Bootstrap Ranking Example section below, bootstrap all the following steps so that valid confidence intervals on ranks of marker importance may be computed. Take for example 1000 samples with replacement from the original dataset.

For each bootstrap sample and for each candidate image marker, augment the proportional odds model to add each marker and its interaction with treatment one-at-a-time. All candidate markers must be considered (no cherry picking).

For each marker compute the partial statistic for its interaction with treatment.

Over all candidate markers rank their partial .

Separately from each set of 1000 bootstrap per-marker importance ranks for each candidate marker, compute a bootstrap confidence interval for the rank.

The likely result is that the apparent “winner” will have a confidence interval that merely excludes it as being one of the biggest losers. The sample size would likely have to be as large for the confidence intervals to be narrow enough for us to have confidence in the “winner”. The likelihood that the apparent winner is the actual winner in the long run is close to zero.

Dichotomania and Discarding Information

What Kinds of True Thresholds Exist?

Natura non facit saltus (Nature does not make jumps) — Gottfried Wilhelm Leibniz

With much of biomarker research relying on dichotomization of continuous phenomena, let’s pause to realize that such analyses assume the existence of discontinuities in nature that don’t exist. The non-existence of such discontinuities in biomarker-outcome relationships is a key reason that cutpoints “found” in one dataset do not replicate in other datasets. They didn’t exist in the first dataset to begin with, but were found by data torture. Almost all continuous variables in nature operate in a smooth (but usually nonlinear) fashion.

What Do Cutpoints Really Assume? Cutpoints assume discontinuous relationships of the type in the right plot of Figure 1, and they assume that the true cutpoint is known. Beyond the molecular level, such patterns do not exist unless time and the discontinuity is caused by an event. Cutpoints assume homogeneity of outcome on either side of the cutpoint.

Discarding Information

Dichotomania is pervasive in biomarker research. Not only does dichotomania lose statistical information and power, but it leads to sub-optimal decision making. Consider these examples where dichotomization and incomplete conditioning leads to poor clinical thinking.

Patient: What was my systolic BP this time?

MD: It was

Patient: How is my diabetes doing?

MD: Your Hb was

Patient: What about the prostate screen?

MD who likes sensitivity: If you have average prostate cancer, the chance that your PSA is

The problem here is the use of improper conditioning ( instead of ) which loses information, and transposed conditionals, which are inconsistent with forward-in-time decision making. Sensitivity and specificity exemplify both conditioning problems. Sensitivity is .

Clinicians are actually quite good at dealing with continuous markers, gray zones, and trade-offs. So why is poor conditioning so common in biomarker research done by clinical researchers?

How to Do Bad Biomarker Research

Implied Goals of Bad Research

Create a diagnostic or prognostic model that

will be of limited clinical utility

will not be strongly validated in the future

has an interpretation that is not what it seems

uses cut-points, when cut-points don’t even exist, that

others will disagree with

result in discontinuous predictions and thinking

requires more biomarkers to make up for information loss due to dichotomization

Find a biomarker to personalize treatment selection that is not as reliable as using published average treatment effects from RCTs

Study Design to Achieve Unreliable or Clinically Unusable Results

Ignore the clinical literature when deciding upon clinical variables to collect

Don’t allow clinical variables to have dimensionality as high as the candidate biomarkers

Don’t randomize the order of sample processing; inform lab personnel of patient’s outcome status

Don’t study reliability of biomarker assays or clinical variables

Re-label demographic variables as clinical variables

Choose a non-representative sample

Double the needed sample size (at least) by dichotomizing the outcome measure

Reduce the effective sample size by splitting into training and validation samples

Statistical Analysis Plan

Don’t have one as this might limit investigator flexibility to substitute hypotheses

Categorize continuous variables or assume they operate linearly

Even though the patient response is a validated continuous measurement, analyze it as “high” vs. “low”

Use univariable screening and stepwise regression

Ignore time in time-to-event data

Choose a discontinuous improper predictive accuracy score to gauge diagnostic or prognostic ability

Try different cut-points on all variables in order to get a good value on the improper accuracy score

Use Excel or a menu-based statistical package so no one can re-trace your steps in order to criticize them

Interpretation and Validation

Pretend that the clinical variables you adjusted for were adequate and claim that the biomarkers provide new information

Pick a “winning” biomarker even though tiny changes in the sample result in a different “winner”

Overstate the predictive utility in general

Validate predictions using an independent sample that is too small or that should have been incorporated into training data to achieve adequate sample size

If the validation is poor, re-start with a different data split and proceed until validation is good

Avoid checking the absolute accuracy of predictions; instead group predictions into quartiles and show the quartiles have different patient outcomes

Categorize predictions to avoid making optimum Bayes decisions

What’s Gone Wrong with Omics & Biomarkers?

Subramanian and Simon (2010) wrote an excellent paper on gene expression-based prognostic signatures in lung cancer. They reviewed 16 non-small-cell lung cancer gene expression studies from 2002–2009 that studied patients. They scored studies on appropriateness of protocol, statistical validation, and medical utility. Some of their findings are below.

Average quality score: 3.1 of 7 points

No study showed prediction improvement over known risk factors; many failed to validate

Most studies did not even consider factors in guidelines

Completeness of resection only considered in 7 studies

Similar for tumor size

Some studies only adjusted for age and sex

The idea of assessing prognosis post lung resection without adjusting for volume of residual tumor can only be assumed to represent fear of setting the bar too high for potential biomarkers.

Difficulties of Picking “Winners”

When there are many candidate biomarkers for prognostication, diagnosis, or finding differential treatment benefit, a great deal of research uses a problematic reductionist approach to “name names”. Some of the most problematic research involves genome-wide associate studies of millions of candidate SNPs in an attempt to find a manageable subset of SNPs for clinical use. Some of the problems routinely encountered are

Multiple comparison problems

Extremely low power; high false negative rate

Potential markers may be correlated with each other

Small changes in the data can change the winners

Significance testing can be irrelevant; this is fundamentally a ranking and selection problem

Ranking Markers

Efron’s bootstrap can be used to fully account for the difficulty of the biomarker selection task. Selection of winners involves computing some statistic for each candidate marker, and sorting features by these strength-of-association measures. The statistic can be a crude unadjusted measure (correlation coefficient or unadjusted odds ratio, for example), or an adjusted measure. For each of a few hundred samples with replacement from the original raw dataset, one repeats the entire analysis afresh for each re-sample. All the biomarker candidates are ranked by the chosen statistic, and bootstrap percentile confidence intervals for these ranks are computed over all re-samples. 0.95 confidence limits for the rank of each candidate marker capture the stability of the ranks.

Bootstrap Ranking Example

Deming Mi, formally of the Vanderbilt University Department of Biostatistics and the Mass Spectrometry Research Lab, in collaboration with Michael Edgeworth and Richard Caprioli at Vanderbilt, did a moderately high-dimensional protein biomarker analysis. The analysis was based on tissue samples from 54 patients, 0.63 of whom died. The patients had been diagnosed with malignant glioma and received post-op chemotherapy. A Cox model was used to predict time until death, adjusted for age, tumor grade, and use of radiation. The median follow-up time was 15.5 months for survivors, and median survival time was 15 months.

213 candidate features were extracted from an average spectrum using ProTS-Marker (Biodesix Inc.). The markers were ranked for prognostic potential using the Cox partial likelihood ratio . To learn about the reliability of the ranks, 600 bootstrap re-samples were taken from the original data, and markers were re-ranked each time. The 0.025 and 0.975 quantiles of ranks were computed to derive 0.95 confidence limits. Features were sorted by observed ranks in the whole sample. The graphs below depict observed ranks, bootstrap confidence limits for them, and show statistically “significant” associations with asterisks

Results - Best

Results - Worst

One can see from the dot charts that for each candidate marker, whether it be the apparent winner, apparent weakest marker, or anywhere in between, the data are consistent with an extremely wide interval for its rank. The apparent winning protein marker (observed rank 213, meaning the highest association statistic) has a 0.95 confidence interval from 30 to 213. In other words, the data are only able to rule out the winner not being among the 29 worst performing markers out of 213. Evidence for the apparent winner being the real winner is scant.

The data are consistent with the apparent loser (rank = 1) being in the top 16 markers (confidence interval for rank [4, 198]). The fact that the confidence interval for the biggest loser does not extend to the lowest rank of 1 means that none of the 600 re-samples selected the marker as the weakest marker.

The majority of published “winners” are unlikely to validate when feature selection is purely empirical and there are many candidate features.

Summary

There are many ways for biomarker research to fail. Try to avoid using any of these methods.

McGrath, Callie L., Mary E. Kelley, Paul E. Holtzheimer III, Boadie W. Dunlop, W. Edward Craighead, Alexandre R. Franco, R. Cameron Craddock, and Helen S. Mayberg. 2013. “Toward a Neuroimaging Treatment Selection Biomarker for Major Depressive Disorder.”JAMA Psychiatry 70 (8): 821–29. https://doi.org/10.1001/jamapsychiatry.2013.143.

Subramanian, Jyothi, and Richard Simon. 2010. “Gene Expression-Based Prognostic Signatures in Lung Cancer: Ready for Clinical Use?”J Nat Cancer Inst 102: 464–74.

]]>2022big-databioinformaticsbiomarkerbootstrapdata-sciencedecision-makingdichotomizationforward-probabilitygeneralizabilitymedical-literaturemultiplicitypersonalized-medicinepredictionprinciplesreportingreproducibleresponder-analysissample-sizesensitivityhttps://fharrell.com/post/badb/index.htmlThu, 06 Oct 2022 05:00:00 GMTControversies in Predictive Modeling, Machine Learning, and ValidationFrank Harrell
https://fharrell.com/talk/stratos19/index.html

STRATOS: STRengthening Analytical Thinking for Observational Studies 2019, Banff, Alberta CA 2019-06-04

Why R? 2020, 2020-09-26

Padova University Winter School, 2022-01-24

International Conference on Recent Advances in Big Data and Precision Health, Taiwan, 2022-10-03

]]>predictionmachine-learningvalidationregressionhttps://fharrell.com/talk/stratos19/index.htmlWed, 28 Sep 2022 05:00:00 GMTR Workflow for Reproducible Biomedical Research Using QuartoFrank Harrell
https://fharrell.com/talk/rflow/index.html

This article is an overview of the R Workflow electronic book, which also contains an R primer and many examples with code, output, and graphics, with many of the graphs interactive. Here I outline analysis project workflow that I’ve found to be efficient in making reproducible research reports using R with Rmarkdown and now Quarto. I start by covering importing data, creating annotated analysis files, examining extent and patterns of missing data, and running descriptive statistics on them with goals of understanding the data and their quality and completeness. Functions in the Hmisc package are used to annotate data frames and data tables with labels and units of measurement, show metadata/data dictionaries, and to produce tabular and graphical statistical summaries. Efficient and clear methods of recoding variables are given. Several examples of processing and manipulating data using the data.table package are given, including some non-trivial longitudinal data computations. General principles of data analysis are briefly surveyed and some flexible bivariate and 3-variable analysis methods are presented with emphasis on staying close to the data while avoiding highly problematic categorization of continuous independent variables. Examples of diagramming the flow of exclusion of observations from analysis, caching results, parallel processing, and simulation are presented in the e-book. In the process several useful report writing methods are exemplified, including program-controlled creation of multiple report tabs.

A video covering many parts of the first 13 chapters of R Workflow may be found here.

R Workflow is reviewed by Joseph Rickert here. See this article by Norm Matloff for many useful ideas about learning R.

Computing Tools

Key tools for doing modern, high-quality, reproducible analysis are R for data import, processing, analysis, tables, and graphics, and Quarto for producing high-quality static and interactive reports. Users of R Markdown will find it easy to make small changes in syntax to convert to Quarto. Both Quarto and R Markdown rely on the R knitr package to process a blend of sentences and code to insert the tabular and graphical output of the code into the report.

The Research Process

An oversimplified summary of the research process is shown in the following diagram.

flowchart TD
St[Research question] --> M[Determine what<br>to measure<br>based on<br>subject matter<br>knowledge]
M --> Design[Design]
Design --> Exp[Experimental]
Design --> Obs[Observational]
EM[Seek high-resolution<br>high-precison response variables]
BV[Measure key influencers<br>of responses]
Exp --> EM --> BV --> Rsim[R simulations to<br>optimize design<br>estimate sample size<br>study performance of<br>analysis methods] --> Ran[Randomization] --> RR[Extra demand<br>for reproducibility]
Obs --> Comp[Between-Group Comparison]
Obs --> OO[Longitudinal and Other] --> OM[Careful model<br>choices as with<br>comparative studies]
Comp --> Und[Understand factors used<br>in group assignment<br>decisions]
Rat[See which data are<br>available only after<br>detailing which<br>measurements are needed<br><br>Don't rationalize<br>adequacy of variables by<br>being biased by<br>what is available<br><br>Confounding by indication<br>can be managed if indicators<br>are measured reliably]
Und --> MS[Make sure those factors are measured] --> Rat
Rat & RR & OM --> Acq[Data acquisition] --> PA[Principled efficient reproducible analysis<br>respecting information in raw data] --> Rw[R Workflow]

R and Quarto play key roles at several points in this process

R Workflow

The R Workflow detailed at hbiostat.org/rflow has many components depicted in the diagram that follows.

flowchart TD
R[R Workflow] --> Rformat[Report formatting]
Rformat --> Quarto[Quarto setup<br><br>Using metadata in<br>report output<br><br>Table and graph formatting]
R --> DI[Data import] --> Annot[Annotate data<br><br>View data dictionary<br>to assist coding]
R --> Do[Data overview] --> F[Observation filtration<br>Missing data patterns<br>Data about data]
R --> P[Data processing] --> DP[Recode<br>Transform<br>Reshape<br>Merge<br>Aggregate<br>Manipulate]
R --> Des[Descriptive statistics<br>Univariate or simple<br>stratification]
R --> An[Analysis<br>Stay close to data] --> DA[Descriptive<br><br>Avoid tables by using<br>nonparametric smoothers] & FA[Formal]
R --> CP[Caching<br>Parallel computing<br>Simulation]

The above components are greatly facilitated y Quarto, the data.table, Hmisc and ggplot2 packages. The importance of data.table to the R workflow cannot be overstated. data.table provides a unified, clear, concise, and cohesive grammar for manipulating data in a huge number of ways. And it does this without having any dependencies, which makes software updates much easier to manage over the life of long projects. The R Workflow book covers the most commonly needed data processing operations with many examples of the use of data.table.

R Workflow also makes extensive use of the Github repository reptools which contains a number of functions exemplified in the book.

Report Formatting

Formatting can be broken down into multiple areas:

Formatting sentences, including math notation, making bullet and numbered lists, and cross-referencing

Formatting tabular and graphical output

Using metadata (i.e., variable labels and units of measurements) to enhance all kinds of output

Using Quarto capabilities to extend formats using for example collapsible sections, tabbed sections, marginal notes, and graph sizing and captioning

Initial Workflow

The major steps initially are

importing csv files or binary files from other systems

improving variable names

annotating variables with labels and units

viewing data dictionaries

recoding

Having a browser or RStudio window to view data dictionaries, especially to look up variable names and categorical values, is a major help in coding. R Workflow also shows how to import and process a large number of datasets in one step, without repetitive programming, and how to import metadata from an external source.

For ease and clarity of coding it is highly suggested that variable labels be clear and not too long, and that variable labels be used to provide the full description. Variable labels can be looked up programmatically at any point during the analysis, for inclusion in tables and graph axes.

Missing Data

Missing data is a major challenge for analysis. It is important to understand and to document the extent and patterns of missingness. The Hmisc package and reptools repository provides comprehensive looks at missingness including which variables tend to be missing on the same subjects. The multiple ways to summarizing missing data patterns are best reported as multiple tabs. The reptools repository makes it easy to create tabs programmatically, and this is used in the reptoolsmissChk function exemplified in the book. missChk fits an ordinal logistic regression model to describe which types of subjects (based on variables with no NAs) tend to have more variables missing.

Data Checking

Spike histograms often provide the best single way of checking validity of continuous variables. The reptoolsdataChk function allows the user to easily check ranges and cross-variable consistency by just providing a series of R expressions to run against the data table. dataChk summarizes the results in tabs, one tab per check, and includes in addition two overall summary tabs.

Data Overview

It is valuable to know how the analysis sample came to be. Filtering of observations due to study inclusion/exclusion criteria and due to missing data is often documented with a Consort diagram. The consort package makes it easy to produce such diagrams using R data elements to insert denominators into diagram nodes. Alternatively, the R/knitr/Quarto combination along with the mermaid natural diagramming language can be used to make data-fed consort-like flowcharts among many other types of diagrams.

Another type of data overview is based on computing various metrics on each variable. These metrics include

number of distinct values

number of NAs

an information measure that for numeric variables compares the information in the variable to that in a completely continuous variable with no ties

degree of asymmetry of the variable’s distribution

modal variable value (most frequent value)

frequency of modal value

minimum frequency value

frequency of that value

The reptoolsdataOverview function computes all these metrics and plots some of them on a scatterplot where hover text will reveal details about the variable represented by the point.

Descriptive Statistics

The best descriptive statistic for a continuous variable is a spike histogram with 100 or 200 bins. This will reveal bimodality, digit preference, outliers, and many other data features. Empirical cumulative distributions, not covered in R Workflow, are also excellent full-information summaries. The next best summary is an extended box plot showing the mean and multiple quantiles. Details and examples of extended box plots are covered in R Workflow.

Tables can summarize frequency distributions of categorical variables, and measures of central tendency, selected quantiles, and measures of spread for continuous variables. When there is a truly discrete baseline variable one can stratify on it and compute the above types of summary measures. Tables fail completely when one attempts to stratify on a continuous baseline variable after grouping it into intervals. This is because categorization of continuous variables is a bad idea. Among other problems,

categorization misses relationships occurring in an interval (this happens most often when an interval is wide such as an outer quartile)

categorization loses information and statistical power because within-interval outcome heterogeneity is ignored and between-interval ordering is not utilized

intervals are arbitrary, and changing interval boundaries can significantly change the apparent relationship

with cutpoints one can easily manipulate results; one can find a set of cutpoints that results in a positive relationship and a different set that results in a negative relationship

So tables are not good tools when describing one variable against a continuous variable. For that purpose, nonparametric smoothers or flexible parametric models are required. When the dependent variable Y is binary or continuous, the loess nonparametric smoother is an excellent default choice. For binary Y, loess estimates the probability that Y=1. For continuous Y, loess estimates the mean Y as a function of X. In general we need to estimate trends in more than the mean. For example we may want to estimate the median cholesterol as a function of age, the inter-quartile-range of cholesterol as a function of age, or 1- and 2-year incidence of disease vs. a continuous risk factor when censoring is present. A general-purpose way to get smoothed estimates against continuous X is though moving statistical summaries. The moving average is the oldest example. With moving statistics one computes any statistic of interest (mean, quantile, SD, Kaplan-Meier estimate) within a small window (of say observations), then moves the window up a few (say 1-5) observations and repeats the process. For each window the X value taken to represent that window is the mean X within the window. Then the moving estimates are smoothed with loess or the “super smoother”. All this is made easy with the reptoolsmovStats function, and several examples are given in R Workflow.

In randomized clinical trials, it is common to present “Table 1” in which descriptive statistics are stratified by assigned treatment. As imbalances in characteristics are almost by definition chance imbalances, Table 1 is counter-productive. Why not present something that may reveal unexpected results that mean something? This is readily done by replacing Table 1 with a multi-panel plot showing how each of the baseline variables relates to the clinical trial’s main outcome variable. Some examples are presented in R Workflow. These examples include augmenting trend plots with spike histograms and extended box plots to how the marginal distribution of the baseline variable.

Longitudinal Data

R Workflow provides several examples of graphical descriptions of longitudinal data:

representative curves for continuous Y

three types of charts for longitudinal ordinal data

multiple event charts

state transition proportions over all time periods

state occupancy proportions over all time periods and by treatment

event chart for continuous time-to-event data

while often not longitudinal, adverse event charts for comparing safety profiles in clinical trials using a dot chart that includes confidence intervals for differences in proportions

Describing Variable Interrelationships

Variable clustering, using a similarity measure such as Spearman’s rank correlation, are useful for helping the analyst and the investigator to understand inter-relationships among variables. Results are presented as tree diagrams (dendrograms).

Data Manipulation and Aggregation

R Workflow has many examples, including the following

use of data.table

subsetting data tables (or data frames) and analyzing the subset

counting observations

running separate analysis or just computing descriptive statistics by one or more stratification variables

adding and changing variables

selecting variables by patterns in their names

deleting variables

recoding variables in many ways

table look-up

automatically running all possible marginal summaries (e.g., subtotals)

merging data tables

reshaping data tables (e.g., wide to tall and thin)

flexible manipulation of longitudinal data on subjects including “overlap joins” and “non-equi” joins

Graphics

ggplot2 and plotly play key rolls in graphical reporting. Several examples are given, and methods for pulling from data dictionaries to better annotate graphs are given (e.g., putting variable labels on axes, plus units of measurement in a smaller font). the Hmisc package ggfreqScatter function is demonstrated. This allows one to create scatterplots for any sample size, handling coincident points in an easy-to-understand way.

Analysis

Analysis should be driven by statistical principles and be based to the extent possible on a statistical analysis plan to avoid double-dipping and too many “investigator degrees of freedom”, referred to as the garden of forking paths. Analyses should be completely reproducible. They should use the raw data as much as possible, never dichotomizing continuous or ordinal variables. Don’t fall into the trap of computing change from baseline in a parallel-group study.

One exception: an ordinal predictor with few levels can be modeled as a categorical variable but using all possible dichotomizations (less one) with indicator variables.

Descriptive Analysis

This can involve simple stratified estimation when baseline variables are truly categorical and one does not need to adjust for other variable. In general a model-based approach is better as it can adjust for any number of other variables while still providing descriptive results.

To respect the nature of continuous variables, descriptive analyses can use the general “moving statistics in overlapping windows” approach outlined above, or if the mean is only of interested, loess.

Recognize that the usual Table 1 is devoid of useful information, instead opting for useful outcome trends.

Don’t make the mistake of interchanging the roles of independent and dependent variables. Many papers in the medical literature show a table of descriptive statistics where there are two columns: patients with and without an event. Not only is this presentation ignoring the timing of events, but it is using the outcome as the sole independent variable. Instead show nonparametric trends predicting the outcome instead of using the outcome to predict baseline variables. R Workflow has many such examples.

Uncertainty intervals (UIs) such compatibility intervals, confidence intervals, Bayesian credible intervals, and Bayesian highest posterior density intervals are important tools in understanding effects of variables such as treatment. In randomized clinical trials (RCTs) an extremely common mistake, and one that is forced by bad statistical policies of journals such as NEJM, is to compute UIs for outcome tendencies for a single treatment arm in a parallel group study. As RCTs do not sample from a patient population but rely on convenience sampling, there is no useful inference that one can draw from a single group UI. RCTs are interesting because of what happens after patient enrollment: investigator-controlled randomization to a treatment. The only pertinent UIs are for differences in outcomes.

When the UI is approximately symmetric one can have UIs on the same graphs as treatment-specific point estimates by using “half confidence intervals”. In the frequentist domain, these intervals have the property that the point estimates touch the interval if and only if a statistical test of equality is not rejected at a certain statistical level. R Workflow has numerous examples, showing special value of this approach when comparing two survival curves.

Caching and Parallel Computing

knitr used by R Markdown and Quarto provides an automatic caching system to avoid having to run time-consuming steps when making only cosmetic changes to a report. But often one wants to take control of the caching process and the accounting for software and data dependencies in the current step. The runifChanged function makes this easy. Another way to save time is with the use of multiple processors on your machine. R Workflow shows simple examples of both caching and parallel computing.

Simulation

Simulations are used for many reasons including estimation of Bayesian and frequentist power for a study design, trying different designs to find an optimum one, and assessing the performance of a statistical analysis procedure. To minimize coding and be computationally efficient, multi-dimensional arrays can be used when there are many combinations of parameters/conditions to study. For example if one were varying the sample size, the number of covariates, the variance of Y, and a treatment effect to detect, one can set up a 4-dimensional array, run four nested for statements, and stuff results into the appropriate element of the array. For charting results (using for example a ggplot2 dot chart) it is easy to string out the 4-dimensional array into a tall and thin dataset where there is a variable for each dimension. R Workflow provides a full example, along with another method using a data table instead of an array.

Consider an ordinal outcome variable and note that an interval-scaled is also ordinal. Such response variables can be analyzed with nonparametric tests and correlation coefficients and with semiparametric ordinal models. Most semiparametric models in use involve modeling cumulative probabilities of . Resources here deal primarily with the cumulative probability family of semiparametric models^{1}.

^{1} One exception is the continuation ratio ordinal logistic model that is covered in Chapters 13 and 14 of Regression Modeling Strategies. That model is a discrete hazard-based proportional hazards model and applies primarily to a discrete dependent variable Y.

Semiparametric models are regression models having an intercept for each distinct value of , less one. They have the following properties:

By encoding the entire empirical cumulative distribution of in the intercepts, the model assumes nothing about the shape of the distribution for any given covariate setting. The distribution can be continuous, discontinuous, have arbitrary ‘clumping at zero’, or be bimodal. For example, an ordinal model with a single binary covariate representing females (reference category) and males has intercepts that pertain to females and are easily translated by themselves into a cumulative probability distribution of for females.

Because of the previous point, semiparametric models allow one to estimate effect ratios, exceedance probabilities , differences in exceedance probabilities (e.g., absolute risk reduction due to treatment as a function of ), quantiles of , and if is interval-scaled, the mean of .

Semiparametric models do assume that there is a systematic shift in the distribution as one moves from one covariate setting to another. In the single binary covariate female-male example, the regression coefficient for being male completely defines how the distribution of for females is shifted to get the distribution for males.

Cox proportional hazards (PH) model: the shift in the survival distribution (one minus the cumulative distribution of for females) is exponentiated by a hazard ratio (anti-log of the regression coefficient) to obtain one minus the cumulative distribution for males, i.e, . The two distributions are assumed to be parallel on the log-log scale.

Proportional odds (PO) ordinal logistic regression model: the cumulative distribution for females is shifted by an odds ratio (anti-log of the sex regression coefficient) to get the cumulative distribution for males in this way: . The two distributions are assumed to be parallel on the logit (logs odds; ) scale.

Contrast semiparametric model assumptions with those of a parametric model. For the Gaussian linear model, parallelism of the normal inverse of two cumulative distribution functions corresponds to an equal variance assumption, and the curves need to be straight (i.e., a Gaussian distribution holds), not just parallel. In addition, one must be confident that has been properly transformed, an irrelevant requirement for semiparametric models.

Regression coefficients and intercepts do not change if is transformed. Neither do predicted cumulative probabilities or quantiles (e.g. the first quartile of transformed is the transformation of the first quartile of untransformed ). Only predicted means are not preserved under transformation.

Since only the rank order of is used, the models are robust to outliers in (but not in ).

The models work equally well for discrete as for continuous . One can have more parameters in the model than the number of observations due to the large number of intercepts when is continuous and has few ties. Since these intercepts are forced to be in order (in the cumulative probability model family we are dealing with), the effective number of parameters estimated is much smaller. Note that the only efficient full likelihood software implementation at present (besides SAS JMP) for continuous in large datasets is the orm function in the R rms package, which can efficiently handle more than 6000 intercepts (one less than the number of distinct values).

Ordinal models allow one to analyze a continuous response variable that is overriden by clinical events. For example, in analyzing renal function measured by serum creatinine one could override the creatinine measurement with any arbitrary number higher than the highest observed creatinine when a patient requires dialysis. This clumping at the rightmost value of the distribution presents no problems to an ordinal model.

When one uses ordinary regression to analyze a response that is ordinal but not interval scaled, bad things happen.

The most popular semiparametric models are the Cox PH model and the PO ordinal logistic model. Cox developed a partial likelihood method so that the intercepts could be estimated separately and only the regression coefficients need to be optimized. Other semiparametric models use a full likelihood approach that estimates intercepts simultaneously with regression coefficients.

Particular Models

Some special cases of semiparametric models are as follows.

The log-rank test is a special case of the Cox PH model, and so it must assume proportional hazards.

The Wilcoxon-Mann-Whitney two-sample rank-sum test and the Kruskal-Wallis test are special cases of the proportional odds model^{2}.

^{2} Multiple explanations for the Wilcoxon test assuming proportional odds are given here

In randomized clinical trials, researchers often seek to avoid making a PO assumption by dichotomizing ordinal . The costs of this simplification are having to count events of vastly different severities as if they were the same (e.g., hospitalization same as death) and greatly reduced power/increased sample size to achieve the same power as a PO analysis. In general, designers do not weigh costs of oversimplification but only see costs of model assumptions. Dichotomization of involves a severe data assumption. One of the blog articles discusses the ramifications of the PO assumption in detail.

Example Model

Consider a discrete case with response variable representing pain levels of none, mild, moderate, severe. Let the covariates represent an indicator variable for sex (0=female, 1=male) and treatment (0=control, 1=active). A PO model for could be

where . The male:female OR for for any is and the active:control OR is .

Consider probabilities of outcomes for a male on active treatment. The part of the model is -0.9. The probabilities of outcomes of level or worse are as follows:

Meaning

log odds()

1

any pain

0.1

0.52

2

moderate or severe

-0.9

0.29

3

severe

-1.9

0.13

The probability that the pain level will be moderate is 0.29 - 0.13 = 0.16. The probability of being pain free is 1 - 0.52 = 0.48.

A model for continuous would look the same, just have many more s.

Tutorials and Course Material

For a gentle introduction see Section 7.6 in Biostatistics for Biomedical Research, below.

Power calculations tailored to the proportional odds model

7.8.3

Bayesian logistic model

6.10.3

Arguments In Favor of Wider Use of the PO Model

For reasons that can only be explained by unfamiliarity, many reviewers question the reliance on ordinal model assumptions even when they do not insist on verification of equal variance, normality, or proportional hazards assumptions in other settings. Some arguments to assist in interactions with such reviewers are the following.

All statistical methods have assumptions

-test: normality, equal variance; QQ plots: two parallel straight lines of normal inverse cumulative distributions

PO model: two parallel any shape curves of logit of cumulative distributions

Assumptions of PO model far less stringent than the -test (PO uses only rank ordering of )

Assumptions

The PO assumption is a model assumption that may or may not hold

Splitting an ordinal outcome at a cutpoint is a data assumption that is known not to hold

assuming e.g. hospitalization is the same as death

Interest is often in whether patients improve, i.e., whether there is a shift in their outcome distribution from being on different treatments. Interest is often not confined to whether a single threshold was achieved.

Just as with the proportional hazards assumption, PO analysis can provide meaningful summary ORs even when PO as violated, unless the violation is in the form of a major reversal in the direction of the treatment effect for certain cutoffs (where the sample sizes support cutoff-specific estimation)

Examining treatment effects over all cutoffs of the outcome for examining the PO assumption is not reliable (wide confidence intervals for ORs)

Designing a study to be able to test/estimate a specific cutoff’s OR requires much larger sample size than using the whole ordinal spectrum

PO = Wilcoxon test when there is no covariate adjustment The PO model is an extension of the Wilcoxon test handling covariates

Wilcoxon test is an accepted way to assess whether one treatment has higher patient responses than other, when the response is continuous or ordinal

Scaling the Wilcoxon statistic to [0,1] yields a concordance probability (which is the Mann-Whitney statistic)

Probability that a randomly chosen patient on treatment B has a higher value than a randomly chosen patient on treatment A

Over a wide variety of simulated clinical trials the between the PO model log OR and logit() is 0.996

Mean absolute error in computing the Wilcoxon statistic (scaled to [0,1]) from the OR is 0.008

Simple conversion formula

The numerator of the score statistic from the PO model is identical to the Wilcoxon statistic

Treatment OR is interpreted just as OR from a binary outcome, just need to pool outcomes

E.g. OR for vs. in place of OR for vs.

Side-by-side stacked bar charts

proportion of patients at each level of for each treatment

same but from the model for a specific covariate setting

Testing the PO Assumption

Peterson and Harrell (1990) Applied Stat39:205-217 showed that the score test for proportional hazards is anti-conservative, i.e., can have highly inflated from -values being too small^{3}. As Peterson and Harrell state on p. 215:

^{3} This was pointed out to the author of SAS PROC LOGISTIC in 1990, and the author elected to ignore this finding in implementing the very problematic score test in the procedure.

The simulations reveal that both tests often give blatantly erroneous results when the cross-tabulation table for the response variable by an explanatory variable contains empty cells at an inner value of , i.e., . Less blatant, but still suspicious, results are occasionally obtained if the table suffers from a general sparseness of cell sizes. The score test also suffers if the number of observations at one of the levels of is small relative to the total sample size.

Peterson and Harrell did not study operating characteristics of the likelihood ratio (LR) test for proportional odds, but in general LR tests have better performance than Wald and score tests^{4}. See this, which emphasizes the importance of assessing the (minimal) impact of the PO assumption rather than testing it, for details

^{4} See this paper by Andrew Thomas who studied the likelihood ratio test for PO.

Relaxing the PO Assumption

Peterson and Harrell (1990) developed the partial PO model to relax the PO (equal slopes / parallelism) assumption for one or more predictors. See Software below for implementations.

Paper exemplifying game playing in choice of RCT endpoint and how things go seriously wrong with time to symptom resolution as an endpoint (authors had to ignore emergency department visits)

lrm function for discrete Y or continuous Y having not more than a couple of hundred distinct levels, only implements the logit link

orm function is intended for continuous Y and efficiently handles up to several thousand distinct values; multiple link functions are handled

VGAM package for discrete Y implements a huge number of link functions and allows for general relaxation of proportional odds and related assumptions, allowing one to fit the partial proportional odds model

blrm function implements Bayesian proportional odds, partial PO, and constrained partial PO models with optional random intercepts for Y with up to perhaps 200 distinct values (execution time is linear in the number of distinct Y)

Hmisc package popower and posamsize functions for power and sample size calculations for unadjusted PO comparisons and the Wilcoxon test

Other

All major statistical software packages have a proportional odds model. The partial PO model is probably only implemented in R, and only SAS JMP has an efficient implementation like rms::orm for continuous Y.

]]>2022endpointsordinalregressionhttps://fharrell.com/post/rpo/index.htmlThu, 05 May 2022 05:00:00 GMTDecision curve analysis for quantifying the additional benefit of a new markerEmily Vertosick and Andrew Vickers
https://fharrell.com/post/addmarkerdca/index.html

Decision curve analysis for quantifying the additional benefit of a new marker

Decision curve analysis is a method to quantify the clinical utility of a model. While discrimination, calibration and other metrics can be used to compare the performance of models, these methods do not give us information on whether the application of a model in a real-life setting would be appropriate or beneficial.

For example, take the question of whether using a model including age, sex and cholesterol is better than a model with age and sex alone at identifying patients with chest pain who have a risk of coronary artery disease (CAD) and should undergo cardiac catheterization. To utilize the new model, a blood draw must be performed to measure the patient’s cholesterol. However, if using the new model with cholesterol could help ensure that the patients with CAD are treated while allowing patients without CAD to avoid the invasive cardiac catheterization procedure, then requiring a blood draw may be reasonable. Decision curves allow us to assess the risk-benefit balance associated with adding cholesterol to the age and sex model.

The x-axis is “threshold probability” while the y-axis is “net benefit”. The threshold probability is the risk level that would prompt an intervention. For example, a threshold probability of 10% would indicate that anyone who had a risk from the model of ≥10% would receive the intervention. Net benefit is calculated as true positives minus false positives, where the latter is given a weighting factor related to the relative benefits and harms of each. Generally in medicine, it is more beneficial to have a true positive (e.g. find a cancer) than it is harmful to have a false positive (e.g. an unnecessary biopsy). Hence false positives are often downweighted in the net benefit calculation. The weights are actually derived from the threshold probability: for more on net benefit and its calculation and interpretation, see www.decisioncurveanalysis.org.

Interpretation of the decision curve comes down to comparing the net benefit between the models and for the default strategies of intervening on all or no patients for the threshold probability of interest. Choosing the threshold probability of interest depends on the trade-off between the risk of the test and the intervention and the preferences of the doctor and patient.

For a younger, healthier patient, a doctor may be more willing to intervene and perform a cardiac catheterization at a lower risk threshold, like 15%. A 15% risk threshold implies that missing a case of CAD that requires catheterization is approximately 6 times as harmful as an unnecessary catheterization (odds ratio 85%/15% = 5.7), so we would be willing to accept the risk associated with catheterization from 6 similar patients to avoid missing a diagnosis and treatment for 1 young healthy patient.

On the other hand, an older, sicker patient may be at a higher risk and receive less benefit from cardiac catheterization and would prefer only to undergo catheterization if they are at a relatively high risk, for example, 35% or higher. Hence we should really only show the decision curve for the range of 15% to 35%.

Up to a risk threshold of about 20%, there is only a very slight difference between the age and sex model and the age, sex and cholesterol model – they both pretty much overlap with the “treat all” line. At higher threshold probabilities, there appears to be a slight benefit associated with adding cholesterol to the model, because the purple line for the cholesterol model indicates a higher net benefit than the model without.

We can also create a corresponding plot where the y-axis is “net reduction in interventions” rather than net benefit. This shows the number of patients who would avoid catheterization – holding the number of cases of CAD found constant – using the cholesterol model instead of age and sex alone.

Below is a table which shows the net benefit, difference in net benefit and net difference in interventions avoided for threshold probabilities from 5-50% (we’ve gone a bit wider than 15% to 35% for the purposes of illustration). A net difference in interventions avoided of 1 means that, if we decide on catheterization based on the model including cholesterol rather than age and sex alone, we will have to do one fewer catheterization for every 100 patients, while keeping the number of cases of CAD found constant.

Threshold

Net Benefit (Age+Sex)

Net Benefit (Age+Sex+Cholesterol)

Difference in Net Benefit

Net Difference in Interventions Avoided per 100 patients

5%

0.6420

0.6420

0.0000

0.0443

10%

0.6221

0.6222

0.0001

0.1329

15%

0.6000

0.6006

0.0005

0.3100

20%

0.5752

0.5752

0.0000

0.0000

25%

0.5477

0.5545

0.0068

2.037

30%

0.5150

0.5321

0.0171

3.986

35%

0.4902

0.5085

0.0183

3.404

40%

0.4616

0.4814

0.0198

2.967

45%

0.4380

0.4512

0.0132

1.609

50%

0.4101

0.4181

0.0080

0.7972

What this table shows is that (with the exception of the 20% threshold, an effect we can likely attribute to statistical noise) the model including cholesterol leads to a net difference in interventions. However, at some thresholds, e.g. 10 or 15%, the effect is very small. A net difference of 0.31 means that we would have to apply the cholesterol model rather than the age and sex model to over 300 patients to avoid one catheterization procedure.

This illustrates the value of net benefit for quantifying the benefit of additional markers. Most patients have a cholesterol available, and even if someone doesn’t, it is a quick, virtually painless and inexpensive test. So yes, we would probably take 300 cholesterol measurements to avoid one cardiac catheterization. This is likely not true for a more expensive, inconvenient or invasive test. For instance, it is unlikely that we would be willing to subject over 300 patients to an MRI to prevent one undergoing a cardiac catheterization.

Compare this to other metrics for quantifying the added value of a predictor in a model, such as assessing variation or R2. What value of the ratio of variances of the predicted values would indicate that we should use cholesterol or MRI? What increment in AUC would be worthwhile?

Decision curve analysis provides a clinically interpretable metric because it is given in clinical terms: cases of CAD found or cardiac catheterizations avoided.

The Wilcoxon-Mann-Whitney two-sample rank-sum test is a nonparametric test that is much used for comparing two groups on an ordinal or continuous response Y in a parallel group design. Here is how the Wilcoxon statistic is computed and is related to other measures.

The Wilcoxon statistic is the sum of the ranks for those observations belonging to the second group

The Mann-Whitney statistic equals where is the number of observations in the second group

The concordance probability , also called the probability index, is divided by the product of the two groups’ sample sizes. treats ties as having a concordance of , which stems from the use of midranks for ties in computing . is the probability that a randomly chosen subject from the second group has a response Y that is larger than that of a randomly chosen subject from the first group, plus times the probability that they are tied on Y. is just the Wilcoxon statistic re-scaled to .

The proportional odds (PO) ordinal logistic model is a generalization of the Wilcoxon-Mann-Whitney two-sample rank-sum test that allows for covariates. When there are no covariates, the Wilcoxon test and PO model are equivalent in two ways:

The numerator of the Rao efficient score test for comparing two groups in the unadjusted PO model, which tests for an odds ratio (OR) of 1.0, is identical to the Wilcoxon test statistic

Over a wide variety of datasets exhibiting both PO and non-PO, the for predicting the logit of from the PO log(OR) is 0.996

This has been explored in two of my blog articles:

Those unfamiliar with the theory of linear rank statistics will sometimes make the claim that the Wilcoxon test does not assume proportional odds. That this is not the case can be seen in multiple ways:

The OR from the PO model is almost exactly a simple monotonic function of the Wilcoxon statistic (the main topic of this article)

The PO model score test and the Wilcoxon statistic are mathematically equivalent (see above) just as the score test from the Cox proportional hazards model is the same as the logrank test statistic

As explained here by Patrick Breheny, linear rank tests such as the Wilcoxon test are derived in general by solving for the locally most powerful test. Each type of linear rank test comes from a generating distribution. Here are some examples:

Normal generating distribution: the optimal weight for the rank of the response variable is the expected value of order statistic from a standard normal distribution (Fisher-Yates normal scores) or their approximation (van der Waerden scores). The semiparametric model counterpart is the probit ordinal regression model

Double exponential (Laplace) distribution: the observation weight is (Mood median test or sign test)

Logistic distribution: optimal weights are the ordinary ranks . The logistic generating distribution means the two distributions are parallel on the inverse logistic (logit) scale, i.e., PO holds

So the Wilcoxon test is designed for a logistic distribution/PO situation, i.e, that is where it has the most power. Hence it is fair to say that the Wilcoxon test assumes PO. The statement that the Wilcoxon test does not assume PO in order to validly compute -values is misleading; under the null hypothesis the treatment is irrelevant so it must operate in proportional odds. It must also simultaneously operate in proportional hazards and under a wide variety of other model assumptions. Since treatment doesn’t affect the distribution of Y under the null, there are no distribution shift assumptions.

The Wilcoxon two-sample test, invented in 1945, is embraced by most statisticians and clinical trialists, and it doesn’t matter to them whether the test assumes PO or not. Our argument is that since the unadjusted PO model is equivalent to the Wilcoxon test, any reviewer accepting of the Wilcoxon test should logically be accepting of the PO model (invented in 1967). The only possible criticism would be that adjusting for covariates that do not satisfy the PO assumption would ruin the assessment of the treatment effect in the PO model. That this is not the case is demonstrated by an extreme non-PO example here.

In this report I go a step further that earlier blog articles and repeat the simulations done with more repetitions and many more conditions and stratify the results by the degree of non-proportional odds exhibited in each random sample. Random trials were simulated for sample sizes 20, 25, 30, 40, 50, 60, …, 100, 150, 200, 500, 1000. For each trial, 0:1 group assignments were generated such that the number of subjects in the first treatment group is rounded to the nearest integer, where is a random uniform value between and . Ordinal responses Y were generated in five ways by using combinations of the following two aspects:

More continuous vs. more discrete Y

sampling with replacement from the integers 1 to n for the current sample size n

sampling with replacement from the integers 1 to m where for each trial m is randomly chosen from the integers 4, 5, 6, …, 10

Equal vs. unequal sampling probabilities vs. normal distributions with unequal variances

Equal sampling probabilities for both groups and all levels of Y. This is for a null case where there are no true group differences, and will generate samples showing non-PO only for smaller trials

Unequal sampling probabilities, allowing arbitrarily large (or null) treatment effects and arbitrarily large (or small) non-PO for all sample sizes. This is done by taking a random sample of size n or m from a uniform distribution, taking these as the multinomial probabilities for Y in the first group, sampling n0 Y from these unequal probabilities. Then repeat the process independently for the second group with n1 observations. The two sets of multinomial probabilities are disconnected, allowing arbitrarily large non-PO.

Sampling from two normal distributions with varying true differences in means and varying ratios of standard deviations. For this case, large non-PO occurs when the SDs are much different. Trial data for the number of participants assigned to the first group are simulated from a normal distribution with mean and SD where is a draw from a uniform distribution ranging over -1.5 to 1.5 and is a draw from a uniform distribution ranging over 0.4 to 3.0. Then new single draws are made for and for the second sample, and the sample is similarly drawn from a normal distribution. Both sets of sample values are multiplied by ten and rounded to the nearest integer. For this type of random number generation, n vs. m is ignored so there are more simulations for this third type.

One hundred trials are run for each sample size and for each of five combinations. This process generates many configurations of ties and distinct values of Y, degrees of non-proportionality, and treatment allocation ratios.

Quantifying the Departure from PO

For a given sample the degree of non-PO is quantified using the following steps:

Compute the empirical cumulative distribution function (ECDF) of Y stratified by group assignment

Evaluate these ECDFs at the combined set of distinct Y values occurring in the data for either group

When an ECDF value is outside [0.02, 0.98] set it to that nearest boundary

Take the logits of the two ECDFs

Compute over the grid of combined distinct values the difference between the two logit CDFs to examine parallelism. Parallel curves on the logit scale indicate PO.

Quantify the non-parallelism by taking Gini’s mean difference of all the differences (a robust competitor of the standard deviation that is the mean of all pairwise absolute differences). A low value indicates parallelism. The lowest possible value of 0.0 indicates equidistant logit ECDFs across all values of Y.

This procedure is virtually the same as computing empirical log odds ratios for all possible cutoffs of Y and looking at their variation. It differs only in how differences are computed for a cutpoint that is outside the data for one of the groups. It may give too much weight to unstable ORs, so also compute a second measure of non-PO, not using any extrapolation, that is a weighted standard deviation of log ORs over all cutoffs of , with weights equal to the estimated variance of the log odds ratios. This index can be computed whenever there are at least two cutpoints having neither 0.0 nor 1.0 proportions in either group.

The indexes of non-PO are exemplified by taking samples of size n=50 in a similar way to how the simulation will be run later. The plotted ECDFs for the two groups are on the logit scale. The index of non-parallelism of these two transformed curves appears on each panel. The bottom right panel shows the relationship of the two indexes.

Code

# Function to curtail to [0.02, 0.98] before taking logitlg <-function(p) qlogis(pmax(pmin(p, 1.-0.02), 0.02))# Function to curtail log ORs to [-6, 6]cu <-function(x) pmin(pmax(x, -6), 6)# Function to quantify degree of non-proportional odds first by computing the# Gini's mean difference of the difference between# two logit of ECDFs. Quantifies variability of differences over y# When ECDF is 0 or 1 replace by 0.02, 0.98 so can take logit# Note that ecdf produces a function and when it is called with an # x-value that outside the range of the data the value computed is 0 or 1# Computes a second index of non-PO by getting a weighted standard deviation of# all possible log ORs, where weights are inverse of variance of log OR# Note that when a P(Y>=y) is 0 or 1 the weight is zero because variance is infinite.npod <-function(y1, y2, pl=FALSE, xlim=range(ys),ylim=range(lg(f1(r)), lg(f2(r))), axes=TRUE) { f1 <-ecdf(y1) f2 <-ecdf(y2) y <-c(y1, y2) r <-range(y) ys <-sort(unique(y))# There cannot be non-PO if only 2 levels of y, and if no overlap# there is no way to assess non-POif(length(ys) <=2||max(y1) <=min(y2) ||max(y2) <=min(y1)) npo1 <- npo2 <-0.else { dif <-lg(f1(ys)) -lg(f2(ys)) npo1 <-GiniMd(dif) lor <- w <-numeric(length(ys) -1) n1 <-length(y1) n2 <-length(y2)for(j in2:length(ys)) { y <- ys[j] p1 <-mean(y1 >= y) p2 <-mean(y2 >= y)if(min(p1, p2) ==0||max(p1, p2) ==1) lor[j-1] <- w[j-1] <-0else { lor[j-1] <-log(p2 / (1.- p2)) -log(p1 / (1.- p1)) w [j-1] <-1./ ((1./ (n1 * p1 * (1.- p1))) + (1./ (n2 * p2 * (1.- p2)))) } } npo2 <-sqrt(wtd.var(lor, w, normwt=TRUE)) }if(pl) {plot (ys, lg(f1(ys)), type='s', xlab='', ylab='',xlim=xlim, ylim=ylim, axes=axes)lines(ys, lg(f2(ys)), type='s', col='red')text(xlim[1], ylim[2],paste0('npo1=', round(npo1, 2), '\nnpo2=', round(npo2, 2)),adj=c(0,1)) }c(npo1=npo1, npo2=npo2)}getRs('reptools.r', put='source') # defines kabl, htmlListgetRs('hashCheck.r', put='source') # defines runifChangedn <-50; n0 <- n1 <-25par(mfrow=c(3,4), mar=c(1.5,1.5,.5,.5), mgp=c(.5, .4, 0))set.seed(368)z <-matrix(NA, nrow=11, ncol=2)for(i in1:11) { p0 <-runif(n)# Note that sample uses prob as weights and they need not sum to 1 y0 <-sample(1: n, n0, prob=p0, replace=TRUE) p1 <-runif(n) y1 <-sample(1: n, n1, prob=p1, replace=TRUE) z[i, ] <-npod(y0, y1, pl=TRUE, xlim=c(0, 50), ylim=c(-4, 4))}plot(z[, 1], z[, 2], xlab='npo1', ylab='npo2')

Examine how much non-PO is present in the simulated samples as a function of the sample size and the sampling strategy. Show this for two different non-PO measures, and see how the measures relate to each other.

To derive the approximating equation for computing the concordance probability use robust regression to predict logit of concordance probability from the PO log(OR). is curtailed to before taking the logit to not allow infinite estimates. is the chosen transformation because it transforms to be on an unrestricted scale, just as the log odds ratio is. By good fortune (or some unknown theoretical argument) this happens to yield almost perfect linearity.

Quadratic and cubic polynomials were tried on the robust regression fit, with no improvement in or mean absolute prediction error.

Code

g <-function(beta, concord, subset=1:length(beta)) {require(MASS) beta <- beta[subset] concord <- concord[subset] i <-!is.na(concord + beta) concord <- concord[i] beta <- beta[i] f <-rlm(lg(concord) ~ beta) w <-ggfreqScatter(cu(beta), lg(concord), bins=150, g=20, ylab='logit(c)',xlab=expression(paste('Curtailed ', hat(beta)))) +geom_abline(intercept=coef(f)[1], slope=coef(f)[2])print(w) pc <-plogis(fitted(f)) dif <-abs(concord - pc) w <-c(mean(dif, na.rm=TRUE), quantile(dif, 0.9, na.rm=TRUE), cor(pc, concord)^2)names(w) <-c('MAD', 'Q9', 'R2')list(Stats=w, Coefficients=coef(f))}w <-g(beta, cstat)

MAD is the mean absolute difference between predicted and observed , and Q9 is the 0.9 quantile of the absolute errors. Both measures are computed on the [0,1] scaled . The intercept is virtually zero and the regression coefficient of the log(OR) is 0.6453. Our approximation equation for computing the scaled Wilcoxon statistic from the PO model estimate of the OR is derived as follows:

From here on the constant 0.65 will be used. Now examine the relationship on the concordance probability scale. The scatterplot uses colors to denote the frequency of occurrence of nearly coincident points. The quality of the approximation on this scale is given by .

Code

ac <-function(b) { or <-exp(b) (or ^0.65) / (1+ or ^0.65)}ad <-abs(cstat -ac(beta))h <-function(x) round(mean(x), 4)MAD <- ads <-summary(MAD ~ Ydiscrete + Type, fun=h)print(s, markdown=TRUE)

MAD N= 8400

N

MAD

Ydiscrete

Discrete Y

4200

0.0049

Semi-Continuous Y

4200

0.0038

Type

Common Probabilities

2800

0.0016

Normal, Unequal Var.

2800

0.0081

Unequal Probabilities

2800

0.0033

Overall

8400

0.0043

Code

xl <-expression(OR ^0.65/ (1+ OR ^0.65))yl <-'Concordance Probability c'ggfreqScatter(ac(beta), cstat, xlab=xl, ylab=yl) +geom_abline(intercept=0,slope=1,alpha=.1,size=.2)

The points that are more consistent with a curved relationship are mostly singletons or frequency 2-4.

Repeat the last graph stratified by intervals of study sample sizes.

linear npo.R2 spline npo.R2 linear npo2.R2 spline npo2.R2 npo + npo2.R2
0.291 0.303 0.257 0.294 0.353

Code

dd <-datadist(npo, npo2); options(datadist='dd')label(npo) <-'Degree of Non-PO'f <-ols(ad ~rcs(npo, 5))ggplot(Predict(f), ylab='Mean |error|', xlab='Degree of Non-PO', rdata=data.frame(npo), ylim=c(0, 0.1),histSpike.opts=list(frac=function(f) 0.01+0.02* f / (max(f, 2) -1),side=1, nint=100))

The worst MAD is estimated to be around 0.02 and the relationship steepens around npo=0.5. Even though the best model for predicting MAD uses nonlinear functions of both non-PO indexes, for simplicity let’s use the stronger of the two, npo, for key results.

Earlier plots demonstrate the practical equivalence of the no-covariate PO model and the Wilcoxon test ( ), as the points hover about the line of identity. Here is a summary of the 6 out of 8400 simulated trials for which the discrepancy between predicted and actual was .

The discrepant cases are primarily from smaller unbalanced trials with many ties in Y and non-PO. Most importantly, even in the most discrepant datasets there is complete agreement between the PO model and the Wilcoxon test on which group has the higher response tendency, since both approaches yield estimates on the same side of the null values in 8400 out of 8400 trials. The Wilcoxon statistic and the PO model estimate also agree completely in their judgments of equality of treatments. Agreement between being with of 0.5 and being within of 0.0 occurred in 8400 out of 8400 trials.

Note that under the null, PO must hold so no simulations are needed to compute . The null hypothesis implies that the treatment is ineffective everywhere.

Now go a step further and stratify results by intervals of the non-PO metric.

It can be seen that the extremely tight relationship between the PO OR and the Wilcoxon statistic is unaffected by the amount of non-PO exhibited in the sample.

Repeat these plots using the second non-PO measure.

To explore the data patterns that corresponded to the strongest PO violations according to the first non-PO measure in the lower right panel here are the logit transformed ECDFs for those 52 trials. On each panel the total sample size and group allocation ratios are shown. These large non-PO cases are for mainly smaller trials with heavy ties in Y. The first 42 of 52 trials are shown.

The unadjusted proportional odds model’s odds ratio estimate almost perfectly reflects the Wilcoxon test statistic regardless of the degree of non-proportional odds and sample size. A simple formula allows for conversion between the two, and even under severe non-PO the mean absolute error in estimating from OR is 0.008. Importantly, the PO results and the Wilcoxon statistic never disagree on the direction of the treatment effect, and they never disagree about the exact equality of treatments, i.e., OR=1.0 if and only if there is complete overlap in the two groups indicated by with the Wilcoxon -value being 1.0.

R version 4.2.1 (2022-06-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 22.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] MASS_7.3-58 rms_6.4-0 SparseM_1.81 Hmisc_4.7-2
[5] ggplot2_3.3.5 Formula_1.2-4 survival_3.4-0 lattice_0.20-45

To cite R in publications use:

R Core Team (2022).
R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria.
https://www.R-project.org/.

]]>2022endpointsordinaldrug-evaluationhypothesis-testingRCTregressionhttps://fharrell.com/post/powilcoxon/index.htmlWed, 06 Apr 2022 05:00:00 GMTLongitudinal Data: Think Serial Correlation First, Random Effects SecondFrank Harrell
https://fharrell.com/post/re/index.html

Random effects/mixed effects models shine for multi-level data such as measurements within cities within counties within states. They can also deal with measurements clustered within subjects. There are at least two contexts for the latter: rapidly repeated measurements where elapsed time is not an issue, and serial measurements spaced out over time for which time trends are more likely to be important. An example of the first is a series of tests on a subject over minutes when the subject does not fatigue. An example of the second is a typical longitudinal clinical trial where patient responses are assessed weekly or monthly. For the first setup, random effects are likely to capture the important elements of within-subject correlation. Not so much for the second setup, where serial correlation dominates and time ordering is essential.

A random effects model that contains only random intercepts, which is the most common use of mixed effect modeling in randomized trials, assumes that the responses within subject are exchangeable. This can be seen from the statement of the linear mixed effects model with random intercepts. For the th subject assessed on the th occasion we have where random effects might be assumed to have a normal distribution with mean zero and variance . Residuals are irreducible errors assumed to represent white noise and are all independent of one another. The s don’t know any subject boundaries.

Note from the linear mixed model statement that time plays no role in or . Time may play a role as a fixed effect in , but not in the components encoding intra-subject correlation. Shuffling the time order of measurements within subject does not affect the correlation (nor the final parameter estimates, if time is not part of ). Thus the multiple measurements within subject are exchangeable and the forward flow of time is not respected. This induces a certain correlation structure within subject: the compound symmetric correlation structure. A random intercept model assumes that the correlation between any two measurements on the same subject is unrelated to the time gap between the two measurements. Compound symmetry does not fit very well for most longitudinal studies, which instead usually have a serial correlation structure in which the correlation between two measurements wanes as the time gap widens. Serial correlations can be added on top of compound symmetry, but as this is not the default in SAS PROC MIXED this is seldom used in the pharmaceutical industry.

I’ve heard something frightening from practicing statisticians who frequently use mixed effects models. Sometimes when I ask them whether they produced a variogram to check the correlation structure they reply “what’s that?”. A variogram is a key diagnostic for longitudinal models in which the time difference between all possible pairs of measurements on the same subject is played against the covariance of the pair of measurements within subject. These data are pooled over subjects and an average is computed for each distinct time gap occurring in the data, then smoothed. See RMS Course Notes Section 7.8.2 Figure 7.4 for an example.

Faes et al 2009 published a highly useful and intuitive paper on estimating the effective sample size for longitudinal data under various correlation structures. They point out an interesting difference between a compound symmetric (CS) correlation structure and a first-order autoregressive serial correlation structure (AR(1)). Under compound symmetry there is a limit to the information added by additional observations per subject, whereas for AR(1) there is no limit. They explained this thus: “Under CS, every measurement within a cluster is correlated in the same way with all other measurements. Therefore, there is a limit to what can be learned from a cluster and the additional information coming from new cluster members approaches zero with increasing cluster size. In contrast, under AR(1), the correlation wanes with time lag. So, with time gap between measurements tending to infinity, their correlation tends to zero and hence the additional information tends to that of a new, independent observation.”

Random intercepts comprise parameters for subjects. Even though the effective number of parameters is smaller than , the large number of parameters results in a computational burden and convergence issues. Random intercept models are extended into random slope-and-intercept models and random shape models but these entail even more parameters and may be harder to interpret. In addition, when there is an absorbing state or a level of the response variable such that when a subject has she never recovers to , these situations cannot be handled by random effects models. When an incorrect correlation structure is assumed, and the effective number of parameters estimated may be large. See this for an example where modeling the correlation structure correctly made the random effects inconsequential.

The first model for longitudinal data was the growth curve model. See Wishart 1938 and Potthoff and Roy 1964. Multivariate normality was assumed and no random effects were used. Generalized least squares is based on these ideas, and can incorporate multiple types of correlation structures without including any random effects. Markov models (see especially here and here for references) are more general ways to incorporate a variety of correlation structures with or without random effects. Markov models are more general because they easily extend to binary, nominal, ordinal, and continuous . They are computationally fast and require only standard frequentist or Bayesian software until one gets to the post-model-fit stage of turning transition probabilities into state occupancy probabilities. A first-order Markov process models transitions from one time period to the next, conditioning the transition probabilities on the response at the previous period as well as on baseline covariates. To be able to use this model you must have the response variable assessed at baseline. Responses at previous time periods are treated exactly like covariates in the period-to-period transition models. Semiparametric models are natural choices for modeling the transitions, allowing to be binary, ordinal, or continuous. Multiple absorbing states can be handled. Once the model is fitted, one uses a recursive matrix multiplication to uncondition on previous states, yielding current status probabilities, also called state occupancy probabilities. An example of a first-order proportional odds Markov transition model that has quite general application to longitudinal data is below. Let be the response assessed at time period , where .

Here the s are intercepts, and there are of them when takes on distinct values. is and the function expresses how you want to model the effect of the previous state. This may require multiple parameters, all of which are treated just like . The strength of effect of goes along with the strength of the intra-subject correlation, and involvement of adds further flexibility in correlation patterms.

Generalized estimating equations (GEE) is a flexible way to model longitudinal responses, but it has some disadvantages: it is a large sample approximate method; it does not use a full likelihood function so cannot be used in a Bayesian context; not being full likelihood the repeated observations are not properly “connected” to each other, so dropouts and missed visits must be missing completely at random, not just missing at random as full likelihood methods require. Generalized least squares, Markov models when no random effects are added, and GEE are all examples of marginal models, marginal meaning in the sense of not being conditional on subject so not attempting to estimate individual subjects’ trajectories.

Mixed effects conditional (on subject) models are indispensable when a goal is to estimate the outcome trajectory for an individual subject. When the goal is instead to make group level estimates (e.g., treatment differences in trajectories) then one can do excellent analyses without using random effects. Above all, don’t default to only using random intercepts to handle within-subject correlations of serial measurements. This is unlikely to fit the correlation structure in play. And it will not lead to the correct power calculation for your next longitudinal study.

]]>drug-evaluationendpointsmeasurementRCTregression2022https://fharrell.com/post/re/index.htmlTue, 15 Mar 2022 05:00:00 GMTAssessing the Proportional Odds Assumption and Its ImpactFrank Harrell
https://fharrell.com/post/impactpo/index.html

Reviewers who do not seem to worry about the proportional hazards assumption in a Cox model or the equal variance assumption in a -test seem to worry a good deal about the proportional odds (PO) assumption in a semiparametric ordinal logistic regression model. This in spite of the fact that proportional hazards and equal variance in other models are exact analogies to the PO assumption. Furthermore, when there is no covariate adjustment, the PO model is equivalent to the Wilcoxon test, and reviewers do not typically criticize the Wilcoxon test or realize that it has optimum power only under the PO assumption.

The purpose of this report is to (1) demonstrate examinations of the PO assumption for a treatment effect in a two-treatment observational comparison, and (2) discuss various issues around PO model analysis and alternative analyses using cutpoints on the outcome variable. It is shown that exercises such as comparing predicted vs. observed values can be misleading when the sample size is not very large.

Dataset

The dataset, taken from a real observational study, consists of a 7-level ordinal outcome variable y having values 0-6, a treatment variable trt, and a strong baseline variable baseline defined by a disease scale that is related to y but with more resolution. This is a dominating covariate, and failure to adjust for it will result in a weaker treatment comparison. trt levels are A and B, with 48 patients given treatment B and 100 given treatment A.

y
trt 0 1 2 3 4 5 6
A 4 8 26 19 27 2 14
B 2 11 13 5 13 3 1

Proportional Odds Model

Code

f <-lrm(y ~ trt + baseline, data=d)f

Logistic Regression Model

lrm(formula = y ~ trt + baseline, data = d)

Frequencies of Responses

0 1 2 3 4 5 6
6 19 39 24 40 5 15

Model Likelihood Ratio Test

Discrimination Indexes

Rank Discrim. Indexes

Obs 148

LR χ^{2} 69.26

R^{2} 0.386

C 0.778

max |∂log L/∂β| 4×10^{-13}

d.f. 2

R^{2}_{2,148} 0.365

D_{xy} 0.556

Pr(>χ^{2}) <0.0001

R^{2}_{2,141.3} 0.379

γ 0.571

Brier 0.151

τ_{a} 0.449

β

S.E.

Wald Z

Pr(>|Z|)

y≥1

6.1565

0.6167

9.98

<0.0001

y≥2

4.3821

0.4718

9.29

<0.0001

y≥3

2.5139

0.3600

6.98

<0.0001

y≥4

1.5520

0.3174

4.89

<0.0001

y≥5

-0.3033

0.3150

-0.96

0.3357

y≥6

-0.6738

0.3361

-2.00

0.0450

trt=B

-1.1328

0.3290

-3.44

0.0006

baseline

-0.0888

0.0121

-7.32

<0.0001

Code

summary(f)

Effects Response: y

Low

High

Δ

Effect

S.E.

Lower 0.95

Upper 0.95

baseline

4

32

28

-2.48800

0.3397

-3.15300

-1.8220

Odds Ratio

4

32

28

0.08311

0.04271

0.1617

trt — B:A

1

2

-1.13300

0.3290

-1.77800

-0.4880

Odds Ratio

1

2

0.32210

0.16900

0.6138

Code

anova(f)

Wald Statistics for y

χ^{2}

d.f.

P

trt

11.86

1

0.0006

baseline

53.63

1

<0.0001

TOTAL

56.57

2

<0.0001

Volatility of ORs Using Different Cutoffs

Even when the data generating mechanism is exactly proportional odds for treatment, different cutoffs of the response variable Y can lead to much different ORs when the sample size is not in the thousands. This is just the play of chance (sampling variation). To illustrate this point, consider the observed proportions of Y for trt=A as population probabilities for A. Apply an odds ratio of 0.3 to get the population distribution of Y for treated patients. For 10 simulated trials, sample from these two multinomial distributions and compute sample ORs for all Y cutoffs.

Code

p <-table(d$y[d$trt =='A'])p <- p /sum(p)p # probabilities for SOC

For discrete Y we are interested in checking the impact of the PO assumption on predicted probabilities for all of the Y categories, while also allowing for covariate adjustment. This can be done using the following steps:

Select a set of covariate settings over which to evaluate accuracy of predictions

Vary at least one of the predictors, i.e., the one for which you want to assess the impact of the PO assumption

Fit a PO model the usual way

Fit models that relaxes the PO assumption

to relax the PO assumption for all predictors fit a multinomial logistic model

to relax the PO assumption for a subset of predictors fit a partial PO (PPO) model

For all the covariate combinations evaluate predicted probabilities for all levels of Y using the PO model and the relaxed assumption models

Use the bootstrap to compute confidence intervals for the differences in predicted values between a PO model and a relaxed model. This will put the differences in the right context by accounting for uncertainties. This guards against over-emphasis of differences when the sample size does not support estimation, especially for the relaxed model with more parameters. Note that the same problem occurs when comparing predicted unadjusted probabilities to observed proportions, as observed proportions can be noisy.

Level 5 of y has only 5 patients so we combine it with level 6 for fitting the two relaxed models that depend on individual cell frequencies. Similarly, level 0 has only 6 patients, so we combine it with level 1. The PPO model is fitted with the VGAM R package, and the nonpo argument below signifies that the PO assumption is only being relaxed for the treatment effect. The multinomial model allows not only non-PO for trt but also for baseline. See here for impactPO source code.

PO PPO Multinomial
Deviance 395.58 393.10 388.36
d.f. 6 9 12
AIC 407.58 411.10 412.36
p 2 5 8
LR chi^2 69.41 71.89 76.63
LR - p 67.41 66.89 68.63
LR chi^2 test for PO 2.48 7.22
d.f. 3 6
Pr(>chi^2) 0.4792 0.3013
MCS R2 0.374 0.385 0.404
MCS R2 adj 0.366 0.364 0.371
McFadden R2 0.149 0.155 0.165
McFadden R2 adj 0.141 0.133 0.130
Mean |difference| from PO 0.021 0.042
Covariate combination-specific mean |difference| in predicted probabilities
method trt baseline Mean |difference|
1 PPO A 4 0.010
2 PPO B 4 0.033
11 Multinomial A 4 0.032
21 Multinomial B 4 0.052
Bootstrap 0.95 confidence intervals for differences in model predicted
probabilities based on 300 bootstraps
trt baseline
1 A 4
PO - PPO probability estimates
1 2 3 4 5
Lower -0.004 -0.017 -0.058 -0.055 -0.042
Upper 0.008 0.019 0.008 0.081 0.058
PO - Multinomial probability estimates
1 2 3 4 5
Lower 0.002 -0.017 -0.152 -0.105 -0.037
Upper 0.020 0.071 -0.006 0.107 0.133
trt baseline
2 B 4
PO - PPO probability estimates
1 2 3 4 5
Lower -0.043 -0.077 -0.025 -0.191 -0.102
Upper 0.013 0.083 0.197 0.065 0.095
PO - Multinomial probability estimates
1 2 3 4 5
Lower -0.050 -0.025 -0.051 -0.272 -0.143
Upper 0.035 0.147 0.194 0.041 0.095

Comparisons of the PO model fit with models that relax the PO assumption above can be summarized as follows.

By AIC, the model that is most likely to have the best cross-validation performance is the fully PO model (the lower the AIC the better)

There is no evidence for non-PO, either when judging against a model that relaxes the PO assumption for treatment (P=0.48) or against a multinomial logistic model that does not assume PO for any variables (P=0.30).

The McFadden adjusted index, in line with AIC, indicates the best fit is from the PO model

The Maddala-Cox-Snell adjusted indicates the PO model is competitive. See this for information about general adjusted measures.

Nonparametric bootstrap percentile confidence intervals for the difference in predicted values between the PO model and one of the relaxed models take into account uncertainties and correlations of both sets of estimates. In all cases the confidence intervals are quite wide and include 0 (except for one case, where the lower confidence limit is 0.002), which is very much in line with apparent differences being clouded by overfitting (high number of parameters in non-PO models).

These assessments must be kept in mind when interpreting the inter-model agreement between probabilities of all levels of the ordinal outcome in the graphic that follows. According to AIC and adjusted , the estimates from the partial PO model and especially those from the multinomial model are overfitted. This is related to the issue that odds ratios computed from oversimplifying an ordinal response by dichotomizing it are noisy (also see the next to last section below).

AIC is essentially a forecast of what is likely to happen were the accuracy of two competing models be computed on a new dataset not used to fit the model. Had the observational study’s sample size been much larger, we could have randomly split the data into training and test samples and had a head-to-head comparison of the predictive accuracy of a PO model vs. a non-PO (multinomial or partial PO) model in the test sample. Non-PO models will be more unbiased but pay a significant price in terms of variance of estimates. The AIC and adjusted analyses above suggest that the PO model will have lower mean squared errors of outcome probability estimates due to the strong reduction in variance (also see below).

Efficiency of Analyses Using Cutpoints

Clearly, the dependence of the proportional odds model on the assumption of proportionality can be over-stressed. Suppose that two different statisticians would cut the same three-point scale at different cut points. It is hard to see how anybody who could accept either dichotomy could object to the compromise answer produced by the proportional odds model. — Stephen Senn

Above I considered evidence in favor of making the PO assumption. Now consider the cost of not making the assumption. What is the efficiency of using a dichotomous endpoint? Efficiency can be captured by comparing the variance of an inefficient estimate to the variance of the most efficient estimate (which comes from the PO model by using the full information in all levels of the outcome variable). We don’t know the true variances of estimated treatment effects so instead use the estimated variances from fitted PO and binary logistic models.

Code

vtrt <-function(fit) vcov(fit)['trt=B', 'trt=B']vpo <-vtrt(f)w <-NULLfor(cutoff in1:6) { h <-lrm(y >= cutoff ~ trt + baseline, data=d) eff <- vpo /vtrt(h)# To discuss later: critical multiplicative error in OR cor <-exp(sqrt(vtrt(h) - vpo)) w <-rbind(w, data.frame(Cutoff=paste0('y≥', cutoff),Efficiency=round(eff, 2),`Sample Size Ratio`=round(1/eff, 1),`Critical OR Factor`=round(cor, 2),check.names=FALSE)) }w

Under PO the odds ratio from the PO model estimates the same quantity as the odds ratio from any dichotomization of the outcome. The relative efficiency of a dichotomized analysis is the variance of the most efficient (PO model) model’s log odds ratio for treatment divided by the variance of the log odds ratio from a binary logistic model using the dichotomization. The optimal cutoff (mainly due to being a middle value in the frequency distribution) is y≥4. For this dichotomization the efficiency is 0.56 (i.e., analyzing y≥4 vs. y is equivalent to discarding 44% of the sample) and the variance of the treatment log odds ratio is greater than the variance of the log odds ratio from the proportional odds model without binning. This means that the study would have to be larger to have the same power when dichotomizing the outcome as a smaller study that did not dichotomize it. Other dichotomizations result in even worse efficiency.

PO Model Results are Meaningful Even When PO is Violated

Overall Efficacy Assessment

Putting aside covariate adjustment, the PO model is equivalent to a Wilcoxon-Mann-Whitney two-sample rank-sum test statistic. The normalized Wilcoxon statistic (concordance probability; also called probability index) is to within a high degree of approximation a simple function of the estimated odds ratio from a PO model fit. Over a wide variety of datasets satisfying and violating PO, the for predicting the log odds ratio from the logit of the scaled Wilcoxon statistic is 0.996, and the mean absolute error in predicting the concordance probability from the log odds ratio is 0.002. See Violation of Proportional Odds is Not Fatal and If You Like the Wilcoxon Test You Must Like the Proportional Odds Model.

Let’s compare the actual Wilcoxon concordance probability with the concordance probability estimated from the odds ratio without covariate adjustment, .

Code

w <-wilcox.test(y ~ trt, data=d)w

Wilcoxon rank sum test with continuity correction
data: y by trt
W = 2881, p-value = 0.04395
alternative hypothesis: true location shift is not equal to 0

Code

W <- w$statisticconcord <- W /prod(table(d$trt))

Code

u <-lrm(y ~ trt, data=d)u

Logistic Regression Model

lrm(formula = y ~ trt, data = d)

Frequencies of Responses

0 1 2 3 4 5 6
6 19 39 24 40 5 15

Model Likelihood Ratio Test

Discrimination Indexes

Rank Discrim. Indexes

Obs 148

LR χ^{2} 4.18

R^{2} 0.029

C 0.555

max |∂log L/∂β| 2×10^{-7}

d.f. 1

R^{2}_{1,148} 0.021

D_{xy} 0.110

Pr(>χ^{2}) 0.0409

R^{2}_{1,141.3} 0.022

γ 0.247

Brier 0.240

τ_{a} 0.088

β

S.E.

Wald Z

Pr(>|Z|)

y≥1

3.4217

0.4390

7.79

<0.0001

y≥2

1.8302

0.2524

7.25

<0.0001

y≥3

0.4742

0.1948

2.43

0.0149

y≥4

-0.1890

0.1929

-0.98

0.3272

y≥5

-1.6691

0.2561

-6.52

<0.0001

y≥6

-1.9983

0.2858

-6.99

<0.0001

trt=B

-0.6456

0.3174

-2.03

0.0420

Note that the statistic in the above table handles ties differently than the concordance probability we are interested in here.

Code

or <-exp(-coef(u)['trt=B'])cat('Concordance probability from Wilcoxon statistic: ', concord, '\n','Concordance probability estimated from OR: ', or ^0.65/ (1+ or ^0.65), '\n', sep='')

Concordance probability from Wilcoxon statistic: 0.6002083
Concordance probability estimated from OR: 0.6033931

In the absence of adjustment covariates, the treatment odds ratio estimate from a PO model is essentially the Wilcoxon statistic whether or not PO holds. Many statisticians are comfortable with using the Wilcoxon statistic for judging which treatment is better overall, e.g., which treatment tends to move responses towards the favorable end of the scale. So one can seldom go wrong in using the PO model to judge which treatment is better, even when PO does not hold.

Simulation Study of Effect of Adjusting for a Highly Non-PO Covariate

What if the treatment operates in PO but an important covariate strongly violates its PO assumption? Let’s find out by simulating a specific departure from PO for a binary covariate. For a discrete ordinal outcome with levels 0,1,…,6 let the intercepts corresponding to be . Let the true treatment effect be . The simulated covariate is binary with a prevalence of . The true effect of is to have an OR of 3.0 on , , but to have an OR of on , and . So the initial regression coefficient for is and the additional effect of on once crosses to 4 and above is a decrement in its prevailing log odds by . So here is our model to simulate from:

Over simulations compare these three estimates and their standard error:

unadjusted treatment effect

treatment effect adjusted for covariate assuming both treatment and covariate act in PO

treatment effect adjusted for covariate assuming treatment is PO but allowing the covariate to be arbitrarily non-PO

To test the simulation, simulate a very large sample size of n=50,000 and examine the coefficient estimates from the correct partial PO model and from two other models.

Code

sim <-function(beta, n, nsim=100) { tx <-c(rep(0, n/2), rep(1, n/2)) x <-c(rep(0, n/4), rep(1, n/4), rep(0, n/4), rep(1, n/4))# Construct a matrix of logits of cumulative probabilities L <-matrix(alpha, nrow=n, ncol=6, byrow=TRUE) L[tx ==1,] <- L[tx ==1, ] + beta L[x ==1, ] <- L[x ==1, ] +log(3) L[x ==1, 4:6] <- L[x ==1, 4:6] -2*log(3) P <-plogis(L) # cumulative probs P <-cbind(1, P) -cbind(P, 0) # cell probs (each row sums to 1.0) b <- v <- pv <-matrix(NA, nrow=nsim, ncol=3)colnames(b) <-colnames(v) <-colnames(pv) <-c('PPO', 'PO', 'No X') y <-integer(n) a <-'tx' msim <-0for(i in1: nsim) {for(j in1: n) y[j] <-sample(0:6, 1, prob=P[j, ]) f <-try(vglm(y ~ tx + x, cumulative(reverse=TRUE, parallel=FALSE~ x)))if(inherits(f, 'try-error')) next msim <- msim +1 g <-lrm(y ~ tx + x) h <-lrm(y ~ tx) co <-c(coef(f)[a], coef(g)[a], coef(h)[a]) vs <-c(vcov(f)[a,a], vcov(g)[a,a], vcov(h)[a,a]) b[msim, ] <- co v[msim, ] <- vs pv[msim, ] <-2*pnorm(-abs(co /sqrt(vs))) } b <- b [1:msim,, drop=FALSE] v <- v [1:msim,, drop=FALSE] pv <- pv[1:msim,, drop=FALSE] bbar <-apply(b, 2, mean) bmed <-apply(b, 2, median) bse <-sqrt(apply(v, 2, mean)) bsemed <-sqrt(apply(v, 2, median)) sd <-if(msim <2) rep(NA, 3) elsesqrt(diag(cov(b))) pow <-if(nsim <2) rep(NA, 3) elseapply(pv, 2, function(x) mean(x <0.05))list(summary=cbind('Mean beta'= bbar,'Median beta'= bmed,'Sqrt mean estimated var'= bse,'Median estimated SE'= bsemed,'Empirical SD'= sd,'Power'= pow),sims=list(beta=b, variance=v, p=pv),nsim=msim)}require(VGAM)alpha <-c(4.4, 2.6, 0.7, -0.2, -2, -2.4)set.seed(1)si <-sim(beta=-1, 50000, 1)round(si$summary, 4)

Mean beta Median beta Sqrt mean estimated var Median estimated SE
PPO -0.9832 -0.9832 0.0094 0.0094
PO -0.9271 -0.9271 0.0168 0.0168
No X -0.9280 -0.9280 0.0168 0.0168
Empirical SD Power
PPO NA NA
PO NA NA
No X NA NA

With n=50,000 extreme non-PO in the binary covariate hardly affected the estimated treatment and its standard error, and did not affect the ratio of the coefficient estimate to its standard error. Non-PO in does effect the intercepts which has an implication in estimating absolute effects (unlike the treatment OR). But by examining the intercepts when the covariate is omitted entirely one can see that the problems with the intercepts when PO is forced are no worse than just ignoring the covariate altogether (not shown here).

Now simulate 2000 trials with n=300 and study how the various models perform.

Code

set.seed(7)fi <-'~/data/sim/simtx.rds'if(file.exists(fi)) simr <-readRDS(fi) else { s <-sim(-1, 300, 2000) s0 <-sim( 0, 300, 2000) # also simulate under the null simr <-list(s=s, s0=s0)saveRDS(simr, fi)}cat('Convergence in', simr$s$nsim, 'simulations\n\n')

Convergence in 1947 simulations

Code

kab(round(simr$s$summary, 4))

Mean beta

Median beta

Sqrt mean estimated var

Median estimated SE

Empirical SD

Power

PPO

-1.0157

-1.0100

0.2273

0.2281

0.2340

0.9979

PO

-0.9609

-0.9565

0.2189

0.2184

0.2227

0.9974

No X

-0.9599

-0.9556

0.2188

0.2183

0.2227

0.9974

The second line of the summary shows what to expect when fitting a PO model in the presence of severe non-PO for an important covariate. The mean estimated treatment effect is the same as not adjusting for the covariate and so is its estimated standard error. Both are close to the estimate from the proper model—the partial PO model that allows for different effects of over the categories of . And for all three models the standard error of the treatment effect estimated from that model’s information matrix is very accurate as judged by the closeness to the empirical SD of the simulated regression coefficient estimates.

Check simulations under the null, i.e., with for treatment. Look at the distribution of p-values for the three model’s treatment 2-sided Wald tests (which should be uniform), and the empirical , the fraction of Wald p-values .

under the improper (with respect to ) PO model is just under that of the model ignoring , which is estimated to be at the nominal 0.05. The estimated for the appropriate partial PO model is just over 0.05.

Using the PO Model to Estimate the Treatment Effect for a Specific Y Cutoff

Just as in the case where one thinks that a sex by treatment interaction may be present, actually estimating such an interaction effect can make treatment estimates worse for both sexes in small samples even when the interaction is truly present. This is because estimating an unknown quantity well requires both minimal bias and good precision (low variance), and adding a parameter to the model increases variance (one must estimate both the main effect and the interaction, equivalent to estimating separate treatment effects for females and males). The probability that an estimate is within a given tolerance of the true value is closely related to the mean squared error (MSE) of the estimator. MSE equals variance plus the square of bias. Bias is the systematic error that can result from model misspecification, e.g., fitting a common OR (assuming PO) when the treatment OR needs to vary for some levels of Y (non-PO).

A log odds ratio estimate for a specific cutoff Y≥y derived from a model that dichotomized the raw data at y will tend to be unbiased for estimating that specific log odds ratio. Suppose the log OR has variance . The MSE of the log OR estimate is since the bias is approximately zero. Now consider estimating the common OR in a PO model and using that to estimate the OR for Y≥y. Suppose that common log OR has variance and bias ( is a weighted log OR the PO model estimates minus the true log OR for Y≥y) so that MSE of the log OR for the PO model is . The multiplicative bias (fold-change bias) is . How large must this multiplicative bias in the OR estimate be (i.e., how much non-PO needs to exist) before the tailored model for Y≥y has lower mean squared error (on the log scale) than the less-well-fitting PO model? By comparing the two MSEs of and we find that the critical multiplicative error in the OR is .

For the dataset we have been analyzing, the critical fold change in OR is tabulated in the table above under the column Critical OR Factor. For example, for the lowest cutoff this factor is 2.33. This is interpreted as saying that an ill-fitting PO model would still break even with a tailored well-fitting model (one that suffers from having higher variance of due to not breaking ties in Y) in terms of the chance of having the OR estimate close to the true OR, as long as the true combined estimand PO OR is not more than a factor of 2.33 away from the true OR for Y≥1. For example, if the OR that the PO model is estimating is 2, this estimate would be equal in accuracy to a tailored sure-to-fit estimate if the true PO is 4.66, and would be better than the tailored estimate if the true OR is less than 4.66.

Looking over all possible cutoffs, a typical OR critical fold change is 1.5. Loosely speaking if ORs for two different cutoffs have a ratio less than 1.5 and greater than 1/1.5 the PO model will provide a more accurate treatment OR for a specific cutoff than will an analysis built around estimating the OR only for that cutoff. As the sample size grows, the critical multiplicative change in OR will fall. This leads to the next section.

A Continuous Solution

Instead of assessing the adequacy of the PO assumption, hoping that the data contain enough information to discern whether a PO model is adequate and then making a binary decision (PO or non-PO model), a far better approach is to allow for non-PO to the extent that the current sample size allows. By scaling the amount of non-PO allowed, resulting in a reasonable amount of borrowing of information across categories of Y, one can achieve a good mean squared error of an effect estimator. This can be achieved using a Bayesian partial proportional odds model with a skeptical prior distribution for the parameters representing departures from the PO assumption. As the sample size increases, the prior wears off, and the PO assumption is progressively relaxed. All uncertainties are accounted for, and the analyst need not make a PO/non-PO choice. This is implemented in the R rmsb package blrm function. See this for discussion of using this approach for a formal analysis studying to what extent a treatment effects one part of the outcome scale differently than it affects other parts.

To get a feeling for how the degree of skepticism of the prior for the departure from PO relates to the MSE of a treatment effect, we choose normal distributions with mean 0 and various variances, compute penalized maximum likelihood estimates (PMLEs). These PMLEs are computed by forming the prior and the likelihood and having the Bayesian procedure optimize the penalized likelihood and not do posterior sampling, to save time. Note that the reciprocal of the variance of the prior is the penalty parameter in PMLE (ridge regression).

Going along with examples shown here, consider a 3-level response variable Y=0,1,2 and use the following partial PO model for the two-group problem without covariates. Here treatment is coded x=0 for control, x=1 for active treatment.

When PO holds. is the additional treatment effect on .

Consider true probabilities for Y=0,1,2 when x=0 to be the vector p0 in the code below, and when x=1 to be the vector p1. These vectors are not in proportional odds. Draw samples of size 100 from each of these two multinomial distributions, with half having x=0 and half having x=1. Compute the PMLE for various prior distributions for that are normal with mean 0 and with SD varying over 0.001 (virtually assuming PO), 0.1, 0.5, 0.75, 1, 1.5, 2, 4 (almost fully trusting the partial PO model fit, with very little discounting of ). When the prior SD for the amount of non-PO is 0.5, this translates to a prior probability of 0.02275 that and the same for .

True model parameters are solved for using the following:

require(rmsb)p0 <-c(.4, .2, .4)p1 <-c(.3, .1, .6)lors <-c('log OR for Y>=1'=qlogis(0.7) -qlogis(0.6),'log OR for Y=2'=qlogis(0.6) -qlogis(0.4))alpha1 <-qlogis(0.6)alpha2 <-qlogis(0.4)beta <-qlogis(0.7) - alpha1tau <-qlogis(0.6) - alpha2 - betac(alpha1=alpha1, alpha2=alpha2, beta=beta, tau=tau)

alpha1 alpha2 beta tau
0.4054651 -0.4054651 0.4418328 0.3690975

Let’s generate a very large (n=20,000) patient dataset to check the above calculations by getting unpenalized MLEs (by setting the SD of prior distributions to 1000).

Code

m <-10000# observations per treatmentm0 <- p0 * m # from proportions to frequenciesm1 <- p1 * mx <-c(rep(0, m), rep(1, m))y0 <-c(rep(0, m0[1]), rep(1, m0[2]), rep(2, m0[3]))y1 <-c(rep(0, m1[1]), rep(1, m1[2]), rep(2, m1[3]))y <-c(y0, y1)table(x, y)

y
x 0 1 2
0 4000 2000 4000
1 3000 1000 6000

Code

f <-blrm(y ~ x, ~x, priorsd=1000, method='opt')coef(f)

y>=1 y>=2 x x:y>=2
0.4054412 -0.4054380 0.4418775 0.3690140

Code

# Also check estimates when a small prior SD is put on tauf <-blrm(y ~ x, ~x, priorsd=1000, priorsdppo=0.0001, method='opt')coef(f) # note PMLE of tau is almost zero

y>=1 y>=2 x x:y>=2
3.176311e-01 -3.177820e-01 6.601258e-01 2.155386e-10

Code

# Compare with a PO modelcoef(lrm(y ~ x))

y>=1 y>=2 x
0.3176995 -0.3176995 0.6600599

Let’s also simulate for 1000 in each group the variance of the difference in log ORs.

Code

m <-1000x <-c(rep(0, m), rep(1, m))nsim <-5000set.seed(2)lg <-function(y) qlogis(mean(y))dlor <-numeric(nsim)for(i in1: nsim) { y0 <-sample(0:2, m, replace=TRUE, prob=p0) y1 <-sample(0:2, m, replace=TRUE, prob=p1) dlor[i] <-lg(y1 ==2) -lg(y0 ==2) - (lg(y1 >=1) -lg(y0 >=1))}mean(dlor)

[1] 0.368478

Code

v1000 <-var(dlor)v100 <- v1000 * (1000/100)cat('Variance of difference in log(OR): 1000 per group:', v1000, ' 100 per group:', v100, '\n')

Variance of difference in log(OR): 1000 per group: 0.004667525 100 per group: 0.04667525

For a sample containing n subjects per treatment arm, the variance of the difference in the two log ORs (i.e., the amount of deviation from PO) is approximately . An approximate way to think of the effect of a skeptical prior on the difference in log ORs is to assume that has a normal distribution with mean and variance . When the prior for has mean 0 and variance , the posterior mean for is The denominator is the shrinkage factor . Study how varies with and .

One can see for example that when the prior SD for is the prior causes an estimate of to shrink only by only about a factor of 1.25 even for very small sample sizes. By the time there are 200 patients per treatment arm the shrinkage towards PO is not noticeable.

The following simulations for 100 patients per arm provide more accurate estimates because formal PMLE is used and the data likelihood is not assumed to be Gaussian. In addition to quantifying the effect of shrinkage caused by different (prior SD of ), we compute the root mean squared errors for estimating log(OR) for and for .

Code

m <-100x <-c(rep(0, m), rep(1, m))nsim <-500sds <-c(.0001, 0.1, 0.5, 0.75, 1, 1.5, 2, 4, 10, 50)lsd <-length(sds)gam <-if(ishtml) 'γ'else'$\\gamma$'R <-array(NA, c(nsim, lsd, 2),dimnames=list(NULL, paste0(gam, '=', sds),c('Y>=1', 'Y=2')))set.seed(3)for(i in1: nsim) { y0 <-sample(0:2, m, replace=TRUE, prob=p0) y1 <-sample(0:2, m, replace=TRUE, prob=p1) y <-c(y0, y1)for(j in1: lsd) { f <-blrm(y ~ x, ~ x, priorsd=1000, priorsdppo=sds[j], method='opt') k <-coef(f)# save the two treatment log ORs (for Y>=1 and for Y=2) R[i, j, 1:2] <-c(k['x'], k['x'] + k['x:y>=2']) }}# For each prior SD compute the two mean log ORs and compare# truthcat('True values:\n')

True values:

Code

lors

log OR for Y>=1 log OR for Y=2
0.4418328 0.8109302

In a mixed Bayesian/frequentist sense (computing MSE of a posterior mean), the optimum MSE in estimating the two treatment effects (log ORs) was obtained at . The observed shrinkage factors do not track very well with the approximate ones derived earlier. A better approximation is needed.

Further Reading

See a similar case study in RMS Section 13.3.5. In that example, the sample size is larger and PO is clearly violated.

Computing Environment

R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 22.04 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
attached base packages:
[1] splines stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] rmsb_0.2.0 VGAM_1.1-6 rms_6.4-0 SparseM_1.81
[5] Hmisc_4.7-2 ggplot2_3.3.5 Formula_1.2-4 survival_3.4-0
[9] lattice_0.20-45

To cite R in publications use:

R Core Team (2022).
R: A Language and Environment for Statistical Computing.
R Foundation for Statistical Computing, Vienna, Austria.
https://www.R-project.org/.

]]>2022accuracy-scoredichotomizationendpointsordinalhttps://fharrell.com/post/impactpo/index.htmlWed, 09 Mar 2022 06:00:00 GMTA Comparison of Decision Curve Analysis with Traditional Decision AnalysisAndrew Vickers
https://fharrell.com/post/dca/index.html

Introduction

In a traditional decision analysis, the analyst creates a decision tree and then estimates probabilities and assigns utilities for each possible outcome. Decision curve analysis is a type of decision analysis that can be applied to the evaluation of prognostic models and diagnostic tests. The major advantage is that it does not require specification of multiple utilities for different outcomes. Instead, the threshold probability of disease – a concept essential for clinical use of any model – is used to reflect the consequences of different decisions. The theory underlying decision curve analysis, and the use of threshold probability, has been discussed in multiple papers here, here, and here. The website decisioncurveanalysis.org includes links to these papers plus additional code, tutorials and data sets.

Here we compare decision curve analysis and traditional decision analysis to illustrate their similarities and differences.

Motivating example

We used the data set used in the evaluation of a commercial prostate cancer marker, the 4Kscore. We evaluated two prespecified models: the Prostate Cancer Prevention Trial (PCPT) risk calculator, described in the paper, and a new model (“Free PSA” model) consisting of total PSA, free PSA, prior negative biopsy, race and digital rectal exam. The aim of these models is to inform the decision to biopsy. A biopsy is seen as justified if it results in the detection of high-grade prostate cancer.

The PCPT model had a reasonable discrimination for high-grade prostate cancer (0.735) but was miscalibrated, with important underestimation of risk at low threshold probabilities (see top figure). The free PSA model has a slightly higher discrimination (0.774) but much better calibration, with only some slight overestimation of risk (bottom figure).

Traditional decision analysis

A traditional decision analysis starts with a tree, showing the decision point, given by a square, different outcomes, shown by a circle, and then outcomes. This is shown in the figure below. Here cancer is defined as “high-grade prostate cancer” and “no cancer” is benign findings or low-grade disease.

The next step is to assign utilities to each of the four possible outcomes. We can assign the outcome of no biopsy in a patient without cancer, the best possible outcome, the maximum utility of 1. A prostate biopsy is uncomfortable and is associated with a risk of infection, so an unnecessary biopsy is assigned a utility of 0.95. A diagnosis of high-grade cancer entails treatments such as radiotherapy and surgery, which can have long-term impacts on urinary, bowel and erectile function. A cancer diagnosis also raises the possibility of recurrent disease, with more toxic treatments such as hormonal therapy or chemotherapy, as well as the risk of metastasis and cancer-specific death. Hence finding a high-grade cancer will be assigned a utility of 0.8. Missing a cancer increases the risk of toxic treatment, metastasis and death and is assigned a utility of 0.35.

This is a somewhat simplified tree. A more complete tree might separate out low-grade from benign disease and then also allow for the possibility of biopsy complications, whether major or minor. This would lead to 12 different outcomes: high-grade / low-grade / no cancer detected on biopsy that leads to no / minor / major complications and no biopsy where the true disease state is high-grade / low-grade / no cancer. The simplified tree is used here as it better suited to help illustrate the underlying principles.

We can now apply the tree to the strategies of biopsying all men at risk – the current clinical strategy - biopsying no men, or biopsy according to the PCPT or free PSA prediction models. For the prediction models, we need to choose cut-points to determine what counts as a positive test indicating biopsy. The table shows utilities for different cut-points between a 2.5% and 25% risk of high-grade prostate cancer (there is actually a more rational way to choose the cut-point, but we will come back to that). The expected utility of biopsying all vs. no patients is 0.9161 and 0.8529 respectively. The table also shows the difference in expected utility between each model and the strategy of biopsying all men, and the difference in expected utility between the two models. The PCPT model is favored for low cut-points (2.5% and 5%) otherwise the highest expected utility is for the free PSA model.

Table 1

Decision curve analysis

In a decision curve analysis, there is no need to consider and specify multiple different utilities, For the simplified tree, these would be the four utilities for true and false positives and negatives; for the more complete tree, described above, the analyst would have to specify, for instance, the utility of having a minor complication after biopsy in which low-grade cancer was found. Instead, all the analyst needs to do is obtain a range of reasonable probability thresholds. In the current case, that might be 5% - 20% (see this) although here we will show below a wider range for illustrative purposes. The free PSA model, shown in green, has much higher net benefit than the PCPT model (orange), which has lower net benefit than the strategy of biopsying all men (blue) for much of the range.

The net benefits are shown in table form below. Also shown is the net benefit calculated from the expected utilities in the first table using the formula (Expected utility of the model – utility of false negative) ÷ (utility of true positive – utility of false negative).

Table 2

The critical point is that when the risk threshold is congruent with the utilities, net benefit calculated from a decision curve is the same as that calculated from utilities. This is shown in the bolded row. The rational risk threshold Pt can be obtained as Pt ÷ (1 - Pt ) = (utility of true positive – utility of false negative) ÷ (utility of true negative – utility of false positive). This gives Pt ÷ (1 - Pt ) = (utility of 0.8 – 0.35) ÷ (1 – 0.95) = 9 and hence Pt = 10%.

The table also shows that the rank ordering of utilities is altered when a cut-point is used that is inconsistent with the utilities. For instance, at a cut-point of 25%, the decision curve shows higher net benefit for the PCPT compared to the strategy of biopsying all men; the expected utility from the decision analysis favors biopsying all men. This can also be seen by comparing the strategies of treating all or no men. The decision curve analysis shows, appropriately, that the strategy of treating all men is only of higher net benefit if the threshold probability is below the prevalence.

Conclusion

We have compared traditional decision analysis with decision curve analysis. We show that, where the threshold probability in the decision curve analysis is congruent with the utilities used in the decision analysis, utility estimates are identical for both methods. We also show that decision curve analysis is a more natural fit for the evaluation of models because, unlike traditional decision analysis, there can be no inconsistency between the cut-point used for the model and the utilities used in the decision analysis.

Comments from Readers

Giuliano Cruz: Very interesting post! It hurts even more to read sens/spec “optimization” in, say, the biomarker literature after seeing decision theoretic approaches for specification of risk thresholds. One doubt I have: following https://www.tandfonline.com/doi/abs/10.1198/000313008X370302, the formula for rational odds(Pt) (that arrived at Pt=10%) seems inverted. As I read, it should be odds(Pt) = (TN-FP)/(TP-FN). Is that correct? Thanks!

Uriah Finkel: Great post! I wonder what about Risk Percentiles? In some applications it is required to observe results for top 5% population at risk, or that there’s enough budget for intervention for limited absolute amount of people. Should it be considered as another constraint?

]]>decision-makingdiagnosismedicine2021https://fharrell.com/post/dca/index.htmlMon, 27 Dec 2021 06:00:00 GMTLongitudinal Ordinal Models as a General Framework for Medical OutcomesFrank Harrell
https://fharrell.com/talk/cmstat/index.html

]]>RCTdrug-developmentbayesregressionendpointshttps://fharrell.com/talk/rcteff/index.htmlThu, 04 Nov 2021 05:00:00 GMTCommentary on Improving Precision and Power in Randomized Trials for COVID-19 Treatments Using Covariate Adjustment, for Binary, Ordinal, and Time-to-Event OutcomesFrank HarrellStephen Senn
https://fharrell.com/post/ipp/index.html

Standard covariate adjustment as commonly used in randomized clinical trials recognizes which quantities are likely to be constant (relative treatment effects) and which quantities are likely to vary (within-treatment-group outcomes and absolute treatment effects). Modern statistical modeling tools such as regression splines, penalized maximum likelihood estimation, Bayesian shrinkage priors, semiparametric models, and hierarchical or serial correlation models allow for great flexibility in the covariate adjustment context. A large number of parametric model adaptations are available without changing the meaning of model parameters or adding complexity to interpretation. Standard covariate adjustment provides an ideal basis for capturing evidence about heterogeneity of treatment effects through formal interaction assessment. It is argued that absolute treatment effects (e.g., absolute risk reduction estimates) should be computed only on an individual patient basis and should not be averaged, because of the predictably large variation in risk reduction across patient types. This is demonstrated with a large randomized trial. Quantifying treatment effects through average absolute risk reductions hides many interesting phenomena, is inconsistent with individual patient decision making, is not demonstrated to add value, and provides less insights than standard regression modeling. And since it is not likelihood based, focusing on average absolute treatment effects does not build a bridge to Bayesian or longitudinal frequentist models that are required to take external information and various design complexities into account.

We usually treat individuals not populations. That being so, rational decision-making involves considering who to treat not whether the population as a whole should be treated. I think there are few exceptions to this. One I can think of is water-fluoridation but in most cases we should be making decisions about individuals. In short, there may be reasons on occasion to use marginal models but treating populations will rarely be one of them. — Stephen Senn, 2022

Benkeser, Díaz, Luedtke, Segal, Scharfstein, and Rosenblum^{1} have written a paper aimed at improving precision and power in COVID-19 randomized trials. Some of the analytic results were available^{2} and could have been compared to these. But here we want to raise the questions of whether ordinary covariate adjustment is more insightful, already solves the problems needing to be solved, and what are the advantages of the authors’ multi-step approach. Benkeser et al focus on the following treatment effect estimands:

^{1} Benkeser D, Díaz I, Luedtke A, Segal J, Scharfstein D, Rosenblum M (2020): Improving precision and power in randomized trials for COVID-19 treatments using covariate adjustment, for binary, ordinal, and time-to-event outcomes. To appear in Biometrics, DOI:10.1111/biom.13377.

^{2} Lasaffre E, Senn S (2003): A note on non-parametric ANCOVA for covariate adjustment in randomized clinical trials. Statistics in Medicine22: 3583-3596.

Wilcoxon-Mann-Whitney statistic or probability index, also known as the concordance probability for an ordinal outcome

average log odds ratio for an ordinal outcome

difference in restricted mean survival times for time-to-event outcomes

difference in cumulative incidence for same

relative risk for same

Taking for example the risk difference, one of the authors’ procedures is as follows for a trial of subjects (both treatment arms combined).

fit a binary logistic model containing treatment and baseline covariates

for each of the subjects compute the risk of the outcome using the subject’s covariates but setting treatment to control for all

for each of the subjects compute the risk of the outcome using the subject’s covariates but now setting treatment to the new treatment for all

The paper in our humble opinion began on the wrong foot by taking as given that “the primary goal of covariate adjustment is to improve precision in estimating the marginal treatment effect.” Not only are marginal effects distorted by subjects not being representative of the target population, but the primary goals of covariate adjustment are (1) in a linear model, to improve power and precision by reducing residual variation, and (2) in a nonlinear model, to improve the fit of the model, which improves power, and to get a more appropriate treatment effect estimate in the face of strong heterogeneity of outcomes within a single treatment arm. Whichever statistical model is used, a treatment effect estimand should have meaning outside of the study. We seek an estimand that has the highest chance of applying to patients not in the clinical trial. When the statistical model is nonlinear (e.g., logistic or Cox proportional hazards models), the effect that is capable of being constant is relative efficacy, e.g., conditional odds or conditional hazards ratio. Because of non-collapsibility of the odds and hazards ratio, the effect ratio that does not condition on covariates will be tilted towards 1.0 when easily explainable outcome heterogeneity is not explained. And in the Cox model, failure to condition on covariates will result in a greater departure from the proportional hazards assumption. Ordinary covariate adjustment works quite well and has been relied upon in countless randomized clinical trials. As will be discussed later, standard covariate adjustment provides more transportable effect estimates. On the other hand, the average treatment effect advocated by Benkeser et al does not even transport to the original clinical trial, in the sense that it is an average of unlikes and will apply neither to low-risk subjects nor to high-risk subjects. Importantly, this average difference is a strong function of not only the clinical trial’s inclusion/exclusion criteria but also of the characteristics of subjects actually randomized, limiting its application to other populations. Note that the idea of making results from a conditional model consistent with a marginal one was discussed in Lee and Nelder^{3}.

^{3} Lee Y, Nelder JA (2004): Conditional and marginal models: Another view. Statistical Science19:219-228.

Benkeser et al (2020) are quite correct in stating that covariate adjustment is a much underutilized statistical method in randomized clinical trials (a point argued by many for many years^{4}). The resulting loss of power and precision and inflation of sample size borders on scandalous. But we fear that the authors’ remedy will make clinical trialists less likely to account for covariates. This is because the authors’ approach is overly complicated without a corresponding gain over traditional covariate adjustment, is harder to interpret, can provide misleading effect estimates by the averaging of unlikes as discussed in detail below, is harder to pre-specify (pre-specification being a hallmark of rigorous randomized clinical trials), does not provide a basis for interaction assessment (heterogeneity of treatment effect), and does not extend to longitudinal or clustered data. One can argue that what the authors proposed should not be called covariate adjustment in the usual sense, but we will leave that for others to debate.

^{4} Senn SJ (2013): Being efficient about efficacy estimation. Statistics in Biopharmaceutical Research5:204-210.

When we read a statistical methods paper, the first questions we ask ourselves are these: Does the paper solve a problem that needed to be solved? Are there other problems that would have been better to solve? Did the authors conduct a proper comparative study to demonstrate the net benefit of the new method? Unfortunately, we do not view the authors’ paper favorably on these counts. Thinking of all the many problems we have in COVID-19 therapeutic research alone, we have more pressing problems that talented statistical researchers such as Benkeser et al could have attacked instead. Some of these problems from which to select include

How do we model outcomes in a way that recognizes that a treatment may not affect mortality by the same amount that it affects non-fatal outcomes?

What is the best statistical evidential measure for whether a mortality effect is consistent with the other effects?

What are the best ways to judge which treatment is better when results for different patient outcomes conflict (e.g., when a treatment slightly raises mortality but seems to cause a sharp reduction in a disease severity measure)?

What is the best combination of statistical efficiency and clinical interpretation in constructing an outcome variable?

What is the information gain from a longitudinal ordinal response when compared to a time-to-event outcome?

How should one elicit prior distributions in a pandemic, or how should one form skeptical priors?

How should simultaneous studies inform each other during a pandemic?

What is the optimal number of parameters to devote to covariate adjustment and what is the best way to relax linearity assumptions when doing covariate modeling?

Is it better to use subject matter knowledge to pre-specify a rather small number of covariates, or should one use a large number of covariates with a ridge () penalty for their effects^{5}?

What is the best way to model treatment covariate interactions? Should Bayesian priors be put on the interaction effects, and how should such priors be elicited?

Can Bayesian models provide exact small-sample inference in the presence of missing covariate values?

What is the best adaptive design to use and how often should one analyze the data?

What is the best longitudinal model to use for ordinal outcomes, is a simple random effects model as good as a serial correlation model, what prior distributions work best for correlation-related parameters, and should absorbing states be treated in a special way?

^{5} Chen Q, Nian H, Zhu Y, Talbot HK, Griffin MR, Harrell, FE (2016): Too many covariates and too few cases? - a comparative study. Statistics in Medicine35: 4546-4558.

To frame the discussion below, consider an alternative to the authors’ proposed methods: flexible parametric models that adjust for key pre-specified covariates without assuming that continuous covariates operate linearly (e.g., continuous covariates are represented with regression splines in the model). Such a standard model has the following properties.

The parametric model parameterizes the treatment effect on a scale for which it is possible for the treatment effect to be constant (log odds, log hazard, etc.) and hence represented by a single number.

Because the treatment effect parameter has an unrestricted range, the need for interactions between treatment and covariates is minimized.

The model provides a basis for interactions that more likely represent biologic or pharmacologic effects than tricks to restrict the model’s probability estimates to .

The model is readily extended to handle longitudinal binary and ordinal responses and multi-level models (e.g., days within patients within clinical centers).

Well studied multiple imputation procedures exist to handle missing covariates.

Full likelihood methods used in fitting the model elegantly handle various forms of censoring and truncation.

As shown by example here, standard covariate adjustment is more robust than Benkeser et al imply, and even when an ill-fitting model is used, the result may be more useful than marginal estimates.

More on Estimands

Benkeser et al’s primary estimand is the average difference in outcome probability in going from treatment A to treatment B. The authors have chosen a method of incorporating covariates that is robust, but it is robust because the method averages out estimation errors into an oversimplified effect measure. In order to get proper individualized absolute risk reductions, the authors would have to model covariates to the same standard sought by regular covariate adjustment. The authors’ method is flexible, but their estimand hides important data features. Standard regression models are very likely to fit data well enough to provide excellent covariate adjustment, but an estimand that represents a differences in averages (e.g., an overall absolute risk reduction—ARR) is an example of combining unlikes. ARR due to a treatment is a strong function of the base risk. For example, patients who are sicker at baseline have “more room to move” so have larger ARR than less sick patients. ARR is an estimand that should be estimated only on a single-patient basis (see here, here, and here). In the binary outcome case, personalized ARR is a simple function of the odds ratio and baseline risk in the absence of treatment interactions. When there is a treatment interaction on the logit scale, the difference in ARR that is estimated by the difference in two logistic model is only slightly more complex.

The authors did not expose this issue to the readers. For example, one would find that a baseline-risk-specific version of their Figure 1 would reveal much variation in ARR.

As an example, the GUSTO I study^{6} of thrombolytics in treatment of acute myocardial infarction with its sample size of 41,021 patients and overall 0.07 incidence of death has been analyzed in detail, and various risk models have been developed from this randomized trial’s data. As shown here, for a treatment comparison of major interest, accelerated t-PA ($n=$10,348) vs. streptokinase (SK, $n=$20,162), the average ARR for the probability of 30d mortality is 0.011. But this is misleading as it is dominated by a minority of high risk patients as shown in the figure described below. The median ARR is 0.007 which is much more representative of what patients can expect. But far better is to provide individualized ARR estimates.

^{6} GUSTO Investigators (1993): An international randomized trial comparing four thrombolytic strategies for acute myocardial infarction. New England Journal of Medicine329: 673-682.

In the analysis of $n=$40,830 patients (2851 deaths) in GUSTO-I presented here, three treatments (two indicator variables) were allowed to interact with 6 covariates, with the continuous covariates expanded into spline functions. The interaction likelihood ratio test statistic was 16.6 on 20 d.f., showing little evidence for interaction. Thus the assumption of constancy of the treatment odds ratios is difficult to reject. Another way to allow a more flexible model fit is to penalize all the interaction terms down to what optimally cross-validates. Using a quadratic (ridge) penalty, the optimum penalty was , again indicating no reason that the treatment odds ratio should not be considered to be a single number. Giving interactions one more benefit of the doubt, penalized maximum likelihood estimates were obtained, penalizing all the covariate treatment interaction terms to effectively a single degree of freedom. This model was used to estimate the distribution of individual patient ARR with t-PA shown in the figure below.

One sees the large amount of variation in ARR. Other results show large variation in risk ratios and minimal variation on ORs (and keep in mind that the optimal estimate of this OR variation is in fact zero). The relationship between baseline risk under control therapy (SK) and the estimated ARR is shown below.

Through proper conditioning and avoidance of averaging of unlikes by estimating ARR for individual patients and not averaging these estimates over patients, it is seen that standard covariate adjustment using well accepted relative treatment effect estimands is simpler, can fit data patterns better, requires only standard software, and is more insightful.

See Hoogland et al^{7} for a detailed article on individualized treatment effect prediction.

^{7} Hoogland J, IntHout J, Belias M, Rovers MM, Riley RD, Harrell FE, Moons KGM, Debray TPA, Reitsma JB (2021): A tutorial on individualized treatment effect prediction from randomized trials with a binary endpoint. Accepted, Statistics in Medicine. Preprint

Others have claimed that our argument in favor of transportability of conditional estimands for treatment effects is incorrect, and that marginal estimands should form the basis for transportability of findings to other patient populations, as advocated by Pearl and Bareinboim^{8}. The marginal estimand is not appropriate in our context for the following reasons:

^{8} Pearl J, Bareinboim E (2014): External validity: From do-calculus to transportability across populations. Statistical Science29:579-595.

Pearl and Bareinboim developed their approach to apply to complex situations where a covariate may be related to the amount of treatment received. We are dealing only with exogeneous pre-existing patient-level covariates here.

The transport equation 3.4 of Pearl and Bareinboim requires the use of the covariate distribution in the target population. This distribution is not usually available when a clinical trial finding is published or is evaluated by regulators.

It is easier to transport an efficacy estimate to an individual patient (at least under the no-interaction assumption or if interactions are correctly modeled in the fully conditional model). The one patient is the target and one only needs to know her/his covariate value, not the distribution of an entire target population.

We are ultimately interested in individual patient decision making, not group decision making.

Lack of Comparisons

Benkeser et al showed large efficiency gains of their approach over ignoring covariates. But the authors did not compare the power for their approach vs. standard covariate adjustment. This leaves readers wondering whether the new method is worth the trouble or is in fact less efficient than the standard.

New Odds Ratio Estimator

With the aim of dealing with non-proportional odds, the authors developed a weighted log odds ratio estimator. But the standard maximum likelihood estimator already solves this problem. As detailed here and here, the Wilcoxon-Mann-Whitney concordance probability , i.e., the probability that a randomly chosen patient on treatment B has a higher level response than a randomly chosen patient on treatment A, also called the probability index, is a simple function of the maximum likelihood estimate of the treatment regression coefficient whether or not proportional odds holds. The conversion equation is where OR = . This formula is accurate to an average absolute error for computing from of 0.002.

The equivalence of the OR and the Wilcoxon-Mann-Whitney estimand also makes the authors’ estimand 2 in Section 3.2 somewhat moot.

We note that overlap measures are not without problems^{9}.

^{9} Senn SJ (2011): U is for unease: Reasons to mistrust overlap measures in clinical trials. Statistics in Biopharmaceutical Research3:302-309.

Lack of a Likelihood

COVID-19 therapeutic research is an area where Bayesian methods are being used with increasing frequency. Frequentist methods that use a full likelihood approach provide an excellent bridge to development of or comparison with Bayesian analogs that use the same likelihood and only need to add a prior. The authors’ methods are not likelihood based, so they do not provide a bridge to Bayes. The proposed methods do not provide exact inference for small as does Bayes, has no way of incorporating skepticism or external information about treatment effects, and has no way to quantify evidence that is as actionable as posterior probabilities of (1) any efficacy and (2) clinically meaningful efficacy.

The authors briefly discuss missing ordinal outcomes. Instead of this being an issue of missingness, in many situations an ordinal outcome is interval censored. This is the case, for example, when a range of ordinal values is not ascertained on a given day. To deal with interval censoring, a full likelihood approach is helpful, and the authors’ approach may not be extendible to deal with general censoring.

The lack of a likelihood also prevents the authors’ approach from dealing with variation across sites in a multi-site clinical trial through the use of random effects models, and extensions to longitudinal outcomes are not provided. The flexibility of a Bayesian longitudinal proportional odds model that allows for general censoring and departures from the proportional odds assumption is described here.

Technical Issues

The authors stated “By incorporating baseline variable information, covariate adjusted estimators often enjoy smaller variance …”. This must be clarified to apply to ARR estimates, but not in general. In logistic and Cox models, for example, covariate adjustment increases treatment effect variance on the log ratio scale (see this and this for a literature review, and also the important paper by Robinson and Jewell^{10}). Despite this increase, traditional covariate adjustment still results in increased Bayesian or frequentist power because the model being more correct (i.e., some of the previously unexplained outcome heterogeneity is now explained) moves the treatment effect estimate farther from zero. The regression coefficient increases in absolute value faster than the standard error increases, hence the power gain. For logistic and Cox models, variance reduction occurs only on probability-scale estimands.

^{10} Robinson, LD, Jewell NP (1991): Some surprising results about covariate adjustment in logistic regression models. International Statistical Review58:227-240.

In their Section 6 the authors recommended that when a utility function can be agreed upon, one should consider the difference in mean utilities between treatment arms when the outcome is ordinal. Even though the difference in mean utilities can be a highly appropriate measure of treatment effectiveness, it is important to note that the patient-specific utilities are still discrete and are likely to have a very strange, even bimodal, distribution. Hence utilities may be best modeled with a semiparametric model such as the proportional odds model.

The authors credited the proportional odds model to McCullagh^{11} but this model was developed by Walker and Duncan^{12} and other work even predates this.

^{11} McCullagh, P (1980): Regression models for ordinal data. Journal of the Royal Statistical Society Series B42:109-142.

^{12} Walker, SH, Duncan, DB (1967): Estimation of the probability of an event as a function of several independent variables. Biometrika54:167-178.

The authors discuss asymptotic accuracy of their method, which is of interest to statisticians but not practitioners who need to know the accuracy of the method in their (usually too small) sample.

The authors did not seem to have rigorously evaluated the accuracy of bootstrap confidence intervals. We have an example where none of the standard bootstraps provides sufficient accuracy for a confidence interval for a standard logistic model odds ratio. Non-coverage probabilities are not close to 0.025 in either tail when attempting to compute a 0.95 interval. It is important to evaluate the left and right non-coverage probabilities, as the confidence interval can be right on the average but wrong for both tails.

The use of categorization for continuous predictors (e.g., the treatment of age just before section 4.1.3) does not represent best statistical practice.

To be very picky, the authors’ (and so many other authors) use of the term “type I error” does not refer to the probability of an error but rather to the probability of making an assertion.

The paper advises readers to consider using variable selection algorithms. Stepwise variable selection brings a host of problems, typically ruins standard error estimates (Greenland^{13}), and is not consistent with full pre-specification.

^{13} Greenland, S (2000): When should epidemiologic regressions use random coefficients? Biometrics56:915-921.

The idea to use information monitoring in forming stopping rules needs to be checked for consistency with optimum decision making, and it may be difficult to specify the information threshold.

Related to missing covariates, the recommendation to use single imputation and its implication to not use in the imputation process has been well studied and found to be lacking, especially with regard to getting standard errors correct.

In the authors’ Supporting Information, the intuition for how covariate adjustment can lead to precision gains begins with a discussion of covariate imbalance. With the linear model, a random marginal conditional imbalance term is identified thus becoming a conditional bias, which is then removed, adjusting the estimate if necessary and reducing the variance. It is the possibility that the estimate may have to be adjusted that makes the variance for the conditional and the marginal estimates ‘correct’ given the model^{14}. Covariate adjustment produces conditionally and unconditionally unbiased estimates. But imbalance is not the primary reason for doing covariate adjustment in randomized trials. Covariate adjustment is more about accounting for easily explainable outcome heterogeneity^{15}^{16}. At any rate, apparent covariate imbalances may be offset by counterbalancing covariates one did not bother to analyze.

^{14} Senn SJ (2019): The well-adjusted statistician: Analysis of covariance explained. https://www.appliedclinicaltrialsonline.com/view/well-adjusted-statistician-analysis-covariance-explained . A valid standard error reflects how the estimate will vary over all randomisations. The adjustment that occurs as a result of fitting a covariate can be represented as . Here and are means of the outcome and the covariate respectively and stands for treatment and stands for control. For simplicity assume the uncorrected differences at outcome and baseline have been standardised to have a variance of one. Other things being equal there cannot be a reduction in the residual variance by fitting the covariate unless is large which in turn implies that and must vary together. If you don’t adjust you are allowing for the fact that in the absence of any treatment effect might differ from 0 due to the fact that differ randomly from 0 and this will affect the outcome. By adjusting you cash in the bonus by reducing the variance of the estimate. But this can only happen because adjustment is possible.

^{15} Lane PW, Nelder JA (1982): Analysis of covariance and standardization as instances of prediction. Biometrics38: 613-621.

^{16} Senn SJ (2013): Seven myths of randomization in clinical trials. Statistics in Medicine32:1439-1450.

Methods that develop models from only one treatment arm are prone to overfitting, which entails fitting idiosyncratic associations in that arm in such a way that when the outcome comparison is made with the other arm can result in bias in non-huge samples.

Omission of simulated samples with empty cells may slightly bias simulation results and is not necessary in ordinary proportional odds modeling.

Contrasted with the group sequential design outlined by the authors, a continuously sequential Bayesian design using a traditional proportional odds model for covariate adjustment is likely to provide more understandable evidence and result in earlier stopping for efficacy, harm, or futility.

Discussion

To add your comments, discussions, and questions go to datamethods.orghere. See the end of this post for a discussion archive.

Grant Support

This work was supported by CONNECTS and by CTSA award No. UL1 TR002243 from the National Center for Advancing Translational Sciences. Its contents are solely the responsibility of the authors and do not necessarily represent official views of the National Center for Advancing Translational Sciences or the National Institutes of Health. CONNECTS is supported by NIH NHLBI 1OT2HL156812-01, ACTIV Integration of Host-targeting Therapies for COVID-19 Administrative Coordinating Center from the National Heart, Lung. and Blood Institute (NHLBI).

Related Commentaries, Examples, and Other Resources

Lars v: Thank you for this very interesting read that leads to much thoughtful discussion. Here is some food-for-thought. Suppose was observe (X,A,Y) where X are baseline variables, A is a binary treatment, and Y is a continuous outcome. Let us assume that E[Y|A=a,X=x] = cA + dX is a linear model so that we can identify conditional effects by parameters that one can estimate at the rate square-root n (which is a big assumption). Consider, the marginal ATE effect parameter E_XE[Y|A=1,X] - E_XE[Y|A=0,X] . Under the linear model assumption, we actually have E_XE[Y|A=1,X] - E_XE[Y|A=0,X] = E_X[E[Y|A=1,X] - E[Y|A=0,X] ] = E_X[c1 + dX -c0 - dX ] = E_X[c] = c. Thus, the marginal ATE equals the conditional treatment effect parameter! This is no coincidence. Most, if not all, conditional treatment effect parameters based on parametric assumptions correspond with some nonparametric marginal treatment effect parameter (e.g. marginalized hazard ratios, odds ratios, etc). The strong parametric assumptions allow one to turn marginal effects into conditional effects. Note that simply including an interaction between A and X already makes the identification of a conditional effect parameter much more challenging. The benefit of estimating the marginal effect E_XE[Y|A=1,X] - E_XE[Y|A=0,X] instead is that our inference is non-parametrically correct even when the conditional mean is not linear. Nonetheless, there is substantial work on estimating and obtaining inference for conditional treatment effects nonparametrically (e.g. CATE). Note that this is a fairly difficult problem, as conditional treatment effect parameters are usually not square-root(n) estimable in the nonparametric setting.

As a note, this certainly motivates looking for marginal effect parameters that identify conditional effect parameters under stricter assumptions. If one is willing to believe that the assumptions are true, one can always interpret it as a conditional effect. Also, pairing estimates and inference of conditional effects based on parametric models with those of marginal effects from nonparametric models is a good way to obtain robust results that cover all bases. If the marginal effect and conditional effect are substantially different (assuming they identify the same parameter under stricter assumptions), then this might lead one to conclude that the parametric model assumptions are violated.

Frank Harrell: It’s easier to manage discussions on datamethods.org where there is a link above to a topic already started for this area. On a quick read of your interesting comments I get the idea that that thinking is restricted to linear models. If so, the scope may be too narrow.

]]>bayescovid-19designgeneralizabilityinferencemetricsordinalpersonalized-medicineRCTregressionreporting2021https://fharrell.com/post/ipp/index.htmlSat, 17 Jul 2021 05:00:00 GMTIncorrect Covariate Adjustment May Be More Correct than Adjusted Marginal EstimatesFrank Harrell
https://fharrell.com/post/robcov/index.html
.table {
width: 50%;
}

We usually treat individuals not populations. That being so, rational decision-making involves considering who to treat not whether the population as a whole should be treated. I think there are few exceptions to this. One I can think of is water-fluoridation but in most cases we should be making decisions about individuals. In short, there may be reasons on occasion to use marginal models but treating populations will rarely be one of them. — Stephen Senn, 2022

Background

In a randomized clinical trial (RCT) it is typical for several participants’ baseline characteristics to be much more predictive of the outcome variable Y than treatment is predictive of Y. Covariate adjustment in an RCT gains power by making the analysis more consistent with the data generating model, i.e., by accounting for outcome heterogeneity due to wide distributions of baseline prognostic factors. When Y is continuous and random errors have a normal distribution, it is well known that classical covariate adjustment improves power over an unadjusted analysis no matter how poorly the model fits. Lack of fit makes the random errors larger, but not as large as omitting covariates entirely. Nonlinear models such as logistic or Cox regression have no error term to absorb lack of fit, so lack of fit changes (usually towards zero) parameter estimates for all terms in the model, including the treatment effect. But the most poorly specified model is one that assumes all covariate effects are nil, i.e., one that does not adjust for covariates. Even ill-fitting models will provide more useful treatment effect estimates than a model that ignores covariates. See this for details and references about covariate adjustment in RCTs.

Because they are more easily connected to causal inference, many statisticians and epidemiologists like marginal effect estimates. Adjusted marginal effects can adjust for covariate imbalance and can take covariate distributions and resulting outcome heterogeneity into account, at least until the very last step that involves averaging over the covariate distribution. Adjusted marginal estimation does take outcome heterogeneity fully into account at all times when estimating variances. The following simple example for a single treatment arm study shows how that works.

Suppose that a sample of 100 subjects had 40 females and 60 males, and that 10 of the females and 30 of the males had disease. The marginal estimate of the probability of disease is 40/100 = 0.4, and the variance of the estimator assuming constant risk (i.e., assuming risk for females = risk for males) is . But with knowledge of each person’s sex we compute the variance using sex-specific risk estimates as follows. The marginal estimate combines the sex-specific estimates according to the sex distribution in the sample (a hint that what is about to happen doesn’t apply to the population). This estimate is , identical to the marginal estimate. But the true and estimated variances are not the same as that computed in the absence of knowledge of subjects’ sex. The estimated variance is which is smaller than 0.0024 due to the fact that the correct variance recognizes that males and females do not have the same outcome probabilities. So marginal stratified/adjusted estimation corrects the mistake of using the wrong variance formulas when computing crude marginal estimates, among other benefits such as preventing Simpson’s “paradox”. Any time you see for the variance of a proportion, remember that this formula assumes that applies equally to all subjects.

Turning to the more interesting two-sample problem, the adjusted marginal approach can be used to derive other interesting quantities such as marginal adjusted hazard or odds ratios. As opposed to estimating relative treatment effect conditional on covariate values, marginal estimands that account for covariates (outcome heterogeneity) are based on differences in average predicted values. Advocates of marginal treatment effect estimates for nonlinear models such as Benkeser, Díaz, Luedtke, Segal, Scharfstein, and Rosenblum^{1} cite as one of the main advantages of the method its robustness to model misspecification. In their approach, gains in efficiency from covariate adjustment can result, and certain types of model lack of fit are in effect averaged out. But it is this averaging that makes the resulting treatment effect in an RCT hard to interpret. Their marginal treatment effect uses a regression model as a stepping stone to an effect estimator in which estimates are made on a probability scale and averaged. For example, if Y is binary, one might fit a binary logistic regression model on the baseline covariates and the treatment variable (Benkeser et al. prefer to fit separate models by treatment, omitting the treatment indicator from each.) Then for every trial participant one obtains, for example, the estimated probability that Y=1 under treatment A, then for the same covariate values, the probability that Y=1 under treatment B. The estimates are averaged over all subjects and then subtracted to arrive at a marginal treatment effect estimate.

^{1} Benkeser D, Díaz I, Luedtke A, Segal J, Scharfstein D, Rosenblum M (2020): Improving precision and power in randomized trials for COVID-19 treatments using covariate adjustment, for binary, ordinal, and time-to-event outcomes. To appear in Biometrics, DOI:10.1111/biom.13377.

There are problems with this approach:

it changes the estimand to something that is not applicable to individual patient decision making

it estimates the difference in probabilities over the distribution of observed covariate values and is dependent on the covariate distributions of participants actually entering the trial

this fails to recognize that RCT participants are not a random sample from the target population; RCTs are valid when their designs result in representative treatment effects, and they do not require representative participants

to convert marginal estimates to estimates that are applicable to the population requires the RCT sample to be a probability sample and for the sampling probabilities from the population to be known

these sampling weights are almost never known; RCTs are almost always based on convenience samples

the marginal approach makes assessment of differential treatment effect (interactions) difficult

The claim that ordinary conditional estimates are not robust also needs further exploration. Here I take a simple example where there are two strong covariates—age and sex—and age has a very nonlinear true effect on the log odds that Y=1. Suppose that the investigators do not know much about flexible parametric modeling (the use of regression splines, etc.) but assume that age has a linear effect, and does the covariate adjustment assuming linearity. Suppose also that sex is omitted from the model. What happens? Is the resulting conditional odds ratio (OR) for treatment valid? We will see that it is not exactly correct, but that it can be more valid than the marginal estimate. In regression analysis one can never get the model “correct.” Instead, modeling is a question of approximating the effects of baseline variables that explain outcome heterogeneity. The better the model the more complete the conditioning and the more accurate the patient-specific effects that are estimated from the model. Omitted covariates or under-fitting strong nonlinear relationships results in effectively conditioning on only part of what one would like to know. This partial conditioning still results in useful estimates, and the estimated treatment effect will be somewhere between a fully correctly adjusted effect and a non-covariate-adjusted effect.

Simulation Model

Assume a true model as specified below:

where is age, treatment is A or B, is sex, and denotes the inverse of the logit function, i.e., . is defined as if and 0 otherwise, and is 1 if is true and 0 otherwise. We assume the effect of using treatment B instead of treatment A raises the odds that Y=1 by a factor of 2.0, i.e., the treatment effect is OR=2 so that . The age effect is a linear spline with slope change at 65y. Assume the true age effect is given by the initial slope of and the increment in slope starting at age 65 is . Assume that and . Then the true relationships are given in the following graph.

Simulate a clinical trial from this model, with 2000 participants in each treatment arm. Assume that the age distribution for those volunteering for the trial has a mean of 70 and a standard deviation of 8.

Code

simdat <-function(n, mage, sdage=8, fem=0.5, a =0,b1 =log(2), b2 =0.01, b3 =0.07, b4 =0.5) { age <-rnorm(n, mage, sdage) sex <-sample(c('female', 'male'), n, replace=TRUE, prob=c(fem, 1.- fem)) tx <-c(rep('A', n/2), rep('B', n/2)) logit <- a + b1 * (tx =='B') + b2 * age + b3 *pmax(age -65, 0) + b4 * (sex =='male') prob <-plogis(logit) y <-ifelse(runif(n) <= prob, 1, 0)data.frame(age, sex, tx, y)}set.seed(1)d <-simdat(n=4000, mage=70)dd <-datadist(d); options(datadist='dd')

First fit the correct structure—a linear spline in age and with the sex variable included—to make sure we can approximately recover the truth with this fairly large sample size.

Code

f <-lrm(y ~ tx +lsp(age, 65) + sex, data=d)f

Logistic Regression Model

lrm(formula = y ~ tx + lsp(age, 65) + sex, data = d)

Model Likelihood Ratio Test

Discrimination Indexes

Rank Discrim. Indexes

Obs 4000

LR χ^{2} 196.93

R^{2} 0.080

C 0.664

0 676

d.f. 4

R^{2}_{4,4000} 0.047

D_{xy} 0.329

1 3324

Pr(>χ^{2}) <0.0001

R^{2}_{4,1685.3} 0.108

γ 0.329

max |∂log L/∂β| 4×10^{-12}

Brier 0.134

τ_{a} 0.092

β

S.E.

Wald Z

Pr(>|Z|)

Intercept

-0.2723

0.8159

-0.33

0.7385

tx=B

0.6550

0.0885

7.41

<0.0001

age

0.0150

0.0132

1.14

0.2539

age’

0.0597

0.0190

3.15

0.0016

sex=male

0.4661

0.0875

5.33

<0.0001

Code

ggplot(Predict(f, age, tx, sex='female')) +geom_line(data=wf, aes(x=age, y=xb, color=tx, linetype=I(2))) +labs(title="Fitted quadratic model and true model for females",caption="Solid lines:fitted\nDashed lines:truth")

Now fit the incorrect model that assumes age is linear and omits sex. Also compute Huber-White robust variance-covariance estimates.

Code

g <-lrm(y ~ tx + age, data=d, x=TRUE, y=TRUE)g

Logistic Regression Model

lrm(formula = y ~ tx + age, data = d, x = TRUE, y = TRUE)

Model Likelihood Ratio Test

Discrimination Indexes

Rank Discrim. Indexes

Obs 4000

LR χ^{2} 158.98

R^{2} 0.065

C 0.649

0 676

d.f. 2

R^{2}_{2,4000} 0.038

D_{xy} 0.299

1 3324

Pr(>χ^{2}) <0.0001

R^{2}_{2,1685.3} 0.089

γ 0.299

max |∂log L/∂β| 5×10^{-7}

Brier 0.135

τ_{a} 0.084

β

S.E.

Wald Z

Pr(>|Z|)

Intercept

-2.3459

0.3657

-6.41

<0.0001

tx=B

0.6434

0.0881

7.31

<0.0001

age

0.0530

0.0053

9.96

<0.0001

Code

ggplot(Predict(g, age, tx)) +geom_line(data=wf, aes(x=age, y=xb, color=tx, linetype=I(2))) +labs(title="Fitted linear-in-age model ignoring sex, and true model for females",caption="Solid lines:fitted\nDashed lines:truth")

Code

rob <-robcov(g)rob

Logistic Regression Model

lrm(formula = y ~ tx + age, data = d, x = TRUE, y = TRUE)

Model Likelihood Ratio Test

Discrimination Indexes

Rank Discrim. Indexes

Obs 4000

LR χ^{2} 158.98

R^{2} 0.065

C 0.649

0 676

d.f. 2

R^{2}_{2,4000} 0.038

D_{xy} 0.299

1 3324

Pr(>χ^{2}) <0.0001

R^{2}_{2,1685.3} 0.089

γ 0.299

max |∂log L/∂β| 5×10^{-7}

Brier 0.135

τ_{a} 0.084

β

S.E.

Wald Z

Pr(>|Z|)

Intercept

-2.3459

0.3524

-6.66

<0.0001

tx=B

0.6434

0.0882

7.29

<0.0001

age

0.0530

0.0051

10.37

<0.0001

The poorly fitting model profited from balanced in the (unknown to it) sex distribution. The robust standard error estimates did not change the (improperly chosen) model-based standard errors very much in this instance.

Now fit an unadjusted model.

Code

h <-lrm(y ~ tx, data=d)h

Logistic Regression Model

lrm(formula = y ~ tx, data = d)

Model Likelihood Ratio Test

Discrimination Indexes

Rank Discrim. Indexes

Obs 4000

LR χ^{2} 55.68

R^{2} 0.023

C 0.578

0 676

d.f. 1

R^{2}_{1,4000} 0.014

D_{xy} 0.157

1 3324

Pr(>χ^{2}) <0.0001

R^{2}_{1,1685.3} 0.032

γ 0.309

max |∂log L/∂β| 2×10^{-12}

Brier 0.139

τ_{a} 0.044

β

S.E.

Wald Z

Pr(>|Z|)

Intercept

1.3069

0.0546

23.93

<0.0001

tx=B

0.6390

0.0869

7.35

<0.0001

Treatment effect estimates and SEs are summarized below.

Model

SE

OR

Correct

0.66

0.09

1.92

Linear

0.64

0.09

1.90

Unadjusted

0.64

0.09

1.89

The difference between fitting the correct and the incorrect models usually results in larger changes in treatment effect estimates and/or standard errors than what we see here, but the main points of this exercise are (1) how far the unadjusted treatment effect estimate is from the true value used in data generation, and (2) the difficulty in interpreting marginal estimates.

Adjusted Marginal Estimates

The crude marginal proportions of Y=1 stratified by treatment A, B are 0.787 and 0.875. A slight simplification of the Benkeser estimate (we are not fitting separate models for treatments A and B) is computed below.

Code

marg <-function(fit, data, sx=FALSE) { prop <-with(data, tapply(y, tx, mean))# Compute estimated P(Y=1) as if everyone was on treatment A da <-data.frame(tx='A', age=data$age)if(sx) da$sex <- data$sex pa <-plogis(predict(fit, da))# Compute the same as if everyone was on treatment B db <-data.frame(tx='B', age=data$age)if(sx) db$sex <- data$sex pb <-plogis(predict(fit, db)) odds <-function(x) x / (1.- x) ma <-mean(pa); mb <-mean(pb) z <-rbind('Marginal covariate adjusted'=c(ma, mb, mb - ma, odds(mb)/odds(ma)),'Observed proportions'=c(prop, prop[2] - prop[1], odds(prop[2])/odds(prop[1])))colnames(z) <-c('A', 'B', 'B - A', 'OR')round(z, 3)}marg(g, d)

A B B - A OR
Marginal covariate adjusted 0.788 0.874 0.086 1.871
Observed proportions 0.787 0.875 0.088 1.895

In this instance, the marginal B:A odds ratio happened to almost equal the true conditional OR, which will not be the case in general. For comparison, compute marginal estimates for the correctly fitted model.

Code

marg(f, d, sx=TRUE)

A B B - A OR
Marginal covariate adjusted 0.788 0.875 0.087 1.883
Observed proportions 0.787 0.875 0.088 1.895

The marginal estimates are very close to the raw proportions (this will not be true in general), but as Benkeser et al discussed, they always have an advantage over crude estimates in that their standard errors are smaller. Estimates from the incorrect covariate model are virtually the same as from the correct model.

The question now is how to interpret the 0.087 estimated marginal difference in P(Y=1) between treatments. This difference is a function of all of the values of age observed in the data. It is specific to the participants volunteering to be in our simulated clinical trial. Without selecting a probability sample from the population (i.e., without relying on volunteerism), we have no way to weight the individual P(Y=1) estimates to the population.

How does the marginal difference apply to a clinical population in which the age distribution has a mean that is 15 years younger than the volunteers from which our trial participants were drawn and that instead of a 50:50 sex distribution has 0.65 females? We simulate such a sample.

A B B - A OR
Marginal covariate adjusted 0.678 0.807 0.13 1.996
Observed proportions 0.677 0.807 0.13 1.997

Code

marg(f, d, sx=TRUE)

A B B - A OR
Marginal covariate adjusted 0.676 0.808 0.132 2.019
Observed proportions 0.677 0.807 0.130 1.997

Traditional covariate adjustment with a misspecified model managed to correctly estimate the treatment effect:

Code

lrm(y ~ tx + age, data=d)

Logistic Regression Model

lrm(formula = y ~ tx + age, data = d)

Model Likelihood Ratio Test

Discrimination Indexes

Rank Discrim. Indexes

Obs 4000

LR χ^{2} 97.22

R^{2} 0.035

C 0.597

0 1030

d.f. 2

R^{2}_{2,4000} 0.024

D_{xy} 0.194

1 2970

Pr(>χ^{2}) <0.0001

R^{2}_{2,2294.3} 0.041

γ 0.194

max |∂log L/∂β| 2×10^{-11}

Brier 0.187

τ_{a} 0.074

β

S.E.

Wald Z

Pr(>|Z|)

Intercept

0.0270

0.2558

0.11

0.9160

tx=B

0.6926

0.0743

9.32

<0.0001

age

0.0130

0.0046

2.84

0.0045

Code

lrm(y ~ tx + age + sex, data=d)

Logistic Regression Model

lrm(formula = y ~ tx + age + sex, data = d)

Model Likelihood Ratio Test

Discrimination Indexes

Rank Discrim. Indexes

Obs 4000

LR χ^{2} 150.88

R^{2} 0.054

C 0.623

0 1030

d.f. 3

R^{2}_{3,4000} 0.036

D_{xy} 0.246

1 2970

Pr(>χ^{2}) <0.0001

R^{2}_{3,2294.3} 0.062

γ 0.246

max |∂log L/∂β| 5×10^{-9}

Brier 0.184

τ_{a} 0.094

β

S.E.

Wald Z

Pr(>|Z|)

Intercept

-0.1607

0.2586

-0.62

0.5343

tx=B

0.7128

0.0749

9.52

<0.0001

age

0.0128

0.0046

2.79

0.0053

sex=male

0.5809

0.0811

7.16

<0.0001

Code

lrm(y ~ tx +lsp(age, 65) + sex, data=d)

Logistic Regression Model

lrm(formula = y ~ tx + lsp(age, 65) + sex, data = d)

Model Likelihood Ratio Test

Discrimination Indexes

Rank Discrim. Indexes

Obs 4000

LR χ^{2} 162.36

R^{2} 0.058

C 0.624

0 1030

d.f. 4

R^{2}_{4,4000} 0.039

D_{xy} 0.247

1 2970

Pr(>χ^{2}) <0.0001

R^{2}_{4,2294.3} 0.067

γ 0.248

max |∂log L/∂β| 2×10^{-5}

Brier 0.184

τ_{a} 0.095

β

S.E.

Wald Z

Pr(>|Z|)

Intercept

0.3295

0.2996

1.10

0.2714

tx=B

0.7148

0.0749

9.54

<0.0001

age

0.0032

0.0055

0.58

0.5587

age’

0.1040

0.0328

3.17

0.0015

sex=male

0.5824

0.0812

7.17

<0.0001

Now instead of an increase in the probability of Y=1 due to treatment B in the mean age 70 group of 0.086 we have an increase of 0.130 in the younger and more female general clinical population that has lower risk. The 0.086 estimate no longer applies.

Frequentist Operating Characteristics

Type I

One may worry that the parametric model that falsely assumed linear age and zero sex effects has poor frequentist operating characteristics. Let’s explore the type I assertion probability of the linear-in-age logistic model omitting sex, by simulating 5000 trials just like our original trial but with a zero age and sex-conditional treatment effect. We consider the ordinary Wald statistic for treatment, and the Huber-White robust Wald statistic. Instead of using sample sizes of 4000 we use 600. Also compute using the actual standard errors for the logistic model, and run the likelihood ratio test for treatment based on the incorrect model. Compute for the marginal adjusted effect test assuming normality and by using the true standard error (to within simulation error, that is).

SE Power
Usual LRM 0.182 0.045
Robust SE 0.182 0.045
Actual SE 0.180 0.048
Correct model, usual 0.183 0.046
Incorrect model, LR NA 0.047
Marginal 0.036 0.049

Model-based treatment effect standard errors are indistinguishable from robust standard errors in this example. For both types of Wald statistics for testing the treatment effect, was estimated to be very slightly below the nominal 0.05 level. The significant lack of fit caused by assuming that (1) a very strong covariate (age) is linear and (2) the sex effect is zero did not harm the frequentist assertion probability under the null. The adjusted marginal method had an accurate , at least when the standard error did not need to be estimated from the data.

Type II

Now consider power to detect a treatment OR of 1.75 for n=600. Give the adjusted marginal method the benefit of not having to estimate the standard error of the difference in average probabilities, by using the standard deviation of observed point estimates over the simulation. For ordinary logistic covariate adjustment we include both the correct and the incorrect models. For the marginal method the incorrect model is used.

Code

set.seed(8)simoc(b1=log(1.75))

SE Power
Usual LRM 0.197 0.808
Robust SE 0.197 0.809
Actual SE 0.200 0.794
Correct model, usual 0.202 0.813
Incorrect model, LR NA 0.812
Marginal 0.035 0.809

Ordinary covariate adjustment resulted in slightly better power (0.813 vs. 0.808) when the correct model was used. The power of the adjusted marginal comparison (0.809) was virtually the same as these, when the standard error did not need to be estimated.

Conclusion

The marginal treatment effect estimate involves averaging of unlikes. Even though outcome heterogeneity is taken into account, the averaging at the final step hides the outcome heterogeneity that dictates that the risks of outcomes vary systematically by strong baseline covariates. This makes the marginal estimate sample-specific, difficult to interpret, and prevents it from applying to target populations. And critics of traditional covariate conditional models focus on possible non-robustness of nonlinear regression models. As exemplified in simulations shown here, such criticisms may be unwarranted.

If our simulation results hold more generally, the issue with adjusted marginal estimates is more with estimating magnitudes of treatment effectiveness than with statistical power. But ordinary covariate adjustment with ill-fitting nonlinear models may be just as powerful as the two-stage robust marginal procedure, while controlling .

Risk differences are clinically relevant measures of treatment effect. But because of extreme baseline risk-dependent heterogeneity of risk differences, risk differences should be covariate-specific and not averaged. This is discussed in more detail here.

]]>2021generalizabilityRCTregressionhttps://fharrell.com/post/robcov/index.htmlTue, 29 Jun 2021 05:00:00 GMTAvoiding One-Number Summaries of Treatment Effects for RCTs with Binary OutcomesFrank Harrell
https://fharrell.com/post/rdist/index.html
.table {
width: 50%;
}

Background

In a randomized clinical trial (RCT) with a binary endpoint Y it is traditional in a frequentist analysis to summarize the estimated treatment effect with an odds ratio (OR), risk ratio (RR), or risk difference (RD, also called absolute risk reduction). For any of these measures there are several forms of estimation:

covariate-conditional estimates (the usual covariate adjustment approach using a single stage regression analysis)

marginal adjusted estimate (average personalized RD using a two-stage approach)

ORs have some potential to simplify things and can easily be translated into RD, but are not without controversy, and tend to present interpretation problems for some clinical researchers. Marginal adjusted estimates may be robust, but may not accurately estimate RD for either any patient in the RCT or for the clinical population to which RCT results are to be applied, because in effect they assume that the RCT sample is a random sample from the clinical population, something not required and never realized for RCTs.

When Y is a Gaussian continuous response with constant variance, it is possible to reduce the results of an RCT treatment comparison to two numbers: the difference in mean Y and the between-subject covariate-adjusted variance (the latter being used for prediction intervals as opposed to group mean intervals). Things are much different with binary Y. Among other things, the variance of Y is a function of the mean of Y (P(Y=1)), and model misspecification will alter all of the coefficients in a logistic model. A rich discussion and debate about effect measures, especially ORs, has been held on datamethods.org. At the heart of the debate, as well stated by Sander Greenland, are problems caused by one attempting to reduce a treatment effect to a single number such as an OR or RD.

For individual patient decision making, when narrowing the focus to efficacy alone, the best information to present to the patient is the estimated individualized risk of the bad outcome separately under all treatment alternatives. That is because patients tend to think in terms of absolute risk, and differences in risks don’t tell the whole story. A RD often means different things to patients depending on whether the base risk is very small, in the middle, or very large. For this article, results for all patients and not just any one patient are presented, and the picture is simplified to estimation of RD and not its two component probabilities. But one full-information graphical representation of the component probabilities is also presented.

Statistical Evidence for Efficacy

I propose that the entire distribution of RD from the trial be presented rather than putting so much emphasis on single-number summaries. Some single-number summaries are still needed, however, such as

covariate-adjusted OR

adjusted marginal RD (mean personalized predicted risk as if all patients were on treatment A minus mean predicted risk as if all patients were on treatment B)

median RD

The adjusted marginal RD is the mean over all estimated RDs. As exemplified in the GUSTO-I study, the mean RD may not be representative due to outliers, i.e., the mean RD may be dominated by a minority of high-risk patients. Median RD is more representative of the mortality reduction patients may achieve in this study.

The covariate-adjusted logistic regression model does not have to be perfect for estimates to be valid (either ORs or RDs). This model is recommended for obtaining the primary statistical evidence for any efficacy, by testing that the treatment OR is 1.0. In the absence of treatment baseline covariate interactions, there is only one p-value. What happens when our RCT presentation involves a distribution of 10,000 RDs for 10,000 patients? There is still only one p-value, because under the model assumptions the risk difference is zero if and only if OR=1.0.

When one moves from testing whether there is any efficacy vs. whether there is absolute efficacy beyond a certain level, e.g., RD > 0.02, the statistical evidence will vary depending not only on the RD cutoff but also on how high risk is the patient. For example, Bayesian posterior probabilities that RD > d are functions of d and baseline covariates.

As a side note, from an RD perspective, the treatment benefit that a patient gets from being high risk is indistinguishable from the benefit she gets due to a treatment baseline covariate interaction. But it does matter for quantifying statistical evidence.

For the remainder of the article I concentrate on summarizing trials using the entire RD distribution.

GUSTO-I

The 41,000 patient GUSTO-I study has been a gold mine for predictive modeling and efficacy exploration, spawning many excellent statistical re-analyses. The study goal was to provide evidence for whether or not t-PA lowered mortality over streptokinase (SK) for patients with acute myocardial infarction. Overall proportion of deaths at 30d was 0.07. I analyze a subset of 30,510 patients to compare the accelerated dosing of t-PA with streptokinase. Descriptive statistics may be found here.

Extensive analyses of the dataset have found no evidence for treatment interactions in this dataset, and that an additive (on the logit scale) binary logistic model with flexible modeling of continuous variables provides an excellent data fit.

The t-PA:SK odds ratio is 0.81. But let’s present richer results. Estimate the distribution of SK - t-PA risk differences from the above full covariate-adjusted model. But first show the full information needed for medical decision making: the estimated risks for individual patients under both treatment alternatives. One can readily see risk magnification in the absolute risk reduction as baseline risk increases. The points are all on a line because the logistic model allowed for no interactions with treatment (and no interactions were needed).

Code

prisk <-function(fit) { d <-copy(gusto) # otherwise next command will make data.table change the original dataset d[, tx :='SK'] p1 <-plogis(predict(fit, d)) d[, tx :='tPA'] p2 <-plogis(predict(fit, d))list(p1=p1, p2=p2)}pr <-prisk(f)with(pr, ggfreqScatter(p1, p2, bins=1000) +geom_abline(intercept=0, slope=1) +xlab('Risk of Mortality for SK') +ylab('Risk of Mortality for t-PA'))xl <-'SK - t-PA Risk Difference'd <-with(pr, p1 - p2)hist(d, nclass=100, prob=TRUE, xlab=xl, main='')lines(density(d, adj=0.4))mmed <-function(x, pl=TRUE) { mn <-mean(x) md <-median(x)if(pl) abline(v=c(mn, md), col=c('red', 'blue'))round(c('Mean risk difference'=mean(x), 'Median RD'=median(x)), 4)}mmed(d)

Mean risk difference Median RD
0.0111 0.0067

Code

title(sub='Red line: mean RD; Blue line: median RD', adj=1)

This graph provides a much fuller picture than an OR or than the blue or red vertical lines (median and mean RD). Clinicians can readily see that most patients are lower risk and receive little absolute benefit form t-PA, while a minority of very high-risk patients can receive almost an absolute 0.05 risk reduction. But the previous graph shows even more.

RD Distribution Under Different Models

We can never make the data better than they are. We would like to know the RD distribution for a model that adjusts for all prognostic factors, but it is not possible to know or to measure all such factors. In the spirit of SIMEX (simulation-extrapolation) which is used to correct for measurement error when estimating parameters, we can make things worse than they are and study how things vary. Use a fast backwards stepdown procedure to rank covariates by their apparent predictive importance, then remove covariates one-at-a-time so that only the apparently most important variable (age) remains. For each model show the distribution of RDs computed from it. Keep treatment in all models though.

Code

fastbw(f, aics=100000)

Deleted Chi-Sq d.f. P Residual d.f. P AIC
tx 15.27 1 1e-04 15.27 1 1e-04 13.27
sex 30.92 1 0e+00 46.19 2 0e+00 42.19
pmi 63.89 1 0e+00 110.09 3 0e+00 104.09
miloc 105.93 2 0e+00 216.02 5 0e+00 206.02
pulse 286.28 2 0e+00 502.30 7 0e+00 488.30
sysbp 292.69 1 0e+00 794.98 8 0e+00 778.98
Killip 633.77 3 0e+00 1428.75 11 0e+00 1406.75
age 996.28 3 0e+00 2425.03 14 0e+00 2397.03
Approximate Estimates after Deleting Factors
Coef S.E. Wald Z P
[1,] -2.101 0.02446 -85.89 0
Factors in Final Model
None

Code

forms <-list('full'=day30 ~ tx +rcs(age,4) + Killip +pmin(sysbp, 120) +lsp(pulse, 50) + pmi + miloc + sex,'-sex'=day30 ~ tx +rcs(age,4) + Killip +pmin(sysbp, 120) +lsp(pulse, 50) + pmi + miloc,'-pmi'=day30 ~ tx +rcs(age,4) + Killip +pmin(sysbp, 120) +lsp(pulse, 50) + miloc,'-miloc'=day30 ~ tx +rcs(age,4) + Killip +pmin(sysbp, 120) +lsp(pulse, 50),'-pulse'=day30 ~ tx +rcs(age,4) + Killip +pmin(sysbp, 120),'-sysbp'=day30 ~ tx +rcs(age,4) + Killip,'-Killip'=day30 ~ tx +rcs(age,4))z <- u <-NULL; i <-0; nm <-names(forms)for(form in forms) { i <- i +1 f <-lrm(form, data=gusto, maxit=30) d <-with(prisk(f), p1 - p2) z <-rbind(z, data.frame(model=nm[i], d=d)) u <-rbind(u, data.frame(model=nm[i], what=c('mean', 'median'),stat=c(mean(d), median(d))))if(i >1) cat('Mean |difference in RD| compared to previous model:', round(mean(abs(d - dprev)), 4), '\n') dprev <- d}

Mean |difference in RD| compared to previous model: 0.001
Mean |difference in RD| compared to previous model: 0.0011
Mean |difference in RD| compared to previous model: 0.002
Mean |difference in RD| compared to previous model: 0.0025
Mean |difference in RD| compared to previous model: 0.0024
Mean |difference in RD| compared to previous model: 0.0028

The final model (labeled -Killip) contains only age and treatment. The risk difference distribution is fairly stable over the various models.

Summary

The never-ending discussion about the choice of effect measures when Y is binary is best resolved by avoiding the oversimplifications that are required to make such choices. The proposed summarization of the main trial result is more clinically interpretable, more consistent with individual patient decision making, and embraces rather than hides outcome heterogeneity in the RD distribution. See this for some related discussions.

## Comments from Readers

Giuliano Cruz: Very interesting post! It hurts even more to read sens/spec “optimization” in, say, the biomarker literature after seeing decision theoretic approaches for specification of risk thresholds. One doubt I have: following https://www.tandfonline.com/doi/abs/10.1198/000313008X370302, the formula for rational odds(Pt) (that arrived at Pt=10%) seems inverted. As I read, it should be odds(Pt) = (TN-FP)/(TP-FN). Is that correct? Thanks!Uriah Finkel: Great post! I wonder what about Risk Percentiles? In some applications it is required to observe results for top 5% population at risk, or that there’s enough budget for intervention for limited absolute amount of people. Should it be considered as another constraint?