Posts on Statistical Thinking
http://fharrell.com/post/
Recent content in Posts on Statistical Thinking
Hugo  gohugo.io
enus
© 20182019
Sun, 01 Jan 2017 00:00:00 +0000

Statistically Efficient Ways to Quantify Added Predictive Value of New Measurements
http://fharrell.com/post/addvalue/
Wed, 17 Oct 2018 00:00:00 +0000
http://fharrell.com/post/addvalue/
<style>
p.caption {
fontsize: 0.6em;
}
pre code {
overflow: auto;
wordwrap: normal;
whitespace: pre;
}
</style>
<p class="rquote">
When the outcomme variable Y is continuous, there are only three measures of added value that are commonly used: increase in <span class="math inline">\(R^2\)</span>, decrease in mean squared prediction error, and decrease in mean absolute prediction error. Why have so many measures been invented when Y is binary or censored?
</p>
<div id="introduction" class="section level1">
<h1>Introduction</h1>
<p>A recurring topic in clinical and translational research is the assessment of new information provided by molecular, physiologic, imaging, and other biomarkers. The notion of added value is in the predictive sense, either when diagnosing a hidden disease, e.g., predicting current status using a binary or ordinal logistic model, or when predicting future events, e.g., using a timetoevent statistical model to predict future disease occurrence, recurrence of disease, or occurrence of a clinical event such as death or stroke.</p>
<p>Before getting into purely analytical issues, note that there are many study design issues. One of the most common mistakes, sometimes intentional, is to fail to collect variables that are available in medical practice to put into the base model. This leads to inadequate adjustment for prior information, setting the bar too low when measuring added value of new information. Other common mistakes are categorizing some of the adjustment variables, resulting in residual confounding and failure to adjust for the real background variables, and categorizing new continuous measurements, resulting in understatement of their added value. Without knowing it, many translational researchers, by dichotomizing new biomarkers, are requiring additional biomarkers to be measured to make up for the lost information in dichotomized markers.</p>
</div>
<div id="howdidwegethere" class="section level1">
<h1>How Did We Get Here</h1>
<p>Statisticians have no better sense of history than other scientists. In the quest for publishing new ideas, measures of added value are constantly being invented by statisticians, without asking whether older methods already solve the problem at hand. Some of the examples of measures that are commonly used but are not needed in this setting are the <span class="math inline">\(c\)</span>index (<span class="citation">Harrell et al. (<a href="#refhar82">1982</a>)</span>; area under the ROC curve if the outcome is binary), and <a href="https://onlinelibrary.wiley.com/toc/10970258/27/2">IDI and NRI</a>. They are not needed because measures based on standard regression methods are not only adequate to the task, but are more powerful and <strong>more flexible and insightful</strong>, especially when interactions are involved. Especially problematic are measures such as the categorical version of NRI (net reclassification improvement) which not only requires arbitrary categorization of risk estimates but then goes on to use inefficient binary summaries from them. Pencina (personal communication) has regretted including statistical tests for these measures in his highlycited paper, as these tests have nowhere near the power of the goldstandard likelihood ratio test.</p>
<p>Comparing two cindexes (one from the base model and one from the larger model containing the new biomarkers) is a lowpower procedure. This is because the cindex is a rank measure (the concordance probability from the Wilcoxon test or Somers’ <span class="math inline">\(D_{xy}\)</span> rank correlation) that does not sufficiently reward extreme predictions that are correct<a href="#fn1" class="footnoteref" id="fnref1"><sup>1</sup></a>. And taking the difference between two cindexes corresponds to taking the difference in two Wilcoxon statistics, which is never done. Instead, a headtohead comparison is demanded. This can be done by using a different Ustatistic based on all possible pairs of observations and counting the fraction of pairs for which one model is “more concordant” with the outcome than the other model. This respects pairings of pairings, and is implemented in the R <code>Hmisc</code> package <code>rcorrp.cens</code> function. Still, this is not as powerful as the likelihood ratio <span class="math inline">\(\chi^2\)</span> test.</p>
<p>Worse than any of these problems is the continued use of <a href="http://fharrell.com/post/classdamage">discontinuous improper accuracy scoring rules</a> such as sensitivity, specificity, precision, and recall.</p>
</div>
<div id="keymeasures" class="section level1">
<h1>Key Measures</h1>
<p>There are three gold standards, and statisticians have too often tried to forget them in the search for novelty:</p>
<ul>
<li>frequentist: loglikelihood, including the likelihood ratio <span class="math inline">\(\chi^2\)</span> test (LR) and AIC</li>
<li>Bayesian: loglikelihood + log prior, including various Bayesian information criteria</li>
<li>explained variation in Y</li>
</ul>
<p>Methods based on one of these gold standards are simpler, more powerful, and allow for greater complexity. The best example of handling complexity will be demonstrated in the case study below, in which the new marker interacts with a standard variable (there, age) when predicting disease probability.</p>
<p>In the binary outcome (Y) case, there are two commonly used proper accuracy scores: the logarithmic accuracy score (a perobservation loglikelihood), and the quadratic accracy score (Brier score; mean squared error). The loglikelihood can be turned into a 01 pseudo <span class="math inline">\(R^2\)</span> measure: <span class="math inline">\(1  \exp(\text{LR} / n)\)</span>. The only thing going against pseudo <span class="math inline">\(R^2\)</span> is the difficulty in interpreting its absolute value. But it is excellent for comparing two or more models, even though examining increases in LR is better.</p>
<p>Explained outcome variation is another key type of measure. In the linear model, the traditional <span class="math inline">\(R^2\)</span> is often used. This is SSR / SST where SSR is the sum of squares due to regression (the sum of squares of differences between predicted values and the mean predicted value), and SST is the sum of squares total, which is n1 times the variance of Y. <span class="math inline">\(R^2\)</span> may also be written as SSR / (SSR + SSE) where SSE is the sum of squared residuals. In the linear model, <span class="math inline">\(R^2\)</span> and the loglikelihood are measuring the same thing since LR = <span class="math inline">\(n \log(1  R^{2})\)</span>.</p>
<p>We can reexpress <span class="math inline">\(R^2\)</span> as <span class="math inline">\(\frac{\text{var}(\hat{Y})}{\text{var}(Y)}\)</span> where <span class="math inline">\(\hat{Y} = X \hat{\beta}\)</span> is the linear predictor (predicted mean, for the linear model).</p>
<p><span class="math inline">\(R^2\)</span> can equivalently be written as the ratio of the explained variance to the sum of explained and unexplained variance. The unexplained variance is the variance of the residuals in the linear model. For a probability model, the natural way to express the proportion of explained variance is through the predicted probabilities that Y=1, denoted by <span class="math inline">\(\hat{P}\)</span>. The <span class="math inline">\(R^2\)</span> measure for a binary Y model is then <span class="math display">\[\frac{\text{var}(\hat{P})}{\text{var}(\hat{P}) + \sum_{i}^{n} \hat{P}_{i} (1  \hat{P}_{i}) / n}\]</span>
where <span class="math inline">\(\text{var}(\hat{P})\)</span> is the sample variance of the <span class="math inline">\(n\)</span> <span class="math inline">\(\hat{P}_{i}\)</span>.<a href="#fn2" class="footnoteref" id="fnref2"><sup>2</sup></a></p>
<p><span class="citation">Kent and O’Quigley (<a href="#refken88mea">1988</a>)</span> have extended the idea of the fraction of explained variation in the outcome to various nonlinear models including those used in survival analysis. The SST or var(Y) is distributionspecific. The beauty of this approach is its focus on the variance of <span class="math inline">\(\hat{Y}\)</span>, which is independent of the prevalence of Y=1 in the binary case and of the amount of censoring in timetoevent analysis. See also <span class="citation">B. ChoodariOskooei, Royston, and Parmar (<a href="#refcho12simII">2012</a>)</span> and <span class="citation">Babak ChoodariOskooei, Royston, and Parmar (<a href="#refcho12simI">2012</a>)</span>.</p>
<p>A different type of key measure will also be exemplified in the case study: differences in predicted values between the base model and the expanded model.</p>
<div id="relativeexplainedvariation" class="section level2">
<h2>Relative Explained Variation</h2>
<p>Relative explained variation is a simple concept that has the extra advantage of being completely free of the distribution of Y and the customized error variance or SST necessary for computing the proportion of explained variance for distributions other than the normal. For a linear model relative explained variation is the ratio of the <span class="math inline">\(R^2\)</span> for the base model to the larger <span class="math inline">\(R^2\)</span> for the combined model that contains also the new markers being tested. And since for <span class="math inline">\(R^{2} \leq 0.25\)</span>, <span class="math inline">\(n \log(1  R^{2})\)</span> is approximately <span class="math inline">\(n R^{2}\)</span>, the relative variation explained by the base variables is approximately equal to the <em>adequacy index</em> discussed in the maximum likelihood estimation chapter of <a href="http://biostat.mc.vanderbilt.edu/rms">Regression Modeling Strategies</a> (<span class="citation">Harrell (<a href="#refrms2">2015</a>)</span>): <span class="math display">\[\text{Adequacy index} = \text{LR}_{A} / \text{LR}_{AB}\]</span> where the base model is denoted by A, the added predictors (e.g., biomarkers) are denoted by B, AB represents the combined model with A and B as predictors, and LR is defined above. Here adequacy refers to the adequacy of the model that ignores the new predictors.</p>
<p>Whether using the adequacy index or relative variation explained, one minus such an index is the fraction of new information provided by predictors in B. It is the proportion of explainable variation that is explained by B.</p>
<p>To emphasize the simplicity of relative explained variation, it is just the ratio of variances of predicted values. And other statistical indexes may be computed from the predicted values, such as the mean absolute difference from the mean predicted value, and the <span class="math inline">\(g\)</span> index described in <em>Regression Modeling Strategies</em>. The latter is the mean absolute difference between any two predicted values. But it is possible for the <span class="math inline">\(g\)</span> index for AB to be smaller than that for model A.</p>
</div>
</div>
<div id="casestudyquantifyingdiagnosticinformation" class="section level1">
<h1>Case Study: Quantifying Diagnostic Information</h1>
<p>Consider a series of patients from the Duke Cardiovascular Disease Databank. These patients were referred to Duke University Medical Center for chest pain and underwent cardiac catheterization during which a dye is injected and a coronary angiography is used to view blockages of coronary arteries. Significant coronary artery disease is here defined as a blockage of at least 75% by vessel diameter, in at least one major coronary artery. Here we consider total cholesterol as if it were a new diagnostic marker, and we wish to quantify the new diagnostic information provided by cholesterol. The base model is oversimplified for purposes of illustration. It contains only the powerful variables age and sex. In practice it should contain in addition to age and sex all relevant easily available baseline variables, such as pain characteristics, blood pressure, smoking history, etc. From previous analysis, a nonlinear interaction was demonstrated between age and sex and between age and cholesterol. The former is related to women “catching up” with men with respect to cardiovascular risk, after menopause. The latter interaction captures the fact that high cholesterol is not as dangerous for older patients and the possibility that very low cholesterol is actually harmful for them.</p>
<p>The dataset is available from the Vanderbilt Department of Biostatistics wiki, and includes 2258 patients with all variables measured. It may be automatically downloaded using the R <code>Hmisc getHdata</code> function. The dataset is also analyzed in the <a href="http://hbiostat.org/doc/bbr.pdf">BBR diagnosis chapter</a> and <a href="http://hbiostat.org/talks/memtab18.pdf">here</a>. The latter link contains more analyses that compare pre and posttest probabilities.</p>
<div id="developmentofbinarylogisticmodel" class="section level2">
<h2>Development of Binary Logistic Model</h2>
<p>We first fit a binary logistic regression model with interactions, modeling age and cholesterol in a smooth nonlinear fashion using restricted cubic splines with default knot<a href="#fn3" class="footnoteref" id="fnref3"><sup>3</sup></a> locations. The nonlinear interaction between age and cholesterol is a restricted one such that terms that are nonlinear in both predictors are excluded. This is to save degrees of freedom. The code below also fits the base model containing only age and sex.</p>
<p>Estimated log odds of the risk of significant CAD is plotted against cholesterol for ages 40 and 70 years for males. One can readily see that the diagnostic value of cholesterol is greater for younger patients, and there is some evidence that very low total cholesterol is risky at age 70.</p>
<pre class="r"><code>require(rms)</code></pre>
<pre class="r"><code>options(prType='html')
getHdata(acath)
acath < subset(acath, !is.na(choleste))
acath$sex < factor(acath$sex, 0:1, c('male', 'female'))
dd < datadist(acath); options(datadist='dd')
f < lrm(sigdz ~ rcs(age,4) * sex, data=acath)
f</code></pre>
<div align="center">
<strong>Logistic Regression Model</strong>
</div>
<pre>
lrm(formula = sigdz ~ rcs(age, 4) * sex, data = acath)
</pre>
<table class="gmisc_table" style="bordercollapse: collapse; margintop: 1em; marginbottom: 1em;">
<thead>
<tr>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; borderleft: 1px solid black; borderright: 1px solid black; textalign: center;">
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; borderright: 1px solid black; textalign: center;">
Model Likelihood<br>Ratio Test
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; borderright: 1px solid black; textalign: center;">
Discrimination<br>Indexes
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; borderright: 1px solid black; textalign: center;">
Rank Discrim.<br>Indexes
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="minwidth: 9em; borderleft: 1px solid black; borderright: 1px solid black; textalign: center;">
Obs 2258
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
LR χ<sup>2</sup> 489.51
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
<i>R</i><sup>2</sup> 0.270
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
<i>C</i> 0.770
</td>
</tr>
<tr>
<td style="minwidth: 9em; borderleft: 1px solid black; borderright: 1px solid black; textalign: center;">
0 768
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
d.f. 7
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
<i>g</i> 1.216
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
<i>D</i><sub>xy</sub> 0.539
</td>
</tr>
<tr>
<td style="minwidth: 9em; borderleft: 1px solid black; borderright: 1px solid black; textalign: center;">
1 1490
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
Pr(>χ<sup>2</sup>) <0.0001
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
<i>g</i><sub>r</sub> 3.372
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
γ 0.546
</td>
</tr>
<tr>
<td style="minwidth: 9em; borderleft: 1px solid black; borderright: 1px solid black; textalign: center;">
max ∂log <i>L</i>/∂β 1×10<sup>7</sup>
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
<i>g</i><sub>p</sub> 0.242
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
τ<sub>a</sub> 0.242
</td>
</tr>
<tr>
<td style="minwidth: 9em; borderbottom: 2px solid grey; borderleft: 1px solid black; borderright: 1px solid black; textalign: center;">
</td>
<td style="minwidth: 9em; borderbottom: 2px solid grey; borderright: 1px solid black; textalign: center;">
</td>
<td style="minwidth: 9em; borderbottom: 2px solid grey; borderright: 1px solid black; textalign: center;">
Brier 0.177
</td>
<td style="minwidth: 9em; borderbottom: 2px solid grey; borderright: 1px solid black; textalign: center;">
</td>
</tr>
</tbody>
</table>
<table class="gmisc_table" style="bordercollapse: collapse; margintop: 1em; marginbottom: 1em;">
<thead>
<tr>
<th style="fontweight: 900; borderbottom: 1px solid grey; bordertop: 2px solid grey; textalign: center;">
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; textalign: right;">
β
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; textalign: right;">
S.E.
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; textalign: right;">
Wald <i>Z</i>
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; textalign: right;">
Pr(><i>Z</i>)
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="textalign: left;">
Intercept
</td>
<td style="minwidth: 7em; textalign: right;">
3.9266
</td>
<td style="minwidth: 7em; textalign: right;">
0.8405
</td>
<td style="minwidth: 7em; textalign: right;">
4.67
</td>
<td style="minwidth: 7em; textalign: right;">
<0.0001
</td>
</tr>
<tr>
<td style="textalign: left;">
age
</td>
<td style="minwidth: 7em; textalign: right;">
0.1130
</td>
<td style="minwidth: 7em; textalign: right;">
0.0213
</td>
<td style="minwidth: 7em; textalign: right;">
5.30
</td>
<td style="minwidth: 7em; textalign: right;">
<0.0001
</td>
</tr>
<tr>
<td style="textalign: left;">
age’
</td>
<td style="minwidth: 7em; textalign: right;">
0.0761
</td>
<td style="minwidth: 7em; textalign: right;">
0.0658
</td>
<td style="minwidth: 7em; textalign: right;">
1.16
</td>
<td style="minwidth: 7em; textalign: right;">
0.2477
</td>
</tr>
<tr>
<td style="textalign: left;">
age’’
</td>
<td style="minwidth: 7em; textalign: right;">
0.1890
</td>
<td style="minwidth: 7em; textalign: right;">
0.2856
</td>
<td style="minwidth: 7em; textalign: right;">
0.66
</td>
<td style="minwidth: 7em; textalign: right;">
0.5082
</td>
</tr>
<tr>
<td style="textalign: left;">
sex=female
</td>
<td style="minwidth: 7em; textalign: right;">
2.2574
</td>
<td style="minwidth: 7em; textalign: right;">
1.5121
</td>
<td style="minwidth: 7em; textalign: right;">
1.49
</td>
<td style="minwidth: 7em; textalign: right;">
0.1355
</td>
</tr>
<tr>
<td style="textalign: left;">
age × sex=female
</td>
<td style="minwidth: 7em; textalign: right;">
0.0930
</td>
<td style="minwidth: 7em; textalign: right;">
0.0378
</td>
<td style="minwidth: 7em; textalign: right;">
2.46
</td>
<td style="minwidth: 7em; textalign: right;">
0.0138
</td>
</tr>
<tr>
<td style="textalign: left;">
age’ × sex=female
</td>
<td style="minwidth: 7em; textalign: right;">
0.0629
</td>
<td style="minwidth: 7em; textalign: right;">
0.1073
</td>
<td style="minwidth: 7em; textalign: right;">
0.59
</td>
<td style="minwidth: 7em; textalign: right;">
0.5577
</td>
</tr>
<tr>
<td style="borderbottom: 2px solid grey; textalign: left;">
age’’ × sex=female
</td>
<td style="minwidth: 7em; borderbottom: 2px solid grey; textalign: right;">
0.0905
</td>
<td style="minwidth: 7em; borderbottom: 2px solid grey; textalign: right;">
0.4412
</td>
<td style="minwidth: 7em; borderbottom: 2px solid grey; textalign: right;">
0.21
</td>
<td style="minwidth: 7em; borderbottom: 2px solid grey; textalign: right;">
0.8375
</td>
</tr>
</tbody>
</table>
<pre class="r"><code>pre < predict(f, type='fitted') # pretest probability
g < lrm(sigdz ~ rcs(age,4) * sex + rcs(choleste,4) + rcs(age,4) %ia%
rcs(choleste,4), data=acath)
g</code></pre>
<div align="center">
<strong>Logistic Regression Model</strong>
</div>
<pre>
lrm(formula = sigdz ~ rcs(age, 4) * sex + rcs(choleste, 4) +
rcs(age, 4) %ia% rcs(choleste, 4), data = acath)
</pre>
<table class="gmisc_table" style="bordercollapse: collapse; margintop: 1em; marginbottom: 1em;">
<thead>
<tr>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; borderleft: 1px solid black; borderright: 1px solid black; textalign: center;">
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; borderright: 1px solid black; textalign: center;">
Model Likelihood<br>Ratio Test
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; borderright: 1px solid black; textalign: center;">
Discrimination<br>Indexes
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; borderright: 1px solid black; textalign: center;">
Rank Discrim.<br>Indexes
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="minwidth: 9em; borderleft: 1px solid black; borderright: 1px solid black; textalign: center;">
Obs 2258
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
LR χ<sup>2</sup> 596.99
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
<i>R</i><sup>2</sup> 0.322
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
<i>C</i> 0.795
</td>
</tr>
<tr>
<td style="minwidth: 9em; borderleft: 1px solid black; borderright: 1px solid black; textalign: center;">
0 768
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
d.f. 15
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
<i>g</i> 1.401
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
<i>D</i><sub>xy</sub> 0.590
</td>
</tr>
<tr>
<td style="minwidth: 9em; borderleft: 1px solid black; borderright: 1px solid black; textalign: center;">
1 1490
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
Pr(>χ<sup>2</sup>) <0.0001
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
<i>g</i><sub>r</sub> 4.060
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
γ 0.590
</td>
</tr>
<tr>
<td style="minwidth: 9em; borderleft: 1px solid black; borderright: 1px solid black; textalign: center;">
max ∂log <i>L</i>/∂β 4×10<sup>5</sup>
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
<i>g</i><sub>p</sub> 0.266
</td>
<td style="minwidth: 9em; borderright: 1px solid black; textalign: center;">
τ<sub>a</sub> 0.265
</td>
</tr>
<tr>
<td style="minwidth: 9em; borderbottom: 2px solid grey; borderleft: 1px solid black; borderright: 1px solid black; textalign: center;">
</td>
<td style="minwidth: 9em; borderbottom: 2px solid grey; borderright: 1px solid black; textalign: center;">
</td>
<td style="minwidth: 9em; borderbottom: 2px solid grey; borderright: 1px solid black; textalign: center;">
Brier 0.167
</td>
<td style="minwidth: 9em; borderbottom: 2px solid grey; borderright: 1px solid black; textalign: center;">
</td>
</tr>
</tbody>
</table>
<table class="gmisc_table" style="bordercollapse: collapse; margintop: 1em; marginbottom: 1em;">
<thead>
<tr>
<th style="fontweight: 900; borderbottom: 1px solid grey; bordertop: 2px solid grey; textalign: center;">
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; textalign: right;">
β
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; textalign: right;">
S.E.
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; textalign: right;">
Wald <i>Z</i>
</th>
<th style="borderbottom: 1px solid grey; bordertop: 2px solid grey; textalign: right;">
Pr(><i>Z</i>)
</th>
</tr>
</thead>
<tbody>
<tr>
<td style="textalign: left;">
Intercept
</td>
<td style="minwidth: 7em; textalign: right;">
7.3635
</td>
<td style="minwidth: 7em; textalign: right;">
4.9896
</td>
<td style="minwidth: 7em; textalign: right;">
1.48
</td>
<td style="minwidth: 7em; textalign: right;">
0.1400
</td>
</tr>
<tr>
<td style="textalign: left;">
age
</td>
<td style="minwidth: 7em; textalign: right;">
0.1688
</td>
<td style="minwidth: 7em; textalign: right;">
0.1113
</td>
<td style="minwidth: 7em; textalign: right;">
1.52
</td>
<td style="minwidth: 7em; textalign: right;">
0.1292
</td>
</tr>
<tr>
<td style="textalign: left;">
age’
</td>
<td style="minwidth: 7em; textalign: right;">
0.0364
</td>
<td style="minwidth: 7em; textalign: right;">
0.2516
</td>
<td style="minwidth: 7em; textalign: right;">
0.14
</td>
<td style="minwidth: 7em; textalign: right;">
0.8850
</td>
</tr>
<tr>
<td style="textalign: left;">
age’’
</td>
<td style="minwidth: 7em; textalign: right;">
0.0410
</td>
<td style="minwidth: 7em; textalign: right;">
1.0461
</td>
<td style="minwidth: 7em; textalign: right;">
0.04
</td>
<td style="minwidth: 7em; textalign: right;">
0.9687
</td>
</tr>
<tr>
<td style="textalign: left;">
sex=female
</td>
<td style="minwidth: 7em; textalign: right;">
3.0665
</td>
<td style="minwidth: 7em; textalign: right;">
1.6093
</td>
<td style="minwidth: 7em; textalign: right;">
1.91
</td>
<td style="minwidth: 7em; textalign: right;">
0.0567
</td>
</tr>
<tr>
<td style="textalign: left;">
choleste
</td>
<td style="minwidth: 7em; textalign: right;">
0.0145
</td>
<td style="minwidth: 7em; textalign: right;">
0.0256
</td>
<td style="minwidth: 7em; textalign: right;">
0.57
</td>
<td style="minwidth: 7em; textalign: right;">
0.5709
</td>
</tr>
<tr>
<td style="textalign: left;">
choleste’
</td>
<td style="minwidth: 7em; textalign: right;">
0.0316
</td>
<td style="minwidth: 7em; textalign: right;">
0.0889
</td>
<td style="minwidth: 7em; textalign: right;">
0.36
</td>
<td style="minwidth: 7em; textalign: right;">
0.7226
</td>
</tr>
<tr>
<td style="textalign: left;">
choleste’’
</td>
<td style="minwidth: 7em; textalign: right;">
0.1417
</td>
<td style="minwidth: 7em; textalign: right;">
0.2811
</td>
<td style="minwidth: 7em; textalign: right;">
0.50
</td>
<td style="minwidth: 7em; textalign: right;">
0.6142
</td>
</tr>
<tr>
<td style="textalign: left;">
age × choleste
</td>
<td style="minwidth: 7em; textalign: right;">
0.0003
</td>
<td style="minwidth: 7em; textalign: right;">
0.0006
</td>
<td style="minwidth: 7em; textalign: right;">
0.51
</td>
<td style="minwidth: 7em; textalign: right;">
0.6072
</td>
</tr>
<tr>
<td style="textalign: left;">
age × choleste’
</td>
<td style="minwidth: 7em; textalign: right;">
0.0002
</td>
<td style="minwidth: 7em; textalign: right;">
0.0017
</td>
<td style="minwidth: 7em; textalign: right;">
0.12
</td>
<td style="minwidth: 7em; textalign: right;">
0.9016
</td>
</tr>
<tr>
<td style="textalign: left;">
age × choleste’’
</td>
<td style="minwidth: 7em; textalign: right;">
0.0005
</td>
<td style="minwidth: 7em; textalign: right;">
0.0055
</td>
<td style="minwidth: 7em; textalign: right;">
0.09
</td>
<td style="minwidth: 7em; textalign: right;">
0.9296
</td>
</tr>
<tr>
<td style="textalign: left;">
age’ × choleste
</td>
<td style="minwidth: 7em; textalign: right;">
0.0004
</td>
<td style="minwidth: 7em; textalign: right;">
0.0011
</td>
<td style="minwidth: 7em; textalign: right;">
0.40
</td>
<td style="minwidth: 7em; textalign: right;">
0.6866
</td>
</tr>
<tr>
<td style="textalign: left;">
age’’ × choleste
</td>
<td style="minwidth: 7em; textalign: right;">
0.0004
</td>
<td style="minwidth: 7em; textalign: right;">
0.0046
</td>
<td style="minwidth: 7em; textalign: right;">
0.09
</td>
<td style="minwidth: 7em; textalign: right;">
0.9301
</td>
</tr>
<tr>
<td style="textalign: left;">
age × sex=female
</td>
<td style="minwidth: 7em; textalign: right;">
0.1134
</td>
<td style="minwidth: 7em; textalign: right;">
0.0402
</td>
<td style="minwidth: 7em; textalign: right;">
2.82
</td>
<td style="minwidth: 7em; textalign: right;">
0.0048
</td>
</tr>
<tr>
<td style="textalign: left;">
age’ × sex=female
</td>
<td style="minwidth: 7em; textalign: right;">
0.0687
</td>
<td style="minwidth: 7em; textalign: right;">
0.1140
</td>
<td style="minwidth: 7em; textalign: right;">
0.60
</td>
<td style="minwidth: 7em; textalign: right;">
0.5468
</td>
</tr>
<tr>
<td style="borderbottom: 2px solid grey; textalign: left;">
age’’ × sex=female
</td>
<td style="minwidth: 7em; borderbottom: 2px solid grey; textalign: right;">
0.1509
</td>
<td style="minwidth: 7em; borderbottom: 2px solid grey; textalign: right;">
0.4678
</td>
<td style="minwidth: 7em; borderbottom: 2px solid grey; textalign: right;">
0.32
</td>
<td style="minwidth: 7em; borderbottom: 2px solid grey; textalign: right;">
0.7471
</td>
</tr>
</tbody>
</table>
<pre class="r"><code>post < predict(g, type='fitted') # posttest probability
ageg < c(40, 70) # test=cholesterol
psig < Predict(g, choleste, age=ageg, sex='male', fun=plogis)
ggplot(psig, adj.subtitle=FALSE, ylab='Prob(CAD)')</code></pre>
<p><img src="http://fharrell.com/post/addvalue_files/figurehtml/fit1.png" width="672" /></p>
</div>
<div id="visualizinginformationprovidedbycholesterolbyanalyzingpreandposttestprobabilities" class="section level2">
<h2>Visualizing Information Provided by Cholesterol By Analyzing Pre and Posttest Probabilities</h2>
<p>The most basic display of added value is backtoback highresolution histograms of pre and posttest predicted risks. When the new markers add clinically important predictive information, the histogram widens.</p>
<pre class="r"><code>par(mgp=c(4,1,0), mar=c(6,5,2,2))
xlab < c(paste0('Pretest\nvariance=', round(var(pre), 3)),
paste0('Posttest\nvariance=',round(var(post), 3)))
histbackback(pre, post, brks=seq(0.01, 0.99, by=0.01), xlab=xlab, ylab='Prob(CAD)')</code></pre>
<p><img src="http://fharrell.com/post/addvalue_files/figurehtml/hist1.png" width="672" /></p>
<p>Another good way to visualize the diagnostic yield of cholesterol is to plot estimated post vs. pretest probability of CAD. We use quantile regression to estimate the 0.1 and 0.9 quantiles of posttest risk as a function of pretest risk. This allows one to readily see the typical changes in a pretest probability once cholesterol is known.</p>
<pre class="r"><code>plot(pre, post, xlab='PreTest Probability (age + sex)',
ylab='PostTest Probability\n(age + sex + cholesterol)', pch=46)
abline(a=0, b=1, col=gray(.8))
lo < Rq(post ~ rcs(pre, 7), tau=0.1) # 0.1 quantile
hi < Rq(post ~ rcs(pre, 7), tau=0.9) # 0.9 quantile
at < seq(0, 1, length=200)
lines(at, Predict(lo, pre=at)$yhat, col='red', lwd=1.5)
lines(at, Predict(hi, pre=at)$yhat, col='red', lwd=1.5)
abline(v=.5, col='red')</code></pre>
<p><img src="http://fharrell.com/post/addvalue_files/figurehtml/prepost1.png" width="672" /></p>
<p>By plotting the prepost risk difference vs. age we can easily see the dependence on age of the added value of cholesterol. We show the data for males.</p>
<pre class="r"><code>d < cbind(acath, pre=pre, post=post)
with(subset(d, sex == 'male'),
{
lo < Rq(post  pre ~ rcs(age, 5), tau=0.1)
hi < Rq(post  pre ~ rcs(age, 5), tau=0.9)
at < seq(min(age), max(age), length=200)
w < data.frame(age=at, lo=Predict(lo, age=at)$yhat,
hi=Predict(hi, age=at)$yhat)
ggfreqScatter(age, post  pre,
xlab='Age', ylab='Estimated Post  Pretest Probability') +
geom_line(aes(x=age, y=lo), data=w, inherit.aes=FALSE) +
geom_line(aes(x=age, y=hi), data=w, inherit.aes=FALSE) +
geom_hline(aes(yintercept=0), col='red')
})</code></pre>
<p><img src="http://fharrell.com/post/addvalue_files/figurehtml/vsage1.png" width="672" /></p>
</div>
<div id="statisticalindexesforaddedvalueofcholesterol" class="section level2">
<h2>Statistical Indexes for Added Value of Cholesterol</h2>
<p>The gold standard LR test for added value of cholesterol is obtained by comparing log likelihoods of the pre and posttest models. Cholesterol has a total of 8 parammeters; 5 for interacting with the spline function of age and 3 for its main effect<a href="#fn4" class="footnoteref" id="fnref4"><sup>4</sup></a>. Before getting the 8 d.f. LR test, let’s also get the LR test for interaction between cholesterol and age.</p>
<pre class="r"><code>h < lrm(sigdz ~ rcs(age,4)*sex + rcs(choleste,4), data=acath)
lrtest(h, g) # test interaction</code></pre>
<pre><code>
Model 1: sigdz ~ rcs(age, 4) * sex + rcs(choleste, 4)
Model 2: sigdz ~ rcs(age, 4) * sex + rcs(choleste, 4) + rcs(age, 4) %ia%
rcs(choleste, 4)
L.R. Chisq d.f. P
11.19828202 5.00000000 0.04758732 </code></pre>
<pre class="r"><code>lrtest(f, g) # test cholesterol as main or interacting effect</code></pre>
<pre><code>
Model 1: sigdz ~ rcs(age, 4) * sex
Model 2: sigdz ~ rcs(age, 4) * sex + rcs(choleste, 4) + rcs(age, 4) %ia%
rcs(choleste, 4)
L.R. Chisq d.f. P
107.4788 8.0000 0.0000 </code></pre>
<p>From the last test there is very strong evidence that cholesterol adds diagnostic value. The question is to what extent. So we turn to the various indexes discussed earlier—indexes that are samplesize independent, unlike the LR <span class="math inline">\(\chi^2\)</span> statistic that doubles when the sample size doubles, all else being equal.
As with the graphical summaries above, our indexes require no binning of data and fully allow for cholesterol to have varying importance as a function of age.</p>
<p>The detailed statistics that were listed for the pre and posttest models above provide the LR <span class="math inline">\(\chi^2\)</span>, Nagelkerke pseudo <span class="math inline">\(R^2\)</span>, three <span class="math inline">\(g\)</span> indexes (log odds ratio scale, odds ratio scale, and risk scale), Brier score, <span class="math inline">\(c\)</span>index, Somers’ <span class="math inline">\(D_{xy}\)</span>, GoodmanKruskal <span class="math inline">\(\gamma\)</span>, and Kendall’s <span class="math inline">\(\tau_a\)</span>, the latter three being rank correlations between predicted disease risk and actual disease presence. <span class="math inline">\(D_{xy}\)</span> is connected to <span class="math inline">\(c\)</span> by <span class="math inline">\(D_{xy} = 2 (c  \frac{1}{2})\)</span>.</p>
<div id="newoldmeasures" class="section level3">
<h3>New Old Measures</h3>
<p>Now consider measures of explained variation in disease status, and relative explained variation. Recall that pre and posttest predicted risks are stored in the variables <code>pre</code> and <code>post</code>, respectively.</p>
<pre class="r"><code>r < function(x) round(x, 2)
s < function(x) round(x, 3)
lra < f$stats['Model L.R.']
lrb < g$stats['Model L.R.']
ra < f$stats['R2']
rb < g$stats['R2']
br2 < function(p) var(p) / (var(p) + sum(p * (1  p)) / length(p))</code></pre>
<p>In the table below, “Fraction of new information” is the proportion of total predictive information in age, sex, and cholesterol that was added by cholesterol (main effect + age interaction effect). “Fraction explained risk” is the <span class="math inline">\(R^2\)</span> measure explicitly for binary Y, i.e., the ratio of the variance of predicted risk to the sum of the variance of predicted risk and the average risk times (1  risk), computed by the <code>br2</code> function above.</p>
<table>
<thead>
<tr class="header">
<th></th>
<th>Index</th>
<th>Value</th>
<th>Formula</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>a</td>
<td>Pretest LR <span class="math inline">\(\chi^2\)</span></td>
<td>489.51</td>
<td></td>
</tr>
<tr class="even">
<td>b</td>
<td>Posttest LR <span class="math inline">\(\chi^2\)</span></td>
<td>596.99</td>
<td></td>
</tr>
<tr class="odd">
<td>c</td>
<td>Adequacy of base model</td>
<td>0.82</td>
<td>a/b</td>
</tr>
<tr class="even">
<td>d</td>
<td>Fraction of new information from cholesterol</td>
<td>0.18</td>
<td>1c</td>
</tr>
<tr class="odd">
<td>e</td>
<td>Pretest Nagelkerke pseudo <span class="math inline">\(R^2\)</span></td>
<td>0.27</td>
<td></td>
</tr>
<tr class="even">
<td>f</td>
<td>Posttest Nagelkerke pseudo <span class="math inline">\(R^2\)</span></td>
<td>0.32</td>
<td></td>
</tr>
<tr class="odd">
<td>g</td>
<td>Variance of pretest risk</td>
<td>0.047</td>
<td></td>
</tr>
<tr class="even">
<td>h</td>
<td>Variance of posttest risk</td>
<td>0.057</td>
<td></td>
</tr>
<tr class="odd">
<td>i</td>
<td>Relative explained variation</td>
<td>0.83</td>
<td>g/h</td>
</tr>
<tr class="even">
<td>j</td>
<td>Fraction of new information</td>
<td>0.17</td>
<td>1i</td>
</tr>
<tr class="odd">
<td>k</td>
<td>Pretest fraction explained risk</td>
<td>0.21</td>
<td></td>
</tr>
<tr class="even">
<td>l</td>
<td>Posttest fraction explained risk</td>
<td>0.25</td>
<td></td>
</tr>
<tr class="odd">
<td>m</td>
<td>Relative explained variation</td>
<td>0.83</td>
<td>k/l</td>
</tr>
<tr class="even">
<td>n</td>
<td>Fraction of new information</td>
<td>0.17</td>
<td>1m</td>
</tr>
</tbody>
</table>
<p>By the various indexes in the above table, the fraction of total diagnostic information that was due to the “new” test (cholesterol) ranges from 0.16 to 0.17. My favorite measure for assessing the fraction of diagnostic information that is new information from cholesterol is given by row j, i.e., one minus the ratio of the variance of pretest probability of disease <span class="math inline">\(\hat{P}\)</span> to the variance of posttest <span class="math inline">\(\hat{P}\)</span>. This is related to one of the most useful displays one can make: highresolution histograms of pre and posttest probabilities, where the emphasis is on the width of the distributions. More discriminating models provide a greater variety of predictions, subject to assuming the model is wellcalibrated. See above for histograms for pre and posttest predicted risks.</p>
</div>
</div>
<div id="analysisconclusions" class="section level2">
<h2>Analysis Conclusions</h2>
<p>The diagnostic yield of cholesterol is age and sexdependent. Measuring cholesterol seldom allows one to rule in or rule out coronary disease with certainty. It adds a fraction of 0.17 of new diagnostic information to age and sex.</p>
</div>
</div>
<div id="areasofapplication" class="section level1">
<h1>Areas of Application</h1>
<p>The methods described here have wide applicability to diagnostic and prognostic research. One area in particular is ripe for applying the methods: genetics. Genetic markers have been touted as being valuable for diagnosis and prognosis. It would be worthwhile to quantify the added value of genetic markers using one of the above indexes, along with the graphics displaying diagnostic or prognostic yield for individual subjects.</p>
</div>
<div id="references" class="section level1 unnumbered">
<h1>References</h1>
<div id="refs" class="references">
<div id="refcho12simI">
<p>ChoodariOskooei, Babak, Patrick Royston, and Mahesh K. B. Parmar. 2012. “A Simulation Study of Predictive Ability Measures in a Survival Model I: Explained Variation Measures.” <em>Stat Med</em> 31 (23): 2627–43. <a href="https://doi.org/10.1002/sim.4242">https://doi.org/10.1002/sim.4242</a>.</p>
</div>
<div id="refcho12simII">
<p>ChoodariOskooei, B., P. Royston, and Mahesh K. B. Parmar. 2012. “A Simulation Study of Predictive Ability Measures in a Survival Model II: Explained Randomness and Predictive Accuracy.” <em>Stat Med</em> 31 (23): 2644–59. <a href="https://doi.org/10.1002/sim.5460">https://doi.org/10.1002/sim.5460</a>.</p>
</div>
<div id="refhar82">
<p>Harrell, F. E., R. M. Califf, D. B. Pryor, K. L. Lee, and R. A. Rosati. 1982. “Evaluating the Yield of Medical Tests.” <em>JAMA</em> 247: 2543–6.</p>
</div>
<div id="refrms2">
<p>Harrell, Frank E. 2015. <em>Regression Modeling Strategies, with Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis</em>. Second edition. New York: Springer. <a href="https://doi.org/10.1007/9783319194257\%0020">https://doi.org/10.1007/9783319194257\%0020</a>.</p>
</div>
<div id="refken88mea">
<p>Kent, John T., and John O’Quigley. 1988. “Measures of Dependence for Censored Survival Data.” <em>Biometrika</em> 75: 525–34.</p>
</div>
<div id="refsch03pre">
<p>Schemper, Michael. 2003. “Predictive Accuracy and Explained Variation.” <em>Stat Med</em> 22: 2299–2308.</p>
</div>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>The Wilcoxon, Spearman, and other rank tests are very powerful for testing for existence of an association, but not for assessing <em>differences</em> in the strength of association.<a href="#fnref1" class="footnoteback">↩</a></p></li>
<li id="fn2"><p><span class="citation">Schemper (<a href="#refsch03pre">2003</a>)</span> is an excellent paper advocating for measures based on absolute rather than squared differences.<a href="#fnref2" class="footnoteback">↩</a></p></li>
<li id="fn3"><p>Knots are locations of curvature changes, i.e., locations in the covariate space where cubic polynomials are joined together.<a href="#fnref3" class="footnoteback">↩</a></p></li>
<li id="fn4"><p>The LR test could have easily tested multiple new biomarkers simultaneously.<a href="#fnref4" class="footnoteback">↩</a></p></li>
</ol>
</div>

In Machine Learning Predictions for Health Care the Confusion Matrix is a Matrix of Confusion
http://fharrell.com/post/mlconfusion/
Tue, 28 Aug 2018 00:00:00 +0000
http://fharrell.com/post/mlconfusion/
<p>Drew Griffin Levy, PhD<br><small>GoodScience, Inc.</small><br><small><tt>drew@dogoodscience.com</tt></small><br><small><tt> <a href="http://linkedin.com/in/drewglevy" target="_blank">Linkedin:drewglevy</a> </tt></small><br><small><tt> <a href="http://www.DoGoodScience.com" target="_blank">DoGoodScience.com</a> </tt></small><br><br></p>
<p>Machine Learning (ML) has already transformed ecommerce, web search, advertising, finance, intelligence, media, and more. ML is becoming ubiquitous and its centripetal gravity draws health care into the swirl. ML will potentially impact all aspects of health care: prevention, diagnostics, drug discovery, drug development, therapeutics, safety, health care delivery, population health management, administration, etc. The question is not whether ML will disrupt healthcare; this is inexorable. The fundamental question is how ML will ultimately help patients: and more immediately, how it will help clinicians provide better care to patients.</p>
<p>Recent demonstrations of ML applications in health care (e.g., <a href="https://www.nature.com/articles/s4174601800291" target="_blank">Rajkomar, et al., Scalable and accurate deep learning with electronic health records. 2018</a>) feature advances for interoperability, scalability, and integrating all available digital health information to “harmonize inputs and predict medical events through direct feature learning with high accuracy for tasks such as predicting inhospital mortality (area under the receiver operator curve [AUROC] across sites 0.93–0.94), predicting 30day unplanned readmission (AUROC 0.75–0.76), predicting prolonged length of stay (AUROC 0.85–0.86), and predicting all of a patient’s final discharge diagnoses (frequencyweighted AUROC 0.90).” This may be an inflection point for ML in health care, prefiguring ML as a routine tool for efficiently integrating and making sense of health care data at scale.</p>
<p>These AUROC measures of performance feel reassuring, and results like these will surely encourage additional work. However, <em>the quality and character of ML is exposed in the nature of the performance metrics chosen</em>. We must carefully choose the goals we ask these systems to optimize.</p>
<p>Evaluation of models for use in health care should scrupulously take the intended purpose of the model into account. The AUROC may be the right answer, but to a question subtly, yet fundamentally, different from <em>prediction</em>. For prediction, the AUROC is only an oblique answer to the correct question: how well does the dataalgorithm system make accurate predictions of responses for future observations? This problem exposes a general confusion in ML of classification and prediction and several additional problems. These include categorization of inherently continuous variables in the interest of spurious prediction by classification, and misconstrued discrimination, among other issues. Calibration approaches appropriate for representing prediction accuracy are largely overlooked in ML. Rigorous calibration of prediction is important for model optimization, but also ultimately crucial for medical ethics. Finally, the amelioration and evolution of ML methodology is about more than just technical issues: it will require vigilance for our own human biases that makes us see only what we want to see, and keep us from thinking critically and acting consistently.</p>
<p>With AUROC’s such as those reported in Rajkomar, <em>et al.</em>, we feel comforted that something is working as expected. These AUROC’s feel very satisfying. Such is power of cognitive comfort and the social force of convention that the receiver operating curve (ROC) has been reflexively used to evaluate model performance in almost all reports of statistical modeling and statistical learning. The ROC <em>does</em> indicate something about the capability of a tool for identifying a binary signal (such as for radar tuning for detecting incoming planes) and patterns in the errors of signal discrimination (noise). But the ROC and AUROC are predicated on sensitivity and specificity, and consequently do not provide correct direct information about the potential value of the prediction tool. The reasoning is so subtle as to be generally elusive. It is due to improper (inverse or perverse) conditioning: the error of <a href="https://en.wikipedia.org/wiki/Confusion_of_the_inverse" target="_blank">transposed conditionals</a> or affirming the consequent.</p>
<h2 id="theconfusionmatrix">The Confusion Matrix</h2>
<p>A <a href="https://en.wikipedia.org/wiki/Confusion_matrix" target="_blank">confusion matrix</a> is used to describe the performance of a classification model (a “classifier”) in binary data for which the true values are known as well. In its simplest and most typical presentation, it is a special contingency table with two dimensions used to evaluate the results of the test or algorithm. Each row of the matrix represents the ascribed or attributed class while each column represents the actual condition (the truth). The cells of the confusion matrix report the number of true positives and false positives, and false negatives and true negatives.</p>
<p>The ROC curve plots the truepositive rate (TPR) against the falsepositive rate (FPR) at various threshold settings. The truepositive rate is also known as sensitivity (or probability of detection in machine learning). The falsepositive rate is calculated as 1–specificity.</p>
<p>While sensitivity and specificity sound like the right thing and a good thing, there is an essential misdirection for prediction. The problem for prediction with focusing on sensitivity and specificity is that you are conditioning on the wrong thing: the true underlying condition; you are conditioning on the thing you actually want information about. In $Pr($true positive $\mid$ true condition positive$)$ and $Pr($true negative $\mid$ true condition negative$)$ these measures make fixed the aspect that should be free to vary to provide the information actually needed to assess something meaningful about the future performance of the algorithm. To understand how the algorithm will actually perform in new data, the measures required are $Pr($true positive $\mid$ ascribed class positive$)$ and $Pr($true negative $\mid$ ascribed class negative$)$, i.e., the other dimension of the confusion matrix. To measure something meaningful about future performance, the outcome of interest (or what will be found to be true) must not be fixed by conditioning.</p>
<p>While sensitivity and specificity are widely employed as accuracy measures and have intuitive appeal, it is well established that our intuitions can mislead us. Sensitivity and specificity are properties of the measurement process. Sensitivity and specificitybased measures are not meaningful probabilities for prediction <em>per se</em>, unless we are specifically interested in our informed guesses when we already know the outcomes—the retrospective view; for example, the probability of the antecedent test result given present knowledge of disease or outcome status. Sensitivity and specificity would be applicable in casecontrol studies, for example. It is generally the reverse that is useful information in providing health care. For prediction and decision making we need to directly forecast the likelihood that the patient has the disease, given the test result (and other available information). How good the test is among patients with and without disease is ancillary.</p>
<p>Sensitivity and specificity only tell you something obliquely about prediction. They tell you something about the observed error proportions for specific tests or algorithms, but not about uncertainties for future observations or events and directly about the quality of the prognostication. Prediction requires conditional probabilities in which the frequency of the outcome or response variable of interest is random and depends on earlier events (e.g., test results or algorithm results). For decision making our uncertainty is generally and properly about prospective probabilities (likelihood of a future event of concern) given past and present conditions and events.</p>
<p>The information in the two axes of the confusion matrix is not symmetric. They are related, but not the same. This is why Bayes theorem is so valuable. The conditional probability, $Pr(AB)$, that event A occurs given that event B has occurred already, is not equal to $Pr(BA)$. Falsely equating the two probabilities frequently causes various errors of reasoning.</p>
<p>The confusion matrix and it’s subtle informational asymmetries is a source of confusion. This is nuanced but not trivial, inconsequential, or negligible.</p>
<p>The problem with how the ROC and AUROC are used is in confusing signal detection for prediction. The confusion comes from the fact that both signal detection and prediction involve reckoning uncertainty, but these are different kinds of uncertainty: measurement error estimation vs. stochastic estimation. This confusion is exacerbated when the concept of prediction is redefined as “filling in missing information” (<a href="https://www.predictionmachines.ai/" target="_blank">“Prediction Machines: The Simple Economics of Artificial Intelligence,” 2018</a>). This liberal epistemology discounts the fundamental importance of time for information and creates ambiguity. Whether the arrow is in the bow or on the target matters for epistemology and information. But a proper accuracy <a href="https://en.wikipedia.org/wiki/Scoring_rule" target="_blank">scoring rule</a> for prediction is a metric applied to probability forecasts. It is a metric that is optimized when the forecasted probabilities are identical to the true outcome probabilities.</p>
<p>The prevalence and stubborn persistence of the ROC and AUROC may be attributable to the complexity and nuance of the underlying statistical reasoning, and the assumed wisdom of existing practice. It much like the zombie inertia of the use of the nullhypothesis statistical testing paradigm, the pvalue and Type I error in epidemiology and the social sciences (and elsewhere). Sensitivity, specificity, the ROC and the cindex have a nice technical “truthiness” quality about them that make them attractive and “sticky”. The relative merit of positivepredictive value (PPV) and NPV over sensitivity and specificity for clinical purposes involves similar issues and is emphasized in current medical education, but neglected in ML. This is also very much like the inability of casecontrol studies to measure absolute risk because of conditioning on casestatus. And there are numerous psychological and social factors that also explain the ROC’s persistent attendance in the literature. For example, when confronted with a perplexing problem, question, or decision, we make life easier for ourselves by unwittingly answering a substitute, simpler question. Instead of estimating the probability of a certain complex outcome we subconsciously rely on an estimate of another, less complex or more accessible outcome. We never get around to answering the harder question (<a href="https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow" target="_blank">D. Kahneman, <em>Thinking Fast and Slow</em></a>). It is also very difficult to not adopt or capitulate to what has become a norm in scientific communication.</p>
<p>The problem of using the ROC as a performance measure for ML prediction systems is further complicated by specious categorization of inherently continuous variables in the interest of accommodating inefficient discrimination measures, and moreover by spurious prediction by classification.</p>
<h2 id="amatrixofconfusion">A Matrix of Confusion</h2>
<p>The measures of sensitivity and specificity in the confusion matrix are binary or dichotomous; and a binary measure—the bit—is the most elemental and simplest form of information. Humans have a very strong bias for avoiding complexity and strong tendencies for reducing dimensionality whenever possible. This tendency is so strong that we often seek to satisfy it even if unconsciously we throw away information to do so. Models are frequently developed and reported which dichotomize predicted values post hoc. For instance, information rich predicted probabilities from logistic regression and other models are frequently dichotomized at some threshold (e.g, arbitrarily at 0.5) to permit expression as categories (of 0’s and 1’s) just to supply the units that sensitivity and specificity measures and the ROC requires for an index of discrimination. This is coercing prediction into a classification paradigm and confusing fundamentally different objectives. Regardless of the optimization algorithm, the practice of using categorical accuracy scores for measuring predictive performance and to drive feature selection for models has led to countless misleading findings in bioinformatics, machine learning, and data science (<a href="http://fharrell.com/post/classdamage/">Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules</a>).</p>
<p>Categorization of inherently continuous independent or predictor variables (the righthand side of the equation) is also highly prevalent and associated with a host of problems (<a href="http://biostat.mc.vanderbilt.edu/CatContinuous" target="_blank">Problems Caused by Categorizing Continuous Variables</a>). This is unnecessary as it is easy, using regression splines, to allow every continuous predictor in modeling to have a smooth nonlinear effect. Categorization of continuous variables, whether dependent or independent variables, is associated with waste of information at best; but more generally lead to distortions of information and purposes. Categorization of continuous variables is entropic; and it only appears to help.</p>
<p>And vice versa: just as probabilistic prediction models are coerced into classification models, classification models are frequently misconstrued as prediction models. The confusion is that classification (a selection of alternative states) can be tantamount to a decision (a choice among alternative actions) using the data alone without incorporating subject specific utilities (see <a href="http://fharrell.com/post/classification/">Classification vs. Prediction</a>). The presumption comes from the fallacious view that ultimately endusers need to make a binary decision, so binary classification is needed. Optimum decisions require making full use of available data, developing expectations quantitatively expressed as individual probabilities on a continuous scale, and applying an independently derived loss/utility/cost function to make a decision that minimizes expected loss or maximizes expected utility. Different end users have different utility functions which leads to their having different risk thresholds for action. Classification assumes that every user has the same utility function—one only implicit in the classification system (though one wouldn’t know it from the literature). The author of a paper is not in possession of the real utilities for patients.</p>
<p>For all applications it is well to distinguish and clearly differentiate prediction and classification. Formally, for ML, classification is using labels you have for data in hand to correctly label new data. This is feature recognition and class or category attribution. Strictly understood, it is about <em>identification</em>, and not about stochastic outcomes. Classification is best used with nonstochastic mechanistic or deterministic processes that yield outcomes that occur frequently. Classification should be used when outcomes are inherently distinct and predictors are strong enough to provide, for all subjects, a probability closely approximating 1.0 for one of the outcomes. A classification does not account well for gray zones. Classification techniques are appropriate in situations in which there is a known gold standard and replicate observations with approximately the same result each time, for instance in pattern recognition (e.g., optical character recognition algorithms, etc.). In such situations the process generating the data are primarily nonstochastic, with high signal:noise ratios.</p>
<p>Classification is frequently not strictly understood (for the source of wisdom inspiring this thesis, see <a href="http://fharrell.com/post/statml/">Road Map for Choosing Between Statistical Modeling and Machine Learning</a>). Classification is inferior to probability modeling for driving the development of a predictive instrument. It is better to use the full information in the data to develop a probability model and to preserve gray zones. In ML, classification methods are frequently employed for ersatz prediction, or that conflate prediction and decisionmaking, which generate more confusion.</p>
<h2 id="discrimination">Discrimination</h2>
<p>The AUROC (or its equivalent for the case of binary response variable, the cindex or cstatistic) is conventionally employed as a measure of the discrimination capacity of a model: the ability to correctly classify observations into categories of interest. Setting aside the question of the appropriateness of classificationfocused measures (sensitivity, specificity and their summary in the ROC) of performance for prediction models, I speculate that the AUROC and the cstatistic do not really reflect what people generally think it does. And here again, nuances and behavioral economics (inconsistencies in perceptions, cognition, behavior and logic) are pertinent.</p>
<p>Discrimination literally indicates the ability to identify a meaningful difference between things and connotes the ability to put observations into groups correctly. As applied to a prediction model the area under the ROC curve or the cstatistic, however, is based on the <em>ranks</em> of the predicted probabilities and compares these ranks between observations in the classes of interest. The AUC is closely related to the Mann–Whitney U, which tests whether positives are ranked higher than negatives, and to the Wilcoxon ranksum statistic. Because this is a rank based statistic, the area under the curve is the probability that a randomly chosen subject from one outcome group will have a higher score than a randomly chosen subject from the other outcome group—that’s all.</p>
<p>In health care, discrimination is frequently concerned with the ability of a test to correctly classify those with and without the disease. Consider the situation in which patients are already correctly classified into two groups by some gold standard of pathology. To realize the AUROC measure, you randomly pick one patient from the disease group and one from the nondisease group and perform the test on both. The patient with the more abnormal test result should be the one from the disease group. The area under the curve is the percentage of randomly drawn pairs for which the test correctly rank orders the test measures for the two patients in the random pair. It is something like accuracy, but <em>not</em> accuracy.</p>
<p>Also frequently important in health care is the ability of a model to prognosticate something like death. For the binary logistic prediction model the area under the curve is the probability that a random sample of the deceased will have a greater rank estimated probability of death than a randomly chosen survivor. This is only a corollary of what is desirable for evaluation of the performance of a prediction model: a measure of quantitative absolute agreement between observed and predicted mortality. The probability of correctly ranking a pair seems of secondary interest.</p>
<p>If you develop a model indicting that I am likely to develop a cancer, and you tell me that you assert this because the model has an AUROC of 0.9, you really have only told me something about the expected relative ranking of my predicted value; that on average people who do not go on to develop the cancer tend to have a lower predicted value. This seems like something less than strong inference. Whether the absolute risks are 0.19 vs. 0.17, or 0.9 vs. 0.2 does not enter into the information.</p>
<p>Various other rigorous interpretations of the AUC include, the average value of sensitivity for all possible values of specificity, and vice versa (<a href="https://www.wiley.com/enus/Statistical+Methods+in+Diagnostic+Medicine%2C+2nd+Editionp9780470183144" target="_blank">Zhou, Obuchowski, McClish, 2011</a>); or the probability that a randomly selected subject with the condition has a test result indicating greater suspicion than that of a randomly chosen subject without the condition (<a href="https://pubs.rsna.org/doi/10.1148/radiology.143.1.7063747?url_ver=Z39.882003&rfr_id=ori%3Arid%3Acrossref.org&rfr_dat=cr_pub%3Dpubmed" target="_blank">Hanley & McNeil, 1982</a>).</p>
<p>I would bet that there is a substantial disconnect between all these rigorous formal interpretations of the AUROC and how the audience for reported evaluations of prediction models thinks about what it means. I suspect that we interpret (confabulate) the AUROC or cstatistic as something like a fancy <em>Proportion Classified Correctly</em> accuracy measure or an $R^2$. We read more into it than there is. But the area under the curve is the probability that a randomly chosen subject from one outcome group will have a higher score than a randomly chosen subject form the other outcome group, nothing more. I do not feel that tells us enough.</p>
<p>And there are various ways we may be misled by this measure (<a href="http://circ.ahajournals.org/content/115/7/928.long" target="_blank">Cook, 2007</a>; <a href="https://www2.unil.ch/biomapper/Download/LoboGloEcoBioGeo2007.pdf" target="_blank">Lobo, 2007</a>). As the ROC curve does not use the estimated probabilities themselves, only ranks, it may be insensitive to absolute differences in predicted probabilities. Hence, a well discriminating model can have poor calibration. And perfect calibration is possible with poor discrimination when the range of predicted probabilities is small (as with a homogeneous population casemix), as discrimination is sensitive to the variance in the predictor variables. Overfitted models can show both poor discrimination and calibration when validated in new patients. Inferential tests for comparing AUROC are problematic (<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3617074/" target="_blank">Seshan, 2013</a>), and other disadvantages with the AUROC are noted (<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4356897/" target="_blank">Halligan, 2015</a>). For various reasons, the AUROC and the cindex or cstatistic are problematic and of limited value for comparing among tests or models, though unfortunately, still widely used for such.</p>
<p>There are no single model performance measures that are at once simple, intuitive, correct, and foolproof. Again, the AUROC has a nice technical “truthiness”” quality; but this may be chimera. As an assessment of performance for a prediction model there may be less there than meets the eye; and we too often tend to see only what we want to see. Again, nuances in statistical thinking combined with the complexities of human psychology are conditions conducive to confusion.</p>
<p>I feel we use discrimination measures somewhat indiscriminately. Much like the general misinterpretation of pvalues (hypothetical frequencies of data patterns under an arbitrary assumed statistical model interpreted as hypothesis probabilities). And like other forms of discrimination, because of circular reasoning the AUROC tends to confirm our biases or reinforce our prejudices about the merits of a model. For purposes of evaluating prediction models, discrimination may better be represented by the distributions of predicted probabilities rather than a facile single statistic that is a derivative of sensitivity and specificity. And the AUROC should not be the only criterion in assessment of model performance. Although the AUROC may be useful and sufficient under some circumstances (and even then, perhaps less well and often than is generally thought), the evaluation of prognostic models should not rely solely on the ROC curve, but should assess both discrimination and calibration—perhaps with a much greater emphasis on calibration.</p>
<h2 id="calibration">Calibration</h2>
<p>A prediction generator is optimized when the forecasted probabilities are identical to the true outcome probabilities. Calibration is performed by directly comparing the model output with the corresponding measured values. The distance between the predicted outcomes and actual outcomes is central to quantify overall performance for prediction models. From a statistical perspective what we want for confidence in a model is a measure reflecting uniformly negligibly small absolute distances between predicted and observed outcomes.</p>
<p>A continuous accuracy scoring rule is a metric that makes full use of the entire range of predicted probabilities and does not have a large categorical jump because of an arbitrary threshold marking an infinitesimal change in a predicted probability.</p>
<p>For the purposes of comparing predicted and observed outcomes, instead of transforming predicted probabilities to the discrete scale with dichotomization at an arbitrary threshold such as 0.5 (as is done for the categorical sensitivity and specificitybased accuracy measures), the observed outcomes are transformed to the probability scale. The discrete (0,1) observed outcomes are mapped with fidelity on the continuous probability scale using nonparametric smoothers. Highresolution calibration or calibrationinthesmall assesses the absolute forecast accuracy of predictions at granular levels. When the calibration curve is linear, this can be summarized by the calibration slope and intercept. A more general visualization approach uses a loess smoother or spline function to estimate and illustrate the calibration curve. With such visualizations, lack of fit for individuals in any part of the range becomes evident. Several continuous summary metrics for calibration are available such as the various descriptive statics for calibration error, the logarithmic proper scoring rule (average loglikelihood), or the Brier score, etc.</p>
<p>ML needs to be more circumspect about the methods employed in assessment of predictive performance and driving tuning and feature selection for models.</p>
<h2 id="ethicsandaccuracyvalidation">Ethics and Accuracy Validation</h2>
<p>Artificial intelligence (AI) and ML are widely conceived as programs that learn from data to perform a task or make a decision automatically. Some aspects of AI/ML in health care may be different from AI/ML applications in other domains because of the nature of health care decisionmaking and the ultimate role of medical ethics. These differences are exposed in consideration of classification vs. prediction and how models are validated.</p>
<p>Much technical acumen is applied to developing systems such as those that can determine whether a potential customer will click a button and purchase something from a set of advertised offerings. Whether framed as a classification or a prediction, the negligible particular costs and risks involved in most commercial applications do not drive careful consideration of methodology in the same way that some health care activities might. As the power and appeal of ML in many fields leads ineluctably to technology transfer to health care, the ethics of technology usage involve examination of issues that are not always apparent or immediate.</p>
<p>From a medical ethics perspective, fundamental principles for consideration include regard for a trusting patient–physician relationship emphasizing beneficence (concern for the patient’s best interests, and the benefits the patient may derive from testing or a prognostication) and respect for autonomy (an appreciation that patients make choices about their medical care). Ethical issues of nonmaleficence (using tests when the consequences of the test are uncertain) and justice also arise in providing health care. AI/ML will eventually have to address where and how decisions are made and the locus of responsibility in health care. The ethical principles of respect for autonomy, beneficence, and justice in health care should guide ML methods development in many cases. Concerns about classification vs. prediction for decisionmaking, and calibration for model evaluation will eventually impose themselves.</p>
<p>In addition to ML methods, any bias that exists in the health system may be also represented in the EHR data, and an EHRML algorithm may perpetuate that bias in future medical decisions. The ethics of autonomy and justice may be served when the <em>datamodel system</em> informing care is transparent and the evidence and reasoning supporting a clinical decision are available for scrutiny by all stakeholders—especially the patientprovider partnership. Where data and models are inscrutable, potential ethical conflicts may emerge for patient autonomy and raise questions concerning the locus of responsibility in clinical decision making.</p>
<p>Model validation and accuracy are germane to how ML will ultimately help provide better care to patients. Proportion classified correctly, sensitivity, specificity, precision, and recall are all improper accuracy scoring rules for prediction and should not play a role when individual patient risk estimation is the real goal. The credibility of an ML tool for individuallevel decision making will require assessment with calibration. For informing patients and medical decision making, a reassuring calibration should be a primary requirement. In emphasizing accurate calibration over discrimination the fundamental medical ethics of respecting patient individuality and autonomy are served, and moreover helps to optimize decision making under uncertainty.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Received wisdom or conventional and prevalent practice is not always a useful guide. ML and deep learning may well provide accurate predictions, but with the accuracy measures typically reported we still do not really know. The AUROC is a reasonable initial screen for exploratory propositions about dataalgorithm systems. But without calibration expressed across the full range of outcome probabilities in the population of interest we will not know how reliable the probabilitybased predictions will ultimately be, no matter how it was generated (no matter how big the data or technical the algorithm). We should also be generally wary of inappropriate applications of classification, and of classification impersonating prediction, and other necessary distinctions.</p>
<p>Applications of Big Data and Data Science to health care will likely follow a path similar to that of other sectors (e.g., finance, marketing) and of other trends as they are adopted for commercial purposes. This process includes innovation to expose, create or capture value with high initial expectations; recalibration as the gap between the original aspirations and the reality of discovered limitations is understood; consolidation and coalescing around new practices and standards; and, finally maturation in understanding how best to use this new resource; ultimately leading toward routine productive use in commercialization, operations and decision making. This is the natural history of innovation. And much time and fortune is lost in the interval between the peak of expectations and the plateau of productivity (see <a href="https://en.wikipedia.org/wiki/Hype_cycle" target="_blank">Gartner Hype Cycle; Technology Hype</a>).</p>
<p>It <em>is</em> possible to ‘bend the curve’ of this cycle by leveraging knowledge and good strategy. There will be a <a href="https://en.wikipedia.org/wiki/John_Henry_(folklore)" target="_blank">“John Henry”</a> moment between ML and conventional prediction modeling. The comparisons made to date are specious because the availability of high quality conventional prediction models employing modern applied methods (<a href="https://link.springer.com/book/10.1007%2F9783319194257" target="_blank">Harrell, 2015</a>; <a href="https://www.springer.com/us/book/9780387772431" target="_blank">Steyerberg, 2009</a>) by experienced analysts is very, very limited (e.g., see <a href="https://www.ncbi.nlm.nih.gov/pubmed/19628409" target="_blank">Cui, 2009</a>), and the use of AUROC in comparison unsuitable. Building an ML algorithm that includes rigorous evaluation against the best alternative is highly valuable for bending the curve, as it will sharpen thinking, forfend surprise and obviate criticism. Such comparisons will eventually be compelled by issues concerning medical ethics.</p>
<p>For AI and ML to ultimately help clinicians provide better care to patients the technical issues in the performance metrics chosen for ML evaluation will eventually prove to be critically important. To develop AI/ML that delivers better care for patients will require rigorous thinking about what are often complex and nuanced issues, and require deep understanding of health data and the various forms of predictive and evaluative models. Understanding of what is real “information” (as inputs to, and outputs from, algorithms)—its quality, its value and usefulness—will not come automatically or easily. The path may be fraught with abstruse or inconvenient truths. Vigilance for our own human biases that makes us see only what we want to see and keep us from thinking critically and acting consistently will help us navigate. The destination, though, is better care for patients.</p>

Data Methods Discussion Site
http://fharrell.com/post/disc/
Tue, 19 Jun 2018 00:00:00 +0000
http://fharrell.com/post/disc/
<h2>Table of Contents</h2>
<nav id="TableOfContents">
<ul>
<li><a href="#background">Background</a></li>
<li><a href="#plan">Plan</a></li>
<li><a href="#topics">Topics</a></li>
<li><a href="#roleoftwitter">Role of Twitter</a></li>
<li><a href="#linkstoresources">Links to Resources</a></li>
<li><a href="#discourseinformation">Discourse Information</a>
<ul>
<li><a href="#waystocreateanaccount">Ways to Create an Account</a></li>
</ul></li>
</ul>
</nav>
<h1 id="background">Background</h1>
<p>I have learned more from Twitter than I ever thought possible, from those I follow and from my followers. Quick pointers to useful resources has been invaluable. I have also gotten involved in longer discussions. Some of those, particularly those related to design and interpretation of newly published studies (especially randomized clinical trials — RCTs), have gotten very involved and controversial. Twitter is not designed for indepth discourse, and I soon lose track of the discussion and others’ previous points. This is particularly true if I’m away from a discussion for more than 24 hours. Also, some Twitter discussions would have been more civil had there been a moderator.</p>
<p>There are excellent discussion boards related to statistics, e.g. <a href="http://stats.stackexchange.com" target="_blank">stats.stackexchange.com</a>, <a href="https://groups.google.com/forum/#!forum/medstats" target="_blank">medstats</a>, and <a href="http://talkstats.com" target="_blank">talkstats</a>, and a variety of sites related to medical research (including clinical trials), epidemiology, and machine learning. An informal <a href="https://twitter.com/f2harrell/status/989486563947098112" target="_blank">Twitter poll</a> was conducted on 20180426  20180427, resulting in 242 responses from those in my Twitter sphere. Of those, 0.71 were in favor of creating a new site vs. 0.29 who wanted to solely use Twitter for discussions on the intended topics.</p>
<h1 id="plan">Plan</h1>
<p>After much research, I’ve chosen <a href="http://discourse.org" target="_blank">discourse.org</a> for the platform for a new discussion board. This will require putting up a server to host the site. Fortunately all the software needed (linux, ruby, discourse, etc.) is free. After the site is up and running, more moderators will be required. The site name will be <code>datamethods.org</code>. We hope to have it running in July 2018.</p>
<p>As the purpose of communication/collaboration between quantitative experts and clinical/translational researchers is a key function of the <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5263220" target="_blank">Biostatistics Epidemiology and Research Design</a> (BERD) program of the national <a href="https://ncats.nih.gov/ctsa" target="_blank">Clinical and Translational Science Award</a> program of NIH/NCATS, the Vanderbilt BERD (Biostatistics Epidemiology and Research Design) program will support this discussion site under its CTSA funded <a href="https://victr.vanderbilt.edu" target="_blank">VICTR</a> center, and the national CTSA BERD consortium is likely to also be involved. This has the potention to bring dozens of experienced statisticians and epidemiologists to the table to assist clinical and translational investigators and research consumers with their study design, analysis, and interpretation questions.</p>
<p>discourse.org recognizes participation and helpfulness. A good example may be found <a href="http://discourse.mcstan.org" target="_blank">here</a>. The software also makes it very easy to find your place in a large number of discussions, and to upvoted answers to question.</p>
<h1 id="topics">Topics</h1>
<p>The areas that will be emphasized in the new discussion site follow. Global emphasis is on fostering communication between quantitativelyskilled persons and researchers not specializing in the math side of things.</p>
<ul>
<li>quantitative methods in general, including enhancing numeracy of those participants who are not into math or statistics</li>
<li>general statistical issues such as analysis of change scores and categorization of continuous variables</li>
<li>measurement issues</li>
<li>interpretation of published statistical analyses</li>
<li>statistical design of particular studies/clinical trials</li>
<li>statistical analysis issues in published biomedical and epidemiologic research papers</li>
<li>choosing optimal statistical graphics for presenting study results</li>
<li>discussing statistical models and machine learning for biomedical and epidemiologic problems</li>
</ul>
<p>The site will be organized into the following major and minor categories, with lots of tags available to further distinguish and crossreference topics. <b>Your input about these categories is needed.</b> Words in brackets are the topic names used within the web site, if different from the description.</p>
<ul>
<li>statistics and data analysis [data analysis]
<ul>
<li>descriptive and exploratory [descriptive]</li>
<li>formal analysis [formal], including inference, confidence limits, statistical tests</li>
<li>uncertainty</li>
<li>missing data and measurement error [data problems]</li>
<li>models</li>
<li>modeling strategies (including model specification, nonlinearities, interactions and heterogeneity of treatment effect, avoiding categorization, how to sequence multiple steps) [modeling strategy]</li>
<li>variable selection</li>
<li>data reduction, clustering, unsupervised learning [data reduction]</li>
<li>accuracy and information measures [accuracy]</li>
<li>model validation and interpretation [validation]</li>
<li>bayes</li>
<li>machine learning</li>
<li>comparative methods performance [comparative methods]</li>
<li>causal inference</li>
<li>probability</li>
<li>inference and generalizability [generalization]</li>
</ul></li>
<li>research methods
<ul>
<li>measurement</li>
<li>study and experimental design [design]</li>
<li>sample survey design [sample survey]</li>
<li>study interpretation</li>
<li>metaanalysis</li>
<li>research data management [data management]</li>
</ul></li>
<li>treatment comparison
<ul>
<li>general issues and ethics [general]</li>
<li>design issues for a specific trial [design]</li>
<li>statistical interpretation of specific studies [interpret]</li>
<li>methods for comparison of nonrandomized treatments [observational] including propensity scores and instrumental variables approaches</li>
</ul></li>
<li>computing
<ul>
<li>systems</li>
<li>tools</li>
<li>languages</li>
<li>web applications</li>
<li>databases</li>
<li>big data</li>
</ul></li>
<li>graphics
<ul>
<li>static</li>
<li>dynamic</li>
<li>programming</li>
</ul></li>
<li>education
<ul>
<li>teaching and learning methods [teaching learning]</li>
<li>statistical [stat]</li>
<li>math and numeracy [math numeracy]</li>
<li>scientific method & design [scientific method]</li>
<li>collaboration</li>
<li>knowledge dissemination [dissemination]</li>
<li>career development</li>
</ul></li>
<li>topical areas
<ul>
<li>diagnosis</li>
<li>biomarker research</li>
<li>cardiology</li>
<li>cancer</li>
<li>psychology</li>
<li>nutrition</li>
<li>epidemiology</li>
<li>health policy</li>
<li>drug development</li>
</ul></li>
<li>journal articles
<ul>
<li>journal club</li>
<li>reviews</li>
</ul></li>
<li>news (courses, webcasts, meetings, etc.)</li>
<li>meta (discussion, pointers, etiquette about the site)</li>
</ul>
<p>More major categories can be added as needed. As alluded to above, a number (by default limited to 5) of tags can be added to any post to allow easy search and crossreferencing beyond just using categories and subcategories. Once a user gains a certain “trust level” she can define new tags in the system for everyone. Initially defined tags are <code>propensity</code> and <code>datareduction</code>. <code>discourse</code> makes it easy to construct URLs that anyone can share, that can bring up all posts in a certain category or containing a certain tag.</p>
<p>To discuss this proposal, post a tweet mentioning @f2harrell, or use the commenting facility at the end of this post.</p>
<h1 id="roleoftwitter">Role of Twitter</h1>
<p>Informed by <a href="https://blog.discourse.org/2018/04/effectivelyusingdiscoursetogetherwithgroupchat" target="_blank">this</a>, my opinion is that <code>twitter</code> is good at the following things:</p>
<ul>
<li>shortterm discussions that are not emotional
<ul>
<li>Short messages</li>
<li>Discussions needing a “memory” are best done using a modern discussion board platform</li>
<li>Those that are emotional need moderators to flag inappropriate content (including personal attacks) for review and fixing, or to delete content</li>
</ul></li>
<li>pointers to other resources including longer discussions on discussion boards</li>
<li>requests for interested persons to join the longer discussions elsewhere</li>
<li>quick polls</li>
<li>news alerts</li>
</ul>
<p>Twitter is timeoriented, but discussion boards are topicoriented (then timeordered within topic/subtopic). Discussion boards such as <code>discourse.org</code> have a permanent memory and do not have any real message length limits.</p>
<h1 id="linkstoresources">Links to Resources</h1>
<ul>
<li><a href="http://discourse.mcstan.org/faq" target="_blank">discourse.org civility guidelines</a></li>
<li><a href="https://blog.discourse.org/2018/04/effectivelyusingdiscoursetogetherwithgroupchat" target="_blank">Using Discourse effectively with group chat</a></li>
</ul>
<h1 id="discourseinformation">Discourse Information</h1>
<h2 id="waystocreateanaccount">Ways to Create an Account</h2>
<ul>
<li>Link to your Google, Twitter, or Github account</li>
<li>Soon to be added: authentication by Facebook and Yahoo</li>
</ul>

Viewpoints on Heterogeneity of Treatment Effect and Precision Medicine
http://fharrell.com/post/hteview/
Mon, 04 Jun 2018 00:00:00 +0000
http://fharrell.com/post/hteview/
<p class="rquote">
To translate the results of clinical trials into practice may require a lot of work involving modelling and further background information. 'Additive at the point of analysis but relevant at the point of application' should be the motto. <br>
— Stephen Senn, <a href="http://errorstatistics.com/2013/04/19/stephensennwhenrelevanceisirrelevant">When Relevance is Irrelevant</a>
<br><br>
The simple idea of risk magnification has more potential to improve medical decision making and cut costs than "omics" precision medicine approaches. Risk magnification uses standard statistical tools and standard clinical variables. Maybe it's not sexy enough or expensive enough to catch on.
</p>
<h2 id="notesonthepcorituftpacemeeting">Notes on the PCORI/Tuft PACE Meeting</h2>
<p>This is a reflection on what I heard and didn’t hear at the 20180531 meeting <a href="https://nam.edu/event/evidenceandtheindividualpatientunderstandingheterogeneoustreatmenteffectsforpatientcenteredcare" target="_blank">Evidence and the Individual Patient: Understanding Heterogeneous Treatment Effects (HTE) for patientCentered Care</a>, sponsored by <a href="https://pcori.org" target="_blank">PCORI</a>, the Tufts University <a href="https://www.tuftsmedicalcenter.org/ResearchClinicalTrials/InstitutesCentersLabs/InstituteforClinicalResearchandHealthPolicyStudies/ResearchPrograms/CenterforPredictiveAnalyticsandComparativeEffectivness.aspx" target="_blank">PACE Center</a>, and the <a href="http://nam.edu" target="_blank">National Academy of Medicine</a>. I learned a lot and thought that the meeting was well organized and very intellectually stimulating. Hats off to David Kent for being the meeting’s mastermind. Some of the high points of the meeting to me personally were</p>
<ul>
<li>Meeting critical care clinical trial guru Derek Angus for the first time<sup class="footnoteref" id="fnref:Onenitpickyco"><a rel="footnote" href="#fn:Onenitpickyco">1</a></sup></li>
<li>Being with my longtime esteemed colleague Ewout Steyerberg</li>
<li>Hearing Cecile Janssens’ sanguine description of the information yield of genetics to date</li>
<li>Listening to Michael Pencina convey a big picture understanding of predictive modeling</li>
<li>Hearing Steve Goodman’s wisdom about risk applying to individuals but only being estimable from groups</li>
<li>Seeing Patrick Heagerty’s clear description of exactly what can be learned about an individual treatment effect when a patient can undergo only one treatment, and clearly discuss individual vs. population analysis</li>
<li>Hearing inspiring stories from two patient stakeholders who happen to also be researchers</li>
<li>Getting reminded of the pharmacogenomic side of the equation from my Vanderbilt colleague Josh Peterson</li>
<li>Watching John Spertus give a spirited report about how smart clinicians in one cardiovascular treatment setting are more likely to use a treatment for patients who stand to benefit the least, from the standpoint of predicted risk</li>
<li>Watching Rodney Hayward give an even more spirited talk about how medical performance incentives often do not achieve their intended effects</li>
<li>It was gratifying to hear extreme criticism of onevariableatatime subgrouping from several speakers</li>
<li>It was worrying to see some speakers dividing predicted risk into quantile groups when in fact risk is a continuous variable, and quantile interval boundaries are driven by demographics (and are arbitrarily manipulated by changing study inclusion criteria) and not by biomedicine</li>
</ul>
<h2 id="backgroundissuesinprecisionmedicine">Background Issues in Precision Medicine</h2>
<p>There are five ways I can think of achieving personalized/precision medicine, besides the physician talking and listening to the patient:</p>
<ul>
<li>Development of new diagnostic tests that contain validated, new information</li>
<li>Breakthroughs in treatments for welldefined patient subpopulations</li>
<li>Finding strong evidence of patientspecific treatment random effects using randomized crossover studies<sup class="footnoteref" id="fnref:StephenSennhas"><a rel="footnote" href="#fn:StephenSennhas">2</a></sup>, and finding actionable treatment plans once such heterogeneity of treatment effects (HTE) is demonstrated and understood</li>
<li>Finding strong evidence of interaction between treatment and patient characteristics, which I’ll call differential treatment effects (DTE)</li>
<li>Giving treatments to patients with the largest expected absolute benefit of the treatment (largest absolute risk reduction)</li>
</ul>
<p>The last approach has little to do with HTE and is mainly a mathematical issue arising from the fact that there is only room to move a probability (risk) when the patient’s risk is in the middle. Patients who are at very low risk of a clinical outcome cannot have a large absolute risk reduction. I’ll call this phenomenon <em>risk magnification</em> (RM) because the absolute risk difference is magnified by having a higher baseline risk.</p>
<p>The conference focused more on RM than on HTE. RM is the simplest and most universal approach to medical decision making, and requires the least amount of information<sup class="footnoteref" id="fnref:Atleastatthe"><a rel="footnote" href="#fn:Atleastatthe">3</a></sup>. Before discussing RM vs HTE, we must define relative and absolute treatment effects. For a continuous variable such as blood pressure that is semilinearly related to clinical outcome (at least with regard to the normaltohypertensive range of blood pressure), reduction in mean blood pressure as estimated in a randomized clinical trial (RCT) is both an absolute and a relative measure. For binary and timetoevent endpoints, an absolute difference is the difference in cumulative incidence of the event at 2y, or the difference in life expectancy. A relative effect may be an odds ratio, hazard ratio, or in an accelerated failure time model, a survival time ratio.</p>
<h2 id="riskmagnification">Risk Magnification</h2>
<p>There are two stages in the understanding and implementation of RM:</p>
<ul>
<li>In an RCT, estimate the relative treatment effect and try to find evidence for constancy of this relative effect. If there is evidence for interaction on the relative scale, then the relative treatment effect is a function of patient characteristics.</li>
<li>When making patient decisions, one can easily (in most situations) convert the relative effect from the first step into an absolute risk reduction if one has an estimate of the current patient’s absolute risk. This estimate may come from the same trial that produced the relative efficacy estimate, if the RCT enrolled a sufficient variety of subjects. Or it can come from a purely observational study if that study contains a large number of subjects given usual care or some other appropriate reference set.</li>
</ul>
<p>These issues are discussed <a href="http://fharrell.com/post/ehrsrcts">here</a> and <a href="http://fharrell.com/post/rctmimic">here</a>, in <a href="https://jamanetwork.com/journals/jama/articleabstract/209767" target="_blank">Kent and Hayward’s paper</a>, and in Stephen Senn’s <a href="http://slideshare.net/StephenSenn1/realworldmodified" target="_blank">presentation</a>. An early application is in <a href="https://www.ahjonline.com/article/S00028703(97)701649/abstract" target="_blank">Califf et al</a>.</p>
<p>In most cases one can compute the absolute benefit as a function of (known or unknown) patient baseline risk using simple math, without requiring any data, once the relative efficacy is estimated. It is only at the decision point for the patient at hand that the risk estimate is needed.</p>
<p>Here is an example for a binary endpoint in which the treatment effect is given by a constant odds ratio. The graph below exemplifies two possible odds ratios: 0.8 and 0.6. One can see that the absolute risk reduction by treatment is strongly a function of baseline risk (no matter how this risk arose), and this reduction can be estimated even without a risk model, under certain assumptions.</p>
<pre><code class="languager">require(Hmisc)
</code></pre>
<pre><code class="languager">knitrSet(lang='blogdown')
plot(0, 0, type="n", xlab="Patient Risk Under Usual Care",
ylab="Absolute Risk Reduction",
xlim=c(0,1), ylim=c(0,.15))
i < 0
or < c(0.8, 0.6)
for(h in or) {
i < i + 1
p < seq(.0001, .9999, length=200)
logit < log(p/(1  p)) # same as qlogis(p)
logit < logit + log(h) # modify by odds ratio
p2 < 1/(1 + exp(logit))# same as plogis(logit)
d < p  p2
lines(p, d)
maxd < max(d)
smax < p[d==maxd]
text(smax, maxd + .005, paste0('OR=', format(h)), cex=.8)
}
</code></pre>
<p><img src="http://fharrell.com/post/hteview_files/figurehtml/setup1.png" width="672" /></p>
<p>For an example analysis where the relative treatment effect varies with patient characteristics, see <a href="http://hbiostat.org/doc/bbr.pdf" target="_blank">BBR Section 13.6.2</a>.</p>
<h2 id="heterogeneityoftreatmenteffects">Heterogeneity of Treatment Effects</h2>
<p>The conference did not emphasize the underpinnings of HTE, but this article gives me an excuse to describe my beliefs about HTE. In what follows I’m referring actually to DTE because I’m assuming that estimates are based on parallelgroup studies, but I’ll slip into the HTE nomenclature.</p>
<p>It is only meaningful to define HTE on a relative treatment effect scale, because otherwise HTE is always present (because of RM) and the concept of HTE becomes meaningless. A relative scale such as log odds or log relative hazard is a scale in which it is mathematically possible for the treatment effect to be constant over the whole patient spectrum<sup class="footnoteref" id="fnref:Notethatonan"><a rel="footnote" href="#fn:Notethatonan">4</a></sup>. It is only the relative scale in which treatment effectiveness differences have the potential to be related to mechanisms of action. By contrast, absolute risk reduction comes from <strong>generalized risk</strong>, and generalized risk can come from any source including advanced age, greater extent of disease, and comorbidities. Researchers frequently make the mistake of examining variation in absolute risk reduction by subgrouping, one day shouting “older patients get more benefit” and another day concluding “patients with comorbidities get more benefit”, but these are illusory. It is often the case that <strong>anything</strong> giving the patient more risk will be related to enhancing absolute treatment benefit. It is an error in labeling to attribute these effects to a specific variable<sup class="footnoteref" id="fnref:DavidKentmenti"><a rel="footnote" href="#fn:DavidKentmenti">5</a></sup>.</p>
<p>Though the PCORI/Tufts meeting did not intend to cover the following topic, it would be useful at some point to have indepth discussions about HTE/DTE, to address at least two general points:</p>
<ul>
<li>Which sorts of treatment/disease combinations should be selected for examining HTE?</li>
<li>What happens when we quantify the outcome variation explained by HTE?</li>
</ul>
<p>On the first point, I submit that the situations having the most promise for finding and validating HTE/DTE are trials in which the average treatment effect is large (and is in the right direction). It is tempting to try to find HTE in a trial with a small overall difference, but there are two problems in doing so. First, the statistical signal or information content of the data are unlikely to be sufficient to estimate differential effects<sup class="footnoteref" id="fnref:Underthebesto"><a rel="footnote" href="#fn:Underthebesto">6</a></sup>. Second, to say that HTE exists when the average treatment effect is close to zero implies that there must exist patient subgroups where the treatment does significant harm to the patients. The plausibility of this assumption should be questioned.</p>
<p>On the second point, about quantification of nonconstancy of relative treatment effect, a very fruitful area of research could involve developing strategies for “proof of concept” studies of DTE that parallels how principal component analysis has been used in gene microarray and GWAS research to show that a possible predictive signal exists in the genome. This same approach could be used to quantify the signal strength for differential treatment effects by patient characteristics. This would address a common problem: factors that potentially interact with treatment can be correlated, diminishing statistical power of individual interaction tests. By reducing a large number of potential interaction factors to several principal components (or other summary scores) and getting a “chunk test” for the joint interaction influence of those variables with treatment, one could show that something is there without spending statistical power “naming names.”</p>
<p>This relates to what I perceive is a major need in HTE research: to quantify the amount of patient outcome variation that is explained by treatment interactions in comparison to the variation explained by just using an additive model that includes treatment and a long list of covariates. A powerful index for quantifying such things is the “adequacy index” described in the maximum likelihood estimation chapter in <em><a href="http://fharrell.com/links">Regression Modeling Strategies</a></em>. This index answers the question “what proportion of the explainable outcome variation as measured by the model likelihood ratio chisquare statistic is explainable by ignoring all the intefraction effects?” One minus this is the fraction of predictive information provided by DTE. In my experience, the outcome variable explained by main effects swamps that explained by adding interaction effects to models. I predict that clinical researchers will be surprised how little differential treatment effects matter when compared to outcomes associated with patient risk factors, and when compared to RM.</p>
<p>My suggestions for developing statistical analysis plans for testing and estimating DTE/HTE are in <a href="http://hbiostat.org/doc/bbr.pdf" target="_blank">BBR Section 13.6</a>.</p>
<h2 id="averagesvscustomizedestimates">Averages vs. Customized Estimates</h2>
<p class="rquote">
Advocates of precision medicine are required to show that customized treatment selection results in better patient outcomes than optimizing average outcomes.
</p>
<p>An unspoken issue occurred to me during the meeting. We need to be talking much more about mean squared error (MSE) of estimates of individualized treatment effects. MSE equals the variance of an estimate plus the square of the estimate’s bias. Variance is reduced by increasing the sample size or by being able to explain more outcome variation (having a higher signal:noise ratio). Bias can come from a problematic study design that misestimated the average treatment effect, or by assuming that the effect for the patient at hand is the same as the average relative treatment effect when in fact the treatment effect interacted with one or more patient characteristics. But when one allows for interactions, the variance of estimates increases substantially (especially for patient types that are not well represented in the sample). So interaction effects must be fairly large for it to be worthwhile to include these effects in the model, i.e., for MSE to be reduced (i.e., for the square of bias to decrease more than the variance increases).</p>
<p>To really understand HTE, one must understand how a patient disagrees with herself over time, even when the treatment doesn’t change. Stephen Senn has written extensively about this, and a new paper entitled <a href="http://www.pnas.org/content/early/2018/06/15/1711978115" target="_blank">Lack of grouptoindividual generalizability is a threat to human subjects research</a> by Fisher, Medaglia, and Jeronimus is a worthwhile read. Also see this excellent article: <a href="https://academic.oup.com/ije/article/40/3/537/747708" target="_blank">Epidemiology, epigenetics and the ‘Gloomy Prospect’: embracing randomness in population health research and practice</a> by George Davey Smith.</p>
<p>In the absence of knowledge about patientspecific treatment effects, the best estimate of the relative treatment effect for an individual patient is the average relative treatment effect<sup class="footnoteref" id="fnref:Thiscanbeeasi"><a rel="footnote" href="#fn:Thiscanbeeasi">7</a></sup>. Selecting the treatment that provides the best average relative effect will be the best decision for an individual unless DTEs are large. To better personalize the decision, other than accounting for absolute risk (which is a different issue and may objectively deal with cost issues), requires abundant data on DTE.</p>
<h2 id="generalizabilityofrcts">Generalizability of RCTs</h2>
<p>At the meeting I heard a couple of comments implying that randomized trials are not generalizable to patient populations that are different from the clinical trial patients. This thought comes largely from a misunderstanding of what RCTs are intended to do, as described in detail <a href="http://fharrell.com/post/rctmimic">here</a>: to estimate relative efficacy. Even though absolute efficacy varies greatly from patient to patient due to RM, evidence for variation in relative efficacy has been minimal outside of the molecular tumor signature world.</p>
<p>The beauty of the conference concentrating on risk magnification is that RM always exists whenever risk is an issue (not so much in a pure blood pressure trial), RM is easier to deal with, and to account for RM does not require increasing the sample size, although RM is benefited from having large observational cohorts upon which to estimate risk models for computing absolute risk reduction given patient characteristics and relative treatment effects. RM does not require crossover studies, and can be actualized even without a risk model if the treating physician has a good gestalt of her patient’s outcome risk. In my view, RM should be emphasized more than HTE because of its practicality. To do RM “right” to obtain personalized estimates of absolute treatment benefit, we do need to spend more effort checking that risk models are absolutely calibrated.</p>
<h2 id="otherthingsidliketohaveheardorfurtherdiscussed">Other Things I’d Like to Have Heard or Further Discussed</h2>
<ul>
<li>There was some discussion of multiple endpoints and tradeoffs between safety and efficacy. Patient utility analysis and the use of ordinal clinical outcomes would have been a nice addition, though there’s not time for everything.</li>
<li>The ACCORDBP trial was described as “negative”, but that was a frequentist trial so all one can say is that the trial did not amass enough information to reject the null hypothesis.</li>
<li>I heard someone mention at one point that subgroup analysis “breaks the randomization.” I don’t think that’s strictly true. It’s just that subgroup analysis is not competitive statistically, and is usually misleading because of noise, arbitrariness, and colinearities.</li>
<li>Someone mentioned tree methods but single trees require 100,000 patients to work adequately and even then are not competitive with regression.</li>
<li>There needs to be more discussion about the choice of outcome measures in trials. DTE/HTE analysis requires highinformation outcomes to have much hope; binary outcomes have low information content.</li>
<li>It may have been Fan Li and Michael Pencina who mentioned the use of penalized maximum likelihood estimation for estimating DTE (e.g., lasso, elastic net). These do not provided any statistical inference capabilities (as opposed to Bayesian penalization through skeptical priors).</li>
</ul>
<p><a class="anchor" id="evidence"></a></p>
<h2 id="evidenceforhomogeneityoftreatmenteffects">Evidence for <strong>Homo</strong>geneity of Treatment Effects</h2>
<p>For continuous outcome variables Y where the variance of measurements can be disconnected from the mean, one way to estimate the magnitude of HTE is to compare the variance in Y in the active treatment group and that in the control group. If HTE exists, it cannot affect a pure control group but should increase the variance of Y in the treatment group due to heterogeneity of treatment effect across types of subjects. Two studies have examined this issue in metaanalyses. The first, related to weight loss treatments, found “evidence is limited for the notion that there are clinically important differences in exercisemediated weight change.” The second paper reviewed 208 studies and found evidence in the opposite direction from HTE: the average ratio of variances of Y for treated:control was 0.89.</p>
<ul>
<li><a href="https://doi.org/10.1111/obr.12682" target="_blank">Inter‐individual differences in weight change following exercise interventions: a systematic review and meta‐analysis of randomized controlled trials</a> by Williamson, Atkinson, Batterham</li>
<li><a href="https://f1000research.com/articles/730" target="_blank">Does evidence support the high expectations placed in precision medicine? A bibliographic review</a> by Cortés et al.</li>
</ul>
<h4 id="footnotes">Footnotes</h4>
<div class="footnotes">
<hr />
<ol>
<li id="fn:Onenitpickyco">One nitpicky comment about a small point in Derek’s presentation: He described an analysis in which a risk model was developed in placebo patients and applied to active arm patients. This approach has the possibility of creating a bias caused by fitting idiosyncrasies of placebo patients, in a way that may exaggerate treatment effect estimates. <a class="footnotereturn" href="#fnref:Onenitpickyco"><sup>^</sup></a></li>
<li id="fn:StephenSennhas">Stephen Senn has <a href="https://www.bmj.com/content/329/7472/966" target="_blank">shown</a> how one may estimate patient random effects representing individual response to therapy in a 6period 2treatment crossover study. See also <a href="http://journals.sagepub.com/doi/abs/10.1177/0962280210379174" target="_blank">this</a>. <a class="footnotereturn" href="#fnref:StephenSennhas"><sup>^</sup></a></li>
<li id="fn:Atleastatthe">At least at the analysis phase, if not at the implementation stage. <a class="footnotereturn" href="#fnref:Atleastatthe"><sup>^</sup></a></li>
<li id="fn:Notethatonan">Note that on an additive risk scale, interactions must be present to prevent risks from going outside the legal range of [0,1]. <a class="footnotereturn" href="#fnref:Notethatonan"><sup>^</sup></a></li>
<li id="fn:DavidKentmenti">David Kent mentioned to me that he had some strong examples where <em>relative</em> treatment benefit was a function of <em>absolute</em> baseline risk. I need to know more about this. <a class="footnotereturn" href="#fnref:DavidKentmenti"><sup>^</sup></a></li>
<li id="fn:Underthebesto">Under the best of situations, the sample size needed to estimate an interaction effect is four times that needed to estimate the average treatment effect. <a class="footnotereturn" href="#fnref:Underthebesto"><sup>^</sup></a></li>
<li id="fn:Thiscanbeeasi">This can be easily translated into a customized absolute risk reduction estimate as discussed earlier. <a class="footnotereturn" href="#fnref:Thiscanbeeasi"><sup>^</sup></a></li>
</ol>
</div>

Navigating Statistical Modeling and Machine Learning
http://fharrell.com/post/statml2/
Mon, 14 May 2018 00:00:00 +0000
http://fharrell.com/post/statml2/
<p>Drew Levy<br><small><tt>drew@dogoodscience.com</tt></small><br><small><tt> <a href="http://linkedin.com/in/drewglevy" target="_blank">Linkedin:drewglevy</a> </tt></small><br><small><tt> <a href="http://www.DoGoodScience.com" target="_blank">DoGoodScience.com</a> </tt></small><br><br></p>
<p class="rquote">
... the art of data analysis is about choosing and using multiple tools.<br><a href="http://biostat.mc.vanderbilt.edu/rms"> —Regression Modeling Strategies</a>, pp. vii
</p>
<p>Frank Harrell’s post, <a href="http://fharrell.com/post/statml/">Road Map for Choosing Between Statistical Modeling and Machine Learning</a>, does us the favor of providing a contrast of statistical modeling (SM) and machine learning (ML) in terms of fundamental attributes (signal:noise and data requirements, dependence on assumptions and structure, interest in “special” parameters, accounting of uncertainties and predictive accuracy). This is clarifying perspective. Despite the prevalent conflation of SM and ML within the rubric of ‘data science’, Frank’s post underscores that SM and ML are different in important ways and the individual considerations in this contrast should assist us in making deliberated decisions about when and how to apply one approach or another. This cogent set of criteria help us better select tools that are fitforpurpose and serve our particular ends with the best means. Getting clarity about what our real ends are might be the harder part.</p>
<p>To extend the analogy, the guideposts identified by Frank could be illustrated as a route map if put into the format of a series of junctures (and termini). Here is an example:</p>
<ol>
<li>Do you want to isolate the effect of special variables or have an interpretable model? If yes, turn left toward SM; if no, keep driving …</li>
<li>Is your sample size less than huge? If yes, park in the space designated “SM”; if no, …</li>
<li>Is your signal:noise low? If yes, take the ramp toward “SM”; if no, …</li>
<li>Is there interest in estimating the uncertainty in forecasts? If yes, merge into SM lane; if no, …</li>
<li>Is nonadditivity/complexity expected to be strong? If yes, gun the pedal toward ML; if no, … you can continue the journey with SM.</li>
</ol>
<p>This allegorical cartoon is simplistic: the situation is certainly much more nuanced than this. But it is more systematic thinking than is often employed (such as, ‘I have lots of data, therefore ML’). There are other maps that people could draw, and junctures to consider. The route illustrated above is intended to encourage others to plot a course thoughtfully. And the allegory is certainly narratively thin: there are surprises lurking in the landscape along the highway.</p>
<p>Frank’s contrast between SM and ML exposes an essential question: “who/what is actually learning?” For the most part, in ML only the machine is learning. Little or no understanding is escaping from the black box for human knowledge, and this means that ML is purely instrumental. In some ways ML is like operant conditioning, or the automative System 1 thinking process in humans (Kahneman’s <a href="https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow" target="_blank">Thinking Fast and Slow</a>). They take in information and result in behavioral outputs, but operate below the level of conscious awareness. The machine is largely ‘dumb’ and cannot tell you very well what it has learned; nor can it be aware of when and how it may be fundamentally wrong. While ML can serve many purposes, there are potential risks and costs associated with mechanical opacity (viz, machine trading).</p>
<p>The Scientific Method demonstrates that you can use controlled experiments and strong claims to understand causation and predict events. SMs sort of blur the boundaries between rigorous causal understanding and purely instrumental utility: they are predictive tools that are also comprehensible to humans <a href="http://www.informationphilosopher.com/knowledge/best_explanation.html" target="_blank">Inference to the Best Explanation</a>, and the random component of the model models the sources of uncontrolled variation. ML shows that if you relax the first two requirements to weak claims, you can still predict events, but perhaps not understand them [special thanks to <a href="https://www.linkedin.com/in/garrettgrolemund49328411" target="_blank">Garrett Grolemund</a> for his thinking and language about these issues]. We have seen how ML vs. SM can be reframed as situated on some spectrum (e.g., the spectrum of human intermediation, in <a href="https://jamanetwork.com/journals/jama/articleabstract/2675024" target="_blank">Big Data and Machine Learning in Health Care</a>, by AL Beam and IS Kohane). This suggests yet another spectrum:<br>
<p class="rquote">
Experimental Science → clear causal understanding and predictions<br>
Statistical Models → understanding that holds under a set of assumptions, and supplies predictions and uncertainty estimates<br>
Machine Learning → predictions
</p>
This spectrum invites consideration of the various ways we use predictions: from corroborating or refuting theory (as in the scientific method), to calibrating fit and positing structure, to utilitarian prognostication.</p>
<p>For several medical applications a black box prediction tool would appear to be entirely suitable, such as reading pathology, predicting treatment nonadherence, or some highcomplexity nonlinear systems biology problems, etc. Predicting accurately in such applications may be entirely enough, whether or not you know why the predictions are accurate. We don’t need to be mechanics to have a car get us to our destinations. In this, ML may best be construed literally as a ‘tool’ in the instrumental sense: a form of augmentation of human effective capacity. In a generic sense, AI/ML is, to date, primarily about building systems that can address a discrete and specific problem by processing enormous volumes of data and providing answers to highly structured questions in an automated way, very quickly.</p>
<p>So, it may be that one of the first forks in the road map for choosing between ML and SM should be whether you want to claim to be doing formal science or not. For the endeavor to be scientific, you have to have and empirically assess hypotheses or theories about how some aspect of the world works; which are minimal or absent in ML. If learning, in the sense of accruing knowledge about how the world works, is not a predicate of ML, however highly technical ML may be, it should not be misconstrued as scientific. Despite being a central feature of the current Data Science meme, ML should surrender any pretensions about being science. But is a potentially highly effective technology.</p>
<p>This reasoning exposes as well an obverse issue in how SM is sometimes used in medicine. While SM provides prediction based on evaluation of specific hypotheses about nature, it is very frequently used to rationalize a simplistic heuristic approach for clinical decision making, inadvertently forsaking the full probabilistic information available for the decision. Ultimately, realworld medical decisionmaking is a forecast: conditional on a set of premises provided in data it is a prediction about what course of action is likely to yield the best result, especially for individual patientlevel decisionmaking (e.g., Precision Medicine, Personalized Medicine). Traditional rigorous causal inference has led to a reductionist focus on particular independent effects and has encouraged a selective focus on a limited set of terms in the righthand side (r.h.s.) of the equation. With SM a prevalent tendency is to focus, after adjustment, on selected variables and just use these ‘risk factors’. Frequently, just categorical classes of the selected variables are used in making decisions about care, further reducing these to heuristics for decision making—much as we tend to use pvalues as facile surrogates for richer evidence. This is also similar to promoting the value of a new biomarker that in isolation provides less information than the basic clinical data available. We have a strong tendency to reduce information for decisions to singular and simple binary inputs. This is entropic dissipation of information, due largely to our stubborn preference for cognitive ease in decision making.</p>
<p>Models that make accurate predictions of responses for future observations by incorporating relevant information for decision making perform the correct calculus of integrating information, and provide correct output for informing decisions with explicit probability and uncertainty estimates (the lefthand side of the equation: l.h.s.). You will hear remarks that reflect resistance to probabilitybased clinical decision making: complaints that probabilities are too complex and that emphasize what physicians want or need. I think this is a misplaced objective at a fundamental level. The correct objective and focus is what leads ultimately to the best outcomes for patients. This should not be about how to make it easy for physicians—it is about finding and adopting the best process for decision making that serves the interest of patients, no matter how difficult, awkward or inconvenient for physicians. I have sympathy for clinicians—they are only human, with limited cognitive capacity (information bandwidth) like the rest of us. Because thinking consumes our limited energy human cognition is prone to take the path of least resistance. And we are generally entirely unaware of this as we are doing it. Cognitive laziness is built deep into our nature. But the real value to be served in clinical decision making is the quality of care and outcomes for patients. Where individual patients are involved, rich multivariable information rigorously integrated for individual patientlevel decisionmaking leads to much greater acuity in predicting the consequences of health care actions; and ultimately, to better decisions and outcomes.</p>
<p>Well formulated realworld posterior conditional probabilities (i.e., l.h.s.) are highvalue information about both potential outcomes and uncertainties. Left hand sidebased decisionmaking maps observations to actions, and better informs effective care related decisionmaking, potentially improving outcomes for patients. Paradoxically, while we may have learned something specific and scientific from the data with SM, we also are not using the predictive capacity of SM—the l.h.s.—optimally either. Prediction generalizes estimation and to some extent hypothesis testing (<a href="http://biostat.mc.vanderbilt.edu/rms" target="_blank">Regression Modeling Strategies</a>, pp. 1); For SM—like ML—overall prediction remains a major goal.</p>
<p>Our tendency toward cognitive ease (our allergy to complexity) may explain part of the sex appeal of ML: the allure of outsourcing cognitive effort to the machine. The perceived value of our technology is in removing the difficulty and uncertainty from our lives. This is a source of the seductive power of technology. Part of what is attractive about ML is that it appears to absolve humans of the need to think hard and that solutions will appear out of the machine ‘automagically’. ML appeals to our bias for cognitive ease, and risks beguiling “magical thinking” (a term I borrowed from <a href="https://mitpress.mit.edu/books/whatalgorithmswant" target="_blank">What Algorithms Want</a> by Ed Finn). There is a prevalent fantasy about “the killer app’”, and how it will liberate us from our cognitive limitations and the effort of hard thought. And this “killer app” fantasy (in combination with our lazy thinking) reinforces the notion that success is all about the technology—about the algorithm.</p>
<p>Judging from the prevalence of articles and advertisements in the vocational literature and the lay press, the requirement for ML experience among job postings, the emphasis on ML at professional meetings, etc., you might think that SM has gone the way of the horseandbuggy, or is an endangered species occupying a precarious ecological niche. But, whereas in this epoch we are carried away in a tsunami of data, and ML requires big data, it does not follow that doing ML should now be obligatory. We need to be thinking more carefully than that. An important initial reflection should be on the temptation to be doing ‘Big Data Science’ for the sake of ‘doing Big Data Science’. This is a prevalent confusion of means and ends: solutions in search of a problem. It confuses instruments with objectives. While there are many useful technologies, wisdom resides in knowing which to use and when to (and not to) use them. True value is in the quality of the results, not in just being able to claim pride of place on the Data Science bandwagon. Notwithstanding the rare lucky shots, arbitrary applications of a technology more often than not have underwhelming results. “Give somebody a hammer, and he will treat everything as a nail” very often leads to “This hammer is no good at pounding this screw!” There are many and diverse sources of knowledge about individual statistical methods and applications, but “… the art of data analysis is about choosing and using multiple tools” (<a href="http://biostat.mc.vanderbilt.edu/rms" target="_blank">Regression Modeling Strategies</a>, pp. vii.) True value will emerge from the judicious and appropriate application of tools for settled purposes. This is where the road map for choosing between ML and SM is useful.</p>
<p>The issue of a false dichotomy is moot: ML and SM are different. A better question may be, are there conditions and ways in which ML and SM can be complementary for specific purposes? Are there ways they can be combined? Are they compatible within the domain of modern applied practice? In the general domain of practice SM and ML only fully displace one another in a perspective of chauvinistic zerosum domination. They only appear to compete if their respective advantages under specific conditions and for specific purposes are not understood. They only appear to compete under conditions of prejudice or incomplete understanding. Frank’s roadmap does much to resolve this.</p>

Road Map for Choosing Between Statistical Modeling and Machine Learning
http://fharrell.com/post/statml/
Mon, 30 Apr 2018 00:00:00 +0000
http://fharrell.com/post/statml/
<p class="rquote">
Machine learning (ML) may be distinguished from statistical models (SM) using any of three considerations:<br><b>Uncertainty</b>: SMs explicitly take uncertainty into account by specifying a probabilistic model for the data.<br><b>Structural</b>: SMs typically start by assuming additivity of predictor effects when specifying the model.<br><b>Empirical</b>: ML is more empirical including allowance for highorder interactions that are not prespecified, whereas SMs have identified parameters of special interest.<br><br>There is a growing number of hybrid methods combining characteristics of traditional SMs and ML, especially in the Bayesian world. Both SMs and ML can handle highdimensional situations.
<br><br>
It is often good to let the data speak. But you must be comfortable in assuming that the data are speaking rationally. Data can fool you.<br><br>Whether using SM or ML, work with a methodologist who knows what she is doing, and don't begin an analysis without ample subject matter input.
</p>
<p>Data analysis methods may be described by their areas of applications, but for this article I’m using definitions that are strictly methodsoriented. A statistical model (SM) is a data model that incorporates probabilities for the data generating mechanism and has identified unknown parameters that are usually interpretable and of special interest, e.g., effects of predictor variables and distributional parameters about the outcome variable. The most commonly used SMs are regression models, which potentially allow for a separation of the effects of competing predictor variables. SMs include ordinary regression, Bayesian regression, semiparametric models, generalized additive models, longitudinal models, timetoevent models, penalized regression, and others. Penalized regression includes ridge regression, lasso, and elastic net. Contrary to what some machine learning (ML) researchers believe, SMs easily allow for complexity (nonlinearity and secondorder interactions) and an unlimited number of candidate features (if penalized maximum likelihood estimation or Bayesian models with sharp skeptical priors are used). It is especially easy, using regression splines, to allow every continuous predictor to have a smooth nonlinear effect.</p>
<p>ML is taken to mean an algorithmic approach that does not use traditional identified statistical parameters, and for which a preconceived structure is not imposed on the relationships between predictors and outcomes. ML usually does not attempt to isolate the effect of any single variable. ML includes random forests, recursive partitioning (CART), bagging, boosting, support vector machines, neural networks, and deep learning. ML does not model the data generating process but rather attempts to learn from the dataset at hand. ML is more a part of computer science than it is part of statistics. Perhaps the simplest way to distinguish ML form SMs is that SMs (at least in the regression subset of SM) favor additivity of predictor effects while ML usually does not give additivity of effects any special emphasis.</p>
<p>ML and AI have had their greatest successes in high signal:noise situations, e.g., visual and sound recognition, language translation, and playing games with concrete rules. What distinguishes these is quick feedback while training, and availability of <strong>the</strong> answer. Things are different in the low signal:noise world of medical diagnosis and human outcomes. A great use of ML is in pattern recognition to mimic radiologists’ expert image interpretations. For estimating the probability of a positive biopsy given symptoms, signs, risk factors, and demographics, not so much.</p>
<p>There are many published comparisons of predictive performance of SM and ML. In many of the comparisons, only naive regression methods are used (e.g., everything is assumed to operate linearly), so the SM comparator is nothing but a straw man. And not surprisingly, ML wins. The reverse also happens, where the ML comparator algorithm uses poorlychosen default parameters or the particular ML methods chosen for comparison are out of date. As a side note, when the SM method is just a straw man, the outcry from the statistical community is relatively muted compared with the outcry from ML advocates when the “latest and greatest” ML algorithm was not used in the comparison with SMs. ML seems to require more tweaking than SMs. But SMs often require a timeconsuming data reduction step (unsupervised learning) when the number of candidate predictors is very large and penalization (lasso or otherwise) is not desired.</p>
<p>Note that there are ML algorithms that provide superior <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3575184" target="_blank">predictive discrimination</a> but that pay insufficient attention to <a href="http://fharrell.com/post/medml" target="_blank">calibration</a> (absolute accuracy).</p>
<p>Because SMs favor additivity as a default assumption, when additive effects dominate, SM requires far lower sample sizes (typically 20 events per candidate predictor) than ML, which typically requires <a href="https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471228814137" target="_blank">200 events</a> per candidate predictor. Thus ML can sometimes create a demand for “big data” when smallmoderate sized datasets will do. I sometimes dislike ML solutions for particular medical problems because of ML’s <strong>lack</strong> of assumptions. But SMs are not very good at reliably finding nonprespecified interactions; SM typically requires interactions to be prespecified. On the other hand, <a href="https://www.ahrq.gov" target="_blank">AHRQ</a>sponsored research I did on large medical outcomes datasets in the 1990s with the amazing University of Nevada Reno physicianstatistician <a href="https://www.legacy.com/obituaries/rgj/obituary.aspx?n=philgoodman&pid=144885798" target="_blank">Phil Goodman</a>, whom we lost at an alltooearly age, demonstrated that important nonadditive effects are rare when predicting patient mortality. As a result, neural networks were no better than logistic regression in terms of predictive discrimination in these datasets.</p>
<p>There are many current users of ML algorithms who falsely believe that one can <a href="http://fharrell.com/post/mlsamplesize" target="_blank">make reliable predictions from complex datasets with a small number of observations</a>. Statisticians are pretty good at knowing the limitations caused by the effective sample size, and to stop short of trying to incorporate model complexity that is not supported by the information content of the sample.</p>
<p>Here are some rough guidelines that attempt to help researchers choose between the two approaches, for a prediction problem<sup class="footnoteref" id="fnref:Notethatasdes"><a rel="footnote" href="#fn:Notethatasdes">1</a></sup>.</p>
<p><strong>A statistical model may be the better choice if</strong></p>
<ul>
<li>Uncertainty is inherent and the signal:noise ratio is not large—even with identical twins, one twin may get colon cancer and the other not; one should model tendencies instead of doing classification when there is randomness in the outcome</li>
<li>One doesn’t have perfect training data, e.g., cannot repeatedly test one subject and have outcomes assessed without error</li>
<li>One wants to isolate effects of a small number of variables</li>
<li>Uncertainty in an overall prediction or the effect of a predictor is sought</li>
<li>Additivity is the dominant way that predictors affect the outcome, or interactions are relatively small in number and can be prespecified</li>
<li>The sample size isn’t huge</li>
<li>One wants to isolate (with a predominantly additive effect) the effects of “special” variables such as treatment or a risk factor</li>
<li>One wants the entire model to be interpretable</li>
</ul>
<p><strong>Machine learning may be the better choice if</strong></p>
<ul>
<li>The signal:noise ratio is large and the outcome being predicted doesn’t have a strong component of randomness; e.g., in visual pattern recognition an object must be an <code>E</code> or not an <code>E</code></li>
<li>The learning algorithm can be trained on an unlimited number of exact replications (e.g., 1000 repetitions of each letter in the alphabet or of a certain word to be translated to German)</li>
<li>Overall prediction is the goal, without being able to succinctly describe the impact of any one variable (e.g., treatment)</li>
<li>One is not very interested in estimating uncertainty in forecasts or in effects of selected predictors</li>
<li>Nonadditivity is expected to be strong and can’t be isolated to a few prespecified variables (e.g., in visual pattern recognition the letter <code>L</code> must have both a dominating vertical component <strong>and</strong> a dominating horizontal component <strong>and</strong> these two must intersect at their endpoints)</li>
<li>The sample size is <a href="https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471228814137" target="_blank">huge</a></li>
<li>One does not need to isolate the effect of a special variable such as treatment</li>
<li>One does not care that the model is a “black box”</li>
</ul>
<h2 id="editorialcomment">Editorial Comment</h2>
<p>Some readers have <a href="https://twitter.com/samfin55/status/991031725189984258" target="_blank">commented on twitter</a> that I’ve created a false dichotomy of SMs vs. ML. There is some truth in this claim. The motivations for my approach to the presentation are</p>
<ul>
<li>to clarify that regression models are <strong>not</strong> ML<sup class="footnoteref" id="fnref:Thereisaninte"><a rel="footnote" href="#fn:Thereisaninte">2</a></sup></li>
<li>to sharpen the discussion by having a somewhat concrete definition of ML as a method without “specialness” of the parameters, that does not make many assumptions about the structure of predictors in relation to the outcome being predicted, and that does not explicitly incorporate uncertainty (e.g., probability distributions) into the analysis</li>
<li>to recognize that the bulk of machine learning being done today, especially in biomedical research, seems to be completely uninformed by statistical principles (much to its detriment IMHO), even to the point of many ML users not properly understanding predictive accuracy. It is impossible to have good predictions that address the problem at hand without a thorough understanding of measures of predictive accuracy when choosing the measure to optimize.</li>
</ul>
<p>Some definitions of ML and discussions about the definitions may be found <a href="https://www.techemergence.com/whatismachinelearning" target="_blank">here</a>, <a href="https://machinelearningmastery.com/whatismachinelearning" target="_blank">here</a>, and <a href="https://stackoverflow.com/questions/2620343" target="_blank">here</a>. I like the following definition from <a href="http://www.amazon.com/dp/0070428077?tag=inspiredalgor20" target="_blank">Tom Mitchell</a>: <em>The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.</em></p>
<p>The two fields may also be defined by how their practitioners spend their time. Someone engaged in ML will mainly spend her time choosing algorithms, writing code, specifying tuning parameters, waiting for the algorithm to run on a computer or cluster, and analyzing the accuracy of the resulting predictions. Someone engaged mainly in SMs will tend to spend time choosing a statistical model family, specifying the model, checking goodness of fit, analyzing accuracy of predictions, and interpreting estimated effects.</p>
<p>See <a href="https://twitter.com/f2harrell/status/990991631900921857" target="_blank">this</a> for more twitter discussions.</p>
<figure >
<img src="http://fharrell.com/img/keithmageeTweet.png" width="60%" />
</figure>
<h2 id="furtherreading">Further Reading</h2>
<ul>
<li>Followup Article by Drew Levy: <a href="http://fharrell.com/post/statml2">Navigating Statistical Modeling and Machine Learning</a></li>
<li><a href="http://www2.math.uu.se/~thulin/mm/breiman.pdf" target="_blank">Statistical Modeling: The Two Cultures</a> by Leo Breiman <br><small>Note: I very much disagree with Breiman’s view that data models are not important. How would he handle truncated/censored data for example? I do believe that data models need to be flexible. This is facilitated by Bayesian modeling.</small></li>
<li><a href="https://jamanetwork.com/journals/jama/articleabstract/2675024" target="_blank">Big Data and Machine Learning in Health Care</a> by AL Beam and IS Kohane</li>
<li>Harvard Business Review article <a href="https://hbr.org/2016/12/whyyourenotgettingvaluefromyourdatascience" target="_blank">Why You’re Not Getting Value From Your Data Science</a>, about regression vs. machine learning in business applications</li>
<li><a href="http://www.sharpsightlabs.com/blog/differencemachinelearningstatisticsdatamining" target="_blank">What’s the Difference Between Machine Learning, Statistics, and Data Mining?</a></li>
<li><a href="https://jamanetwork.com/journals/jama/fullarticle/2683125" target="_blank">Big Data and Predictive Analytics: Recalibrating Expectations</a> by Shah, Steyerberg, Kent</li>
<li><a href="https://matloff.wordpress.com/2018/06/20/neuralnetworksareessentiallypolynomialregression" target="_blank">Neural Networks are Essentially Polynomial Regression</a> by Norman Matloff</li>
</ul>
<h4 id="footnotes">Footnotes</h4>
<div class="footnotes">
<hr />
<ol>
<li id="fn:Notethatasdes">Note that as described <a href="http://fharrell.com/post/classification" target="_blank">here</a>, it is not appropriate to cast a prediction problem as a classification problem except in special circumstances that usually entail instant visual or sound pattern recognition requirements in a high signal:noise situation where the utility/cost/loss function cannot be specified. ML practitioners frequently misunderstand this, leading them to use <a href="http://www.fharrell.com/post/classdamage" target="_blank">improper accuracy scoring rules</a>. <a class="footnotereturn" href="#fnref:Notethatasdes"><sup>^</sup></a></li>
<li id="fn:Thereisaninte">There is an intersection of ML and regression in neural networks. See <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780140108" target="_blank">this article</a> for more. <a class="footnotereturn" href="#fnref:Thereisaninte"><sup>^</sup></a></li>
</ol>
</div>

Musings on Multiple Endpoints in RCTs
http://fharrell.com/post/ymult/
Mon, 26 Mar 2018 00:00:00 +0000
http://fharrell.com/post/ymult/
<p class="rquote">
Learning is more productive than avoiding mistakes. And if one wishes to just avoid mistakes, make sure the mistakes are real. Question whether labeling endpoints is productive, and whether type I error risks are valuable in quantifying evidence for effects and should interfere with asking questions.
</p>
<p>The <a href="https://www.nhlbi.nih.gov" target="_blank">NHLBI</a>funded <a href="https://www.ischemiatrial.org" target="_blank">ISCHEMIA</a> multinational randomized clinical trial<sup class="footnoteref" id="fnref:DisclosureIlea"><a rel="footnote" href="#fn:DisclosureIlea">1</a></sup> is designed to assess the effect of cardiac catheterizationguided coronary revascularization strategy (which includes optimal medical management) compared to optimal medical management alone (with cardiac cath reserved for failure of medical therapy) for patients with stable coronary artery disease. It is unique in that the use of cardiac catheterization is randomized, so that the entire “strategy pipeline” can be studied. Previous studies performed randomization after catheterization results were known, allowing the socalled “oculostenotic reflex” of cardiologists to influence adherence to randomization to a revascularization procedure.</p>
<p>As well summarized <a href="https://www.tctmd.com/news/ischemiafracasamidchargesmovinggoalpostsinvestigatorscomeoutswinging" target="_blank">here</a> and <a href="http://circoutcomes.ahajournals.org/content/11/4/e004744" target="_blank">here</a>, the ISCHEMIA trial recently created a great deal of discussion in the cardiology community when the primary outcome was changed from cardiovascular death or nonfatal myocardial infarction to a 5category endpoint that also includes hospitalization for unstable angina or heart failure, and resuscitated cardiac arrest. The 5component endpoint was the trial’s original primary endpoint and was the basis for the NIH grant funding. The possibility and procedure for reverting back to this 5component endpoint was thought out even before the study began. The change was pragmatic as is usually the case: the accrual and event rates seldom go as hoped. The main concern in the cardiology community is the use of socalled “soft” endpoints. The original twocomponent endpoint is now an important secondary endpoint.</p>
<p>The purpose of this article is not to discuss ISCHEMIA but to discuss the general study design, endpoint selection, and analysis issues ISCHEMIA raises that apply to a multitude of trials.</p>
<h1 id="powervoodooevenwithonlyoneendpointsizingastudyischallenging">Power Voodoo: Even With Only One Endpoint, Sizing a Study is Challenging</h1>
<p>Before discussing power, recall that the type I error α is the probability (risk) of making an assertion of a nonzero effect when the true effect is zero. In any given study we don’t know if a type I error has been committed. A type I error is not an error in the usual sense; it is a longrun operating characteristic, i.e., the chance of someone observing data <strong>more</strong> extreme than ours if they could indefinitely repeat our experiment but with a treatment effect of exactly zero magically inserted. Type I error is the chance of making an <strong>assertion</strong> of efficacy <em>in general</em>, when there is no efficacy.</p>
<p>Power calculations, and sample size calculations based on power, have long been thought by statisticians to be more voodoo than science. Besides all the problems related to null hypothesis testing in general, and arbitrariness in the setting of α and power (1  type II error β), a significant difficulty and chance for arbitrariness is the choice of the effect size δ to detect with probability 1  β. δ is invariably manipulated, at least partly, to result in a sample size that meets budget constraints. What if instead a fully <a href="http://fharrell.com/post/bayesseq">sequential trial</a> was done and budgeting were incremental depending on the promise shown by current results? δ could be held at the original effect size determined by clinical experts, and a Bayesian approach could be used in which no single δ was assumed. Promising evidence for a morethanclinicallytrivial effect could result in the release of more funds<sup class="footnoteref" id="fnref:Thissequential"><a rel="footnote" href="#fn:Thissequential">2</a></sup>. Total program costs could even be reduced, by more quickly stopping studies with a high risk of being futile. A sequential approach makes it less necessary to change an endpoint for pragmatic reasons once the study begins. So would adoption of a Bayesian approach to evidence generation, as a replacement for null hypothesis significance testing. If one “stuck it out” with the original endpoint no matter what the accrual and event frequency, and found that the treatment efficacy assessment is not “definitive” but that the posterior probability of efficacy was 0.93 at the planned study end, many would regard the result as providing good evidence (i.e., a betting person would not make money by betting against the new treatment). On the other hand, p > 0.05 would traditionally be seen as “the study is uninformative since statistical significance was not achieved<sup class="footnoteref" id="fnref:Interpretingthe"><a rel="footnote" href="#fn:Interpretingthe">3</a></sup>.” To some extent the perceived need to change endpoints in a study<sup class="footnoteref" id="fnref:Whichisethical"><a rel="footnote" href="#fn:Whichisethical">4</a></sup> occurs because study leaders and especially sponsors are held hostage by the null hypothesis significance testing/power paradigm.</p>
<p>Speaking of Bayes and sample size calculations, the Bayesian philosophy is to not have any unknowns in any calculation. Posterior probabilities are conditional on current cumulative data and do not use a single value for δ. An entire prior distribution is used for δ. By allowing for uncertainty in δ, Bayesian power calculations are more honest than frequentist calculations.
Some useful references are <a href="https://www.zotero.org/groups/2199991/feh/items/tag/bayes/tag/samplesize" target="_blank">here</a>.</p>
<p>One of the challenges in power and sample size calculations, and knowing when to stop a study, is that there are competing goals. One might be interested in concluding any of the following:</p>
<ul>
<li>the treatment is beneficial (working in the right direction)</li>
<li>the treatment is more than trivially beneficial</li>
<li>the estimate of the magnitude of the treatment effect has sufficient precision (e.g., the multiplicative margin of error in a hazard ratio)</li>
</ul>
<p>In the frequentist domain, planning studies around <a href="https://www.zotero.org/groups/2199991/feh/items/tag/precision" target="_blank">precision</a> frees the researcher from having to choose δ. The ISCHEMIA study, in addition to doing traditional power calculations, also emphasized having a sufficient sample size to estimate the hazard ratio for the two most important endpoints to within an adequate multiplicative margin of error with 0.95 confidence. Bayesian precision can likewise be determined using the half width of the 0.95 credible interval for the treatment effect<sup class="footnoteref" id="fnref:Iftherearemul"><a rel="footnote" href="#fn:Iftherearemul">5</a></sup>.</p>
<h1 id="multipleendpointsandendpointprioritization">Multiple Endpoints and Endpoint Prioritization</h1>
<p>To a large extent, the perceived need to adjust/penalize for asking multiple questions (about multiple endpoints) or at least the need for prioritization of endpoints arises from the perceived need to control overall type I error (also known as α spending). The chance of making an “effectiveness” assertion if any of three endpoints shows evidence against a null hypothesis is greater than α for any one endpoint. As an aside, <a href="https://www.zotero.org/groups/2199991/feh/items/itemKey/NCNTAZ5R/q/Farewell" target="_blank">Cook and Farewell</a> give a persuasive argument for prioritization of endpoints but not adjusting their pvalues for multiplicity when one is asking separate questions regarding the endpoints<sup class="footnoteref" id="fnref:Thatiswhenthe"><a rel="footnote" href="#fn:Thatiswhenthe">6</a></sup>. Think of prioritization of endpoints as prespecification of the order for publication and how the study results are publicized. It is OK to announce a “significant” third endpoint as long as the “insignificant” first and second endpoints are announced first, and the context for the third endpoint is preserved.</p>
<p>Having been privy to dozens of hours of discussions among clinical trialists during protocol writing for many randomized clinical trials, I can confidently say that the reasoning for the final choices comes from a mixture of practical, clinical, and patientoriented considerations, perhaps with too much emphasis on the pragmatic statistical question “for which endpoint that the treatment possibly affects are we likely to to have sufficient number of events?”. Though statistical considerations are important, this approach is not fully satisfying because</p>
<ul>
<li>the final choices remain too arbitrary and are not purely clinically/public health motivated</li>
<li>binary endpoints <a href="http://fharrell.com/post/ordinalinfo">are not statistically efficient anyway</a></li>
<li>using separate binary endpoints does not combine the endpoints into an overall patientutility scale<sup class="footnoteref" id="fnref:Anordinalwhat"><a rel="footnote" href="#fn:Anordinalwhat">7</a></sup>.</li>
</ul>
<p>Having multiple prespecified endpoints also sets the stage for a blinded committee to change the endpoint priority for pragmatic reasons, related to the “slavery to statistical power and null hypothesis testing” discussed above.</p>
<p>It is important to note for ISCHEMIA and in general that having a primary endpoint does not prevent anyone interpreting the study’s final result from emphasizing a secondary or tertiary endpoint.</p>
<h1 id="jointmodelingofmultipleendpoints">Joint Modeling of Multiple Endpoints</h1>
<p>joint modeling of multiple outcomes allows uncovering relationships of multiple outcome variables, and quantifying joint evidence for all outcomes simultaneously, while providing the usual marginal outcome evidence (for each outcome separately). As discussed <a href="http://fharrell.com/post/bayesfreqstmts">here</a> and <a href="http://fharrell.com/post/journey">here</a>, Bayesian posterior inference has many advantages in this context. For example, the final analysis of a clinical trial with three endpoints E<sub>1</sub>, E<sub>2</sub>, E<sub>3</sub> might be based on posterior probabilities of the following forms:</p>
<table>
<thead>
<tr>
<th> </th>
<th> </th>
</tr>
</thead>
<tbody>
<tr>
<td>Prob(E<sub>1</sub> > 0 or E<sub>2</sub> > 0 or E<sub>3</sub> > 0)</td>
<td>Prob(efficacy) on <strong>any</strong> endpoint</td>
</tr>
<tr>
<td>Prob(E<sub>1</sub> > 0 and E<sub>2</sub> > 0)</td>
<td>Prob(efficacy) on both of the first two endpoints</td>
</tr>
<tr>
<td>Prob(E<sub>1</sub> > 0 or (E<sub>2</sub> > 3 and E<sub>3</sub> > 4))</td>
<td>Prob(any mortality reduction or large reductions on two nonfatal endpoints)</td>
</tr>
<tr>
<td>Prob(at least two of E<sub>1</sub> > 0, E<sub>2</sub> > 0, E<sub>3</sub> > 0)</td>
<td>Prob(hitting any two of the three efficacy targets)</td>
</tr>
<tr>
<td>Prob(1 < E<sub>1</sub> < 1)</td>
<td>Prob(similarity of E<sub>1</sub> outcome)</td>
</tr>
</tbody>
</table>
<p>One can readily see that once you get away from null hypothesis testing, many clinically relevant possibilities exist, and multiplicity considerations are cast aside. A reasonable strategy would be to demand an extrahigh probability of hitting any one of three targets, or a somewhat lower probability of hitting any two of the three targets. More about this way of thinking may be found <a href="http://fharrell.com/post/bayesfreqstmts">here</a><sup class="footnoteref" id="fnref:Justasonecan"><a rel="footnote" href="#fn:Justasonecan">8</a></sup>.</p>
<p>Posterior probabilities also provide the direct forward predictive type of evidence that leads to optimum decisions. Barring cost considerations, a treatment that has a 0.93 chance of reducing mortality may be deemed worthwhile, especially if a skeptical prior was used.</p>
<p>Joint Bayesian modeling of multiple endpoints also allows one to uncover interrelationships among the endpoints as described in the recent paper by <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/pst.1852" target="_blank">Costa and Drury</a>. One of the methods proposed by the authors, the one based on a multivariate copula, has several advantages. First, one obtains the usual marginal treatment effects on each endpoint separately. Second, the Bayesian analysis they describe allows one to estimate the amount of dependence between two endpoints, which is interesting in its own right and will help in estimating power when planning future studies. Third, the amount of such dependence can be allowed to vary by treatment. For example, if one endpoint is an efficacy endpoint (or continuous measurement) and another is the occurrence of an adverse event, placebo subjects may randomly experience the adverse event such that there is no withinperson correlation between it and the efficacy response. On the other hand, subjects on the active drug may experience the efficacy and safety outcomes together. E.g., subjects getting the best efficacy response may be those with more adverse events. Estimation of betweenoutcome dependencies is of real clinical interest.</p>
<p>Most importantly, Bayesian analysis of clinical trials, when multiple endpoints are involved, allows the results for each endpoint to be properly interpreted marginally. That is because the prior state of knowledge, which may reasonably be encapsulated into a skeptical prior (i.e., a prior distribution that assumes large treatment effects are unlikely) leads to a posterior probability of efficacy for each endpoint that is straightforwardly interpreted regardless of context. Because Bayes deals with <a href="http://fharrell.com/post/pvalprobs">forward probabilities</a>, these posterior probabilities of efficacy are calibrated by their priors. For example, the skepticism with which we view efficacy of a treatment on endpoint E<sub>2</sub> comes from the data about the E<sub>2</sub> effect and the prior skepticism about the E<sub>2</sub> effect, no matter what the effect on E<sub>1</sub>. This way of thinking shows clearly the value of trying to learn more from a study by asking multiple questions. One should not be penalized for curiosity.</p>
<h3 id="furtherreading">Further Reading</h3>
<ul>
<li><a href="https://jamanetwork.com/journals/jama/articleabstract/185214?redirect=true" target="_blank">Composite end points in randomized trials: There is no free lunch</a></li>
<li><a href="https://www.zotero.org/groups/2199991/feh/items/tag/multipleendpoints" target="_blank">Miscellaneous papers</a> on multiple endpoints</li>
</ul>
<h3 id="footnotes">Footnotes</h3>
<div class="footnotes">
<hr />
<ol>
<li id="fn:DisclosureIlea">Disclosure: I lead the independent statistical team at Vanderbilt that supports the DSMB for the ISCHEMIA trial and was involved in the trial design. As the DSMB reporting statistician I am unblinded to treatment assignment and outcomes. I had no interaction with the blinded independent advisory committee that recommended the primary endpoint change or with the process that led to that recommendation. <a class="footnotereturn" href="#fnref:DisclosureIlea"><sup>^</sup></a></li>
<li id="fn:Thissequential">This sequential funding approach assumes that outcomes occur quickly enough to influence the assessment. <a class="footnotereturn" href="#fnref:Thissequential"><sup>^</sup></a></li>
<li id="fn:Interpretingthe">Interpreting the pvalue in conjunction with a 0.95 confidence interval would help, but there are two problems. First, most users of frequentist theory are hung up on the pvalue. Second, the confidence interval has endpoints that are not controllable by the user, in contrast to Bayesian posterior probabilities of treatment effects being available for any userspecified interval endpoints. For example, one may want to compute Prob(blood pressure reduction > 3mmHg). <a class="footnotereturn" href="#fnref:Interpretingthe"><sup>^</sup></a></li>
<li id="fn:Whichisethical">Which is ethically done, by decision makers using only pooled treatment data. <a class="footnotereturn" href="#fnref:Whichisethical"><sup>^</sup></a></li>
<li id="fn:Iftherearemul">If there are multiple data looks, the traditional frequentist confidence interval is no longer valid and a complicated adjustment is needed. The adjusted confidence interval would be seen by Bayesians as conservative. <a class="footnotereturn" href="#fnref:Iftherearemul"><sup>^</sup></a></li>
<li id="fn:Thatiswhenthe">That is, when the endpoint comparisons are to be interpreted marginally. <a class="footnotereturn" href="#fnref:Thatiswhenthe"><sup>^</sup></a></li>
<li id="fn:Anordinalwhat">An ordinal “what’s the worst thing that happened to the patient” scale would have few assumptions, would increase power, and would give credit to a treatment that has more effect on more serious outcomes than it has on less serious ones. <a class="footnotereturn" href="#fnref:Anordinalwhat"><sup>^</sup></a></li>
<li id="fn:Justasonecan">Just as one can compute the probability of rolling a six on either of two dice, or rolling a total greater than 9, direct predictivemode probabilities may be computed as often as desired with no multiplicity. Multiplicity with backwards probabilities comes from giving data more chances to be extreme (the frequentist sample space) and not from the chances you give more efficacy parameters to be positive. <a class="footnotereturn" href="#fnref:Justasonecan"><sup>^</sup></a></li>
</ol>
</div>

Improving Research Through Safer Learning from Data
http://fharrell.com/post/improveresearch/
Thu, 08 Mar 2018 00:00:00 +0000
http://fharrell.com/post/improveresearch/
<h1 id="overview">Overview</h1>
<p>There are two broad classes of data analysis. The first class, exploratory data analysis, attempts to understand the data at hand, i.e., to understand <em>what happened</em>, and can use descriptive statistics, graphics, and other tools, including multivariable statistical models<sup class="footnoteref" id="fnref:Whenmanyvariab"><a rel="footnote" href="#fn:Whenmanyvariab">1</a></sup>. The second broad class of data analysis is inferential analysis which aims to provide evidence and assist in judgments about the process generating the one dataset. Here the interest is in generalizability, and a statistical model is not optional<sup class="footnoteref" id="fnref:Everystatistica"><a rel="footnote" href="#fn:Everystatistica">2</a></sup>. Sometimes this is called population inference but it can be thought of less restrictively as understanding the data generating process. Also there is prediction, which is a mode of inference distinct from judgment and decision making.</p>
<p>The following discussion concentrates on inference, although several of the concepts, especially measurement accuracy, fully pertain to exploratory data analysis.</p>
<p>The key elements of learning from data using statistical inference involve the following:</p>
<ol>
<li>prespecification if doing formal inference, intending to publish, or intending to be reviewed by regulatory authorities</li>
<li>choosing an experimental design</li>
<li>considering the spectrum of persubject information available</li>
<li>considering information content, bias, and precision in measurements</li>
<li>understanding variability in measurements</li>
<li>specification of the statistical model</li>
<li>incorporating beliefs of judges/regulators/consumers into the model parameters if Bayesian</li>
<li>incorporating beliefs of judges/regulators/consumers into model interpretation if frequentist</li>
<li>using the model to quantify evidence</li>
<li>replication/validation, when needed</li>
<li>translating the evidence to a decision or an action</li>
</ol>
<h1 id="prespecification">Prespecification</h1>
<p>Prespecification of the study design and analysis are incredibly important components of reproducible research. It is necessary unless one is engaging in exploratory learning (especially in the initial phase of research) and not intending for the results to be considered confirmatory. Prespecification controls investigator degrees of freedom (see below) and keeps the investigator from entering the <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf" target="_blank">garden of forking paths</a>. A large fraction of studies that failed to validate can be traced to the nonexistence of a prospective, specific data transformation and statistical analysis plan. Randomized clinical trials require almost complete prespecification. Animal and observational human subjects research does not enjoy the same protections, and many an experiment has resulted in statistical disappointment that tempted the researcher to modify the analysis, choice of response variable, sample membership, computation of derived variables, normalization method, etc. The use of cutoffs on pvalues causes a large portion of this problem.</p>
<p>Frequentist and Bayesian analysis are alike with regard the need for prespecification. But a Bayesian approach has an advantage here: you can include parameters for what you don’t know and hope you don’t need (but are not sure). For example, one could specify a model in which a doseresponse relationship is linear, but add a parameter that allows a departure from linearity. One can hope that interaction between treatment and race is absent, but <a href="https://www.ncbi.nlm.nih.gov/pubmed/9192445" target="_blank">include parameters allowing for such interactions</a>. In these two examples, skeptical prior distributions for the “extra” parameters would favor a linear doseresponse or absence of interaction, but as the sample size increases the data would allow these parameters to “float” as needed. Bayes still provides accurate inferences when one is not sure of the model. This is discussed further below. Twostage analyses as typically employed in the frequentist paradigm (e.g., pretesting for linearity of doseresponse) do not control type I error.</p>
<h1 id="experimentaldesign">Experimental Design</h1>
<p>The experimental design is all important, and is what allows interpretations to be causal. For example, in comparing two treatments there are two types of questions:</p>
<ol>
<li>Did treatment B work better in the group of patients receiving it in comparison to those patients who happened to receive treatment A?</li>
<li>Would this patient fare better were <em>she</em> given treatment B vs. were <em>she</em> given treatment A?</li>
</ol>
<p>The first question is easy to answer using statistical models (estimation or prediction), not requiring any understanding of physicians’ past treatment choices. The second question is one of causal inference, and it is impossible for observational data to answer that question without additional unverifiable assumptions. (Compare this to a randomized crossover study where the causal question can be almost directly answered.)</p>
<p>In a designed experiment, the experimenter usually knows exactly which variables to measure, and some of the variables are completely controlled. For example, in a 3x3 randomized factorial design, two factors are each experimentally set to three different levels giving rise to 9 controlled combinations. The experiment can block on yet other factors to explain outcome variation caused by them. In a randomized crossover study, an investigator can estimate causal treatment differences per subject if carryover effects are washed out. In an observational therapeutic effectiveness study it is imperative to measure a long list of relevant variables that explain outcomes <strong>and</strong> treatment choices. Still not guaranteeing an ability to answer the causal therapeutic question, having a wide spectrum of accurately collected baseline data is required to begin the process. Other design elements of observational studies are extremely important, including such aspects as when variables are measured, which subjects are included, what is the meaning of “time zero”, and how does one avoid losses to followup.</p>
<h1 id="measurementsandunderstandingvariability">Measurements and Understanding Variability</h1>
<p>Understanding what measurements really mean, what they do not capture, minimizing systematic bias, minimizing measurement error, and maximizing data resolution are key to optimizing statistical power and soundness of inference. Resolution is related to data acquisition, variable definitions, and measurement errors. Optimal statistical information comes from continuous measurements whose measurement errors are small.</p>
<p>Understanding sources of variability and incorporating those into the experimental design and the statistical model are important. What is the disagreement in technical replicates (e.g. splitting one blood sample into two and running both through a blood analyzer)? Are there batch effects? Edge effects in a gene microarray? Variation due to different temperatures in the lab each day? Do patients admitted on Friday night inherently have longer hospital stays? Other day of week effects? Seasonal variation and other longterm time effects? How about region, country, and lab variation?</p>
<h1 id="beliefsmatterwheninterpretingresultsorquantifyingabsoluteevidence">Beliefs Matter When Interpreting Results or Quantifying Absolute Evidence</h1>
<p>Notice the inclusion of <em>beliefs</em> in the original list. Frequentists operate under the illusion of objectivity and believe that beliefs are not relevant. This is an illusion, for four reasons.</p>
<ol>
<li>IJ Good showed that all probabilities are subjective because they depend on the knowledge of the observer. One of his examples is that a card player who knows that a certain card is sticky will know a different probability that the card will be at the top of the deck than will a player who doesn’t know that.<br /></li>
<li>To compute pvalues, one <strong>must</strong> know the <em>intentions</em> of the investigator. Did she intend to study 90 patients and happened to observe 10 bad outcomes, or did she intend to sample patients until 10 outcomes happened? Did she intend to do an early data look? Did she actually do an early data look but first wrote an affidavit affirming that she would not take any action as a result of the look? Did she intend to analyze three dependent variables and was the one reported the one she would have reported even had she looked at the data for all three? All of these issues factor into computation of a pvalue.</li>
<li>The choice of the statistical model is always subjective (more below).</li>
<li>Interpretations are subjective. Do you multiplicityadjust a pvalue? Using which of the competing approaches? What if other studies have results that are inconsistent with the new study? How do we discount the current pvalue for that? But most importantly, the necessary conversion of a frequentist probability of data given a hypothesis into evidence about the hypothesis is entirely subjective.</li>
</ol>
<p>Bayesian inference gets criticized for being subjective when in fact its distinguishing feature is that it is stating subjective assumptions clearly.</p>
<h1 id="specificationofthestatisticalmodel">Specification of the Statistical Model</h1>
<p>The statistical model is the prior belief about the <em>structure</em> of the problem. Outside of mathematics and physics, its choice is all too arbitrary, and statistical results and their interpretation depend on this choice. This applies equally to Bayesian and frequentist inference. The model choice has more impact than the choice of a prior distribution; the model choice does not “wear off” nearly as much as the prior does as the sample size gets large.</p>
<p><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5122713" target="_blank">Investigator degrees of freedom</a> greatly affects the reliability and generalizability of scientific findings. This applies to measurement, experimental design, choice of variables, the statistical model, and other facets of the research process. Turning to just the statistical model, there are aspects of modeling about which we continually delude ourselves in such a way as to have false confidence in results. This happens primarily in two ways. Either the investigator “plays with the data” to try different models and uses only the apparently bestfitting one, making confidence and credible intervals too narrow and pvalues and standard errors too small, or she selects a model apriori and hopes that it fits “well enough”. The latter occurs even in confirmatory studies with rigorous prespecification of the analysis. Whenever we use a model that makes an assumption about data structure, including assumptions about linearity, interactions, which variables to include, the shape of the distribution of the response given covariates, constancy of variance, etc., the inference is conditional on all those assumptions being true. The Bayesian approach provides an out: make all the assumptions you want, but allow for departures from those assumptions. If the model contains a parameter for everything we know we don’t know (e.g., a parameter for the ratio of variances in a twosample ttest), the resulting posterior distribution for the parameter of interest will be flatter, credible intervals wider, and confidence intervals wider. This makes them more likely to lead to the correct interpretation, and makes the result more likely to be reproducible.</p>
<p>Consider departures from the normality assumption.
<a href="http://onlinelibrary.wiley.com/book/10.1002/9781118033197" target="_blank">Box and Tiao</a> show how to elegantly allow for nonnormality in a Bayesian twosample ttest. This is done by allowing the data distribution to have a kurtosis (tail heaviness) that is different from what the normal curve allows. They place a prior distribution on the kurtosis parameter favoring normality, but as the sample size increases, less and less normality is assumed. When the data indicate that the tails are heavier than Gaussian, they showed that the resulting point estimates of the two means are very similar to trimmed means. In the same way a prior distribution for the ratio of variances that may favor 1.0, a prior for the degree of interaction between treatment and a baseline variable that favors no interaction, and a prior for the degree of nonlinearity in an age effect that favors linearity for very small sample sizes could all be specified. The posterior distribution for the main parameter of interest will reflect all of these uncertainties in an honest fashion. This is related to penalized maximum likelihood estimation or shrinkage in the frequentist domain<sup class="footnoteref" id="fnref:Thefrequentist"><a rel="footnote" href="#fn:Thefrequentist">3</a></sup>.</p>
<p>Besides the ability to handle more uncertainties, the Bayesian paradigm provides <a href="http://fharrell.com/post/bayesfreqstmts">direct evidentiary statements</a> such as the probability that the treatment reduces blood pressure. This is in contrast with the frequentist paradigm, which results in a probability of getting observed effects greater than what we observed were the true effect exactly zero, the model correct, and the same experiment (other than changing H<sub>0</sub> to be true) were to be repeated indefinitely<sup class="footnoteref" id="fnref:Thisimpliestha"><a rel="footnote" href="#fn:Thisimpliestha">4</a></sup>.</p>
<h1 id="usingstatisticalmodelstoquantifyevidence">Using Statistical Models to Quantify Evidence</h1>
<p>Since the model choice is subjective, if we want our quantified evidence for effects to be accurate and not overstated, we should use Bayesian models acknowledging what we don’t know. Nate Silvers in <a href="https://www.amazon.com/SignalNoiseManyPredictionsFailbut/dp/0143125087" target="_blank">The Signal and the Noise</a> eloquently wrote in detail about a different example of “what we don’t know”, related to causal inference from observational data. In his description of the controversy about cigarette smoking and lung cancer he pointed out that many people believed Ronald Fisher when Fisher said that since one can’t randomize cigarette exposure there is no way to draw trustworthy inference; therefore we should draw no conclusions (he also had a significant conflict of interest, as he was a consultant to the tobacco industry). Silver showed that even an extremely skeptical prior distribution about the effect of cigarette smoking on lung cancer would be overridden by the data. Because only the Bayesian approach allows insertion of skepticism at precisely the right point in the logic flow, one can think of a full Bayesian solution (prior + model) as a way to “get the model right”, taking the design and context into account, to obtain reliable scientific evidence. Note that possible failure to have all confounders measured can be somewhat absorbed into the skeptical prior distribution with a Bayesian approach.</p>
<p>In some tightlycontrolled experiments, the statistical model is somewhat less relevant.</p>
<h1 id="replicationandvalidation">Replication and Validation</h1>
<p>Finally, turn to the complex issue of replication and/or validation. First of all, it is important to know what replication is <em>not</em>. <a href="http://fharrell.com/post/splitval">This article</a> discusses splitsample validation as a datawasting form of <em>internal validation</em>. It does not demonstrate that other investigators with different measurement and survey techniques, different data cleaning procedures, and different subtle ways to “cheat” would arrive at the same answer. It turns geographical differences and time trends into surprises rather than useful covariates. A far better form of internal validation is the bootstrap, which has the added advantage (and burden) of requiring the researcher to completely specify all analytic steps. Now contrast internal validation with true independent replication. The latter has the advantages of validating the following:</p>
<ol>
<li>the investigators and their hidden biases</li>
<li>the specificity of the statistical analysis plan</li>
<li>the technologies on which measurements are based (e.g., gene or protein expression)</li>
<li>the survey techniques including how subjects are interviewed (with respect to leading questions, etc.)</li>
<li>subject inclusion/exclusion criteria</li>
<li>subtle decisions that biased estimates such as treatment effects (e.g., deleting outliers, avoiding blinding and blinded data correction, remeasuring something when its value is suspect, etc.)</li>
<li>other systemic biases that one suspects would be different for different research teams</li>
</ol>
<p>When is an independent replication or model validation warranted? This is difficult to say, but is related to the potential impact of the result, on subjects and on future researchers.</p>
<p>The quickest and cheapest form of partial validation is to validate the investigators and code, in the following sense. Have the investigators provide the prespecified data manipulation (including computation of derived variables) and statistical analysis or machine learning plan, along with the raw data, to an independent team. The independent team executes the data manipulation and analysis plan and compares the results to the results obtained by the original team. Ideally the independent researchers would run the original code on their systems and also do some independent coding. This process verifies code, computations, and specificity of the analysis plan and verifies that once the paper is published others will also be able to replicate the findings. This approach would have entirely prevented the <a href="https://en.wikipedia.org/wiki/Anil_Potti" target="_blank">Duke University Potti scandal</a> had the cancer biomarker investigators at Duke been interested in collaborating with an outside team.</p>
<p>If rigorous internal validation or attempted duplication of results by outsiders fails, there is no need to undertake an acquisition of new independent data to validate the original approach<sup class="footnoteref" id="fnref:Thisespecially"><a rel="footnote" href="#fn:Thisespecially">5</a></sup>.</p>
<h1 id="stepstoenhancethescientificprocess">Steps to Enhance the Scientific Process</h1>
<ol>
<li>Choose an experimental design that is appropriate for the question of interest, taking in account whether association or causation are central to the question</li>
<li>Choose the right measurements and measure them accurately or at least without a systemic bias favoring your viewpoint</li>
<li>Understand sources of variability and incorporate those into the design and the model</li>
<li>Formulate prior distributions for effect parameters that are informed by the subject matter and other reliable data. Even if you use pvalues this process will benefit the research.</li>
<li>Formulate a data model that is informed by subject matter, knowledge about the measurements, and experience with similar data</li>
<li>Add parameters to the model for what you don’t know, putting priors on those parameters so as to favor your favorite model (e.g., normal distribution with equal variances for the ttest; absence of interactions) but not rule out departures from it. If using a frequentist approach, parameters must be “all in”, which will make confidence intervals honest but <a href="https://www.ncbi.nlm.nih.gov/pubmed/9192445" target="_blank">wider than Bayesian credible intervals</a>.</li>
<li>Independently validate code and calculations while verifying the specificity of the statistical analysis or machine learning plan</li>
<li>In many situations, especially when large scale policies are at stake, independently replicate the findings from scratch before believing them</li>
</ol>
<hr />
<h2 id="someusefulreferences">Some Useful References</h2>
<ul>
<li><a href="https://arxiv.org/abs/1511.05219" target="_blank">How much does your data exploration overfit? Controlling bias via information usage</a> by D Russo and J Zou</li>
</ul>
<hr />
<p>This article benefited from many thoughtprovoking discussions with Bert Gunter, who believes that replication and initial exploratory analysis are even more important than I do. Chris Tong also provided valuable ideas. Misconceptions are solely mine.</p>
<p>Footnotes:</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:Whenmanyvariab">When many variables are involved, a statistical model is often the best descriptive tool, even when it’s not used for inference. <a class="footnotereturn" href="#fnref:Whenmanyvariab"><sup>^</sup></a></li>
<li id="fn:Everystatistica">Every statistical test is using a model. For example, the Wilcoxon twosample test is a special case of the proportional odds model and requires the proportional odds assumption to hold to achieve maximum power. <a class="footnotereturn" href="#fnref:Everystatistica"><sup>^</sup></a></li>
<li id="fn:Thefrequentist">The frequentist paradigm does not provide confidence intervals or pvalues when parameters are penalized. <a class="footnotereturn" href="#fnref:Thefrequentist"><sup>^</sup></a></li>
<li id="fn:Thisimpliestha">This implies that the exact experimental design that is <strong>in effect</strong> is known so that the pvalue can be computed by rerunning that exact design indefinitely often to compute the probability of finding a larger effect in those repeated experiments than the effect originally observed. <a class="footnotereturn" href="#fnref:Thisimpliestha"><sup>^</sup></a></li>
<li id="fn:Thisespecially">This especially pertains to prediction, and is less applicable to randomized trials. <a class="footnotereturn" href="#fnref:Thisespecially"><sup>^</sup></a></li>
</ol>
</div>

Is Medicine Mesmerized by Machine Learning?
http://fharrell.com/post/medml/
Thu, 01 Feb 2018 00:00:00 +0000
http://fharrell.com/post/medml/
<p>BD Horne et al wrote an important paper <a href="http://www.amjmed.com/article/S00029343(09)00103X/pdf" target="_blank">Exceptional mortality prediction by risk scores from common laboratory tests</a> that apparently garnered little attention, perhaps because it used older technology: standard clinical lab tests and logistic regression. Yet even putting themselves at a significant predictive disadvantage by binning all the continuous lab values into fifths, the authors were able to achieve a validated cindex (AUROC) of 0.87 in predicting death within 30d in a mixed inpatient, outpatient, and emergency department patient population. Their model also predicted 1y and 5y mortality very well, and performed well in a completely independent NHANES cohort<sup class="footnoteref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup>. It also performed very well when evaluated just in outpatients, a group with very low mortality.</p>
<p>The above model, called by the authors the Intermountain Risk Score, used the following predictors: age, sex, hematocrit, hemoglobin, red cell distribution width, mean corpuscular volume, red blood cell count, platelet count, mean platelet volume, mean corpuscular hemoglobin, mean corpuscular hemoglobin concentration, total white blood count, sodium, potassium, chloride, bicarbonate, calcium, glucose, creatinine, and BUN<sup class="footnoteref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup>. The model is objective, transparent, and needs only onetime and not historical information. It did not need the EHR (other than to get age and sex) but rather used the clinical lab data system. How predicted risks are arrived at is obvious, i.e., a physician can easily see which patient factors were contributing to overall risk of mortality. The predictive factors are measured at obvious times. One can be certain that the model did not use information it shouldn’t such as the use of certain treatments and procedures that may create a kind of circularity with death. It is important to note however that interlab variation has created challenges in analyzing lab data from multiple health systems.</p>
<p>Contrast the above underhyped approach with machine learning (ML). Consider the Avati et al’s paper <a href="https://arxiv.org/abs/1711.06402" target="_blank">Improving palliative care with deep learning</a> which was publicized <a href="https://spectrum.ieee.org/thehumanos/biomedical/diagnostics/stanfordsaipredictsdeathforbetterendoflifecare" target="_blank">here</a>. The Avati paper addresses an important area and is well motivated. Palliative care (e.g., hospice) is often sought at the wrong time and relies on individual physician referrals. An automatic screening method may yield a list of candidate patients near end of life who should be evaluated by a physician for the possibility of recommending palliative rather than curative care. A method designed to screen for such patients needs to be able to estimate either mortality risk or life expectancy accurately.</p>
<p>Avati et al’s analysis used a year’s worth of prior data on each patient and was based on 13,654 candidate features from the EHR. As with any retrospective study not based on an inception cohort with a welldefined “time zero”, it is tricky to define a time zero and somewhat easy to have survival bias and other sampling biases sneak into the analysis. The ML algorithm, in order to use a binary outcome, required division of patients into “positive” and “negative” cases, something not required by regression models for time until an event<sup class="footnoteref" id="fnref:Thereexistneur"><a rel="footnote" href="#fn:Thereexistneur">3</a></sup>. “Positive” cases must have at least 12 months of previous data in the health system, weeding out patients who died quickly. “Negative” cases must have been alive for at least 12 months from the <em>prediction date</em>. It is also not clear how variable censoring times were handled. In standard statistical model, patients entering the system just before the data analysis have short followup and are rightcensored early, but still contribute some information.</p>
<p>Avati et al used deep learning on the 13,654 features to achieve a validated cindex of 0.93. To the authors’ credit, they constructed an unbiased calibration curve, although it used binning and is very low resolution. Like many applications of ML where few statistical principles are incorporated into the algorithm, the result is a failure to make accurate predictions on the absolute risk scale. The calibration curve is far from the line of identity as shown below.</p>
<figure >
<img src="http://fharrell.com/img/ava17impCal.png" width="60%" />
</figure>
<p>The authors interpreted the above figure as “reasonably calibrated.” It is not. For example, a patient with a predicted probability of 0.2 had an actual risk < 0.1. The gain in cindex from ML over simpler approaches has been more than offset by worse calibration accuracy than the other approaches achieved.</p>
<p>Importantly, some of the hype over ML comes from journals and professional societies and not so much from the researchers themselves. That is the case for the Avati et al deep learning algorithm, which is not actually being used in production mode at Stanford. A much better calibrated and somewhat more statisticallybased algorithm is currently being used.</p>
<p>Like many ML algorithms, the focus is on development of “classifiers”. As detailed <a href="http://fharrell.com/post/classification/" target="_blank">here</a>, classifiers are far from optimal in medical decision support where decisions are not to be made in a paper but only once utilities/costs are known. Utilities and costs only become known during the physician/patient interaction. Unlike statistical models which directly estimate risk or life expectancy, the majority of ML algorithms start by using classification, then if a probability is needed they try to convert the patterns into a probability (this is sometimes called a “probability machine”). As judged by Avati et al’s calibration plot, this conversion may not be reliable.</p>
<p>Avati et al, besides showing us what is needed, and consistent with forward prediction (the calibration plot) also reported a number of problematic measures. As detailed <a href="http://fharrell.com/post/classdamage/" target="_blank">here</a>, the use of improper probability accuracy scoring rules is very common in the ML world, because of the hope that one can actually make a decision (classification) using the data without needing to incorporate costs of incorrect decisions (utilities). Improper accuracy scores have a number of problems, such as</p>
<ul>
<li>reversing information flow, i.e., conditioning on outcomes and examining tendencies of inputs</li>
<li>inviting dichotomization of inputs</li>
<li>being optimized by choosing the wrong features and giving them the wrong weights</li>
</ul>
<p>Proportion classified correctly, sensitivity, specificity, precision, and recall are all improper accuracy scoring rules and should not play a role in a forward prediction mode when risk or life expectancy estimation are the real goals. A poker player wins consistently because she is able to estimate the probability she will ultimately win with her current hand, not because she recalls how often she’s had such a hand when she won.</p>
<p>One additional point: the ML deep learning algorithm is a black box, not provided by Avati et al, and apparently not usable by others. And the algorithm is so complex (especially with its extreme usage of procedure codes) that one can’t be certain that it didn’t use proxies for private insurance coverage, raising a possible ethics flag. In general, any bias that exists in the health system may be represented in the EHR, and an EHRwide ML algorithm has a chance of perpetuating that bias in future medical decisions. On a separate note, I would favor using comprehensive comorbidity indexes and severity of disease measures over doing a freerange exploration of ICD9 codes.</p>
<p>It may also be useful to contrast the ML approach with another carefully designed traditional and transparent statistical approach used in the <a href="http://onlinelibrary.wiley.com/doi/10.1111/j.15325415.2000.tb03126.x/full" target="_blank">HELP study</a> of JM Teno, FE Harrell, et al. A validated parametric survival model was turned into an easytouse nomogram for obtaining a variety of predictions on older hospitalized adults:</p>
<figure >
<img src="http://fharrell.com/img/HELPnomogram.png" alt="Nomogram for obtaining predicted 1 and 2year survival probabilities and the 10th, 25th, 50th, 75th, and 90th percentiles of survival time (in months) for individual patients in HELP. Disease class abbreviations: a=ARF/MOSF/Coma, b=all others, c=CHF, d=Cancer, e=Orthopedic. To use the nomogram, place a ruler vertically such that it touches the appropriate value on the axis for each predictor. Read off where the ruler intersects the 'Points' axis at the top of the diagram. Do this for each predictor, making a listing of the points. Add up all these points and locate this value on the 'Total Points' axis with a vertical ruler. Follow the ruler down and read off any of the predicted values of interest. APS is the APACHE III Acute Physiology Score." width="95%" />
<figcaption>
<p>
Nomogram for obtaining predicted 1 and 2year survival probabilities and the 10th, 25th, 50th, 75th, and 90th percentiles of survival time (in months) for individual patients in HELP. Disease class abbreviations: a=ARF/MOSF/Coma, b=all others, c=CHF, d=Cancer, e=Orthopedic. To use the nomogram, place a ruler vertically such that it touches the appropriate value on the axis for each predictor. Read off where the ruler intersects the 'Points' axis at the top of the diagram. Do this for each predictor, making a listing of the points. Add up all these points and locate this value on the 'Total Points' axis with a vertical ruler. Follow the ruler down and read off any of the predicted values of interest. APS is the APACHE III Acute Physiology Score.
</p>
</figcaption>
</figure>
<p>Importantly, patients’ actual preferences for care were also studied in HELP. A different validated prognostic tool for endoflife decision making, derived primarily from ICU patients, is the <a href="http://annals.org/aim/articleabstract/708396/supportprognosticmodelobjectiveestimatessurvivalseriouslyillhospitalizedadults" target="_blank">SUPPORT prognostic model</a>.</p>
<p>In the rush to use ML and large EHR databases to accelerate learning from data, researchers often forget about the advantages of statistical models and of using more compact, cleaner, and better defined data. They also sometimes forget how to measure absolute predictive accuracy, or that utilities must be incorporated to make optimum decisions. Utilities are applied to predicted risks; classifiers are at odds with optimum decision making and with incorporating utilities at the appropriate time, which is usually at the last minute just before the medical decision is made and not when a classifier is being built.</p>
<hr />
<h2 id="referencesguidelinesforreportingpredictivemodels">References: Guidelines for Reporting Predictive Models</h2>
<ul>
<li><a href="http://annals.org/aim/fullarticle/2088549/transparentreportingmultivariablepredictionmodelindividualprognosisdiagnosistripodtripod" target="_blank">TRIPOD Statement</a></li>
<li><a href="http://annals.org/aim/fullarticle/2088542/transparentreportingmultivariablepredictionmodelindividualprognosisdiagnosistripodexplanation" target="_blank">TRIPOD Explanation and Elaboration</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5238707" target="_blank">Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research</a></li>
</ul>
<h2 id="otherrelevantarticles">Other Relevant Articles</h2>
<ul>
<li><a href="https://jamanetwork.com/journals/jama/fullarticle/2675024" target="_blank">Big Data and Machine Learning in Health Care</a></li>
<li><a href="https://publications.parliament.uk/pa/ld201719/ldselect/ldai/100/10002.htm" target="_blank">UK Parliament AI Report</a></li>
<li><a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0194889" target="_blank">Statistical and Machine Learning Forecasting Methods: Concerns and Ways Forward</a> by S Markridakis, E Spiliotis, V Assimakopoulos
<ul>
<li>Excellent discussion of overfitting, measuring accuracy, and lack of rigor in published machine learning studies in financial time series forecasting. Simple statistical methods outperformed complex machine learning algorithms. Previous researchers refused to share data.</li>
</ul></li>
<li><a href="https://jamanetwork.com/journals/jama/articleabstract/2683125" target="_blank">Big Data and Predictive Analytics: Recalibrating Expectations</a> by ND Shah, EW Steyerberg, DM Kent</li>
<li><a href="https://medium.com/@jrzech/whatareradiologicaldeeplearningmodelsactuallylearningf97a546c5b98" target="_blank">What are radiological deep learning models actually learning?</a> by John Zech</li>
<li><a href="https://www.bmj.com/content/361/bmj.k1479" target="_blank">Biases in electronic health record data due to processes within the healthcare system: retrospective observational study</a> by Denis Agniel et al.
<ul>
<li>Data about timing of medical test ordering was more predictive of survival than the actual test results</li>
</ul></li>
<li><a href="https://onlinelibrary.wiley.com/toc/15214036/2014/56/4" target="_blank">Special issue on probability estimation and machine learning</a> of Biometrical Journal, including discussion articles comparing ML and SM</li>
<li><a href="https://jamanetwork.com/journals/jama/fullarticle/2718456?guestAccessKey=353313c067cc4b8f9df7f516a12eacc7&utm_source=silverchair&utm_medium=email&utm_campaign=article_alertjama&utm_content=olf&utm_term=121018" target="_blank">Questions for artificial intelligence in health care</a> by Maddox et al</li>
<li><a href="https://www.nature.com/articles/s4159101803007" target="_blank">Highperformance medicine: the convergence of human and artificial intelligence</a> by Eric Topol</li>
</ul>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">The authors failed to present a highresolution validated calibration to demonstrate the absolute predictive accuracy of the model. They also needlessly dealt with sensitivity and specificity.
<a class="footnotereturn" href="#fnref:1"><sup>^</sup></a></li>
<li id="fn:2">Hemoglobin, red blood count, mean corpuscular hemoglobin, chloride, and BUN were excluded because their information was redundant once all the other predictors were known.
<a class="footnotereturn" href="#fnref:2"><sup>^</sup></a></li>
<li id="fn:Thereexistneur">There exist neural network algorithms for censored timetoevent data. <a class="footnotereturn" href="#fnref:Thereexistneur"><sup>^</sup></a></li>
</ol>
</div>

Information Gain From Using Ordinal Instead of Binary Outcomes
http://fharrell.com/post/ordinalinfo/
Sun, 28 Jan 2018 00:00:00 +0000
http://fharrell.com/post/ordinalinfo/
<p>As discussed in <a href="http://hbiostat.org/doc/bbr.pdf#nameddest=sec:overviewychoice">BBR Section 3.5</a>, a binary dependent variable Y has minimum statistical information, giving rise to minimal statistical power and precision. This can easily be demonstrated by power or sample size calculations. Consider a pain outcome as an example. Instead of having as an outcome the presence or absence of pain, one can significantly increase power by having several levels of pain severity with the lowest level representing “none”; the more levels the better.</p>
<p>The point about the increase in power can also be made by, instead of varying the effect size, varying the effect that can be detected with a fixed power of 0.9 when the degree of granularity in Y is increased. This is all about breaking ties in Y. The more ties there are, the less statistical information is present. Why is this important in study planning? Here’s an all–too–commmon example. A study is designed to compare the fraction of “clinical responders” between two treatments. The investigator knows that the power of a binary endpoint is limited, and has a fixed budget. So she chooses a more impressive effect size for the power calculation—one that is more than clinically relevant. After the data are in, she finds an apparent clinically relevant improvement due to one of the treatments, but because the study was sized only to detect a superclinical improvement, the pvalue is large and the confidence interval for the effect is wide. Little new knowledge is gained from the study except for how to spend money.</p>
<p>Consider a twogroup comparison, with an equal sample size per group. Suppose we want to detect an odds ratio of 0.5 (OR=1.0 means no group effect) for binary Y. Suppose that the probability that Y=1 in the control group is 0.2. The required sample size is computed below.</p>
<pre class="r"><code>require(Hmisc)</code></pre>
<pre class="r"><code>knitrSet(lang='blogdown')
dor < 0.5 # OR to detect
tpower < 0.9 # target power
# Apply OR to p1=0.2 to get p2
p2 < plogis(qlogis(0.2) + log(dor))
n1 < round(bsamsize(0.2, p2, power=tpower)['n1'])
n < 2 * n1</code></pre>
<p>The OR of 0.5 corresponds to an event probability of 0.111 in the second group, and the number of subjects required per group is 347 to achieve a power of 0.9 of detecting OR=0.5.</p>
<p>Let’s now turn to using an ordinal response variable Y for our study. The proportional odds ordinal logistic model is the most widely used ordinal response model. It includes both the WilcoxonMannWhitney twosample rank test and binary logistic regression as special cases.
If ties in Y could be broken, the proportional odds assumption satisfied, and the sample size per group were fixed at 347, what odds ratio would be detectable with the same power of 0.9?</p>
<p>Before proceeding let’s see how close to 0.9 is the power computed using proportional odds model machinery when Y is binary. The vector of cell probabilities needed by the R <code>popower</code> function is the average of the cell probabilities over the two study groups. We write a frontend to <code>popower</code> that computes this average given the odds ratio and the cell probabilities for group 1.</p>
<pre class="r"><code>popow < function(p, or, n) {
# Compute cell probabilities for group 2 using Hmisc::pomodm
p2 < pomodm(p=p, odds.ratio=or)
pavg < (p + p2) / 2
popower(pavg, odds.ratio=or, n=n)
}
z < popow(c(0.8, 0.2), or=dor, n=2 * n1)
z</code></pre>
<pre><code>Power: 0.911
Efficiency of design compared with continuous response: 0.394 </code></pre>
<pre class="r"><code>binpopower < z$power</code></pre>
<p>The approximation to the binary case isn’t perfect since the PO model method’s power is a little above 0.9. But it’s not bad.</p>
<p>Let’s write an R function that given everything else computes the OR needed to achieve a given power and configuration of cell probabilities in the control group.</p>
<pre class="r"><code>g < function(p, n=2 * n1, power=binpopower) {
f < function(or) popow(p, or=or, n = n)$power  power
round(uniroot(f, c(dor  0.1, 1))$root, 3)
}
# Check that we can recover the original detectable OR
g(c(0.8, 0.2))</code></pre>
<pre><code>[1] 0.5</code></pre>
<p>To break ties in Y we’ll try a number of configurations of the cell probabilities for the control group, and for each configuration compute the OR that can be detected with the same power as computed for the binary Y case using the PO model. We will mainly vary the number of levels of Y. For example, to compute the detectable effect size when the probability that Y=1 of 0.2 is divided into two values of Y with equal probability we use <code>g(c(0.8, 0.1, 0.1), n)</code>. Results are shown in the table below.</p>
<pre class="r"><code># Function to draw spike histograms of probabilities as html base64 insert
h < function(p) tobase64image(pngNeedle(p, w=length(p))) </code></pre>
<table>
<thead>
<tr class="header">
<th>Distinct Y Values</th>
<th></th>
<th>Cell Probabilities</th>
<th>Detectable OR</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>2</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAUAAAASCAMAAACkYuT0AAAAIVBMVEUAAAAyMjJmZmaEhISPj4+UlJSZmZng4ODr6+vx8fH///8zi/NfAAAAL0lEQVQImWPgAAEGLhBgYOIEkQxsYJIFB8kMUg8lsalhZwSRrAwgkgVOMnMwMwAAhSUCJ+oP35EAAAAASUVORK5CYII=" alt="image" /></td>
<td>.8 .2</td>
<td>0.5</td>
</tr>
<tr class="even">
<td>2</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAUAAAASCAMAAACkYuT0AAAAFVBMVEUAAACEhISPj4+qqqrLy8vr6+v///9wEjDrAAAAKElEQVQImWNgBQEGNhAgQLIACRYGBmZWZgYGBkZWRiDJxMZEJAlSDwBXcgFHSfw5xwAAAABJRU5ErkJggg==" alt="image" /></td>
<td>.5 .5</td>
<td>0.603</td>
</tr>
<tr class="odd">
<td>3</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAoAAAASCAMAAABVab95AAAAMFBMVEUAAAAEBAQUFBReXl5nZ2dwcHBxcXF5eXmhoaGwsLCzs7Pk5OTr6+vy8vL39/f///9bJARqAAAAPUlEQVQImZXMuQHAIAzAQIUQPgPef1tw49Ci6irRPdTjFSf553fB6N+TN4fNObeGUUIQbY+xQtWEfQuUHllXyQfwiPT7LAAAAABJRU5ErkJggg==" alt="image" /></td>
<td>.8 .2/2 x 2</td>
<td>0.501</td>
</tr>
<tr class="even">
<td>3</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAoAAAASCAMAAABVab95AAAAMFBMVEUAAAAQEBAlJSUqKipnZ2dwcHBycnJ0dHShoaGvr6+ysrLBwcHNzc3q6urr6+v///8dAOhqAAAAQElEQVQImXXMNxIAIQwEwcF79P/fUkp0F8BGHWwN24bYHgzTSPmYrozW/fP+fXF5v2Q4ZYcuGWWDJgntVqg7cgCMUwgjOT+CdAAAAABJRU5ErkJggg==" alt="image" /></td>
<td>.7 .3/2 x 2</td>
<td>0.562</td>
</tr>
<tr class="odd">
<td>3</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAoAAAASCAMAAABVab95AAAALVBMVEUAAABDQ0NYWFhnZ2dwcHB2dnaMjIyhoaGvr6+2trbDw8PZ2dnr6+v29vb///9GxaMQAAAAP0lEQVQImZ3MOw7AIAwE0Qnmmxjf/7iIxhYFTaZ6xWpRD/P+8g1S/RcJ5hgcnClN+57NAcMKmx265StFGzQVFlGrB0OcmqrtAAAAAElFTkSuQmCC" alt="image" /></td>
<td>.5 .5/2 x 2</td>
<td>0.615</td>
</tr>
<tr class="even">
<td>3</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAoAAAASCAMAAABVab95AAAAIVBMVEUAAAADAwNnZ2dwcHCYmJihoaGvr6/Dw8PV1dXr6+v///+HXnldAAAAOElEQVQIma2MMQoAIAzEUrXa3v8fbCdxFcxwBAJHHtDhg16/l65KETULhsIsNICuCVP9UVs6eDY2Z1gFMW1yd0UAAAAASUVORK5CYII=" alt="image" /></td>
<td>1/3 x 3</td>
<td>0.629</td>
</tr>
<tr class="odd">
<td>4</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABEAAAASCAMAAACKJ8VmAAAAM1BMVEUAAAAGBgY+Pj5DQ0NNTU1XV1dYWFh1dXWKioqUlJS9vb3Ozs7n5+fr6+vw8PD4+Pj///9mhRhjAAAASUlEQVQYla3OORLAIAxD0Q9ZiHEMvv9pyVA6lKjSvEbCYvAYLo3C+ZNjk+T4ZyG7tpbSpX61Snd/nylCat4S4n4z/xRQUyhmmQGLQQ+0Y/BHOwAAAABJRU5ErkJggg==" alt="image" /></td>
<td>.8 .2/3 x 3</td>
<td>0.502</td>
</tr>
<tr class="even">
<td>4</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABEAAAASCAMAAACKJ8VmAAAAM1BMVEUAAAA9PT0+Pj5DQ0NGRkZJSUl1dXV2dnZ/f3+FhYW+vr7Ozs7g4ODh4eHm5ubr6+v///8q6XZ8AAAAR0lEQVQYlb2OORKAMBSFyL7nv/ufVuvYOBZSMVRgJ+jk1/L8+XbourRiubXEJXUHSWqwtaFJCQhShakJVQovizfLMGxANvNcUU8Ptx5vX+wAAAAASUVORK5CYII=" alt="image" /></td>
<td>1/4 x 4</td>
<td>0.638</td>
</tr>
<tr class="odd">
<td>5</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABoAAAASCAMAAAByxz6RAAAARVBMVEUAAAABAQEJCQkKCgoODg4WFhYlJSVkZGRtbW14eHh+fn5/f3+Pj4+ZmZmhoaG0tLS7u7vLy8vo6Ojr6+vw8PD8/Pz///8rrl2bAAAAXElEQVQYlbXPSw7AIAgE0NH+1KpVUe5/1JK4pguTzgIS3oIMSA1YzSKdt0rYddqWyKq9vmjt1y/UXZI5Qhiykusyn2tShGnMBSjMzSDK6cDs5YFKlIFMVAEvJ4sXJpIi9jGdA84AAAAASUVORK5CYII=" alt="image" /></td>
<td>0.7 .3/4 x 4</td>
<td>0.563</td>
</tr>
<tr class="even">
<td>5</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABoAAAASCAMAAAByxz6RAAAAS1BMVEUAAAABAQEICAgJCQkKCgoLCwsUFBQrKytSUlJ4eHiEhIShoaGwsLDDw8PLy8vd3d3k5OTr6+vw8PDz8/P39/f6+vr7+/v+/v7///+cI/wTAAAAZ0lEQVQYlbXPNw7AMAxD0Z/ei1J5/5PGhmdlSjhw4AMECHOD3HxP2+1S1rpE5VPh/vVGpX/wH9qv2McR+9pj31uiKW9Oaa3rVTqbfApTmyXqYZFGGKUF+jBVpL86mM0GGMxm6MJU8AA+UiSVSXJMqwAAAABJRU5ErkJggg==" alt="image" /></td>
<td>0.6 .4/4 x 4</td>
<td>0.597</td>
</tr>
<tr class="odd">
<td>5</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABoAAAASCAMAAAByxz6RAAAASFBMVEUAAAADAwMJCQkKCgocHBwvLy94eHh/f3+CgoKJiYmUlJSZmZmhoaGvr6+wsLC/v7/BwcHHx8fLy8vc3Nzr6+vu7u739/f///9EU4FiAAAAYklEQVQYlc2PSQqAMBAEy32JS9TR/v9PTch5joJ9aJgqGGjMDXLzF7XfrqJzd1H7qvEffqKeactHCLm36Ul9DEXNVCZFiJJVzAm1FDXCJa2wSheMCTWUXT2cZgssZif0CdW8SMckNGs501MAAAAASUVORK5CYII=" alt="image" /></td>
<td>0.5 .5/4 x 4</td>
<td>0.618</td>
</tr>
<tr class="even">
<td>5</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABoAAAASCAMAAAByxz6RAAAAS1BMVEUAAAAJCQkKCgoMDAwVFRUgICAlJSVBQUFoaGh4eHiBgYGCgoKFhYWhoaGvr6+ysrLAwMDLy8vOzs7c3Nze3t7q6urr6+v8/Pz////PcL6tAAAAYklEQVQYldXPOQ6AQAwEwV7u+8Y7/38pFsRLRsCEXZIlY8mh5H5AL391S5IokwfJvqCtqKN0VtUpxbrYPDX5QwMc0gyzdMDgyeWmHnZpgknaofeUOQV/oYXVbITRbIXWU+ACggkmJL9LJTEAAAAASUVORK5CYII=" alt="image" /></td>
<td>0.4 .6/4 x 4</td>
<td>0.631</td>
</tr>
<tr class="odd">
<td>5</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABoAAAASCAMAAAByxz6RAAAARVBMVEUAAAAJCQkKCgoODg4lJSUuLi4/Pz9aWlp4eHh/f3+CgoKHh4ehoaGvr6+zs7PLy8vZ2dnc3Nzg4ODr6+vw8PD+/v7///90efXjAAAAYUlEQVQYldWPOQ6AMBADhyMcCRCSJf7/U9kPQImEC1saV4M9Bj3mB9eL15fK4+p9hrlJ1zRdUpvD6WgdYfBNUKUDDqlCcjQAvW+EIu2wSwWio/796lxhgWy2wWaWYXHUcQPYJSJNV3+J0QAAAABJRU5ErkJggg==" alt="image" /></td>
<td>1/5 x 5</td>
<td>0.641</td>
</tr>
<tr class="even">
<td>6</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAACUAAAASCAMAAADrP+ckAAAAUVBMVEUAAAABAQECAgIEBAQGBgYICAiWlpaenp6jo6OlpaWrq6usrKyzs7O7u7vCwsLLy8vPz8/R0dHU1NTW1tbY2NjZ2dnb29ve3t7j4+Pr6+v///+oW0wbAAAAbklEQVQokeXQRw6AMAxE0R96Cb0kcP+DMvgEWSLhzbPk8Wa4UoY7ZX6RSuvrq63Op7kcxroZ22oci3HOuDJqmVwR3myW72LPs/ceCjeJWDpg1NbAIDrwwkMnBmjECMkpr0Iq6EUNrWihFj1UQj8PgfE8BYbAjVoAAAAASUVORK5CYII=" alt="image" /></td>
<td>1/6 x 6</td>
<td>0.643</td>
</tr>
<tr class="odd">
<td>7</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAADIAAAASCAMAAAAuTX21AAAAY1BMVEUAAAAxMTE3Nzc+Pj5GRkZSUlJgYGBnZ2dwcHBxcXF0dHR6enqEhISJiYmOjo6ampqfn5+hoaGvr6+0tLS1tbXExMTJycnR0dHa2trd3d3f39/p6enr6+vy8vL5+fn+/v7////0uzH+AAAAgElEQVQoke3RSRKCQBBE0Y/MyiCTCAp4/1PK9wa41dy82mR1RBfPw+F1OP/K4coXd/nxU+bBqFtymnSNw4cuUbToI4xXnU7JpmOQA2fHGUq9wVU76PQKNy1h1jMmc7xDoQPU2kKrNQxawF2zTyX1s/ftF+2h0gYaraDXy/6apvAGzIpknXhcLgUAAAAASUVORK5CYII=" alt="image" /></td>
<td>1/7 x 7</td>
<td>0.644</td>
</tr>
<tr class="even">
<td>10</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAGUAAAASCAMAAAB1heCEAAAAdVBMVEUAAAAHBwcICAgJCQknJycpKSkqKiorKystLS0uLi4vLy8yMjIzMzM3NzdycnJ1dXV2dnZ6enp9fX1+fn6BgYGGhoaKiorAwMDExMTFxcXIyMjJycnOzs7X19fb29vg4ODr6+vv7+/z8/P39/f5+fn7+/v////RKynaAAAAqUlEQVQ4je3Tyw6CMBRF0Y0WQRB5KiAvReX/P1EGB+Ym2hEdraZNdtL0crexmGysrbJV/l6xMy92KnZebKt8X3msm/e4cnyv/ME5U855ueQ7pXjbm6eYkkgvd1eJ1c59iQmp9DT7m1g6/pI5k8+VCE/7AQrxCp14IpB6uIgX6MWAk9TBVSxgED2ieSpDjH51C7GYQS0eOUgNpGIKjXjgKNWQiTG0oiG8fwCPVfnlhbkCxQAAAABJRU5ErkJggg==" alt="image" /></td>
<td>1/10 x 10</td>
<td>0.646</td>
</tr>
<tr class="odd">
<td>694</td>
<td></td>
<td>1/694 x 694</td>
<td>0.647</td>
</tr>
</tbody>
</table>
<p>The last row corresponds to analyzing a continuous variable with the Wilcoxon test with 347 observations per each of the two groups.</p>
<p>When high values of Y (e.g., Y=1 in the binary case) denote an event, and when the control group has a low probability of the event, splitting the high Ylevel into multiple ordinal levels does not increase power very much. The real gain in power comes from splitting the more frequent nonevent subjects into for example “no event and mild event”. The best power (detectable OR closer to 1.0) comes from having equal probabilities in the cells when averaged over treatment groups, and with at least 5 distinct Y values.</p>
<p>When designing a study, choose a maximum information dependent variable, and attempt to not have more than, say 0.7 of the sample in any one category. But even if the proportion of nonevents is large, it does not hurt to break ties among the events. In some cases it will even help, e.g., when the treatment has a larger effect on the more severe events.</p>
<hr />
<p>The first few lines of <code>Rmarkdown knitr</code> markup used to produce the above table are given below.</p>
<pre><code>
Distinct Y Values Cell ProbabilitiesDetectable OR 

 2  `r h(c(.8, .2))`  .8 .2  `r g(c(.8, .2))` 
 2  `r h(c(.5, .5))`  .5 .5  `r g(c(.5, .5))` </code></pre>
<hr />
<div id="furtherreading" class="section level2">
<h2>Further Reading</h2>
<ul>
<li><a href="https://ccforum.biomedcentral.com/track/pdf/10.1186/cc10240">The added value of ordinal analysis in clinical trials: an example in traumatic brain injury</a> by B Roozenbeek et al.</li>
</ul>
</div>

Why I Don't Like Percents
http://fharrell.com/post/percent/
Fri, 19 Jan 2018 00:00:00 +0000
http://fharrell.com/post/percent/
<p>The numbers zero and one are special; zero because it is a minimum or center point for many measurements and because it is the addition identity constant (x + 0 = x), and one because it is the multiplication identity constant (x × 1 = x) and corresponds to units of measurements. Many important quantities are between 0 and 1, including proportions of a whole and probabilities. One hundred is not special in the same sense as unity, so percent (per 100) doesn’t do anything for me (why not per thousand?).
<style>
img {
height: auto;
maxwidth: 70px;
marginleft: auto;
marginright: auto;
display: block;
}
</style></p>
<p>When a quantity doubles, it gets back to its original value by halving. When in increases by 100% it gets back to its original value by decreasing 50%. Case almost closed. Whereas an increase of 33.33% is balanced by a decrease of 25%, an increase by a factor of <sup>4</sup>⁄<sub>3</sub> is balanced by a decrease to a factor of <sup>3</sup>⁄<sub>4</sub> . If you put 100 dollars into an account that yields 3% interest annually, you will have 100 * (1.03<sup>10</sup>) or 134 dollars after 10 years. To get back to your original value you’d have to lose 2.91% per year for 10 years.</p>
<p>I like fractions like <sup>3</sup>⁄<sub>4</sub>, or the decimal equivalent 0.75. I like ratios, because they are symmetric. Chaining together relative increases is simple with ratios. An increase by a factor of 1.5 followed by an increase by a factor of 1.4 is an increase by a factor of 1.5 * 1.4 or 2.1. A 50% increase followed by a 40% increase is an increase of 110%. To get the right answer with percent increase you have to convert back to ratios, do the multiplication, then convert back to percent.</p>
<p>Many numbers that we quote are probabilities, and a probability is formally a number between 0 and 1. So I don’t like “the chance of rain is 10%” but prefer “the chance of rain is 0.1 or <sup>1</sup>⁄<sub>10</sub>”. When discussing statistical analyses it is especially irksome to see statements such as “significance levels of 5% or power of 90%”. Probabilities are being discussed, so I prefer 0.05 and 0.9.</p>
<p>I have seen clinicians confused over statements such as “the chance of a stroke is 0.5%”, interpreting this as 50%. If we say “the chance of a stroke is 0.005” such confusion is less likely. And I don’t need percent signs everywhere.</p>
<p>Percent change has even more problems than percent. I have often witnessed confusion from statements such as “the chance of stroke increased by 50%”. If the base stroke probability was 0.02 does the speaker mean that it is now 0.52? Not very likely, but you can’t be sure. More likely she meant that the chance of stroke is now 0.02 + 0.5 * 0.02 = 0.03. It would always be clear to instead say one of the following:</p>
<ul>
<li>The chance of stroke went from 0.02 to 0.03</li>
<li>The chance of stroke increased by 0.01 (or the <em>absolute</em> chance of stroke increased by 0.01)</li>
<li>The chance of stroke increased by a factor of 1.5</li>
</ul>
<p>We need to achieve clarity by settling on a convention for wording foldchange decreases. If the chance of stroke decreases from 0.03 to 0.02 and we feel compelled to summarize the <em>relative</em> decrease in risk, we could say that risk of stroke decreased by a factor of 1.5. But even though it looks a bit awkward, I think it would be clearest to say the following, if 0.02 corresponded to treatment A and 0.03 corresponded to treatment B: treatment A multiplied the risk of stroke by <sup>2</sup>⁄<sub>3</sub> in comparison to treatment B. Or you could say that treatment A modified the risk of stroke by a factor of <sup>2</sup>⁄<sub>3</sub>, or that the A:B risk ratio is <sup>2</sup>⁄<sub>3</sub> or 0.667.</p>
<p>Many quantities reported in the scientific literature are naturally ratios. For example, odds ratios and hazard ratios are commonly used. If the ratio of stroke hazard rates treatment B compared to treatment A is 0.75, I prefer to report “the B:A stroke hazard ratio was 0.75.” There’s no need to say that there was a 25% reduction in stroke hazard rate.</p>
<p>Percents have perhaps one good use. When they represent fractions and we don’t care to present but two decimal places of accuracy, i.e., the percents you calculate are all whole numbers, percents may be OK. But I would still prefer numbers like 0.02, 0.86 and to avoid a symbol (%) when just dealing with numbers.</p>
<h2 id="linkstootherresources">Links to Other Resources</h2>
<ul>
<li><a href="https://www.bmj.com/content/358/bmj.j3663" target="_blank">What is a percentage difference?</a> by TJ Cole and DG Altman</li>
</ul>

How Can Machine Learning be Reliable When the Sample is Adequate for Only One Feature?
http://fharrell.com/post/mlsamplesize/
Thu, 11 Jan 2018 00:00:00 +0000
http://fharrell.com/post/mlsamplesize/
<p>The ability to estimate how one continuous variable relates to another continuous variable is basic to the ability to create good predictions. Correlation coefficients are unitless, but estimating them requires similar sample sizes to estimating parameters we directly use in prediction such as slopes (regression coefficients). When the shape of the relationship between X and Y is not known to be linear, a little more sample size is needed than if we knew that linearity held so that all we had to estimate was a slope and an intercept. This will be addressed later.</p>
<p>Consider <a href="http://hbiostat.org/doc/bbr.pdf#nameddest=sec:corrn">BBR Section 8.5.2</a>
where it is shown that the sample size needed to estimate a correlation coefficient to within a margin of error as bad as ±0.2 with 0.95 confidence is about 100 subjects, and to achieve a better margin of error of ±0.1 requires about 400 subjects. Let’s reproduce that plot for the “hardest to estimate” case where the true correlation is 0.</p>
<style>
p.caption {
fontsize: 0.6em;
}
pre code {
overflow: auto;
wordwrap: normal;
whitespace: pre;
}
</style>
<pre class="r"><code>require(Hmisc)</code></pre>
<pre class="r"><code>knitrSet(lang='blogdown')</code></pre>
<pre class="r"><code>plotCorrPrecision(rho=0, n=seq(10, 1000, length=100), ylim=c(0, .4), method='none')
abline(h=seq(0, .4, by=0.025), v=seq(25, 975, by=25), col=gray(.9))</code></pre>
<div class="figure"><span id="fig:plotprec"></span>
<img src="http://fharrell.com/post/mlsamplesize_files/figurehtml/plotprec1.png" alt="Margin for error (length of longer side of asymmetric 0.95 confidence interval) for r in estimating ρ, when ρ=0. Calculations are based on the Fisher z transformation of r." width="672" />
<p class="caption">
Figure 1: Margin for error (length of longer side of asymmetric 0.95 confidence interval) for r in estimating ρ, when ρ=0. Calculations are based on the Fisher z transformation of r.
</p>
</div>
<p>I have seen many papers in the biomedical research literature in which investigators “turned loose” a machine learning or deep learning algorithm with hundreds of candidate features and a sample size that by the above logic is inadequate had there only been one candidate feature. How can ML possibly learn how hundreds of predictors combine to predict an outcome when our knowledge of statistics would say this is impossible? The short answer is that it can’t. Researchers claiming to have developed a useful predictive instrument with ML in the limited sample size case seldom do a rigorous internal validation that demonstrates the relationship between predicted and observed values (i.e., the calibration curve) to be a straight 45° line through the origin. I have worked with a colleague who had previously worked with a ML group who found a predictive signal (high R<sup>2</sup>) with over 1000 candidate features and N=50 subjects. In trying to check their results on new subjects we appear to be finding an R<sup>2</sup> about 1/4 as large as originally claimed.</p>
<p><span class="citation">van der Ploeg, Austin, and Steyerberg (<a href="#refplo14mod">2014</a>)</span> in their article <a href="https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471228814137">Modern modelling techniques are data hungry</a> estimated that to have a very high chance of rigorously validating, many machine learning algorithms require 200 events per <em>candidate</em> feature (they found that logistic regression requires 20 events per candidate features). So it seems that “big data” methods sometimes create the need for “big data” when traditional statistical methods may not require such huge sample sizes (at least when the dimensionality is not extremely high). [Note: in higher dimensonal situations it is possible to specify a traditional statistical model for the prespecified “important” predictors and to add in principal components and other summaries of the remaining features.] For more about “data hunger” in machine learning see <a href="https://stats.stackexchange.com/questions/345737">this</a>. Machine learning algorithms do seem to have unique advantages in high signal:noise ratio situations such as image and sound pattern recognition problems. Medical diagnosis and outcome prediction problems involve a low signal:noise ratio, i.e., the R<sup>2</sup> are typically low and the outcome variable Y is typically measured with error.</p>
<p>I’ve shown the sample size needed to estimate a correlation coefficient with a certain precision. What about the sample size needed to estimate the whole relationship between a single continuous predictor and the probability of a binary outcome? Similar to what is presented in <a href="http://hbiostat.org/doc/rms.pdf#nameddest=sec:lrmn">RMS Notes Section 10.2.3</a>, let’s simulate the average maximum (over a range of X) absolute prediction error (on the probability scale). The following R program does this, for various sample sizes. 1000 simulated datasets are analyzed for each sample size considered.</p>
<pre class="r"><code># X = universe of X values if X considered fixed, in random order
# xp = grid of x values at which to obtain and judge predictions
require(rms)</code></pre>
<pre class="r"><code>sim < function(assume = c('linear', 'smooth'),
X,
ns=seq(25, 300, by=25), nsim=1000,
xp=seq(1.5, 1.5, length=200), sigma=1.5) {
assume < match.arg(assume)
maxerr < numeric(length(ns))
pactual < plogis(xp)
xfixed < ! missing(X)
j < 0
worst < nsim
for(n in ns) {
j < j + 1
maxe < 0
if(xfixed) x < X[1 : n]
nsuccess < 0
for(k in 1 : nsim) {
if(! xfixed) x < rnorm(n, 0, sigma)
P < plogis(x)
y < ifelse(runif(n) <= P, 1, 0)
f < switch(assume,
linear = lrm(y ~ x),
smooth = lrm(y ~ rcs(x, 4)))
if(length(f$fail) && f$fail) next
nsuccess < nsuccess + 1
phat < predict(f, data.frame(x=xp), type='fitted')
maxe < maxe + max(abs(phat  pactual))
}
maxe < maxe / nsuccess
maxerr[j] < maxe
worst < min(worst, nsuccess)
}
if(worst < nsim) cat('For at least one sample size, could only run', worst, 'simulations\n')
list(x=ns, y=maxerr)
}
plotsim < function(object, xlim=range(ns), ylim=c(0.04, 0.2)) {
ns < object$x; maxerr < object$y
plot(ns, maxerr, type='l', xlab='N', xlim=xlim, ylim=ylim,
ylab=expression(paste('Average Maximum ', abs(hat(P)  P))))
minor.tick()
abline(h=c(.05, .1, .15), col=gray(.85))
}
set.seed(1)
X < rnorm(300, 0, sd=1.5) # Allows use of same X's for both simulations
simrun < TRUE
# If blogdown handled caching, would not need to manually cache with Load and Save
if(simrun) Load(errLinear) else {
errLinear < sim(assume='linear', X=X)
Save(errLinear)
}
plotsim(errLinear)</code></pre>
<div class="figure"><span id="fig:logisticsim"></span>
<img src="http://fharrell.com/post/mlsamplesize_files/figurehtml/logisticsim1.png" alt="Simulated expected maximum error in estimating probabilities for x ∈ [1.5, 1.5] with a single normally distributed X with mean zero. The true relationship between X and P(Y=1  X) is assumed to be logit(Y=1) = X. The logistic model fits that are repeated in the simulation assume the relationship is linear, but estimates the slope and intercept. In reality, we wouldn't know that a relationship is linear, and if we allowed it to be nonlinear there would be a bit more variance to the estimated curve, resulting in larger average absolute errors than what are shown in the figure (see below)." width="672" />
<p class="caption">
Figure 2: Simulated expected maximum error in estimating probabilities for x ∈ [1.5, 1.5] with a single normally distributed X with mean zero. The true relationship between X and P(Y=1  X) is assumed to be logit(Y=1) = X. The logistic model fits that are repeated in the simulation assume the relationship is linear, but estimates the slope and intercept. In reality, we wouldn’t know that a relationship is linear, and if we allowed it to be nonlinear there would be a bit more variance to the estimated curve, resulting in larger average absolute errors than what are shown in the figure (see below).
</p>
</div>
<p>But wait—the above simulation assumes that we already knew that the relationship was linear. In practice, most relationships are nonlinear but we don’t know the true transformation. Assume the relationship between X and logit(Y=1) is smooth, we can estimate the relationship reliably with a restricted cubic spline function. Here we use 4 knots, which gives rise to the addition of two nonlinear terms to the model for a total of 3 parameters to estimate not counting the intercept. By estimating these parameters we are estimating the smooth transformation of X and by simulating this process repeatedly we are allowing for “transformation uncertainty”.</p>
<pre class="r"><code>set.seed(1)
if(simrun) Load(errSmooth) else {
errSmooth < sim(assume='smooth', X=X, ns=seq(50, 300, by=25))
Save(errSmooth)
}
plotsim(errSmooth, xlim=c(25, 300))
lines(errLinear, col=gray(.8))</code></pre>
<div class="figure"><span id="fig:simrcs"></span>
<img src="http://fharrell.com/post/mlsamplesize_files/figurehtml/simrcs1.png" alt="Estimated mean maximum (over X) absolute errors in estimating P(Y=1) when X is not assumed to predict the logit linearly (black line). The earlier estimates when linearity was assumed are shown with a gray scale line. Restricted cubic splines could not be fitted for n=25." width="672" />
<p class="caption">
Figure 3: Estimated mean maximum (over X) absolute errors in estimating P(Y=1) when X is not assumed to predict the logit linearly (black line). The earlier estimates when linearity was assumed are shown with a gray scale line. Restricted cubic splines could not be fitted for n=25.
</p>
</div>
<p>You can see that the sample size must exceed 300 just to have sufficient reliability in estimating probabilities just over the range of X of [1.5, 1.5] when we do not know that the relationship is linear and we allow it to be nonlinear.</p>
<p>The morals of the story are</p>
<ul>
<li>Beware of claims of good predictive ability for ML algorithms when sample sizes are not huge in relationship to the number of candidate features</li>
<li>For any problem, whether using machine learning or regression, compute the sample size needed to obtain highly reliable predictions with only a single prespecified predictive feature</li>
<li>If you are not sure that relationships are simple so that you allow various transformations to be attempted, uncertainty increases and so does the expected absolute predicton error</li>
<li>If your sample size is not much bigger than the above minimum, beware of doing any highdimensional analysis unless you have very clean data and a high signal:noise ratio</li>
<li>Also remember that when Y is binary, the minimum sample size necessary just to estimate the intercept in a logistic regression model (equivalent to estimating a single proporton) is 96 (see <a href="http://hbiostat.org/doc/bbr.pdf#nameddest=sec:htestpn">BBR Section 5.6.3</a>)
So it is impossible with binary Y to accurately estimate P(Y=1  X) when there are <em>any</em> candidate predictors if n < 96 (and n=96 only achives a margin of error of ±0.1 in estimating risk).</li>
<li>When the number of candidate features is huge and the sample size is not, expect the list of “selected” features to be volatile, predictive discrimination to be overstated, and absolute predictive accuracy (calibration curve) to be very problematic</li>
<li>In general, know how many observations are required to allow you to reliably learn from the number of candidate features you have</li>
</ul>
<p>See <a href="http://hbiostat.org/doc/bbr.pdf#nameddest=chap:hdata">BBR Chapter 20</a> for an approach to estimating the needed sample size for a given sample size and number of candidate predictors.</p>
<div id="references" class="section level1 unnumbered">
<h1>References</h1>
<div id="refs" class="references">
<div id="refplo14mod">
<p>Ploeg, Tjeerd van der, Peter C. Austin, and Ewout W. Steyerberg. 2014. “Modern Modelling Techniques Are Data Hungry: A Simulation Study for Predicting Dichotomous Endpoints.” <em>BMC Medical Research Methodology</em> 14 (1). BioMed Central Ltd: 137+. <a href="https://doi.org/10.1186/1471228814137" class="uri">https://doi.org/10.1186/1471228814137</a>.</p>
</div>
</div>
</div>

New Year Goals
http://fharrell.com/post/newyeargoals/
Fri, 29 Dec 2017 00:00:00 +0000
http://fharrell.com/post/newyeargoals/
<p>Here are some goals related to scientific research and clinical medicine that I’d like to see accomplished in 2018. <a href="#u2019">Here</a> are some updates for 2019.</p>
<ul>
<li>Physicians come to know that precision/personalized medicine for the most part is based on a false premise</li>
<li>Machine learning/deep learning is understood to not find previously
unknown information in data in the majority of cases, and tends to
work better than traditional statistical models only when dominant
nonadditive effects are present and the signal:noise ratio is
decently high</li>
<li>Practitioners will make more progress in correctly using “old”
statistical tools such as regression models</li>
<li>Medical diagnosis is finally understood as a task in probabilistic
thinking, and sensitivity and specificity (which are characteristics
not only of tests but also of patients) are seldom used</li>
<li>Practitioners using cutpoints/thresholds for inherently continuous
measurements will finally go back to primary references and find
that the thresholds were never supported by data</li>
<li>Dichotomania is seen as a failure to understand utility/loss/cost
functions and as a tragic loss of information</li>
<li>Clinical quality improvement initiatives will rely on randomized
trial evidence and deemphasize purely observational evidence;
learning health systems will learn things that are actually true</li>
<li>Clinicians will give up on the idea that randomized clinical trials
do not generalize to realworld settings</li>
<li>Fewer prepost studies will be done</li>
<li>More research will be reproducible with sounder sample size
calculations, all data manipulation and analysis fully scripted, and
data available for others to analyze in different ways</li>
<li>Fewer sample size calculations will be based on a ‘miracle’ effect
size</li>
<li>Noninferiority studies will no longer use noninferiority margins
that are far beyond clinically significant</li>
<li>Fewer sample size calculations will be undertaken and more
sequential experimentation done</li>
<li>More Bayesian studies will be designed and executed</li>
<li>Classification accuracy will be mistrusted as a measure of
predictive accuracy</li>
<li>More researchers will realize that estimation rather than hypothesis
testing is the goal</li>
<li>Change from baseline will seldom be *computed,* not to mention not
used in an analysis</li>
<li>Percents will begin to be replaced with fractions and ratios</li>
<li>Fewer researchers will draw <strong>any</strong> conclusion from large pvalues
other than “the money was spent”</li>
<li>Fewer researchers will draw conclusions from small pvalues</li>
</ul>
<p>Some wishes expressed by others on Twitter:</p>
<ul>
<li>No more ROC curves</li>
<li>No more bar plots</li>
<li>Ban the term ‘statistical significance’ and ‘statistically
insignificant’</li>
</ul>
<h1 id="updatesfor2019">Updates for 2019</h1>
<p><a class="anchor" id="u2019"></a>
My goals for 2018 were lofty so it’s not surprising that I’m disappointed overall with how little progress has been made on many of the fronts. But I am heartened by seven things:</p>
<ul>
<li>Clinicians are getting noticeably more dubious about personalized/precision medicine</li>
<li>Researchers and clinicians are more dubious about benefits of machine learning</li>
<li>Researchers are more enlightened about problems with pvalues and dichotomous thinking that usually comes with them, and are especially starting to understand what’s wrong with “significant”</li>
<li>Researchers are more enlightened about harm caused by dichotomania in general</li>
<li>We successfully launched <a href="http://datamethods.org" target="_blank">datamethods.org</a> and have created indepth discussion in the community about many of the issues listed under goals for 2018</li>
<li>More researchers are seeing what a waste of ink ROC curves are</li>
<li>More highprofile Bayesian analysis of clinical trials are being published</li>
</ul>
<p>Areas that remain particularly frustrating are:</p>
<ul>
<li>Too many clinicians still believe that randomized clinical trials do not provide valuable efficacy data outside of the types of patients enrolled in the trials</li>
<li>Clinical researchers are still computing change from baseline</li>
<li>Sequential clinical trials are not being done (trials in which the sample size is not pretended to be known)</li>
<li>A failure to understand conditioning (as in what is assumed when computing a conditional probability)</li>
</ul>
<p>If I had to make just one plea for 2019, a general one is this: Recognize that actionable statistical information comes from thinking in a predictive mode. Condition on what you already know to predict what you don’t. Use forwardtime, complete, conditioning. As opposed to typeI errors, pvalues, sensitivity, specificity, and marginal (sample averaged) estimates.</p>

Scoring Multiple Variables, Too Many Variables and Too Few Observations: Data Reduction
http://fharrell.com/post/scoredatareduction/
Tue, 21 Nov 2017 15:40:00 +0000
http://fharrell.com/post/scoredatareduction/
<p>This post will grow to cover questions about data reduction methods, also known as <em>unsupervised learning</em> methods. These are intended primarily for two
purposes:</p>
<ul>
<li>collapsing correlated variables into an overall score so that one
does not have to disentangle correlated effects, which is a
difficult statistical task</li>
<li>reducing the effective number of variables to use in a regression or
other predictive model, so that fewer parameters need to be
estimated</li>
</ul>
<p>The latter example is the “too many variables too few subjects” problem.
Data reduction methods are covered in Chapter 4 of my book <em>Regression
Modeling Strategies</em>, and in some of the book’s case studies.</p>
<hr />
<h3 id="sachavarinwrites">Sacha Varin writes</h3>
<p><small>
I want to add/sum some variables having different units. I decide to
standardize (Zscores) the values and then, once transformed in
Zscores, I can sum them. The problem
is that my variables distributions are non Gaussian (my distributions
are not symmetrical (skewed), they are longtailed, I have all types of
weird distributions, I guess we can say the distributions are
intractable. I know that my distributions don’t need
to be gaussian to calculate Zscores, however, if the distributions are
not close to gaussian or at least symmetrical enough, I guess the
classical Zscore transformation: (Value  Mean)/SD is not valid, that’s why I decide, because my distributions are skewed and longtailed to use the Gini’s mean difference (robust and efficient
estimator).</p>
<ol>
<li>If the distributions are skewed and longtailed, can I standardize
the values using that formula (Value  Mean)/GiniMd ? Or the mean is not a good estimator in presence of skewed and longtailed distributions? What
about (Value  Median)/GiniMd ? Or what else with
GiniMd for a formula to standardize?</li>
<li>In presence of outliers, skewed and longtailed distributions, for
standardization, what formula is better to use
between (Value  Median)/MAD (=median
absolute deviation) or Value  Mean)/GiniMd ? And
why? My situation is not the predictive modeling case, but I want to sum the variables.
</small></li>
</ol>
<hr />
<p>These are excellent questions and touch on an interesting side issue.
My opinion is that standard deviations (SDs) are not very applicable to
asymmetric (skewed) distributions, and that they are not very robust
measures of dispersion. I’m glad you mentioned <a href="https://arxiv.org/pdf/1405.5027.pdf" target="_blank">Gini’s mean
difference</a>, which is the mean of
all absolute differences of pairs of observations. It is highly robust
and is surprisingly efficient as a measure of dispersion when compared
to the SD, even when normality
holds.</p>
<p>The questions also touch on the fact that when normalizing more than
one variable so that the variables may be combined, there is no magic
normalization method in statistics. I believe that Gini’s mean
difference is as good as any and better than the SD. It is also more
precise than the mean absolute difference from the mean or median, and
the mean may not be robust enough in some instances. But we have a rich
history of methods, such as principal components (PCs), that use
SDs.</p>
<p>What I’m about to suggest is a bit more
applicable to the case where you ultimately want to form a predictive
model, but it can also apply when the goal is to just combine several
variables. When the variables are continuous and are on different
scales, scaling them by SD or Gini’s mean difference will allow one to
create unitless quantities that may possibly be added. But the fact
that they are on different scales begs the question of whether they are
already “linear” or do they need separate nonlinear transformations to
be “combinable”.</p>
<p>I think that nonlinear PCs may be a better choice than just adding
scaled variables. When the predictor variables are correlated,
nonlinear PCs learn from the interrelationships, even occasionally
learning how to optimally transform each predictor to ultimately better
predict Y. The transformations (e.g., fitted spline functions) are
solved for to maximize predictability of a predictor, from the other
predictors or PCs of them. Sometimes the way the predictors move
together is the same way they relate to some ultimate outcome variable
that this undersupervised learning method does not have access to. An
example of this is in Section 4.7.3 of my book.</p>
<p>With a little bit of luck, the transformed predictors have more
symmetric distributions, so ordinary PCs computed on these transformed
variables, with their implied SD normalization, work pretty well. PCs
take into account that some of the component variables are highly
correlated with each other, and so are partially redundant and should
not receive the same weights (“loadings”) as other
variables.</p>
<p>The R transcan function in the Hmisc package has various options for nonlinear PCs, and these ideas are generalized in the R
<a href="https://cran.rproject.org/web/packages/homals" target="_blank">homals</a>
package.</p>
<p>How do we handle the case where the number of candidate predictors p is
large in comparison to the effective sample size n? Penalized maximum
likelihood estimation (e.g., ridge regression) and Bayesian regression
typically have the best performance, but data reduction methods are
competitive and sometimes more interpretable. For example, one can use
variable clustering and redundancy analysis as detailed in the RMS book
and course notes. Principal components (linear or nonlinear) can also
be an excellent approach to lowering the number of variables than need
to be related to the outcome variable Y. Two example approaches
are:</p>
<ol>
<li>Use the 15:1 rule of thumb to estimate how many predictors can
reliably be related to Y. Suppose that number is k. Use the first
k principal components to predict Y.</li>
<li>Enter PCs in decreasing order of variation (of the system of Xs)
explained and chose the number of PCs to retain using AIC. This is
far from stepwise regression which enters variables according to
their pvalues with Y. We are effectively entering variables in a
prespecified order with incomplete principal component
regression.</li>
</ol>
<p>Once the PC model is formed, one may attempt to interpret the model by
studying how raw predictors relate to the principal components or to the
overall predicted values.</p>
<p>Returning to Sacha’s original setting,
if linearity is assumed for all variables, then scaling by Gini’s mean
difference is reasonable. But psychometric properties should be
considered, and often the scale factors need to be derived from subject
matter rather than statistical
considerations.</p>

Statistical Criticism is Easy; I Need to Remember That Real People are Involved
http://fharrell.com/post/criticismeasy/
Sun, 05 Nov 2017 21:07:00 +0000
http://fharrell.com/post/criticismeasy/
<p>I have been critical of a number of articles, authors, and journals in
<a href="http://fharrell.com/post/errmed/" target="_blank">this</a>
growing blog article. Linking the blog with Twitter is a way to expose
the blog to more readers. It is far too easy to slip into hyperbole on
the blog and even easier on Twitter with its space limitations.
Importantly, many of the statistical problems pointed out in my article,
are very, very common, and I dwell on recent publications to get the
point across that inadequate statistical review at medical journals
remains a serious problem. Equally important, many of the issues I
discuss, from pvalues, null hypothesis testing to issues with change
scores are not well covered in medical education (of authors and
referees), and pvalues have caused a phenomenal amount of damage to the
research enterprise. Still, journals insist on emphasizing pvalues. I
spend a lot of time educating biomedical researchers about statistical
issues and as a reviewer for many medical journals, but still am on a
quest to impact journal editors.</p>
<p>Besides statistical issues, there are very real human issues, and
challenges in keeping clinicians interested in academic clinical
research when there are so many pitfalls, complexities, and compliance
issues. In the many clinical trials with which I have been involved,
I’ve always been glad to be the statistician and not the clinician
responsible for protocol logistics, informed consent, recruiting,
compliance, etc.</p>
<p>A recent case discussed
<a href="http://fharrell.com/post/errmed/#pcisham" target="_blank">here</a>
has brought the human issues home, after I came to know of the
extraordinary efforts made by the
<a href="http://www.thelancet.com/journals/lancet/article/PIIS01406736(17)327149/fulltext" target="_blank">ORBITA</a>
study’s first author, Rasha AlLamee, to make this study a reality.
Placebocontrolled device trials are very difficult to conduct and to
recruit patients into, and this was Rasha’s first effort to launch and
conduct a randomized clinical trial. I very much admire Rasha’s bravery
and perseverance in conducting this trial of PCI, when it is possible
that many past trials of PCI vs. medical theory were affected by placebo
effects.</p>
<p>Professor of Cardiology at Imperial College London, a coauthor on the
above paper, and Rasha’s mentor,
<a href="https://www.imperial.ac.uk/people/d.francis" target="_blank">Darrel Francis</a>, elegantly pointed
out to me that there is a real person on the receiving end of my
criticism, and I heartily agree with him that none of us would ever want
to discourage a clinical researcher from ever conducting her second
randomized trial. This is especially true when the investigator has a
burning interest to tackle difficult unanswered clinical questions. I
don’t mind criticizing statistical designs and analyses, but I can do a
better job of respecting the sincere efforts and hard work of biomedical
researchers.</p>
<p>I note in passing that I had the honor of being a coauthor with Darrel
on <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0081699" target="_blank">this paper</a>
of which I am extremely proud.</p>
<p>Dr Francis gave me permission to include his thoughts, which are below.
After that I list some ideas for making the path to presenting clinical
research findings a more pleasant journey.</p>
<hr />
<p><strong>As the PI for ORBITA, I apologise for this trial being 40 years late,
due to a staffing issue. I had to wait for the lead investigator, Rasha
AlLamee, to be born, go to school, study Medicine at Oxford University,
train in interventional cardiology, and start as a consultant in my
hospital, before she could begin the trial.</strong></p>
<p>Rasha had just finished her fellowship. She had experience in clinical
research, but this was her first leadership role in a trial. She was
brave to choose for her PhD a rigorous placebocontrolled trial in this
controversial but important area.</p>
<p>Funding was difficult: grant reviewers, presumably interventional
cardiologists, said the trial was (a) unethical and (b) unnecessary.
This trial only happened because Rasha was able to convince our
colleagues that the question was important and the patients would not be
without stenting for long. Recruitment was challenging because it
required interventionists to not apply the oculostenotic reflex. In the
end the key was Rasha keeping the message at the front of all our
colleagues’ minds with her boundless energy and enthusiasm.
Interestingly, when the concept was explained to patients, they agreed
to participate more easily than we thought, and dropped out less
frequently than we feared. This means we should indeed acquire
placebocontrolled data on interventional procedures.</p>
<p>Incidentally, I advocate the term “placebo” over “sham” for these
trials, for two reasons. First, placebo control is well recognised as
essential for assessing drug efficacy, and this helps people understand
the need for it with devices. Second, “sham” is a pejorative word,
implying deception. There is no deception in a placebo controlled trial,
only preagreed withholding of information.</p>
<hr />
<p>There are several ways to improve the system that I believe would foster
clinical research and make peer review more objective and productive.</p>
<ul>
<li>Have journals conduct reviews of background and methods without
knowledge of results.</li>
<li>Abandon journals and use researcherled online systems that invite
open post“publication” peer review and give researchers the
opportunities to improve their “paper” in an ongoing fashion.</li>
<li>If not publishing the entire paper online, deposit the background
and methods sections for open prejournal submission review.</li>
<li>Abandon null hypothesis testing and pvalues. Before that, always
keep in mind that a large pvalue means nothing more than “we don’t
yet have evidence against the null hypothesis”, and emphasize
confidence limits.</li>
<li>Embrace Bayesian methods that provide safer and more actionable
evidence, including measures that quantify clinical significance.
And if one is trying to amass evidence that the effects of two
treatments are similar, compute the direct probability of similarity
using a Bayesian model.</li>
<li>Improve statistical education of researchers, referees, and journal
editors, and strengthen statistical review for journals.</li>
<li>Until everyone understands the most important statistical concepts,
better educate researchers and peer reviewers on
<a href="http://biostat.mc.vanderbilt.edu/ManuscriptChecklist" target="_blank">statistical problems to avoid</a>.</li>
</ul>
<p>On a final note, I regularly review clinical trial design papers for
medical journals. I am often shocked at design flaws that authors state
are “too late to fix” in their response to the reviews. This includes
problems caused by improper endpoint variables that necessitated the
randomization of triple the number of patients actually needed to
establish efficacy. Such papers have often been through statistical
review before the journal submission. This points out two challenges:
(1) there is a lot of betweenstatistician variation that statisticians
need to address, and (2) there are many fundamental statistical concepts
that are not known to many statisticians (witness the widespread use of
change scores and dichotomization of variables even when senior
statisticians are among a paper’s authors).</p>