Posts on Statistical Thinking
http://fharrell.com/post/
Recent content in Posts on Statistical Thinking
Hugo  gohugo.io
enus
© 2018
Sun, 01 Jan 2017 00:00:00 +0000

Data Methods Discussion Site
http://fharrell.com/post/disc/
Tue, 19 Jun 2018 00:00:00 +0000
http://fharrell.com/post/disc/
<p>I have learned more from Twitter than I ever thought possible, from those I follow and from my followers. Quick pointers to useful resources has been invaluable. I have also gotten involved in longer discussions. Some of those, particularly those related to design and interpretation of newly published studies (especially randomized clinical trials — RCTs), have gotten very involved and controversial. Twitter is not designed for indepth discourse, and I soon lose track of the discussion and others’ previous points. This is particularly true if I’m away from a discussion for more than 24 hours. Also, some Twitter discussions would have been more civil had there been a moderator.</p>
<p>There are excellent discussion boards related to statistics, e.g. <a href="http://stats.stackexchange.com" target="_blank">stats.stackexchange.com</a>, <a href="https://groups.google.com/forum/#!forum/medstats" target="_blank">medstats</a>, and <a href="http://talkstats.com" target="_blank">talkstats</a>, and a variety of sites related to medical research (including clinical trials), epidemiology, and machine learning. An informal <a href="https://twitter.com/f2harrell/status/989486563947098112" target="_blank">Twitter poll</a> was conducted on 20180426  20180427, resulting in 242 responses from those in my Twitter sphere. Of those, 0.71 were in favor of creating a new site vs. 0.29 who wanted to solely use Twitter for discussions on the intended topics.</p>
<p>After much research, I’ve chosen <a href="http://discourse.org" target="_blank">discourse.org</a> for the platform for a new discussion board. This will require putting up a server to host the site. Fortunately all the software needed (linux, ruby, discourse, etc.) is free. After the site is up and running, more moderators will be required. The site name will be <code>datamethods.org</code>. We hope to have it running in July 2018.</p>
<p>As the purpose of communication/collaboration between quantitative experts and clinical/translational researchers is a key function of the <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5263220" target="_blank">Biostatistics Epidemiology and Research Design</a> (BERD) program of the national <a href="https://ncats.nih.gov/ctsa" target="_blank">Clinical and Translational Science Award</a> program of NIH/NCATS, the Vanderbilt BERD (Biostatistics Epidemiology and Research Design) program will support this discussion site under its CTSA funded <a href="https://victr.vanderbilt.edu" target="_blank">VICTR</a> center, and the national CTSA BERD consortium is likely to also be involved. This has the potention to bring dozens of experienced statisticians and epidemiologists to the table to assist clinical and translational investigators and research consumers with their study design, analysis, and interpretation questions.</p>
<p>discourse.org recognizes participation and helpfulness. A good example may be found <a href="http://discourse.mcstan.org" target="_blank">here</a>. The software also makes it very easy to find your place in a large number of discussions, and to upvoted answers to question.</p>
<p>The areas that will be emphasized in the new discussion site follow. Global emphasis is on fostering communication between quantitativelyskilled persons and researchers not specializing in the math side of things.</p>
<ul>
<li>quantitative methods in general, including enhancing numeracy of those participants who are not into math or statistics</li>
<li>general statistical issues such as analysis of change scores and categorization of continuous variables</li>
<li>measurement issues</li>
<li>interpretation of published statistical analyses</li>
<li>statistical design of particular studies/clinical trials</li>
<li>statistical analysis issues in published biomedical and epidemiologic research papers</li>
<li>choosing optimal statistical graphics for presenting study results</li>
<li>discussing statistical models and machine learning for biomedical and epidemiologic problems</li>
</ul>
<p>The site will be organized into the following major categories, with lots of tags available to further distinguish and crossreference topics (e.g., cardiology, cancer)</p>
<ul>
<li>measurement</li>
<li>nonstatistical quantitative methods & numeracy</li>
<li>statistics and data analysis (general)</li>
<li>statistical models</li>
<li>machine learning and when to use it vs. statistical models</li>
<li>regression modeling strategies</li>
<li>probability</li>
<li>study design</li>
<li>study interpretation</li>
<li>design and interpretaton of particular RCTs</li>
<li>epidemiology</li>
<li>health policy</li>
<li>psychology</li>
<li>graphics</li>
<li>causal inference</li>
<li>education</li>
<li>news (courses, webcasts, meetings, etc.)</li>
<li>cardiology</li>
<li>cancer</li>
<li>nutrition</li>
<li>meta (discussion, pointers, etiquette about the site)</li>
</ul>
<p>More major categories can be added as needed.</p>
<p>To discuss this proposal, post a tweet mentioning @f2harrell, or use the commenting facility at the end of this post.</p>
<h2 id="linkstoresources">Links to Resources</h2>
<ul>
<li><a href="http://discourse.mcstan.org/faq" target="_blank">discourse.org civility guidelines</a></li>
<li><a href="https://blog.discourse.org/2018/04/effectivelyusingdiscoursetogetherwithgroupchat" target="_blank">Using Discourse effectively with group chat</a></li>
</ul>
<h2 id="discourseinformation">Discourse Information</h2>
<h3 id="waystocreateanaccount">Ways to Create an Account</h3>
<ul>
<li>Link to your Google, Facebook, Twitter, Yahoo, or Github account</li>
<li>Specify your email address and password</li>
</ul>

Viewpoints on Heterogeneity of Treatment Effect and Precision Medicine
http://fharrell.com/post/hteview/
Mon, 04 Jun 2018 00:00:00 +0000
http://fharrell.com/post/hteview/
<p class="rquote">
To translate the results of clinical trials into practice may require a lot of work involving modelling and further background information. 'Additive at the point of analysis but relevant at the point of application' should be the motto. <br>
— Stephen Senn, <a href="http://errorstatistics.com/2013/04/19/stephensennwhenrelevanceisirrelevant">When Relevance is Irrelevant</a>
<br><br>
The simple idea of risk magnification has more potential to improve medical decision making and cut costs than "omics" precision medicine approaches. Risk magnification uses standard statistical tools and standard clinical variables. Maybe it's not sexy enough or expensive enough to catch on.
</p>
<h2 id="notesonthepcorituftpacemeeting">Notes on the PCORI/Tuft PACE Meeting</h2>
<p>This is a reflection on what I heard and didn’t hear at the 20180531 meeting <a href="https://nam.edu/event/evidenceandtheindividualpatientunderstandingheterogeneoustreatmenteffectsforpatientcenteredcare" target="_blank">Evidence and the Individual Patient: Understanding Heterogeneous Treatment Effects (HTE) for patientCentered Care</a>, sponsored by <a href="https://pcori.org" target="_blank">PCORI</a>, the Tufts University <a href="https://www.tuftsmedicalcenter.org/ResearchClinicalTrials/InstitutesCentersLabs/InstituteforClinicalResearchandHealthPolicyStudies/ResearchPrograms/CenterforPredictiveAnalyticsandComparativeEffectivness.aspx" target="_blank">PACE Center</a>, and the <a href="http://nam.edu" target="_blank">National Academy of Medicine</a>. I learned a lot and thought that the meeting was well organized and very intellectually stimulating. Hats off to David Kent for being the meeting’s mastermind. Some of the high points of the meeting to me personally were</p>
<ul>
<li>Meeting critical care clinical trial guru Derek Angus for the first time<sup class="footnoteref" id="fnref:Onenitpickyco"><a rel="footnote" href="#fn:Onenitpickyco">1</a></sup></li>
<li>Being with my longtime esteemed colleague Ewout Steyerberg</li>
<li>Hearing Cecile Janssens’ sanguine description of the information yield of genetics to date</li>
<li>Listening to Michael Pencina convey a big picture understanding of predictive modeling</li>
<li>Hearing Steve Goodman’s wisdom about risk applying to individuals but only being estimable from groups</li>
<li>Seeing Patrick Heagerty’s clear description of exactly what can be learned about an individual treatment effect when a patient can undergo only one treatment, and clearly discuss individual vs. population analysis</li>
<li>Hearing inspiring stories from two patient stakeholders who happen to also be researchers</li>
<li>Getting reminded of the pharmacogenomic side of the equation from my Vanderbilt colleague Josh Peterson</li>
<li>Watching John Spertus give a spirited report about how smart clinicians in one cardiovascular treatment setting are more likely to use a treatment for patients who stand to benefit the least, from the standpoint of predicted risk</li>
<li>Watching Rodney Hayward give an even more spirited talk about how medical performance incentives often do not achieve their intended effects</li>
<li>It was gratifying to hear extreme criticism of onevariableatatime subgrouping from several speakers</li>
<li>It was worrying to see some speakers dividing predicted risk into quantile groups when in fact risk is a continuous variable, and quantile interval boundaries are driven by demographics (and are arbitrarily manipulated by changing study inclusion criteria) and not by biomedicine</li>
</ul>
<h2 id="backgroundissuesinprecisionmedicine">Background Issues in Precision Medicine</h2>
<p>There are five ways I can think of achieving personalized/precision medicine, besides the physician talking and listening to the patient:</p>
<ul>
<li>Development of new diagnostic tests that contain validated, new information</li>
<li>Breakthroughs in treatments for welldefined patient subpopulations</li>
<li>Finding strong evidence of patientspecific treatment random effects using randomized crossover studies<sup class="footnoteref" id="fnref:StephenSennhas"><a rel="footnote" href="#fn:StephenSennhas">2</a></sup>, and finding actionable treatment plans once such heterogeneity of treatment effects (HTE) is demonstrated and understood</li>
<li>Finding strong evidence of interaction between treatment and patient characteristics, which I’ll call differential treatment effects (DTE)</li>
<li>Giving treatments to patients with the largest expected absolute benefit of the treatment (largest absolute risk reduction)</li>
</ul>
<p>The last approach has little to do with HTE and is mainly a mathematical issue arising from the fact that there is only room to move a probability (risk) when the patient’s risk is in the middle. Patients who are at very low risk of a clinical outcome cannot have a large absolute risk reduction. I’ll call this phenomenon <em>risk magnification</em> (RM) because the absolute risk difference is magnified by having a higher baseline risk.</p>
<p>The conference focused more on RM than on HTE. RM is the simplest and most universal approach to medical decision making, and requires the least amount of information<sup class="footnoteref" id="fnref:Atleastatthe"><a rel="footnote" href="#fn:Atleastatthe">3</a></sup>. Before discussing RM vs HTE, we must define relative and absolute treatment effects. For a continuous variable such as blood pressure that is semilinearly related to clinical outcome (at least with regard to the normaltohypertensive range of blood pressure), reduction in mean blood pressure as estimated in a randomized clinical trial (RCT) is both an absolute and a relative measure. For binary and timetoevent endpoints, an absolute difference is the difference in cumulative incidence of the event at 2y, or the difference in life expectancy. A relative effect may be an odds ratio, hazard ratio, or in an accelerated failure time model, a survival time ratio.</p>
<h2 id="riskmagnification">Risk Magnification</h2>
<p>There are two stages in the understanding and implementation of RM:</p>
<ul>
<li>In an RCT, estimate the relative treatment effect and try to find evidence for constancy of this relative effect. If there is evidence for interaction on the relative scale, then the relative treatment effect is a function of patient characteristics.</li>
<li>When making patient decisions, one can easily (in most situations) convert the relative effect from the first step into an absolute risk reduction if one has an estimate of the current patient’s absolute risk. This estimate may come from the same trial that produced the relative efficacy estimate, if the RCT enrolled a sufficient variety of subjects. Or it can come from a purely observational study if that study contains a large number of subjects given usual care or some other appropriate reference set.</li>
</ul>
<p>These issues are discussed <a href="http://fharrell.com/post/ehrsrcts">here</a> and <a href="http://fharrell.com/post/rctmimic">here</a>, in <a href="https://jamanetwork.com/journals/jama/articleabstract/209767" target="_blank">Kent and Hayward’s paper</a>, and in Stephen Senn’s <a href="http://slideshare.net/StephenSenn1/realworldmodified" target="_blank">presentation</a>. An early application is in <a href="https://www.ahjonline.com/article/S00028703(97)701649/abstract" target="_blank">Califf et al</a>.</p>
<p>In most cases one can compute the absolute benefit as a function of (known or unknown) patient baseline risk using simple math, without requiring any data, once the relative efficacy is estimated. It is only at the decision point for the patient at hand that the risk estimate is needed.</p>
<p>Here is an example for a binary endpoint in which the treatment effect is given by a constant odds ratio. The graph below exemplifies two possible odds ratios: 0.8 and 0.6. One can see that the absolute risk reduction by treatment is strongly a function of baseline risk (no matter how this risk arose), and this reduction can be estimated even without a risk model, under certain assumptions.</p>
<pre><code class="languager">require(Hmisc)
</code></pre>
<pre><code class="languager">knitrSet(lang='blogdown')
plot(0, 0, type="n", xlab="Patient Risk Under Usual Care",
ylab="Absolute Risk Reduction",
xlim=c(0,1), ylim=c(0,.15))
i < 0
or < c(0.8, 0.6)
for(h in or) {
i < i + 1
p < seq(.0001, .9999, length=200)
logit < log(p/(1  p)) # same as qlogis(p)
logit < logit + log(h) # modify by odds ratio
p2 < 1/(1 + exp(logit))# same as plogis(logit)
d < p  p2
lines(p, d)
maxd < max(d)
smax < p[d==maxd]
text(smax, maxd + .005, paste0('OR=', format(h)), cex=.8)
}
</code></pre>
<p><img src="http://fharrell.com/post/hteview_files/figurehtml/setup1.png" width="672" /></p>
<p>For an example analysis where the relative treatment effect varies with patient characteristics, see <a href="http://fharrell.com/doc/bbr.pdf" target="_blank">BBR Section 13.6.2</a>.</p>
<h2 id="heterogeneityoftreatmenteffects">Heterogeneity of Treatment Effects</h2>
<p>The conference did not emphasize the underpinnings of HTE, but this article gives me an excuse to describe my beliefs about HTE. In what follows I’m referring actually to DTE because I’m assuming that estimates are based on parallelgroup studies, but I’ll slip into the HTE nomenclature.</p>
<p>It is only meaningful to define HTE on a relative treatment effect scale, because otherwise HTE is always present (because of RM) and the concept of HTE becomes meaningless. A relative scale such as log odds or log relative hazard is a scale in which it is mathematically possible for the treatment effect to be constant over the whole patient spectrum<sup class="footnoteref" id="fnref:Notethatonan"><a rel="footnote" href="#fn:Notethatonan">4</a></sup>. It is only the relative scale in which treatment effectiveness differences have the potential to be related to mechanisms of action. By contrast, absolute risk reduction comes from <strong>generalized risk</strong>, and generalized risk can come from any source including advanced age, greater extent of disease, and comorbidities. Researchers frequently make the mistake of examining variation in absolute risk reduction by subgrouping, one day shouting “older patients get more benefit” and another day concluding “patients with comorbidities get more benefit”, but these are illusory. It is often the case that <strong>anything</strong> giving the patient more risk will be related to enhancing absolute treatment benefit. It is an error in labeling to attribute these effects to a specific variable<sup class="footnoteref" id="fnref:DavidKentmenti"><a rel="footnote" href="#fn:DavidKentmenti">5</a></sup>.</p>
<p>Though the PCORI/Tufts meeting did not intend to cover the following topic, it would be useful at some point to have indepth discussions about HTE/DTE, to address at least two general points:</p>
<ul>
<li>Which sorts of treatment/disease combinations should be selected for examining HTE?</li>
<li>What happens when we quantify the outcome variation explained by HTE?</li>
</ul>
<p>On the first point, I submit that the situations having the most promise for finding and validating HTE/DTE are trials in which the average treatment effect is large (and is in the right direction). It is tempting to try to find HTE in a trial with a small overall difference, but there are two problems in doing so. First, the statistical signal or information content of the data are unlikely to be sufficient to estimate differential effects<sup class="footnoteref" id="fnref:Underthebesto"><a rel="footnote" href="#fn:Underthebesto">6</a></sup>. Second, to say that HTE exists when the average treatment effect is close to zero implies that there must exist patient subgroups where the treatment does significant harm to the patients. The plausibility of this assumption should be questioned.</p>
<p>On the second point, about quantification of nonconstancy of relative treatment effect, a very fruitful area of research could involve developing strategies for “proof of concept” studies of DTE that parallels how principal component analysis has been used in gene microarray and GWAS research to show that a possible predictive signal exists in the genome. This same approach could be used to quantify the signal strength for differential treatment effects by patient characteristics. This would address a common problem: factors that potentially interact with treatment can be correlated, diminishing statistical power of individual interaction tests. By reducing a large number of potential interaction factors to several principal components (or other summary scores) and getting a “chunk test” for the joint interaction influence of those variables with treatment, one could show that something is there without spending statistical power “naming names.”</p>
<p>This relates to what I perceive is a major need in HTE research: to quantify the amount of patient outcome variation that is explained by treatment interactions in comparison to the variation explained by just using an additive model that includes treatment and a long list of covariates. A powerful index for quantifying such things is the “adequacy index” described in the maximum likelihood estimation chapter in <em><a href="http://fharrell.com/links">Regression Modeling Strategies</a></em>. This index answers the question “what proportion of the explainable outcome variation as measured by the model likelihood ratio chisquare statistic is explainable by ignoring all the interaction effects?” One minus this is the fraction of predictive information provided by DTE. In my experience, the outcome variable explained by main effects swamps that explained by adding interaction effects to models. I predict that clinical researchers will be surprised how little differential treatment effects matter when compared to outcomes associated with patient risk factors, and when compared to RM.</p>
<p>My suggestions for developing statistical analysis plans for testing and estimating DTE/HTE are in <a href="http://fharrell.com/doc/bbr.pdf">BBR Section 13.6</a>.</p>
<h2 id="averagesvscustomizedestimates">Averages vs. Customized Estimates</h2>
<p class="rquote">
Advocates of precision medicine are required to show that customized treatment selection results in better patient outcomes than optimizing average outcomes.
</p>
<p>An unspoken issue occurred to me during the meeting. We need to be talking much more about mean squared error (MSE) of estimates of individualized treatment effects. MSE equals the variance of an estimate plus the square of the estimate’s bias. Variance is reduced by increasing the sample size or by being able to explain more outcome variation (having a higher signal:noise ratio). Bias can come from a problematic study design that misestimated the average treatment effect, or by assuming that the effect for the patient at hand is the same as the average relative treatment effect when in fact the treatment effect interacted with one or more patient characteristics. But when one allows for interactions, the variance of estimates increases substantially (especially for patient types that are not well represented in the sample). So interaction effects must be fairly large for it to be worthwhile to include these effects in the model, i.e., for MSE to be reduced (i.e., for the square of bias to decrease more than the variance increases).</p>
<p>To really understand HTE, one must understand how a patient disagrees with herself over time, even when the treatment doesn’t change. Stephen Senn has written extensively about this, and a new paper entitled <a href="http://www.pnas.org/content/early/2018/06/15/1711978115" target="_blank">Lack of grouptoindividual generalizability is a threat to human subjects research</a> by Fisher, Medaglia, and Jeronimus is a worthwhile read. Also see this excellent article: <a href="https://academic.oup.com/ije/article/40/3/537/747708" target="_blank">Epidemiology, epigenetics and the ‘Gloomy Prospect’: embracing randomness in population health research and practice</a> by George Davey Smith.</p>
<p>In the absence of knowledge about patientspecific treatment effects, the best estimate of the relative treatment effect for an individual patient is the average relative treatment effect<sup class="footnoteref" id="fnref:Thiscanbeeasi"><a rel="footnote" href="#fn:Thiscanbeeasi">7</a></sup>. Selecting the treatment that provides the best average relative effect will be the best decision for an individual unless DTEs are large. To better personalize the decision, other than accounting for absolute risk (which is a different issue and may objectively deal with cost issues), requires abundant data on DTE.</p>
<h2 id="generalizabilityofrcts">Generalizability of RCTs</h2>
<p>At the meeting I heard a couple of comments implying that randomized trials are not generalizable to patient populations that are different from the clinical trial patients. This thought comes largely from a misunderstanding of what RCTs are intended to do, as described in detail <a href="http://fharrell.com/post/rctmimic">here</a>: to estimate relative efficacy. Even though absolute efficacy varies greatly from patient to patient due to RM, evidence for variation in relative efficacy has been minimal outside of the molecular tumor signature world.</p>
<p>The beauty of the conference concentrating on risk magnification is that RM always exists whenever risk is an issue (not so much in a pure blood pressure trial), RM is easier to deal with, and to account for RM does not require increasing the sample size, although RM is benefited from having large observational cohorts upon which to estimate risk models for computing absolute risk reduction given patient characteristics and relative treatment effects. RM does not require crossover studies, and can be actualized even without a risk model if the treating physician has a good gestalt of her patient’s outcome risk. In my view, RM should be emphasized more than HTE because of its practicality. To do RM “right” to obtain personalized estimates of absolute treatment benefit, we do need to spend more effort checking that risk models are absolutely calibrated.</p>
<h2 id="otherthingsidliketohaveheardorfurtherdiscussed">Other Things I’d Like to Have Heard or Further Discussed</h2>
<ul>
<li>There was some discussion of multiple endpoints and tradeoffs between safety and efficacy. Patient utility analysis and the use of ordinal clinical outcomes would have been a nice addition, though there’s not time for everything.</li>
<li>The ACCORDBP trial was described as “negative”, but that was a frequentist trial so all one can say is that the trial did not amass enough information to reject the null hypothesis.</li>
<li>I heard someone mention at one point that subgroup analysis “breaks the randomization.” I don’t think that’s strictly true. It’s just that subgroup analysis is not competitive statistically, and is usually misleading because of noise, arbitrariness, and colinearities.</li>
<li>Someone mentioned tree methods but single trees require 100,000 patients to work adequately and even then are not competitive with regression.</li>
<li>There needs to be more discussion about the choice of outcome measures in trials. DTE/HTE analysis requires highinformation outcomes to have much hope; binary outcomes have low information content.</li>
<li>It may have been Fan Li and Michael Pencina who mentioned the use of penalized maximum likelihood estimation for estimating DTE (e.g., lasso, elastic net). These do not provided any statistical inference capabilities (as opposed to Bayesian penalization through skeptical priors).</li>
</ul>
<p><a class="anchor" id="evidence"></a></p>
<h2 id="evidenceforhomogeneityoftreatmenteffects">Evidence for <strong>Homo</strong>geneity of Treatment Effects</h2>
<p>For continuous outcome variables Y where the variance of measurements can be disconnected from the mean, one way to estimate the magnitude of HTE is to compare the variance in Y in the active treatment group and that in the control group. If HTE exists, it cannot affect a pure control group but should increase the variance of Y in the treatment group due to heterogeneity of treatment effect across types of subjects. Two studies have examined this issue in metaanalyses. The first, related to weight loss treatments, found “evidence is limited for the notion that there are clinically important differences in exercisemediated weight change.” The second paper reviewed 208 studies and found evidence in the opposite direction from HTE: the average ratio of variances of Y for treated:control was 0.89.</p>
<ul>
<li><a href="https://doi.org/10.1111/obr.12682" target="_blank">Inter‐individual differences in weight change following exercise interventions: a systematic review and meta‐analysis of randomized controlled trials</a> by Williamson, Atkinson, Batterham</li>
<li><a href="https://f1000research.com/articles/730" target="_blank">Does evidence support the high expectations placed in precision medicine? A bibliographic review</a> by Cortés et al.</li>
</ul>
<h4 id="footnotes">Footnotes</h4>
<div class="footnotes">
<hr />
<ol>
<li id="fn:Onenitpickyco">One nitpicky comment about a small point in Derek’s presentation: He described an analysis in which a risk model was developed in placebo patients and applied to active arm patients. This approach has the possibility of creating a bias caused by fitting idiosyncrasies of placebo patients, in a way that may exaggerate treatment effect estimates. <a class="footnotereturn" href="#fnref:Onenitpickyco"><sup>^</sup></a></li>
<li id="fn:StephenSennhas">Stephen Senn has <a href="https://www.bmj.com/content/329/7472/966" target="_blank">shown</a> how one may estimate patient random effects representing individual response to therapy in a 6period 2treatment crossover study. See also <a href="http://journals.sagepub.com/doi/abs/10.1177/0962280210379174" target="_blank">this</a>. <a class="footnotereturn" href="#fnref:StephenSennhas"><sup>^</sup></a></li>
<li id="fn:Atleastatthe">At least at the analysis phase, if not at the implementation stage. <a class="footnotereturn" href="#fnref:Atleastatthe"><sup>^</sup></a></li>
<li id="fn:Notethatonan">Note that on an additive risk scale, interactions must be present to prevent risks from going outside the legal range of [0,1]. <a class="footnotereturn" href="#fnref:Notethatonan"><sup>^</sup></a></li>
<li id="fn:DavidKentmenti">David Kent mentioned to me that he had some strong examples where <em>relative</em> treatment benefit was a function of <em>absolute</em> baseline risk. I need to know more about this. <a class="footnotereturn" href="#fnref:DavidKentmenti"><sup>^</sup></a></li>
<li id="fn:Underthebesto">Under the best of situations, the sample size needed to estimate an interaction effect is four times that needed to estimate the average treatment effect. <a class="footnotereturn" href="#fnref:Underthebesto"><sup>^</sup></a></li>
<li id="fn:Thiscanbeeasi">This can be easily translated into a customized absolute risk reduction estimate as discussed earlier. <a class="footnotereturn" href="#fnref:Thiscanbeeasi"><sup>^</sup></a></li>
</ol>
</div>

Navigating Statistical Modeling and Machine Learning
http://fharrell.com/post/statml2/
Mon, 14 May 2018 00:00:00 +0000
http://fharrell.com/post/statml2/
<p>Drew Levy<br><small><tt>drew@dogoodscience.com</tt></small><br><small><tt> <a href="http://linkedin.com/in/drewglevy" target="_blank">Linkedin:drewglevy</a> </tt></small><br><small><tt> <a href="http://www.DoGoodScience.com" target="_blank">DoGoodScience.com</a> </tt></small><br><br></p>
<p class="rquote">
... the art of data analysis is about choosing and using multiple tools.<br><a href="http://biostat.mc.vanderbilt.edu/rms"> —Regression Modeling Strategies</a>, pp. vii
</p>
<p>Frank Harrell’s post, <a href="http://fharrell.com/post/statml/">Road Map for Choosing Between Statistical Modeling and Machine Learning</a>, does us the favor of providing a contrast of statistical modeling (SM) and machine learning (ML) in terms of fundamental attributes (signal:noise and data requirements, dependence on assumptions and structure, interest in “special” parameters, accounting of uncertainties and predictive accuracy). This is clarifying perspective. Despite the prevalent conflation of SM and ML within the rubric of ‘data science’, Frank’s post underscores that SM and ML are different in important ways and the individual considerations in this contrast should assist us in making deliberated decisions about when and how to apply one approach or another. This cogent set of criteria help us better select tools that are fitforpurpose and serve our particular ends with the best means. Getting clarity about what our real ends are might be the harder part.</p>
<p>To extend the analogy, the guideposts identified by Frank could be illustrated as a route map if put into the format of a series of junctures (and termini). Here is an example:</p>
<ol>
<li>Do you want to isolate the effect of special variables or have an interpretable model? If yes, turn left toward SM; if no, keep driving …</li>
<li>Is your sample size less than huge? If yes, park in the space designated “SM”; if no, …</li>
<li>Is your signal:noise low? If yes, take the ramp toward “SM”; if no, …</li>
<li>Is there interest in estimating the uncertainty in forecasts? If yes, merge into SM lane; if no, …</li>
<li>Is nonadditivity/complexity expected to be strong? If yes, gun the pedal toward ML; if no, … you can continue the journey with SM.</li>
</ol>
<p>This allegorical cartoon is simplistic: the situation is certainly much more nuanced than this. But it is more systematic thinking than is often employed (such as, ‘I have lots of data, therefore ML’). There are other maps that people could draw, and junctures to consider. The route illustrated above is intended to encourage others to plot a course thoughtfully. And the allegory is certainly narratively thin: there are surprises lurking in the landscape along the highway.</p>
<p>Frank’s contrast between SM and ML exposes an essential question: “who/what is actually learning?” For the most part, in ML only the machine is learning. Little or no understanding is escaping from the black box for human knowledge, and this means that ML is purely instrumental. In some ways ML is like operant conditioning, or the automative System 1 thinking process in humans (Kahneman’s <a href="https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow" target="_blank">Thinking Fast and Slow</a>). They take in information and result in behavioral outputs, but operate below the level of conscious awareness. The machine is largely ‘dumb’ and cannot tell you very well what it has learned; nor can it be aware of when and how it may be fundamentally wrong. While ML can serve many purposes, there are potential risks and costs associated with mechanical opacity (viz, machine trading).</p>
<p>The Scientific Method demonstrates that you can use controlled experiments and strong claims to understand causation and predict events. SMs sort of blur the boundaries between rigorous causal understanding and purely instrumental utility: they are predictive tools that are also comprehensible to humans <a href="http://www.informationphilosopher.com/knowledge/best_explanation.html" target="_blank">Inference to the Best Explanation</a>, and the random component of the model models the sources of uncontrolled variation. ML shows that if you relax the first two requirements to weak claims, you can still predict events, but perhaps not understand them [special thanks to <a href="https://www.linkedin.com/in/garrettgrolemund49328411" target="_blank">Garrett Grolemund</a> for his thinking and language about these issues]. We have seen how ML vs. SM can be reframed as situated on some spectrum (e.g., the spectrum of human intermediation, in <a href="https://jamanetwork.com/journals/jama/articleabstract/2675024" target="_blank">Big Data and Machine Learning in Health Care</a>, by AL Beam and IS Kohane). This suggests yet another spectrum:<br>
<p class="rquote">
Experimental Science → clear causal understanding and predictions<br>
Statistical Models → understanding that holds under a set of assumptions, and supplies predictions and uncertainty estimates<br>
Machine Learning → predictions
</p>
This spectrum invites consideration of the various ways we use predictions: from corroborating or refuting theory (as in the scientific method), to calibrating fit and positing structure, to utilitarian prognostication.</p>
<p>For several medical applications a black box prediction tool would appear to be entirely suitable, such as reading pathology, predicting treatment nonadherence, or some highcomplexity nonlinear systems biology problems, etc. Predicting accurately in such applications may be entirely enough, whether or not you know why the predictions are accurate. We don’t need to be mechanics to have a car get us to our destinations. In this, ML may best be construed literally as a ‘tool’ in the instrumental sense: a form of augmentation of human effective capacity. In a generic sense, AI/ML is, to date, primarily about building systems that can address a discrete and specific problem by processing enormous volumes of data and providing answers to highly structured questions in an automated way, very quickly.</p>
<p>So, it may be that one of the first forks in the road map for choosing between ML and SM should be whether you want to claim to be doing formal science or not. For the endeavor to be scientific, you have to have and empirically assess hypotheses or theories about how some aspect of the world works; which are minimal or absent in ML. If learning, in the sense of accruing knowledge about how the world works, is not a predicate of ML, however highly technical ML may be, it should not be misconstrued as scientific. Despite being a central feature of the current Data Science meme, ML should surrender any pretensions about being science. But is a potentially highly effective technology.</p>
<p>This reasoning exposes as well an obverse issue in how SM is sometimes used in medicine. While SM provides prediction based on evaluation of specific hypotheses about nature, it is very frequently used to rationalize a simplistic heuristic approach for clinical decision making, inadvertently forsaking the full probabilistic information available for the decision. Ultimately, realworld medical decisionmaking is a forecast: conditional on a set of premises provided in data it is a prediction about what course of action is likely to yield the best result, especially for individual patientlevel decisionmaking (e.g., Precision Medicine, Personalized Medicine). Traditional rigorous causal inference has led to a reductionist focus on particular independent effects and has encouraged a selective focus on a limited set of terms in the righthand side (r.h.s.) of the equation. With SM a prevalent tendency is to focus, after adjustment, on selected variables and just use these ‘risk factors’. Frequently, just categorical classes of the selected variables are used in making decisions about care, further reducing these to heuristics for decision making—much as we tend to use pvalues as facile surrogates for richer evidence. This is also similar to promoting the value of a new biomarker that in isolation provides less information than the basic clinical data available. We have a strong tendency to reduce information for decisions to singular and simple binary inputs. This is entropic dissipation of information, due largely to our stubborn preference for cognitive ease in decision making.</p>
<p>Models that make accurate predictions of responses for future observations by incorporating relevant information for decision making perform the correct calculus of integrating information, and provide correct output for informing decisions with explicit probability and uncertainty estimates (the lefthand side of the equation: l.h.s.). You will hear remarks that reflect resistance to probabilitybased clinical decision making: complaints that probabilities are too complex and that emphasize what physicians want or need. I think this is a misplaced objective at a fundamental level. The correct objective and focus is what leads ultimately to the best outcomes for patients. This should not be about how to make it easy for physicians—it is about finding and adopting the best process for decision making that serves the interest of patients, no matter how difficult, awkward or inconvenient for physicians. I have sympathy for clinicians—they are only human, with limited cognitive capacity (information bandwidth) like the rest of us. Because thinking consumes our limited energy human cognition is prone to take the path of least resistance. And we are generally entirely unaware of this as we are doing it. Cognitive laziness is built deep into our nature. But the real value to be served in clinical decision making is the quality of care and outcomes for patients. Where individual patients are involved, rich multivariable information rigorously integrated for individual patientlevel decisionmaking leads to much greater acuity in predicting the consequences of health care actions; and ultimately, to better decisions and outcomes.</p>
<p>Well formulated realworld posterior conditional probabilities (i.e., l.h.s.) are highvalue information about both potential outcomes and uncertainties. Left hand sidebased decisionmaking maps observations to actions, and better informs effective care related decisionmaking, potentially improving outcomes for patients. Paradoxically, while we may have learned something specific and scientific from the data with SM, we also are not using the predictive capacity of SM—the l.h.s.—optimally either. Prediction generalizes estimation and to some extent hypothesis testing (<a href="http://biostat.mc.vanderbilt.edu/rms" target="_blank">Regression Modeling Strategies</a>, pp. 1); For SM—like ML—overall prediction remains a major goal.</p>
<p>Our tendency toward cognitive ease (our allergy to complexity) may explain part of the sex appeal of ML: the allure of outsourcing cognitive effort to the machine. The perceived value of our technology is in removing the difficulty and uncertainty from our lives. This is a source of the seductive power of technology. Part of what is attractive about ML is that it appears to absolve humans of the need to think hard and that solutions will appear out of the machine ‘automagically’. ML appeals to our bias for cognitive ease, and risks beguiling “magical thinking” (a term I borrowed from <a href="https://mitpress.mit.edu/books/whatalgorithmswant" target="_blank">What Algorithms Want</a> by Ed Finn). There is a prevalent fantasy about “the killer app’”, and how it will liberate us from our cognitive limitations and the effort of hard thought. And this “killer app” fantasy (in combination with our lazy thinking) reinforces the notion that success is all about the technology—about the algorithm.</p>
<p>Judging from the prevalence of articles and advertisements in the vocational literature and the lay press, the requirement for ML experience among job postings, the emphasis on ML at professional meetings, etc., you might think that SM has gone the way of the horseandbuggy, or is an endangered species occupying a precarious ecological niche. But, whereas in this epoch we are carried away in a tsunami of data, and ML requires big data, it does not follow that doing ML should now be obligatory. We need to be thinking more carefully than that. An important initial reflection should be on the temptation to be doing ‘Big Data Science’ for the sake of ‘doing Big Data Science’. This is a prevalent confusion of means and ends: solutions in search of a problem. It confuses instruments with objectives. While there are many useful technologies, wisdom resides in knowing which to use and when to (and not to) use them. True value is in the quality of the results, not in just being able to claim pride of place on the Data Science bandwagon. Notwithstanding the rare lucky shots, arbitrary applications of a technology more often than not have underwhelming results. “Give somebody a hammer, and he will treat everything as a nail” very often leads to “This hammer is no good at pounding this screw!” There are many and diverse sources of knowledge about individual statistical methods and applications, but “… the art of data analysis is about choosing and using multiple tools” (<a href="http://biostat.mc.vanderbilt.edu/rms" target="_blank">Regression Modeling Strategies</a>, pp. vii.) True value will emerge from the judicious and appropriate application of tools for settled purposes. This is where the road map for choosing between ML and SM is useful.</p>
<p>The issue of a false dichotomy is moot: ML and SM are different. A better question may be, are there conditions and ways in which ML and SM can be complementary for specific purposes? Are there ways they can be combined? Are they compatible within the domain of modern applied practice? In the general domain of practice SM and ML only fully displace one another in a perspective of chauvinistic zerosum domination. They only appear to compete if their respective advantages under specific conditions and for specific purposes are not understood. They only appear to compete under conditions of prejudice or incomplete understanding. Frank’s roadmap does much to resolve this.</p>

Road Map for Choosing Between Statistical Modeling and Machine Learning
http://fharrell.com/post/statml/
Mon, 30 Apr 2018 00:00:00 +0000
http://fharrell.com/post/statml/
<p class="rquote">
Machine learning (ML) may be distinguished from statistical models (SM) using any of three considerations:<br><b>Uncertainty</b>: SMs explicitly take uncertainty into account by specifying a probabilistic model for the data.<br><b>Structural</b>: SMs typically start by assuming additivity of predictor effects when specifying the model.<br><b>Empirical</b>: ML is more empirical including allowance for highorder interactions that are not prespecified, whereas SMs have identified parameters of special interest.<br><br>There is a growing number of hybrid methods combining characteristics of traditional SMs and ML, especially in the Bayesian world. Both SMs and ML can handle highdimensional situations.
<br><br>
It is often good to let the data speak. But you must be comfortable in assuming that the data are speaking rationally. Data can fool you.<br><br>Whether using SM or ML, work with a methodologist who knows what she is doing, and don't begin an analysis without ample subject matter input.
</p>
<p>Data analysis methods may be described by their areas of applications, but for this article I’m using definitions that are strictly methodsoriented. A statistical model (SM) is a data model that incorporates probabilities for the data generating mechanism and has identified unknown parameters that are usually interpretable and of special interest, e.g., effects of predictor variables and distributional parameters about the outcome variable. The most commonly used SMs are regression models, which potentially allow for a separation of the effects of competing predictor variables. SMs include ordinary regression, Bayesian regression, semiparametric models, generalized additive models, longitudinal models, timetoevent models, penalized regression, and others. Penalized regression includes ridge regression, lasso, and elastic net. Contrary to what some machine learning (ML) researchers believe, SMs easily allow for complexity (nonlinearity and secondorder interactions) and an unlimited number of candidate features (if penalized maximum likelihood estimation or Bayesian models with sharp skeptical priors are used). It is especially easy, using regression splines, to allow every continuous predictor to have a smooth nonlinear effect.</p>
<p>ML is taken to mean an algorithmic approach that does not use traditional identified statistical parameters, and for which a preconceived structure is not imposed on the relationships between predictors and outcomes. ML usually does not attempt to isolate the effect of any single variable. ML includes random forests, recursive partitioning (CART), bagging, boosting, support vector machines, neural networks, and deep learning. ML does not model the data generating process but rather attempts to learn from the dataset at hand. ML is more a part of computer science than it is part of statistics. Perhaps the simplest way to distinguish ML form SMs is that SMs (at least in the regression subset of SM) favor additivity of predictor effects while ML usually does not give additivity of effects any special emphasis.</p>
<p>ML and AI have had their greatest successes in high signal:noise situations, e.g., visual and sound recognition, language translation, and playing games with concrete rules. What distinguishes these is quick feedback while training, and availability of <strong>the</strong> answer. Things are different in the low signal:noise world of medical diagnosis and human outcomes. A great use of ML is in pattern recognition to mimic radiologists’ expert image interpretations. For estimating the probability of a positive biopsy given symptoms, signs, risk factors, and demographics, not so much.</p>
<p>There are many published comparisons of predictive performance of SM and ML. In many of the comparisons, only naive regression methods are used (e.g., everything is assumed to operate linearly), so the SM comparator is nothing but a straw man. And not surprisingly, ML wins. The reverse also happens, where the ML comparator algorithm uses poorlychosen default parameters or the particular ML methods chosen for comparison are out of date. As a side note, when the SM method is just a straw man, the outcry from the statistical community is relatively muted compared with the outcry from ML advocates when the “latest and greatest” ML algorithm was not used in the comparison with SMs. ML seems to require more tweaking than SMs. But SMs often require a timeconsuming data reduction step (unsupervised learning) when the number of candidate predictors is very large and penalization (lasso or otherwise) is not desired.</p>
<p>Note that there are ML algorithms that provide superior <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3575184" target="_blank">predictive discrimination</a> but that pay insufficient attention to <a href="http://fharrell.com/post/medml" target="_blank">calibration</a> (absolute accuracy).</p>
<p>Because SMs favor additivity as a default assumption, when additive effects dominate, SM requires far lower sample sizes (typically 20 events per candidate predictor) than ML, which typically requires <a href="https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471228814137" target="_blank">200 events</a> per candidate predictor. Thus ML can sometimes create a demand for “big data” when smallmoderate sized datasets will do. I sometimes dislike ML solutions for particular medical problems because of ML’s <strong>lack</strong> of assumptions. But SMs are not very good at reliably finding nonprespecified interactions; SM typically requires interactions to be prespecified. On the other hand, <a href="https://www.ahrq.gov" target="_blank">AHRQ</a>sponsored research I did on large medical outcomes datasets in the 1990s with the amazing University of Nevada Reno physicianstatistician <a href="https://www.legacy.com/obituaries/rgj/obituary.aspx?n=philgoodman&pid=144885798" target="_blank">Phil Goodman</a>, whom we lost at an alltooearly age, demonstrated that important nonadditive effects are rare when predicting patient mortality. As a result, neural networks were no better than logistic regression in terms of predictive discrimination in these datasets.</p>
<p>There are many current users of ML algorithms who falsely believe that one can <a href="http://fharrell.com/post/mlsamplesize" target="_blank">make reliable predictions from complex datasets with a small number of observations</a>. Statisticians are pretty good at knowing the limitations caused by the effective sample size, and to stop short of trying to incorporate model complexity that is not supported by the information content of the sample.</p>
<p>Here are some rough guidelines that attempt to help researchers choose between the two approaches, for a prediction problem<sup class="footnoteref" id="fnref:Notethatasdes"><a rel="footnote" href="#fn:Notethatasdes">1</a></sup>.</p>
<p><strong>A statistical model may be the better choice if</strong></p>
<ul>
<li>Uncertainty is inherent and the signal:noise ratio is not large—even with identical twins, one twin may get colon cancer and the other not; one should model tendencies instead of doing classification when there is randomness in the outcome</li>
<li>One doesn’t have perfect training data, e.g., cannot repeatedly test one subject and have outcomes assessed without error</li>
<li>One wants to isolate effects of a small number of variables</li>
<li>Uncertainty in an overall prediction or the effect of a predictor is sought</li>
<li>Additivity is the dominant way that predictors affect the outcome, or interactions are relatively small in number and can be prespecified</li>
<li>The sample size isn’t huge</li>
<li>One wants to isolate (with a predominantly additive effect) the effects of “special” variables such as treatment or a risk factor</li>
<li>One wants the entire model to be interpretable</li>
</ul>
<p><strong>Machine learning may be the better choice if</strong></p>
<ul>
<li>The signal:noise ratio is large and the outcome being predicted doesn’t have a strong component of randomness; e.g., in visual pattern recognition an object must be an <code>E</code> or not an <code>E</code></li>
<li>The learning algorithm can be trained on an unlimited number of exact replications (e.g., 1000 repetitions of each letter in the alphabet or of a certain word to be translated to German)</li>
<li>Overall prediction is the goal, without being able to succinctly describe the impact of any one variable (e.g., treatment)</li>
<li>One is not very interested in estimating uncertainty in forecasts or in effects of selected predictors</li>
<li>Nonadditivity is expected to be strong and can’t be isolated to a few prespecified variables (e.g., in visual pattern recognition the letter <code>L</code> must have both a dominating vertical component <strong>and</strong> a dominating horizontal component <strong>and</strong> these two must intersect at their endpoints)</li>
<li>The sample size is <a href="https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471228814137" target="_blank">huge</a></li>
<li>One does not need to isolate the effect of a special variable such as treatment</li>
<li>One does not care that the model is a “black box”</li>
</ul>
<h2 id="editorialcomment">Editorial Comment</h2>
<p>Some readers have <a href="https://twitter.com/samfin55/status/991031725189984258" target="_blank">commented on twitter</a> that I’ve created a false dichotomy of SMs vs. ML. There is some truth in this claim. The motivations for my approach to the presentation are</p>
<ul>
<li>to clarify that regression models are <strong>not</strong> ML<sup class="footnoteref" id="fnref:Thereisaninte"><a rel="footnote" href="#fn:Thereisaninte">2</a></sup></li>
<li>to sharpen the discussion by having a somewhat concrete definition of ML as a method without “specialness” of the parameters, that does not make many assumptions about the structure of predictors in relation to the outcome being predicted, and that does not explicitly incorporate uncertainty (e.g., probability distributions) into the analysis</li>
<li>to recognize that the bulk of machine learning being done today, especially in biomedical research, seems to be completely uninformed by statistical principles (much to its detriment IMHO), even to the point of many ML users not properly understanding predictive accuracy. It is impossible to have good predictions that address the problem at hand without a thorough understanding of measures of predictive accuracy when choosing the measure to optimize.</li>
</ul>
<p>Some definitions of ML and discussions about the definitions may be found <a href="https://www.techemergence.com/whatismachinelearning" target="_blank">here</a>, <a href="https://machinelearningmastery.com/whatismachinelearning" target="_blank">here</a>, and <a href="https://stackoverflow.com/questions/2620343" target="_blank">here</a>. I like the following definition from <a href="http://www.amazon.com/dp/0070428077?tag=inspiredalgor20" target="_blank">Tom Mitchell</a>: <em>The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.</em></p>
<p>The two fields may also be defined by how their practitioners spend their time. Someone engaged in ML will mainly spend her time choosing algorithms, writing code, specifying tuning parameters, waiting for the algorithm to run on a computer or cluster, and analyzing the accuracy of the resulting predictions. Someone engaged mainly in SMs will tend to spend time choosing a statistical model family, specifying the model, checking goodness of fit, analyzing accuracy of predictions, and interpreting estimated effects.</p>
<p>See <a href="https://twitter.com/f2harrell/status/990991631900921857" target="_blank">this</a> for more twitter discussions.</p>
<h2 id="furtherreading">Further Reading</h2>
<ul>
<li>Followup Article by Drew Levy: <a href="http://fharrell.com/post/statml2">Navigating Statistical Modeling and Machine Learning</a></li>
<li><a href="http://www2.math.uu.se/~thulin/mm/breiman.pdf" target="_blank">Statistical Modeling: The Two Cultures</a> by Leo Breiman <br><small>Note: I very much disagree with Breiman’s view that data models are not important. How would he handle truncated/censored data for example? I do believe that data models need to be flexible. This is facilitated by Bayesian modeling.</small></li>
<li><a href="https://jamanetwork.com/journals/jama/articleabstract/2675024" target="_blank">Big Data and Machine Learning in Health Care</a> by AL Beam and IS Kohane</li>
<li>Harvard Business Review article <a href="https://hbr.org/2016/12/whyyourenotgettingvaluefromyourdatascience" target="_blank">Why You’re Not Getting Value From Your Data Science</a>, about regression vs. machine learning in business applications</li>
<li><a href="http://www.sharpsightlabs.com/blog/differencemachinelearningstatisticsdatamining" target="_blank">What’s the Difference Between Machine Learning, Statistics, and Data Mining?</a></li>
<li><a href="https://jamanetwork.com/journals/jama/fullarticle/2683125" target="_blank">Big Data and Predictive Analytics: Recalibrating Expectations</a> by Shah, Steyerberg, Kent</li>
<li><a href="https://matloff.wordpress.com/2018/06/20/neuralnetworksareessentiallypolynomialregression" target="_blank">Neural Networks are Essentially Polynomial Regression</a> by Norman Matloff</li>
</ul>
<h4 id="footnotes">Footnotes</h4>
<div class="footnotes">
<hr />
<ol>
<li id="fn:Notethatasdes">Note that as described <a href="http://fharrell.com/post/classification" target="_blank">here</a>, it is not appropriate to cast a prediction problem as a classification problem except in special circumstances that usually entail instant visual or sound pattern recognition requirements in a high signal:noise situation where the utility/cost/loss function cannot be specified. ML practitioners frequently misunderstand this, leading them to use <a href="http://www.fharrell.com/post/classdamage" target="_blank">improper accuracy scoring rules</a>. <a class="footnotereturn" href="#fnref:Notethatasdes"><sup>^</sup></a></li>
<li id="fn:Thereisaninte">There is an intersection of ML and regression in neural networks. See <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.4780140108" target="_blank">this article</a> for more. <a class="footnotereturn" href="#fnref:Thereisaninte"><sup>^</sup></a></li>
</ol>
</div>

Musings on Multiple Endpoints in RCTs
http://fharrell.com/post/ymult/
Mon, 26 Mar 2018 00:00:00 +0000
http://fharrell.com/post/ymult/
<p class="rquote">
Learning is more productive than avoiding mistakes. And if one wishes to just avoid mistakes, make sure the mistakes are real. Question whether labeling endpoints is productive, and whether type I error risks are valuable in quantifying evidence for effects and should interfere with asking questions.
</p>
<p>The <a href="https://www.nhlbi.nih.gov" target="_blank">NHLBI</a>funded <a href="https://www.ischemiatrial.org" target="_blank">ISCHEMIA</a> multinational randomized clinical trial<sup class="footnoteref" id="fnref:DisclosureIlea"><a rel="footnote" href="#fn:DisclosureIlea">1</a></sup> is designed to assess the effect of cardiac catheterizationguided coronary revascularization strategy (which includes optimal medical management) compared to optimal medical management alone (with cardiac cath reserved for failure of medical therapy) for patients with stable coronary artery disease. It is unique in that the use of cardiac catheterization is randomized, so that the entire “strategy pipeline” can be studied. Previous studies performed randomization after catheterization results were known, allowing the socalled “oculostenotic reflex” of cardiologists to influence adherence to randomization to a revascularization procedure.</p>
<p>As well summarized <a href="https://www.tctmd.com/news/ischemiafracasamidchargesmovinggoalpostsinvestigatorscomeoutswinging" target="_blank">here</a> and <a href="http://circoutcomes.ahajournals.org/content/11/4/e004744" target="_blank">here</a>, the ISCHEMIA trial recently created a great deal of discussion in the cardiology community when the primary outcome was changed from cardiovascular death or nonfatal myocardial infarction to a 5category endpoint that also includes hospitalization for unstable angina or heart failure, and resuscitated cardiac arrest. The 5component endpoint was the trial’s original primary endpoint and was the basis for the NIH grant funding. The possibility and procedure for reverting back to this 5component endpoint was thought out even before the study began. The change was pragmatic as is usually the case: the accrual and event rates seldom go as hoped. The main concern in the cardiology community is the use of socalled “soft” endpoints. The original twocomponent endpoint is now an important secondary endpoint.</p>
<p>The purpose of this article is not to discuss ISCHEMIA but to discuss the general study design, endpoint selection, and analysis issues ISCHEMIA raises that apply to a multitude of trials.</p>
<h1 id="powervoodooevenwithonlyoneendpointsizingastudyischallenging">Power Voodoo: Even With Only One Endpoint, Sizing a Study is Challenging</h1>
<p>Before discussing power, recall that the type I error α is the probability (risk) of making an assertion of a nonzero effect when the true effect is zero. In any given study we don’t know if a type I error has been committed. A type I error is not an error in the usual sense; it is a longrun operating characteristic, i.e., the chance of someone observing data <strong>more</strong> extreme than ours if they could indefinitely repeat our experiment but with a treatment effect of exactly zero magically inserted. Type I error is the chance of making an <strong>assertion</strong> of efficacy <em>in general</em>, when there is no efficacy.</p>
<p>Power calculations, and sample size calculations based on power, have long been thought by statisticians to be more voodoo than science. Besides all the problems related to null hypothesis testing in general, and arbitrariness in the setting of α and power (1  type II error β), a significant difficulty and chance for arbitrariness is the choice of the effect size δ to detect with probability 1  β. δ is invariably manipulated, at least partly, to result in a sample size that meets budget constraints. What if instead a fully <a href="http://fharrell.com/post/bayesseq">sequential trial</a> was done and budgeting were incremental depending on the promise shown by current results? δ could be held at the original effect size determined by clinical experts, and a Bayesian approach could be used in which no single δ was assumed. Promising evidence for a morethanclinicallytrivial effect could result in the release of more funds<sup class="footnoteref" id="fnref:Thissequential"><a rel="footnote" href="#fn:Thissequential">2</a></sup>. Total program costs could even be reduced, by more quickly stopping studies with a high risk of being futile. A sequential approach makes it less necessary to change an endpoint for pragmatic reasons once the study begins. So would adoption of a Bayesian approach to evidence generation, as a replacement for null hypothesis significance testing. If one “stuck it out” with the original endpoint no matter what the accrual and event frequency, and found that the treatment efficacy assessment is not “definitive” but that the posterior probability of efficacy was 0.93 at the planned study end, many would regard the result as providing good evidence (i.e., a betting person would not make money by betting against the new treatment). On the other hand, p > 0.05 would traditionally be seen as “the study is uninformative since statistical significance was not achieved<sup class="footnoteref" id="fnref:Interpretingthe"><a rel="footnote" href="#fn:Interpretingthe">3</a></sup>.” To some extent the perceived need to change endpoints in a study<sup class="footnoteref" id="fnref:Whichisethical"><a rel="footnote" href="#fn:Whichisethical">4</a></sup> occurs because study leaders and especially sponsors are held hostage by the null hypothesis significance testing/power paradigm.</p>
<p>Speaking of Bayes and sample size calculations, the Bayesian philosophy is to not have any unknowns in any calculation. Posterior probabilities are conditional on current cumulative data and do not use a single value for δ. An entire prior distribution is used for δ. By allowing for uncertainty in δ, Bayesian power calculations are more honest than frequentist calculations.
Some useful references are <a href="http://www.citeulike.org/search/username?q=tag%3Asamplesize*+%26%26+tag%3Abayes*&search=Search+library&username=harrelfe" target="_blank">here</a>.</p>
<p>One of the challenges in power and sample size calculations, and knowing when to stop a study, is that there are competing goals. One might be interested in concluding any of the following:</p>
<ul>
<li>the treatment is beneficial (working in the right direction)</li>
<li>the treatment is more than trivially beneficial</li>
<li>the estimate of the magnitude of the treatment effect has sufficient precision (e.g., the multiplicative margin of error in a hazard ratio)</li>
</ul>
<p>In the frequentist domain, planning studies around <a href="http://www.citeulike.org/search/username?q=tag%3Aprecision&search=Search+library&username=harrelfe" target="_blank">precision</a> frees the researcher from having to choose δ. The ISCHEMIA study, in addition to doing traditional power calculations, also emphasized having a sufficient sample size to estimate the hazard ratio for the two most important endpoints to within an adequate multiplicative margin of error with 0.95 confidence. Bayesian precision can likewise be determined using the half width of the 0.95 credible interval for the treatment effect<sup class="footnoteref" id="fnref:Iftherearemul"><a rel="footnote" href="#fn:Iftherearemul">5</a></sup>.</p>
<h1 id="multipleendpointsandendpointprioritization">Multiple Endpoints and Endpoint Prioritization</h1>
<p>To a large extent, the perceived need to adjust/penalize for asking multiple questions (about multiple endpoints) or at least the need for prioritization of endpoints arises from the perceived need to control overall type I error (also known as α spending). The chance of making an “effectiveness” assertion if any of three endpoints shows evidence against a null hypothesis is greater than α for any one endpoint. As an aside, <a href="http://www.citeulike.org/user/harrelfe/article/13263921" target="_blank">Cook and Farewell</a> give a persuasive argument for prioritization of endpoints but not adjusting their pvalues for multiplicity when one is asking separate questions regarding the endpoints<sup class="footnoteref" id="fnref:Thatiswhenthe"><a rel="footnote" href="#fn:Thatiswhenthe">6</a></sup>. Think of prioritization of endpoints as prespecification of the order for publication and how the study results are publicized. It is OK to announce a “significant” third endpoint as long as the “insignificant” first and second endpoints are announced first, and the context for the third endpoint is preserved.</p>
<p>Having been privy to dozens of hours of discussions among clinical trialists during protocol writing for many randomized clinical trials, I can confidently say that the reasoning for the final choices comes from a mixture of practical, clinical, and patientoriented considerations, perhaps with too much emphasis on the pragmatic statistical question “for which endpoint that the treatment possibly affects are we likely to to have sufficient number of events?”. Though statistical considerations are important, this approach is not fully satisfying because</p>
<ul>
<li>the final choices remain too arbitrary and are not purely clinically/public health motivated</li>
<li>binary endpoints <a href="http://fharrell.com/post/ordinalinfo">are not statistically efficient anyway</a></li>
<li>using separate binary endpoints does not combine the endpoints into an overall patientutility scale<sup class="footnoteref" id="fnref:Anordinalwhat"><a rel="footnote" href="#fn:Anordinalwhat">7</a></sup>.</li>
</ul>
<p>Having multiple prespecified endpoints also sets the stage for a blinded committee to change the endpoint priority for pragmatic reasons, related to the “slavery to statistical power and null hypothesis testing” discussed above.</p>
<p>It is important to note for ISCHEMIA and in general that having a primary endpoint does not prevent anyone interpreting the study’s final result from emphasizing a secondary or tertiary endpoint.</p>
<h1 id="jointmodelingofmultipleendpoints">Joint Modeling of Multiple Endpoints</h1>
<p>joint modeling of multiple outcomes allows uncovering relationships of multiple outcome variables, and quantifying joint evidence for all outcomes simultaneously, while providing the usual marginal outcome evidence (for each outcome separately). As discussed <a href="http://fharrell.com/post/bayesfreqstmts">here</a> and <a href="http://fharrell.com/post/journey">here</a>, Bayesian posterior inference has many advantages in this context. For example, the final analysis of a clinical trial with three endpoints E<sub>1</sub>, E<sub>2</sub>, E<sub>3</sub> might be based on posterior probabilities of the following forms:</p>
<table>
<thead>
<tr>
<th> </th>
<th> </th>
</tr>
</thead>
<tbody>
<tr>
<td>Prob(E<sub>1</sub> > 0 or E<sub>2</sub> > 0 or E<sub>3</sub> > 0)</td>
<td>Prob(efficacy) on <strong>any</strong> endpoint</td>
</tr>
<tr>
<td>Prob(E<sub>1</sub> > 0 and E<sub>2</sub> > 0)</td>
<td>Prob(efficacy) on both of the first two endpoints</td>
</tr>
<tr>
<td>Prob(E<sub>1</sub> > 0 or (E<sub>2</sub> > 3 and E<sub>3</sub> > 4))</td>
<td>Prob(any mortality reduction or large reductions on two nonfatal endpoints)</td>
</tr>
<tr>
<td>Prob(at least two of E<sub>1</sub> > 0, E<sub>2</sub> > 0, E<sub>3</sub> > 0)</td>
<td>Prob(hitting any two of the three efficacy targets)</td>
</tr>
<tr>
<td>Prob(1 < E<sub>1</sub> < 1)</td>
<td>Prob(similarity of E<sub>1</sub> outcome)</td>
</tr>
</tbody>
</table>
<p>One can readily see that once you get away from null hypothesis testing, many clinically relevant possibilities exist, and multiplicity considerations are cast aside. A reasonable strategy would be to demand an extrahigh probability of hitting any one of three targets, or a somewhat lower probability of hitting any two of the three targets. More about this way of thinking may be found <a href="http://fharrell.com/post/bayesfreqstmts">here</a><sup class="footnoteref" id="fnref:Justasonecan"><a rel="footnote" href="#fn:Justasonecan">8</a></sup>.</p>
<p>Posterior probabilities also provide the direct forward predictive type of evidence that leads to optimum decisions. Barring cost considerations, a treatment that has a 0.93 chance of reducing mortality may be deemed worthwhile, especially if a skeptical prior was used.</p>
<p>Joint Bayesian modeling of multiple endpoints also allows one to uncover interrelationships among the endpoints as described in the recent paper by <a href="https://onlinelibrary.wiley.com/doi/abs/10.1002/pst.1852" target="_blank">Costa and Drury</a>. One of the methods proposed by the authors, the one based on a multivariate copula, has several advantages. First, one obtains the usual marginal treatment effects on each endpoint separately. Second, the Bayesian analysis they describe allows one to estimate the amount of dependence between two endpoints, which is interesting in its own right and will help in estimating power when planning future studies. Third, the amount of such dependence can be allowed to vary by treatment. For example, if one endpoint is an efficacy endpoint (or continuous measurement) and another is the occurrence of an adverse event, placebo subjects may randomly experience the adverse event such that there is no withinperson correlation between it and the efficacy response. On the other hand, subjects on the active drug may experience the efficacy and safety outcomes together. E.g., subjects getting the best efficacy response may be those with more adverse events. Estimation of betweenoutcome dependencies is of real clinical interest.</p>
<p>Most importantly, Bayesian analysis of clinical trials, when multiple endpoints are involved, allows the results for each endpoint to be properly interpreted marginally. That is because the prior state of knowledge, which may reasonably be encapsulated into a skeptical prior (i.e., a prior distribution that assumes large treatment effects are unlikely) leads to a posterior probability of efficacy for each endpoint that is straightforwardly interpreted regardless of context. Because Bayes deals with <a href="http://fharrell.com/post/pvalprobs">forward probabilities</a>, these posterior probabilities of efficacy are calibrated by their priors. For example, the skepticism with which we view efficacy of a treatment on endpoint E<sub>2</sub> comes from the data about the E<sub>2</sub> effect and the prior skepticism about the E<sub>2</sub> effect, no matter what the effect on E<sub>1</sub>. This way of thinking shows clearly the value of trying to learn more from a study by asking multiple questions. One should not be penalized for curiosity.</p>
<h3 id="furtherreading">Further Reading</h3>
<ul>
<li><a href="https://jamanetwork.com/journals/jama/articleabstract/185214?redirect=true" target="_blank">Composite end points in randomized trials: There is no free lunch</a></li>
<li><a href="http://www.citeulike.org/user/harrelfe/tag/multipleendpoints" target="_blank">Miscellaneous papers</a> on multiple endpoints</li>
</ul>
<h3 id="footnotes">Footnotes</h3>
<div class="footnotes">
<hr />
<ol>
<li id="fn:DisclosureIlea">Disclosure: I lead the independent statistical team at Vanderbilt that supports the DSMB for the ISCHEMIA trial and was involved in the trial design. As the DSMB reporting statistician I am unblinded to treatment assignment and outcomes. I had no interaction with the blinded independent advisory committee that recommended the primary endpoint change or with the process that led to that recommendation. <a class="footnotereturn" href="#fnref:DisclosureIlea"><sup>^</sup></a></li>
<li id="fn:Thissequential">This sequential funding approach assumes that outcomes occur quickly enough to influence the assessment. <a class="footnotereturn" href="#fnref:Thissequential"><sup>^</sup></a></li>
<li id="fn:Interpretingthe">Interpreting the pvalue in conjunction with a 0.95 confidence interval would help, but there are two problems. First, most users of frequentist theory are hung up on the pvalue. Second, the confidence interval has endpoints that are not controllable by the user, in contrast to Bayesian posterior probabilities of treatment effects being available for any userspecified interval endpoints. For example, one may want to compute Prob(blood pressure reduction > 3mmHg). <a class="footnotereturn" href="#fnref:Interpretingthe"><sup>^</sup></a></li>
<li id="fn:Whichisethical">Which is ethically done, by decision makers using only pooled treatment data. <a class="footnotereturn" href="#fnref:Whichisethical"><sup>^</sup></a></li>
<li id="fn:Iftherearemul">If there are multiple data looks, the traditional frequentist confidence interval is no longer valid and a complicated adjustment is needed. The adjusted confidence interval would be seen by Bayesians as conservative. <a class="footnotereturn" href="#fnref:Iftherearemul"><sup>^</sup></a></li>
<li id="fn:Thatiswhenthe">That is, when the endpoint comparisons are to be interpreted marginally. <a class="footnotereturn" href="#fnref:Thatiswhenthe"><sup>^</sup></a></li>
<li id="fn:Anordinalwhat">An ordinal “what’s the worst thing that happened to the patient” scale would have few assumptions, would increase power, and would give credit to a treatment that has more effect on more serious outcomes than it has on less serious ones. <a class="footnotereturn" href="#fnref:Anordinalwhat"><sup>^</sup></a></li>
<li id="fn:Justasonecan">Just as one can compute the probability of rolling a six on either of two dice, or rolling a total greater than 9, direct predictivemode probabilities may be computed as often as desired with no multiplicity. Multiplicity with backwards probabilities comes from giving data more chances to be extreme (the frequentist sample space) and not from the chances you give more efficacy parameters to be positive. <a class="footnotereturn" href="#fnref:Justasonecan"><sup>^</sup></a></li>
</ol>
</div>

Improving Research Through Safer Learning from Data
http://fharrell.com/post/improveresearch/
Thu, 08 Mar 2018 00:00:00 +0000
http://fharrell.com/post/improveresearch/
<h1 id="overview">Overview</h1>
<p>There are two broad classes of data analysis. The first class, exploratory data analysis, attempts to understand the data at hand, i.e., to understand <em>what happened</em>, and can use descriptive statistics, graphics, and other tools, including multivariable statistical models<sup class="footnoteref" id="fnref:Whenmanyvariab"><a rel="footnote" href="#fn:Whenmanyvariab">1</a></sup>. The second broad class of data analysis is inferential analysis which aims to provide evidence and assist in judgments about the process generating the one dataset. Here the interest is in generalizability, and a statistical model is not optional<sup class="footnoteref" id="fnref:Everystatistica"><a rel="footnote" href="#fn:Everystatistica">2</a></sup>. Sometimes this is called population inference but it can be thought of less restrictively as understanding the data generating process. Also there is prediction, which is a mode of inference distinct from judgment and decision making.</p>
<p>The following discussion concentrates on inference, although several of the concepts, especially measurement accuracy, fully pertain to exploratory data analysis.</p>
<p>The key elements of learning from data using statistical inference involve the following:</p>
<ol>
<li>prespecification if doing formal inference, intending to publish, or intending to be reviewed by regulatory authorities</li>
<li>choosing an experimental design</li>
<li>considering the spectrum of persubject information available</li>
<li>considering information content, bias, and precision in measurements</li>
<li>understanding variability in measurements</li>
<li>specification of the statistical model</li>
<li>incorporating beliefs of judges/regulators/consumers into the model parameters if Bayesian</li>
<li>incorporating beliefs of judges/regulators/consumers into model interpretation if frequentist</li>
<li>using the model to quantify evidence</li>
<li>replication/validation, when needed</li>
<li>translating the evidence to a decision or an action</li>
</ol>
<h1 id="prespecification">Prespecification</h1>
<p>Prespecification of the study design and analysis are incredibly important components of reproducible research. It is necessary unless one is engaging in exploratory learning (especially in the initial phase of research) and not intending for the results to be considered confirmatory. Prespecification controls investigator degrees of freedom (see below) and keeps the investigator from entering the <a href="http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf" target="_blank">garden of forking paths</a>. A large fraction of studies that failed to validate can be traced to the nonexistence of a prospective, specific data transformation and statistical analysis plan. Randomized clinical trials require almost complete prespecification. Animal and observational human subjects research does not enjoy the same protections, and many an experiment has resulted in statistical disappointment that tempted the researcher to modify the analysis, choice of response variable, sample membership, computation of derived variables, normalization method, etc. The use of cutoffs on pvalues causes a large portion of this problem.</p>
<p>Frequentist and Bayesian analysis are alike with regard the need for prespecification. But a Bayesian approach has an advantage here: you can include parameters for what you don’t know and hope you don’t need (but are not sure). For example, one could specify a model in which a doseresponse relationship is linear, but add a parameter that allows a departure from linearity. One can hope that interaction between treatment and race is absent, but <a href="https://www.ncbi.nlm.nih.gov/pubmed/9192445" target="_blank">include parameters allowing for such interactions</a>. In these two examples, skeptical prior distributions for the “extra” parameters would favor a linear doseresponse or absence of interaction, but as the sample size increases the data would allow these parameters to “float” as needed. Bayes still provides accurate inferences when one is not sure of the model. This is discussed further below. Twostage analyses as typically employed in the frequentist paradigm (e.g., pretesting for linearity of doseresponse) do not control type I error.</p>
<h1 id="experimentaldesign">Experimental Design</h1>
<p>The experimental design is all important, and is what allows interpretations to be causal. For example, in comparing two treatments there are two types of questions:</p>
<ol>
<li>Did treatment B work better in the group of patients receiving it in comparison to those patients who happened to receive treatment A?</li>
<li>Would this patient fare better were <em>she</em> given treatment B vs. were <em>she</em> given treatment A?</li>
</ol>
<p>The first question is easy to answer using statistical models (estimation or prediction), not requiring any understanding of physicians’ past treatment choices. The second question is one of causal inference, and it is impossible for observational data to answer that question without additional unverifiable assumptions. (Compare this to a randomized crossover study where the causal question can be almost directly answered.)</p>
<p>In a designed experiment, the experimenter usually knows exactly which variables to measure, and some of the variables are completely controlled. For example, in a 3x3 randomized factorial design, two factors are each experimentally set to three different levels giving rise to 9 controlled combinations. The experiment can block on yet other factors to explain outcome variation caused by them. In a randomized crossover study, an investigator can estimate causal treatment differences per subject if carryover effects are washed out. In an observational therapeutic effectiveness study it is imperative to measure a long list of relevant variables that explain outcomes <strong>and</strong> treatment choices. Still not guaranteeing an ability to answer the causal therapeutic question, having a wide spectrum of accurately collected baseline data is required to begin the process. Other design elements of observational studies are extremely important, including such aspects as when variables are measured, which subjects are included, what is the meaning of “time zero”, and how does one avoid losses to followup.</p>
<h1 id="measurementsandunderstandingvariability">Measurements and Understanding Variability</h1>
<p>Understanding what measurements really mean, what they do not capture, minimizing systematic bias, minimizing measurement error, and maximizing data resolution are key to optimizing statistical power and soundness of inference. Resolution is related to data acquisition, variable definitions, and measurement errors. Optimal statistical information comes from continuous measurements whose measurement errors are small.</p>
<p>Understanding sources of variability and incorporating those into the experimental design and the statistical model are important. What is the disagreement in technical replicates (e.g. splitting one blood sample into two and running both through a blood analyzer)? Are there batch effects? Edge effects in a gene microarray? Variation due to different temperatures in the lab each day? Do patients admitted on Friday night inherently have longer hospital stays? Other day of week effects? Seasonal variation and other longterm time effects? How about region, country, and lab variation?</p>
<h1 id="beliefsmatterwheninterpretingresultsorquantifyingabsoluteevidence">Beliefs Matter When Interpreting Results or Quantifying Absolute Evidence</h1>
<p>Notice the inclusion of <em>beliefs</em> in the original list. Frequentists operate under the illusion of objectivity and believe that beliefs are not relevant. This is an illusion, for four reasons.</p>
<ol>
<li>IJ Good showed that all probabilities are subjective because they depend on the knowledge of the observer. One of his examples is that a card player who knows that a certain card is sticky will know a different probability that the card will be at the top of the deck than will a player who doesn’t know that.<br /></li>
<li>To compute pvalues, one <strong>must</strong> know the <em>intentions</em> of the investigator. Did she intend to study 90 patients and happened to observe 10 bad outcomes, or did she intend to sample patients until 10 outcomes happened? Did she intend to do an early data look? Did she actually do an early data look but first wrote an affidavit affirming that she would not take any action as a result of the look? Did she intend to analyze three dependent variables and was the one reported the one she would have reported even had she looked at the data for all three? All of these issues factor into computation of a pvalue.</li>
<li>The choice of the statistical model is always subjective (more below).</li>
<li>Interpretations are subjective. Do you multiplicityadjust a pvalue? Using which of the competing approaches? What if other studies have results that are inconsistent with the new study? How do we discount the current pvalue for that? But most importantly, the necessary conversion of a frequentist probability of data given a hypothesis into evidence about the hypothesis is entirely subjective.</li>
</ol>
<p>Bayesian inference gets criticized for being subjective when in fact its distinguishing feature is that it is stating subjective assumptions clearly.</p>
<h1 id="specificationofthestatisticalmodel">Specification of the Statistical Model</h1>
<p>The statistical model is the prior belief about the <em>structure</em> of the problem. Outside of mathematics and physics, its choice is all too arbitrary, and statistical results and their interpretation depend on this choice. This applies equally to Bayesian and frequentist inference. The model choice has more impact than the choice of a prior distribution; the model choice does not “wear off” nearly as much as the prior does as the sample size gets large.</p>
<p><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5122713" target="_blank">Investigator degrees of freedom</a> greatly affects the reliability and generalizability of scientific findings. This applies to measurement, experimental design, choice of variables, the statistical model, and other facets of the research process. Turning to just the statistical model, there are aspects of modeling about which we continually delude ourselves in such a way as to have false confidence in results. This happens primarily in two ways. Either the investigator “plays with the data” to try different models and uses only the apparently bestfitting one, making confidence and credible intervals too narrow and pvalues and standard errors too small, or she selects a model apriori and hopes that it fits “well enough”. The latter occurs even in confirmatory studies with rigorous prespecification of the analysis. Whenever we use a model that makes an assumption about data structure, including assumptions about linearity, interactions, which variables to include, the shape of the distribution of the response given covariates, constancy of variance, etc., the inference is conditional on all those assumptions being true. The Bayesian approach provides an out: make all the assumptions you want, but allow for departures from those assumptions. If the model contains a parameter for everything we know we don’t know (e.g., a parameter for the ratio of variances in a twosample ttest), the resulting posterior distribution for the parameter of interest will be flatter, credible intervals wider, and confidence intervals wider. This makes them more likely to lead to the correct interpretation, and makes the result more likely to be reproducible.</p>
<p>Consider departures from the normality assumption.
<a href="http://onlinelibrary.wiley.com/book/10.1002/9781118033197" target="_blank">Box and Tiao</a> show how to elegantly allow for nonnormality in a Bayesian twosample ttest. This is done by allowing the data distribution to have a kurtosis (tail heaviness) that is different from what the normal curve allows. They place a prior distribution on the kurtosis parameter favoring normality, but as the sample size increases, less and less normality is assumed. When the data indicate that the tails are heavier than Gaussian, they showed that the resulting point estimates of the two means are very similar to trimmed means. In the same way a prior distribution for the ratio of variances that may favor 1.0, a prior for the degree of interaction between treatment and a baseline variable that favors no interaction, and a prior for the degree of nonlinearity in an age effect that favors linearity for very small sample sizes could all be specified. The posterior distribution for the main parameter of interest will reflect all of these uncertainties in an honest fashion. This is related to penalized maximum likelihood estimation or shrinkage in the frequentist domain<sup class="footnoteref" id="fnref:Thefrequentist"><a rel="footnote" href="#fn:Thefrequentist">3</a></sup>.</p>
<p>Besides the ability to handle more uncertainties, the Bayesian paradigm provides <a href="http://fharrell.com/post/bayesfreqstmts">direct evidentiary statements</a> such as the probability that the treatment reduces blood pressure. This is in contrast with the frequentist paradigm, which results in a probability of getting observed effects greater than what we observed were the true effect exactly zero, the model correct, and the same experiment (other than changing H<sub>0</sub> to be true) were to be repeated indefinitely<sup class="footnoteref" id="fnref:Thisimpliestha"><a rel="footnote" href="#fn:Thisimpliestha">4</a></sup>.</p>
<h1 id="usingstatisticalmodelstoquantifyevidence">Using Statistical Models to Quantify Evidence</h1>
<p>Since the model choice is subjective, if we want our quantified evidence for effects to be accurate and not overstated, we should use Bayesian models acknowledging what we don’t know. Nate Silvers in <a href="https://www.amazon.com/SignalNoiseManyPredictionsFailbut/dp/0143125087" target="_blank">The Signal and the Noise</a> eloquently wrote in detail about a different example of “what we don’t know”, related to causal inference from observational data. In his description of the controversy about cigarette smoking and lung cancer he pointed out that many people believed Ronald Fisher when Fisher said that since one can’t randomize cigarette exposure there is no way to draw trustworthy inference; therefore we should draw no conclusions (he also had a significant conflict of interest, as he was a consultant to the tobacco industry). Silver showed that even an extremely skeptical prior distribution about the effect of cigarette smoking on lung cancer would be overridden by the data. Because only the Bayesian approach allows insertion of skepticism at precisely the right point in the logic flow, one can think of a full Bayesian solution (prior + model) as a way to “get the model right”, taking the design and context into account, to obtain reliable scientific evidence. Note that possible failure to have all confounders measured can be somewhat absorbed into the skeptical prior distribution with a Bayesian approach.</p>
<p>In some tightlycontrolled experiments, the statistical model is somewhat less relevant.</p>
<h1 id="replicationandvalidation">Replication and Validation</h1>
<p>Finally, turn to the complex issue of replication and/or validation. First of all, it is important to know what replication is <em>not</em>. <a href="http://fharrell.com/post/splitval">This article</a> discusses splitsample validation as a datawasting form of <em>internal validation</em>. It does not demonstrate that other investigators with different measurement and survey techniques, different data cleaning procedures, and different subtle ways to “cheat” would arrive at the same answer. It turns geographical differences and time trends into surprises rather than useful covariates. A far better form of internal validation is the bootstrap, which has the added advantage (and burden) of requiring the researcher to completely specify all analytic steps. Now contrast internal validation with true independent replication. The latter has the advantages of validating the following:</p>
<ol>
<li>the investigators and their hidden biases</li>
<li>the specificity of the statistical analysis plan</li>
<li>the technologies on which measurements are based (e.g., gene or protein expression)</li>
<li>the survey techniques including how subjects are interviewed (with respect to leading questions, etc.)</li>
<li>subject inclusion/exclusion criteria</li>
<li>subtle decisions that biased estimates such as treatment effects (e.g., deleting outliers, avoiding blinding and blinded data correction, remeasuring something when its value is suspect, etc.)</li>
<li>other systemic biases that one suspects would be different for different research teams</li>
</ol>
<p>When is an independent replication or model validation warranted? This is difficult to say, but is related to the potential impact of the result, on subjects and on future researchers.</p>
<p>The quickest and cheapest form of partial validation is to validate the investigators and code, in the following sense. Have the investigators provide the prespecified data manipulation (including computation of derived variables) and statistical analysis or machine learning plan, along with the raw data, to an independent team. The independent team executes the data manipulation and analysis plan and compares the results to the results obtained by the original team. Ideally the independent researchers would run the original code on their systems and also do some independent coding. This process verifies code, computations, and specificity of the analysis plan and verifies that once the paper is published others will also be able to replicate the findings. This approach would have entirely prevented the <a href="https://en.wikipedia.org/wiki/Anil_Potti" target="_blank">Duke University Potti scandal</a> had the cancer biomarker investigators at Duke been interested in collaborating with an outside team.</p>
<p>If rigorous internal validation or attempted duplication of results by outsiders fails, there is no need to undertake an acquisition of new independent data to validate the original approach<sup class="footnoteref" id="fnref:Thisespecially"><a rel="footnote" href="#fn:Thisespecially">5</a></sup>.</p>
<h1 id="stepstoenhancethescientificprocess">Steps to Enhance the Scientific Process</h1>
<ol>
<li>Choose an experimental design that is appropriate for the question of interest, taking in account whether association or causation are central to the question</li>
<li>Choose the right measurements and measure them accurately or at least without a systemic bias favoring your viewpoint</li>
<li>Understand sources of variability and incorporate those into the design and the model</li>
<li>Formulate prior distributions for effect parameters that are informed by the subject matter and other reliable data. Even if you use pvalues this process will benefit the research.</li>
<li>Formulate a data model that is informed by subject matter, knowledge about the measurements, and experience with similar data</li>
<li>Add parameters to the model for what you don’t know, putting priors on those parameters so as to favor your favorite model (e.g., normal distribution with equal variances for the ttest; absence of interactions) but not rule out departures from it. If using a frequentist approach, parameters must be “all in”, which will make confidence intervals honest but <a href="https://www.ncbi.nlm.nih.gov/pubmed/9192445" target="_blank">wider than Bayesian credible intervals</a>.</li>
<li>Independently validate code and calculations while verifying the specificity of the statistical analysis or machine learning plan</li>
<li>In many situations, especially when large scale policies are at stake, independently replicate the findings from scratch before believing them</li>
</ol>
<hr />
<h2 id="someusefulreferences">Some Useful References</h2>
<ul>
<li><a href="https://arxiv.org/abs/1511.05219" target="_blank">How much does your data exploration overfit? Controlling bias via information usage</a> by D Russo and J Zou</li>
</ul>
<hr />
<p>This article benefited from many thoughtprovoking discussions with Bert Gunter, who believes that replication and initial exploratory analysis are even more important than I do. Chris Tong also provided valuable ideas. Misconceptions are solely mine.</p>
<p>Footnotes:</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:Whenmanyvariab">When many variables are involved, a statistical model is often the best descriptive tool, even when it’s not used for inference. <a class="footnotereturn" href="#fnref:Whenmanyvariab"><sup>^</sup></a></li>
<li id="fn:Everystatistica">Every statistical test is using a model. For example, the Wilcoxon twosample test is a special case of the proportional odds model and requires the proportional odds assumption to hold to achieve maximum power. <a class="footnotereturn" href="#fnref:Everystatistica"><sup>^</sup></a></li>
<li id="fn:Thefrequentist">The frequentist paradigm does not provide confidence intervals or pvalues when parameters are penalized. <a class="footnotereturn" href="#fnref:Thefrequentist"><sup>^</sup></a></li>
<li id="fn:Thisimpliestha">This implies that the exact experimental design that is <strong>in effect</strong> is known so that the pvalue can be computed by rerunning that exact design indefinitely often to compute the probability of finding a larger effect in those repeated experiments than the effect originally observed. <a class="footnotereturn" href="#fnref:Thisimpliestha"><sup>^</sup></a></li>
<li id="fn:Thisespecially">This especially pertains to prediction, and is less applicable to randomized trials. <a class="footnotereturn" href="#fnref:Thisespecially"><sup>^</sup></a></li>
</ol>
</div>

Is Medicine Mesmerized by Machine Learning?
http://fharrell.com/post/medml/
Thu, 01 Feb 2018 00:00:00 +0000
http://fharrell.com/post/medml/
<p>BD Horne et al wrote an important paper <a href="http://www.amjmed.com/article/S00029343(09)00103X/pdf" target="_blank">Exceptional mortality prediction by risk scores from common laboratory tests</a> that apparently garnered little attention, perhaps because it used older technology: standard clinical lab tests and logistic regression. Yet even putting themselves at a significant predictive disadvantage by binning all the continuous lab values into fifths, the authors were able to achieve a validated cindex (AUROC) of 0.87 in predicting death within 30d in a mixed inpatient, outpatient, and emergency department patient population. Their model also predicted 1y and 5y mortality very well, and performed well in a completely independent NHANES cohort<sup class="footnoteref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup>. It also performed very well when evaluated just in outpatients, a group with very low mortality.</p>
<p>The above model, called by the authors the Intermountain Risk Score, used the following predictors: age, sex, hematocrit, hemoglobin, red cell distribution width, mean corpuscular volume, red blood cell count, platelet count, mean platelet volume, mean corpuscular hemoglobin, mean corpuscular hemoglobin concentration, total white blood count, sodium, potassium, chloride, bicarbonate, calcium, glucose, creatinine, and BUN<sup class="footnoteref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup>. The model is objective, transparent, and needs only onetime and not historical information. It did not need the EHR (other than to get age and sex) but rather used the clinical lab data system. How predicted risks are arrived at is obvious, i.e., a physician can easily see which patient factors were contributing to overall risk of mortality. The predictive factors are measured at obvious times. One can be certain that the model did not use information it shouldn’t such as the use of certain treatments and procedures that may create a kind of circularity with death. It is important to note however that interlab variation has created challenges in analyzing lab data from multiple health systems.</p>
<p>Contrast the above underhyped approach with machine learning (ML). Consider the Avati et al’s paper <a href="https://arxiv.org/abs/1711.06402" target="_blank">Improving palliative care with deep learning</a> which was publicized <a href="https://spectrum.ieee.org/thehumanos/biomedical/diagnostics/stanfordsaipredictsdeathforbetterendoflifecare" target="_blank">here</a>. The Avati paper addresses an important area and is well motivated. Palliative care (e.g., hospice) is often sought at the wrong time and relies on individual physician referrals. An automatic screening method may yield a list of candidate patients near end of life who should be evaluated by a physician for the possibility of recommending palliative rather than curative care. A method designed to screen for such patients needs to be able to estimate either mortality risk or life expectancy accurately.</p>
<p>Avati et al’s analysis used a year’s worth of prior data on each patient and was based on 13,654 candidate features from the EHR. As with any retrospective study not based on an inception cohort with a welldefined “time zero”, it is tricky to define a time zero and somewhat easy to have survival bias and other sampling biases sneak into the analysis. The ML algorithm, in order to use a binary outcome, required division of patients into “positive” and “negative” cases, something not required by regression models for time until an event<sup class="footnoteref" id="fnref:Thereexistneur"><a rel="footnote" href="#fn:Thereexistneur">3</a></sup>. “Positive” cases must have at least 12 months of previous data in the health system, weeding out patients who died quickly. “Negative” cases must have been alive for at least 12 months from the <em>prediction date</em>. It is also not clear how variable censoring times were handled. In standard statistical model, patients entering the system just before the data analysis have short followup and are rightcensored early, but still contribute some information.</p>
<p>Avati et al used deep learning on the 13,654 features to achieve a validated cindex of 0.93. To the authors’ credit, they constructed an unbiased calibration curve, although it used binning and is very low resolution. Like many applications of ML where few statistical principles are incorporated into the algorithm, the result is a failure to make accurate predictions on the absolute risk scale. The calibration curve is far from the line of identity as shown below.</p>
<figure >
<img src="http://fharrell.com/img/ava17impCal.png" width="60%" />
</figure>
<p>The authors interpreted the above figure as “reasonably calibrated.” It is not. For example, a patient with a predicted probability of 0.2 had an actual risk < 0.1. The gain in cindex from ML over simpler approaches has been more than offset by worse calibration accuracy than the other approaches achieved.</p>
<p>Importantly, some of the hype over ML comes from journals and professional societies and not so much from the researchers themselves. That is the case for the Avati et al deep learning algorithm, which is not actually being used in production mode at Stanford. A much better calibrated and somewhat more statisticallybased algorithm is currently being used.</p>
<p>Like many ML algorithms, the focus is on development of “classifiers”. As detailed <a href="http://fharrell.com/post/classification/" target="_blank">here</a>, classifiers are far from optimal in medical decision support where decisions are not to be made in a paper but only once utilities/costs are known. Utilities and costs only become known during the physician/patient interaction. Unlike statistical models which directly estimate risk or life expectancy, the majority of ML algorithms start by using classification, then if a probability is needed they try to convert the patterns into a probability (this is sometimes called a “probability machine”). As judged by Avati et al’s calibration plot, this conversion may not be reliable.</p>
<p>Avati et al, besides showing us what is needed, and consistent with forward prediction (the calibration plot) also reported a number of problematic measures. As detailed <a href="http://fharrell.com/post/classdamage/" target="_blank">here</a>, the use of improper probability accuracy scoring rules is very common in the ML world, because of the hope that one can actually make a decision (classification) using the data without needing to incorporate costs of incorrect decisions (utilities). Improper accuracy scores have a number of problems, such as</p>
<ul>
<li>reversing information flow, i.e., conditioning on outcomes and examining tendencies of inputs</li>
<li>inviting dichotomization of inputs</li>
<li>being optimized by choosing the wrong features and giving them the wrong weights</li>
</ul>
<p>Proportion classified correctly, sensitivity, specificity, precision, and recall are all improper accuracy scoring rules and should not play a role in a forward prediction mode when risk or life expectancy estimation are the real goals. A poker player wins consistently because she is able to estimate the probability she will ultimately win with her current hand, not because she recalls how often she’s had such a hand when she won.</p>
<p>One additional point: the ML deep learning algorithm is a black box, not provided by Avati et al, and apparently not usable by others. And the algorithm is so complex (especially with its extreme usage of procedure codes) that one can’t be certain that it didn’t use proxies for private insurance coverage, raising a possible ethics flag. In general, any bias that exists in the health system may be represented in the EHR, and an EHRwide ML algorithm has a chance of perpetuating that bias in future medical decisions. On a separate note, I would favor using comprehensive comorbidity indexes and severity of disease measures over doing a freerange exploration of ICD9 codes.</p>
<p>It may also be useful to contrast the ML approach with another carefully designed traditional and transparent statistical approach used in the <a href="http://onlinelibrary.wiley.com/doi/10.1111/j.15325415.2000.tb03126.x/full" target="_blank">HELP study</a> of JM Teno, FE Harrell, et al. A validated parametric survival model was turned into an easytouse nomogram for obtaining a variety of predictions on older hospitalized adults:</p>
<figure >
<img src="http://fharrell.com/img/HELPnomogram.png" alt="Nomogram for obtaining predicted 1 and 2year survival probabilities and the 10th, 25th, 50th, 75th, and 90th percentiles of survival time (in months) for individual patients in HELP. Disease class abbreviations: a=ARF/MOSF/Coma, b=all others, c=CHF, d=Cancer, e=Orthopedic. To use the nomogram, place a ruler vertically such that it touches the appropriate value on the axis for each predictor. Read off where the ruler intersects the 'Points' axis at the top of the diagram. Do this for each predictor, making a listing of the points. Add up all these points and locate this value on the 'Total Points' axis with a vertical ruler. Follow the ruler down and read off any of the predicted values of interest. APS is the APACHE III Acute Physiology Score." width="95%" />
<figcaption>
<p>
Nomogram for obtaining predicted 1 and 2year survival probabilities and the 10th, 25th, 50th, 75th, and 90th percentiles of survival time (in months) for individual patients in HELP. Disease class abbreviations: a=ARF/MOSF/Coma, b=all others, c=CHF, d=Cancer, e=Orthopedic. To use the nomogram, place a ruler vertically such that it touches the appropriate value on the axis for each predictor. Read off where the ruler intersects the 'Points' axis at the top of the diagram. Do this for each predictor, making a listing of the points. Add up all these points and locate this value on the 'Total Points' axis with a vertical ruler. Follow the ruler down and read off any of the predicted values of interest. APS is the APACHE III Acute Physiology Score.
</p>
</figcaption>
</figure>
<p>Importantly, patients’ actual preferences for care were also studied in HELP. A different validated prognostic tool for endoflife decision making, derived primarily from ICU patients, is the <a href="http://annals.org/aim/articleabstract/708396/supportprognosticmodelobjectiveestimatessurvivalseriouslyillhospitalizedadults" target="_blank">SUPPORT prognostic model</a>.</p>
<p>In the rush to use ML and large EHR databases to accelerate learning from data, researchers often forget about the advantages of statistical models and of using more compact, cleaner, and better defined data. They also sometimes forget how to measure absolute predictive accuracy, or that utilities must be incorporated to make optimum decisions. Utilities are applied to predicted risks; classifiers are at odds with optimum decision making and with incorporating utilities at the appropriate time, which is usually at the last minute just before the medical decision is made and not when a classifier is being built.</p>
<hr />
<h2 id="referencesguidelinesforreportingpredictivemodels">References: Guidelines for Reporting Predictive Models</h2>
<ul>
<li><a href="http://annals.org/aim/fullarticle/2088549/transparentreportingmultivariablepredictionmodelindividualprognosisdiagnosistripodtripod" target="_blank">TRIPOD Statement</a></li>
<li><a href="http://annals.org/aim/fullarticle/2088542/transparentreportingmultivariablepredictionmodelindividualprognosisdiagnosistripodexplanation" target="_blank">TRIPOD Explanation and Elaboration</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5238707" target="_blank">Guidelines for Developing and Reporting Machine Learning Predictive Models in Biomedical Research</a></li>
</ul>
<h2 id="otherrelevantarticles">Other Relevant Articles</h2>
<ul>
<li><a href="https://jamanetwork.com/journals/jama/fullarticle/2675024" target="_blank">Big Data and Machine Learning in Health Care</a></li>
<li><a href="https://publications.parliament.uk/pa/ld201719/ldselect/ldai/100/10002.htm" target="_blank">UK Parliament AI Report</a></li>
<li><a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0194889" target="_blank">Statistical and Machine Learning forecasting methods: Concerns and ways forward</a> by S Markridakis, E Spiliotis, V Assimakopoulos
<ul>
<li>Excellent discussion of overfitting, measuring accuracy, and lack of rigor in published machine learning studies in financial time series forecasting. Simple statistical methods outperformed complex machine learning algorithms. Previous researchers refused to share data.</li>
</ul></li>
</ul>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">The authors failed to present a highresolution validated calibration to demonstrate the absolute predictive accuracy of the model. They also needlessly dealt with sensitivity and specificity.
<a class="footnotereturn" href="#fnref:1"><sup>^</sup></a></li>
<li id="fn:2">Hemoglobin, red blood count, mean corpuscular hemoglobin, chloride, and BUN were excluded because their information was redundant once all the other predictors were known.
<a class="footnotereturn" href="#fnref:2"><sup>^</sup></a></li>
<li id="fn:Thereexistneur">There exist neural network algorithms for censored timetoevent data. <a class="footnotereturn" href="#fnref:Thereexistneur"><sup>^</sup></a></li>
</ol>
</div>

Information Gain From Using Ordinal Instead of Binary Outcomes
http://fharrell.com/post/ordinalinfo/
Sun, 28 Jan 2018 00:00:00 +0000
http://fharrell.com/post/ordinalinfo/
<p>As discussed in <a href="http://fharrell.com/doc/bbr.pdf#nameddest=sec:overviewychoice">BBR Section 3.5</a>, a binary dependent variable Y has minimum statistical information, giving rise to minimal statistical power and precision. This can easily be demonstrated by power or sample size calculations. Consider a pain outcome as an example. Instead of having as an outcome the presence or absence of pain, one can significantly increase power by having several levels of pain severity with the lowest level representing “none”; the more levels the better.</p>
<p>The point about the increase in power can also be made by, instead of varying the effect size, varying the effect that can be detected with a fixed power of 0.9 when the degree of granularity in Y is increased. This is all about breaking ties in Y. The more ties there are, the less statistical information is present. Why is this important in study planning? Here’s an all–too–commmon example. A study is designed to compare the fraction of “clinical responders” between two treatments. The investigator knows that the power of a binary endpoint is limited, and has a fixed budget. So she chooses a more impressive effect size for the power calculation—one that is more than clinically relevant. After the data are in, she finds an apparent clinically relevant improvement due to one of the treatments, but because the study was sized only to detect a superclinical improvement, the pvalue is large and the confidence interval for the effect is wide. Little new knowledge is gained from the study except for how to spend money.</p>
<p>Consider a twogroup comparison, with an equal sample size per group. Suppose we want to detect an odds ratio of 0.5 (OR=1.0 means no group effect) for binary Y. Suppose that the probability that Y=1 in the control group is 0.2. The required sample size is computed below.</p>
<pre class="r"><code>require(Hmisc)</code></pre>
<pre class="r"><code>knitrSet(lang='blogdown')
dor < 0.5 # OR to detect
tpower < 0.9 # target power
# Apply OR to p1=0.2 to get p2
p2 < plogis(qlogis(0.2) + log(dor))
n1 < round(bsamsize(0.2, p2, power=tpower)['n1'])
n < 2 * n1</code></pre>
<p>The OR of 0.5 corresponds to an event probability of 0.111 in the second group, and the number of subjects required per group is 347 to achieve a power of 0.9 of detecting OR=0.5.</p>
<p>Let’s now turn to using an ordinal response variable Y for our study. The proportional odds ordinal logistic model is the most widely used ordinal response model. It includes both the WilcoxonMannWhitney twosample rank test and binary logistic regression as special cases.
If ties in Y could be broken, the proportional odds assumption satisfied, and the sample size per group were fixed at 347, what odds ratio would be detectable with the same power of 0.9?</p>
<p>Before proceeding let’s see how close to 0.9 is the power computed using proportional odds model machinery when Y is binary. The vector of cell probabilities needed by the R <code>popower</code> function is the average of the cell probabilities over the two study groups. We write a frontend to <code>popower</code> that computes this average given the odds ratio and the cell probabilities for group 1.</p>
<pre class="r"><code>popow < function(p, or, n) {
# Compute cell probabilities for group 2 using Hmisc::pomodm
p2 < pomodm(p=p, odds.ratio=or)
pavg < (p + p2) / 2
popower(pavg, odds.ratio=or, n=n)
}
z < popow(c(0.8, 0.2), or=dor, n=2 * n1)
z</code></pre>
<pre><code>Power: 0.911
Efficiency of design compared with continuous response: 0.394 </code></pre>
<pre class="r"><code>binpopower < z$power</code></pre>
<p>The approximation to the binary case isn’t perfect since the PO model method’s power is a little above 0.9. But it’s not bad.</p>
<p>Let’s write an R function that given everything else computes the OR needed to achieve a given power and configuration of cell probabilities in the control group.</p>
<pre class="r"><code>g < function(p, n=2 * n1, power=binpopower) {
f < function(or) popow(p, or=or, n = n)$power  power
round(uniroot(f, c(dor  0.1, 1))$root, 3)
}
# Check that we can recover the original detectable OR
g(c(0.8, 0.2))</code></pre>
<pre><code>[1] 0.5</code></pre>
<p>To break ties in Y we’ll try a number of configurations of the cell probabilities for the control group, and for each configuration compute the OR that can be detected with the same power as computed for the binary Y case using the PO model. We will mainly vary the number of levels of Y. For example, to compute the detectable effect size when the probability that Y=1 of 0.2 is divided into two values of Y with equal probability we use <code>g(c(0.8, 0.1, 0.1), n)</code>. Results are shown in the table below.</p>
<pre class="r"><code># Function to draw spike histograms of probabilities as html base64 insert
h < function(p) tobase64image(pngNeedle(p, w=length(p))) </code></pre>
<table>
<thead>
<tr class="header">
<th>Distinct Y Values</th>
<th></th>
<th>Cell Probabilities</th>
<th>Detectable OR</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>2</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAUAAAASCAMAAACkYuT0AAAAIVBMVEUAAAAyMjJmZmaEhISPj4+UlJSZmZng4ODr6+vx8fH///8zi/NfAAAAL0lEQVQImWPgAAEGLhBgYOIEkQxsYJIFB8kMUg8lsalhZwSRrAwgkgVOMnMwMwAAhSUCJ+oP35EAAAAASUVORK5CYII=" alt="image" /></td>
<td>.8 .2</td>
<td>0.5</td>
</tr>
<tr class="even">
<td>2</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAUAAAASCAMAAACkYuT0AAAAFVBMVEUAAACEhISPj4+qqqrLy8vr6+v///9wEjDrAAAAKElEQVQImWNgBQEGNhAgQLIACRYGBmZWZgYGBkZWRiDJxMZEJAlSDwBXcgFHSfw5xwAAAABJRU5ErkJggg==" alt="image" /></td>
<td>.5 .5</td>
<td>0.603</td>
</tr>
<tr class="odd">
<td>3</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAoAAAASCAMAAABVab95AAAAMFBMVEUAAAAEBAQUFBReXl5nZ2dwcHBxcXF5eXmhoaGwsLCzs7Pk5OTr6+vy8vL39/f///9bJARqAAAAPUlEQVQImZXMuQHAIAzAQIUQPgPef1tw49Ci6irRPdTjFSf553fB6N+TN4fNObeGUUIQbY+xQtWEfQuUHllXyQfwiPT7LAAAAABJRU5ErkJggg==" alt="image" /></td>
<td>.8 .2/2 x 2</td>
<td>0.501</td>
</tr>
<tr class="even">
<td>3</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAoAAAASCAMAAABVab95AAAAMFBMVEUAAAAQEBAlJSUqKipnZ2dwcHBycnJ0dHShoaGvr6+ysrLBwcHNzc3q6urr6+v///8dAOhqAAAAQElEQVQImXXMNxIAIQwEwcF79P/fUkp0F8BGHWwN24bYHgzTSPmYrozW/fP+fXF5v2Q4ZYcuGWWDJgntVqg7cgCMUwgjOT+CdAAAAABJRU5ErkJggg==" alt="image" /></td>
<td>.7 .3/2 x 2</td>
<td>0.562</td>
</tr>
<tr class="odd">
<td>3</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAoAAAASCAMAAABVab95AAAALVBMVEUAAABDQ0NYWFhnZ2dwcHB2dnaMjIyhoaGvr6+2trbDw8PZ2dnr6+v29vb///9GxaMQAAAAP0lEQVQImZ3MOw7AIAwE0Qnmmxjf/7iIxhYFTaZ6xWpRD/P+8g1S/RcJ5hgcnClN+57NAcMKmx265StFGzQVFlGrB0OcmqrtAAAAAElFTkSuQmCC" alt="image" /></td>
<td>.5 .5/2 x 2</td>
<td>0.615</td>
</tr>
<tr class="even">
<td>3</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAoAAAASCAMAAABVab95AAAAIVBMVEUAAAADAwNnZ2dwcHCYmJihoaGvr6/Dw8PV1dXr6+v///+HXnldAAAAOElEQVQIma2MMQoAIAzEUrXa3v8fbCdxFcxwBAJHHtDhg16/l65KETULhsIsNICuCVP9UVs6eDY2Z1gFMW1yd0UAAAAASUVORK5CYII=" alt="image" /></td>
<td>1/3 x 3</td>
<td>0.629</td>
</tr>
<tr class="odd">
<td>4</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABEAAAASCAMAAACKJ8VmAAAAM1BMVEUAAAAGBgY+Pj5DQ0NNTU1XV1dYWFh1dXWKioqUlJS9vb3Ozs7n5+fr6+vw8PD4+Pj///9mhRhjAAAASUlEQVQYla3OORLAIAxD0Q9ZiHEMvv9pyVA6lKjSvEbCYvAYLo3C+ZNjk+T4ZyG7tpbSpX61Snd/nylCat4S4n4z/xRQUyhmmQGLQQ+0Y/BHOwAAAABJRU5ErkJggg==" alt="image" /></td>
<td>.8 .2/3 x 3</td>
<td>0.502</td>
</tr>
<tr class="even">
<td>4</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABEAAAASCAMAAACKJ8VmAAAAM1BMVEUAAAA9PT0+Pj5DQ0NGRkZJSUl1dXV2dnZ/f3+FhYW+vr7Ozs7g4ODh4eHm5ubr6+v///8q6XZ8AAAAR0lEQVQYlb2OORKAMBSFyL7nv/ufVuvYOBZSMVRgJ+jk1/L8+XbourRiubXEJXUHSWqwtaFJCQhShakJVQovizfLMGxANvNcUU8Ptx5vX+wAAAAASUVORK5CYII=" alt="image" /></td>
<td>1/4 x 4</td>
<td>0.638</td>
</tr>
<tr class="odd">
<td>5</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABoAAAASCAMAAAByxz6RAAAARVBMVEUAAAABAQEJCQkKCgoODg4WFhYlJSVkZGRtbW14eHh+fn5/f3+Pj4+ZmZmhoaG0tLS7u7vLy8vo6Ojr6+vw8PD8/Pz///8rrl2bAAAAXElEQVQYlbXPSw7AIAgE0NH+1KpVUe5/1JK4pguTzgIS3oIMSA1YzSKdt0rYddqWyKq9vmjt1y/UXZI5Qhiykusyn2tShGnMBSjMzSDK6cDs5YFKlIFMVAEvJ4sXJpIi9jGdA84AAAAASUVORK5CYII=" alt="image" /></td>
<td>0.7 .3/4 x 4</td>
<td>0.563</td>
</tr>
<tr class="even">
<td>5</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABoAAAASCAMAAAByxz6RAAAAS1BMVEUAAAABAQEICAgJCQkKCgoLCwsUFBQrKytSUlJ4eHiEhIShoaGwsLDDw8PLy8vd3d3k5OTr6+vw8PDz8/P39/f6+vr7+/v+/v7///+cI/wTAAAAZ0lEQVQYlbXPNw7AMAxD0Z/ei1J5/5PGhmdlSjhw4AMECHOD3HxP2+1S1rpE5VPh/vVGpX/wH9qv2McR+9pj31uiKW9Oaa3rVTqbfApTmyXqYZFGGKUF+jBVpL86mM0GGMxm6MJU8AA+UiSVSXJMqwAAAABJRU5ErkJggg==" alt="image" /></td>
<td>0.6 .4/4 x 4</td>
<td>0.597</td>
</tr>
<tr class="odd">
<td>5</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABoAAAASCAMAAAByxz6RAAAASFBMVEUAAAADAwMJCQkKCgocHBwvLy94eHh/f3+CgoKJiYmUlJSZmZmhoaGvr6+wsLC/v7/BwcHHx8fLy8vc3Nzr6+vu7u739/f///9EU4FiAAAAYklEQVQYlc2PSQqAMBAEy32JS9TR/v9PTch5joJ9aJgqGGjMDXLzF7XfrqJzd1H7qvEffqKeactHCLm36Ul9DEXNVCZFiJJVzAm1FDXCJa2wSheMCTWUXT2cZgssZif0CdW8SMckNGs501MAAAAASUVORK5CYII=" alt="image" /></td>
<td>0.5 .5/4 x 4</td>
<td>0.618</td>
</tr>
<tr class="even">
<td>5</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABoAAAASCAMAAAByxz6RAAAAS1BMVEUAAAAJCQkKCgoMDAwVFRUgICAlJSVBQUFoaGh4eHiBgYGCgoKFhYWhoaGvr6+ysrLAwMDLy8vOzs7c3Nze3t7q6urr6+v8/Pz////PcL6tAAAAYklEQVQYldXPOQ6AQAwEwV7u+8Y7/38pFsRLRsCEXZIlY8mh5H5AL391S5IokwfJvqCtqKN0VtUpxbrYPDX5QwMc0gyzdMDgyeWmHnZpgknaofeUOQV/oYXVbITRbIXWU+ACggkmJL9LJTEAAAAASUVORK5CYII=" alt="image" /></td>
<td>0.4 .6/4 x 4</td>
<td>0.631</td>
</tr>
<tr class="odd">
<td>5</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAABoAAAASCAMAAAByxz6RAAAARVBMVEUAAAAJCQkKCgoODg4lJSUuLi4/Pz9aWlp4eHh/f3+CgoKHh4ehoaGvr6+zs7PLy8vZ2dnc3Nzg4ODr6+vw8PD+/v7///90efXjAAAAYUlEQVQYldWPOQ6AMBADhyMcCRCSJf7/U9kPQImEC1saV4M9Bj3mB9eL15fK4+p9hrlJ1zRdUpvD6WgdYfBNUKUDDqlCcjQAvW+EIu2wSwWio/796lxhgWy2wWaWYXHUcQPYJSJNV3+J0QAAAABJRU5ErkJggg==" alt="image" /></td>
<td>1/5 x 5</td>
<td>0.641</td>
</tr>
<tr class="even">
<td>6</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAACUAAAASCAMAAADrP+ckAAAAUVBMVEUAAAABAQECAgIEBAQGBgYICAiWlpaenp6jo6OlpaWrq6usrKyzs7O7u7vCwsLLy8vPz8/R0dHU1NTW1tbY2NjZ2dnb29ve3t7j4+Pr6+v///+oW0wbAAAAbklEQVQokeXQRw6AMAxE0R96Cb0kcP+DMvgEWSLhzbPk8Wa4UoY7ZX6RSuvrq63Op7kcxroZ22oci3HOuDJqmVwR3myW72LPs/ceCjeJWDpg1NbAIDrwwkMnBmjECMkpr0Iq6EUNrWihFj1UQj8PgfE8BYbAjVoAAAAASUVORK5CYII=" alt="image" /></td>
<td>1/6 x 6</td>
<td>0.643</td>
</tr>
<tr class="odd">
<td>7</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAADIAAAASCAMAAAAuTX21AAAAY1BMVEUAAAAxMTE3Nzc+Pj5GRkZSUlJgYGBnZ2dwcHBxcXF0dHR6enqEhISJiYmOjo6ampqfn5+hoaGvr6+0tLS1tbXExMTJycnR0dHa2trd3d3f39/p6enr6+vy8vL5+fn+/v7////0uzH+AAAAgElEQVQoke3RSRKCQBBE0Y/MyiCTCAp4/1PK9wa41dy82mR1RBfPw+F1OP/K4coXd/nxU+bBqFtymnSNw4cuUbToI4xXnU7JpmOQA2fHGUq9wVU76PQKNy1h1jMmc7xDoQPU2kKrNQxawF2zTyX1s/ftF+2h0gYaraDXy/6apvAGzIpknXhcLgUAAAAASUVORK5CYII=" alt="image" /></td>
<td>1/7 x 7</td>
<td>0.644</td>
</tr>
<tr class="even">
<td>10</td>
<td><img src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAGUAAAASCAMAAAB1heCEAAAAdVBMVEUAAAAHBwcICAgJCQknJycpKSkqKiorKystLS0uLi4vLy8yMjIzMzM3NzdycnJ1dXV2dnZ6enp9fX1+fn6BgYGGhoaKiorAwMDExMTFxcXIyMjJycnOzs7X19fb29vg4ODr6+vv7+/z8/P39/f5+fn7+/v////RKynaAAAAqUlEQVQ4je3Tyw6CMBRF0Y0WQRB5KiAvReX/P1EGB+Ym2hEdraZNdtL0crexmGysrbJV/l6xMy92KnZebKt8X3msm/e4cnyv/ME5U855ueQ7pXjbm6eYkkgvd1eJ1c59iQmp9DT7m1g6/pI5k8+VCE/7AQrxCp14IpB6uIgX6MWAk9TBVSxgED2ieSpDjH51C7GYQS0eOUgNpGIKjXjgKNWQiTG0oiG8fwCPVfnlhbkCxQAAAABJRU5ErkJggg==" alt="image" /></td>
<td>1/10 x 10</td>
<td>0.646</td>
</tr>
<tr class="odd">
<td>694</td>
<td></td>
<td>1/694 x 694</td>
<td>0.647</td>
</tr>
</tbody>
</table>
<p>The last row corresponds to analyzing a continuous variable with the Wilcoxon test with 347 observations per each of the two groups.</p>
<p>When high values of Y (e.g., Y=1 in the binary case) denote an event, and when the control group has a low probability of the event, splitting the high Ylevel into multiple ordinal levels does not increase power very much. The real gain in power comes from splitting the more frequent nonevent subjects into for example “no event and mild event”. The best power (detectable OR closer to 1.0) comes from having equal probabilities in the cells when averaged over treatment groups, and with at least 5 distinct Y values.</p>
<p>When designing a study, choose a maximum information dependent variable, and attempt to not have more than, say 0.7 of the sample in any one category. But even if the proportion of nonevents is large, it does not hurt to break ties among the events. In some cases it will even help, e.g., when the treatment has a larger effect on the more severe events.</p>
<hr />
<p>The first few lines of <code>Rmarkdown knitr</code> markup used to produce the above table are given below.</p>
<pre><code>
Distinct Y Values Cell ProbabilitiesDetectable OR 

 2  `r h(c(.8, .2))`  .8 .2  `r g(c(.8, .2))` 
 2  `r h(c(.5, .5))`  .5 .5  `r g(c(.5, .5))` </code></pre>

Why I Don't Like Percents
http://fharrell.com/post/percent/
Fri, 19 Jan 2018 00:00:00 +0000
http://fharrell.com/post/percent/
<p>The numbers zero and one are special; zero because it is a minimum or center point for many measurements and because it is the addition identity constant (x + 0 = x), and one because it is the multiplication identity constant (x × 1 = x) and corresponds to units of measurements. Many important quantities are between 0 and 1, including proportions of a whole and probabilities. One hundred is not special in the same sense as unity, so percent (per 100) doesn’t do anything for me (why not per thousand?).
<style>
img {
height: auto;
maxwidth: 70px;
marginleft: auto;
marginright: auto;
display: block;
}
</style></p>
<p>When a quantity doubles, it gets back to its original value by halving. When in increases by 100% it gets back to its original value by decreasing 50%. Case almost closed. Whereas an increase of 33.33% is balanced by a decrease of 25%, an increase by a factor of <sup>4</sup>⁄<sub>3</sub> is balanced by a decrease to a factor of <sup>3</sup>⁄<sub>4</sub> . If you put 100 dollars into an account that yields 3% interest annually, you will have 100 * (1.03<sup>10</sup>) or 134 dollars after 10 years. To get back to your original value you’d have to lose 2.91% per year for 10 years.</p>
<p>I like fractions like <sup>3</sup>⁄<sub>4</sub>, or the decimal equivalent 0.75. I like ratios, because they are symmetric. Chaining together relative increases is simple with ratios. An increase by a factor of 1.5 followed by an increase by a factor of 1.4 is an increase by a factor of 1.5 * 1.4 or 2.1. A 50% increase followed by a 40% increase is an increase of 110%. To get the right answer with percent increase you have to convert back to ratios, do the multiplication, then convert back to percent.</p>
<p>Many numbers that we quote are probabilities, and a probability is formally a number between 0 and 1. So I don’t like “the chance of rain is 10%” but prefer “the chance of rain is 0.1 or <sup>1</sup>⁄<sub>10</sub>”. When discussing statistical analyses it is especially irksome to see statements such as “significance levels of 5% or power of 90%”. Probabilities are being discussed, so I prefer 0.05 and 0.9.</p>
<p>I have seen clinicians confused over statements such as “the chance of a stroke is 0.5%”, interpreting this as 50%. If we say “the chance of a stroke is 0.005” such confusion is less likely. And I don’t need percent signs everywhere.</p>
<p>Percent change has even more problems than percent. I have often witnessed confusion from statements such as “the chance of stroke increased by 50%”. If the base stroke probability was 0.02 does the speaker mean that it is now 0.52? Not very likely, but you can’t be sure. More likely she meant that the chance of stroke is now 0.02 + 0.5 * 0.02 = 0.03. It would always be clear to instead say one of the following:</p>
<ul>
<li>The chance of stroke went from 0.02 to 0.03</li>
<li>The chance of stroke increased by 0.01 (or the <em>absolute</em> chance of stroke increased by 0.01)</li>
<li>The chance of stroke increased by a factor of 1.5</li>
</ul>
<p>We need to achieve clarity by settling on a convention for wording foldchange decreases. If the chance of stroke decreases from 0.03 to 0.02 and we feel compelled to summarize the <em>relative</em> decrease in risk, we could say that risk of stroke decreased by a factor of 1.5. But even though it looks a bit awkward, I think it would be clearest to say the following, if 0.02 corresponded to treatment A and 0.03 corresponded to treatment B: treatment A multiplied the risk of stroke by <sup>2</sup>⁄<sub>3</sub> in comparison to treatment B. Or you could say that treatment A modified the risk of stroke by a factor of <sup>2</sup>⁄<sub>3</sub>, or that the A:B risk ratio is <sup>2</sup>⁄<sub>3</sub> or 0.667.</p>
<p>Many quantities reported in the scientific literature are naturally ratios. For example, odds ratios and hazard ratios are commonly used. If the ratio of stroke hazard rates treatment B compared to treatment A is 0.75, I prefer to report “the B:A stroke hazard ratio was 0.75.” There’s no need to say that there was a 25% reduction in stroke hazard rate.</p>
<p>Percents have perhaps one good use. When they represent fractions and we don’t care to present but two decimal places of accuracy, i.e., the percents you calculate are all whole numbers, percents may be OK. But I would still prefer numbers like 0.02, 0.86 and to avoid a symbol (%) when just dealing with numbers.</p>
<h2 id="linkstootherresources">Links to Other Resources</h2>
<ul>
<li><a href="https://www.bmj.com/content/358/bmj.j3663" target="_blank">What is a percentage difference?</a> by TJ Cole and DG Altman</li>
</ul>

How Can Machine Learning be Reliable When the Sample is Adequate for Only One Feature?
http://fharrell.com/post/mlsamplesize/
Thu, 11 Jan 2018 00:00:00 +0000
http://fharrell.com/post/mlsamplesize/
<p>The ability to estimate how one continuous variable relates to another continuous variable is basic to the ability to create good predictions. Correlation coefficients are unitless, but estimating them requires similar sample sizes to estimating parameters we directly use in prediction such as slopes (regression coefficients). When the shape of the relationship between X and Y is not known to be linear, a little more sample size is needed than if we knew that linearity held so that all we had to estimate was a slope and an intercept. This will be addressed later.</p>
<p>Consider <a href="http://fharrell.com/doc/bbr.pdf#nameddest=sec:corrn">BBR Section 8.5.2</a>
where it is shown that the sample size needed to estimate a correlation coefficient to within a margin of error as bad as ±0.2 with 0.95 confidence is about 100 subjects, and to achieve a better margin of error of ±0.1 requires about 400 subjects. Let’s reproduce that plot for the “hardest to estimate” case where the true correlation is 0.</p>
<style>
p.caption {
fontsize: 0.6em;
}
pre code {
overflow: auto;
wordwrap: normal;
whitespace: pre;
}
</style>
<pre class="r"><code>require(Hmisc)</code></pre>
<pre class="r"><code>knitrSet(lang='blogdown')</code></pre>
<pre class="r"><code>plotCorrPrecision(rho=0, n=seq(10, 1000, length=100), ylim=c(0, .4), method='none')
abline(h=seq(0, .4, by=0.025), v=seq(25, 975, by=25), col=gray(.9))</code></pre>
<div class="figure"><span id="fig:plotprec"></span>
<img src="http://fharrell.com/post/mlsamplesize_files/figurehtml/plotprec1.png" alt="Margin for error (length of longer side of asymmetric 0.95 confidence interval) for r in estimating ρ, when ρ=0. Calculations are based on the Fisher z transformation of r." width="672" />
<p class="caption">
Figure 1: Margin for error (length of longer side of asymmetric 0.95 confidence interval) for r in estimating ρ, when ρ=0. Calculations are based on the Fisher z transformation of r.
</p>
</div>
<p>I have seen many papers in the biomedical research literature in which investigators “turned loose” a machine learning or deep learning algorithm with hundreds of candidate features and a sample size that by the above logic is inadequate had there only been one candidate feature. How can ML possibly learn how hundreds of predictors combine to predict an outcome when our knowledge of statistics would say this is impossible? The short answer is that it can’t. Researchers claiming to have developed a useful predictive instrument with ML in the limited sample size case seldom do a rigorous internal validation that demonstrates the relationship between predicted and observed values (i.e., the calibration curve) to be a straight 45° line through the origin. I have worked with a colleague who had previously worked with a ML group who found a predictive signal (high R<sup>2</sup>) with over 1000 candidate features and N=50 subjects. In trying to check their results on new subjects we appear to be finding an R<sup>2</sup> about 1/4 as large as originally claimed.</p>
<p><span class="citation">Ploeg, Austin, and Steyerberg (<a href="#refplo14mod">2014</a>)</span> in their article <a href="https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471228814137">Modern modelling techniques are data hungry</a> estimated that to have a very high chance of rigorously validating, many machine learning algorithms require 200 events per <em>candidate</em> feature (they found that logistic regression requires 20 events per candidate features). So it seems that “big data” methods sometimes create the need for “big data” when traditional statistical methods may not require such huge sample sizes (at least when the dimensionality is not extremely high). [Note: in higher dimensonal situations it is possible to specify a traditional statistical model for the prespecified “important” predictors and to add in principal components and other summaries of the remaining features.] For more about “data hunger” in machine learning see <a href="https://stats.stackexchange.com/questions/345737">this</a>. Machine learning algorithms do seem to have unique advantages in high signal:noise ratio situations such as image and sound pattern recognition problems. Medical diagnosis and outcome prediction problems involve a low signal:noise ratio, i.e., the R<sup>2</sup> are typically low and the outcome variable Y is typically measured with error.</p>
<p>I’ve shown the sample size needed to estimate a correlation coefficient with a certain precision. What about the sample size needed to estimate the whole relationship between a single continuous predictor and the probability of a binary outcome? Similar to what is presented in <a href="http://fharrell.com/doc/rms.pdf#nameddest=sec:lrmn">RMS Notes Section 8.2.3</a>, let’s simulate the average maximum (over a range of X) absolute prediction error (on the probability scale). The following R program does this, for various sample sizes. 1000 simulated datasets are analyzed for each sample size considered.</p>
<pre class="r"><code># X = universe of X values if X considered fixed, in random order
# xp = grid of x values at which to obtain and judge predictions
require(rms)</code></pre>
<pre class="r"><code>sim < function(assume = c('linear', 'smooth'),
X,
ns=seq(25, 300, by=25), nsim=1000,
xp=seq(1.5, 1.5, length=200), sigma=1.5) {
assume < match.arg(assume)
maxerr < numeric(length(ns))
pactual < plogis(xp)
xfixed < ! missing(X)
j < 0
worst < nsim
for(n in ns) {
j < j + 1
maxe < 0
if(xfixed) x < X[1 : n]
nsuccess < 0
for(k in 1 : nsim) {
if(! xfixed) x < rnorm(n, 0, sigma)
P < plogis(x)
y < ifelse(runif(n) <= P, 1, 0)
f < switch(assume,
linear = lrm(y ~ x),
smooth = lrm(y ~ rcs(x, 4)))
if(length(f$fail) && f$fail) next
nsuccess < nsuccess + 1
phat < predict(f, data.frame(x=xp), type='fitted')
maxe < maxe + max(abs(phat  pactual))
}
maxe < maxe / nsuccess
maxerr[j] < maxe
worst < min(worst, nsuccess)
}
if(worst < nsim) cat('For at least one sample size, could only run', worst, 'simulations\n')
list(x=ns, y=maxerr)
}
plotsim < function(object, xlim=range(ns), ylim=c(0.04, 0.2)) {
ns < object$x; maxerr < object$y
plot(ns, maxerr, type='l', xlab='N', xlim=xlim, ylim=ylim,
ylab=expression(paste('Average Maximum ', abs(hat(P)  P))))
minor.tick()
abline(h=c(.05, .1, .15), col=gray(.85))
}
set.seed(1)
X < rnorm(300, 0, sd=1.5) # Allows use of same X's for both simulations
simrun < TRUE
# If blogdown handled caching, would not need to manually cache with Load and Save
if(simrun) Load(errLinear) else {
errLinear < sim(assume='linear', X=X)
Save(errLinear)
}
plotsim(errLinear)</code></pre>
<div class="figure"><span id="fig:logisticsim"></span>
<img src="http://fharrell.com/post/mlsamplesize_files/figurehtml/logisticsim1.png" alt="Simulated expected maximum error in estimating probabilities for x ∈ [1.5, 1.5] with a single normally distributed X with mean zero. The true relationship between X and P(Y=1  X) is assumed to be logit(Y=1) = X. The logistic model fits that are repeated in the simulation assume the relationship is linear, but estimates the slope and intercept. In reality, we wouldn't know that a relationship is linear, and if we allowed it to be nonlinear there would be a bit more variance to the estimated curve, resulting in larger average absolute errors than what are shown in the figure (see below)." width="672" />
<p class="caption">
Figure 2: Simulated expected maximum error in estimating probabilities for x ∈ [1.5, 1.5] with a single normally distributed X with mean zero. The true relationship between X and P(Y=1  X) is assumed to be logit(Y=1) = X. The logistic model fits that are repeated in the simulation assume the relationship is linear, but estimates the slope and intercept. In reality, we wouldn’t know that a relationship is linear, and if we allowed it to be nonlinear there would be a bit more variance to the estimated curve, resulting in larger average absolute errors than what are shown in the figure (see below).
</p>
</div>
<p>But wait—the above simulation assumes that we already knew that the relationship was linear. In practice, most relationships are nonlinear but we don’t know the true transformation. Assume the relationship between X and logit(Y=1) is smooth, we can estimate the relationship reliably with a restricted cubic spline function. Here we use 4 knots, which gives rise to the addition of two nonlinear terms to the model for a total of 3 parameters to estimate not counting the intercept. By estimating these parameters we are estimating the smooth transformation of X and by simulating this process repeatedly we are allowing for “transformation uncertainty”.</p>
<pre class="r"><code>set.seed(1)
if(simrun) Load(errSmooth) else {
errSmooth < sim(assume='smooth', X=X, ns=seq(50, 300, by=25))
Save(errSmooth)
}
plotsim(errSmooth, xlim=c(25, 300))
lines(errLinear, col=gray(.8))</code></pre>
<div class="figure"><span id="fig:simrcs"></span>
<img src="http://fharrell.com/post/mlsamplesize_files/figurehtml/simrcs1.png" alt="Estimated mean maximum (over X) absolute errors in estimating P(Y=1) when X is not assumed to predict the logit linearly (black line). The earlier estimates when linearity was assumed are shown with a gray scale line. Restricted cubic splines could not be fitted for n=25." width="672" />
<p class="caption">
Figure 3: Estimated mean maximum (over X) absolute errors in estimating P(Y=1) when X is not assumed to predict the logit linearly (black line). The earlier estimates when linearity was assumed are shown with a gray scale line. Restricted cubic splines could not be fitted for n=25.
</p>
</div>
<p>You can see that the sample size must exceed 300 just to have sufficient reliability in estimating probabilities just over the range of X of [1.5, 1.5] when we do not know that the relationship is linear and we allow it to be nonlinear.</p>
<p>The morals of the story are</p>
<ul>
<li>Beware of claims of good predictive ability for ML algorithms when sample sizes are not huge in relationship to the number of candidate features</li>
<li>For any problem, whether using machine learning or regression, compute the sample size needed to obtain highly reliable predictions with only a single prespecified predictive feature</li>
<li>If you are not sure that relationships are simple so that you allow various transformations to be attempted, uncertainty increases and so does the expected absolute predicton error</li>
<li>If your sample size is not much bigger than the above minimum, beware of doing any highdimensional analysis unless you have very clean data and a high signal:noise ratio</li>
<li>Also remember that when Y is binary, the minimum sample size necessary just to estimate the intercept in a logistic regression model (equivalent to estimating a single proporton) is 96 (see <a href="http://fharrell.com/doc/bbr.pdf#nameddest=sec:htestpn">BBR Section 5.6.3</a>)
So it is impossible with binary Y to accurately estimate P(Y=1  X) when there are <em>any</em> candidate predictors if n < 96 (and n=96 only achives a margin of error of ±0.1 in estimating risk).</li>
<li>When the number of candidate features is huge and the sample size is not, expect the list of “selected” features to be volatile, predictive discrimination to be overstated, and absolute predictive accuracy (calibration curve) to be very problematic</li>
<li>In general, know how many observations are required to allow you to reliably learn from the number of candidate features you have</li>
</ul>
<p>See <a href="http://fharrell.com/doc/bbr.pdf#nameddest=chap:hdata">BBR Chapter 20</a> for an approach to estimating the needed sample size for a given sample size and number of candidate predictors.</p>
<div id="references" class="section level1 unnumbered">
<h1>References</h1>
<div id="refs" class="references">
<div id="refplo14mod">
<p>Ploeg, Tjeerd van der, Peter C. Austin, and Ewout W. Steyerberg. 2014. “Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints.” <em>BMC Medical Research Methodology</em> 14 (1). BioMed Central Ltd: 137+. <a href="http://dx.doi.org/10.1186/1471228814137" class="uri">http://dx.doi.org/10.1186/1471228814137</a>.</p>
</div>
</div>
</div>

New Year Goals
http://fharrell.com/post/newyeargoals/
Fri, 29 Dec 2017 00:00:00 +0000
http://fharrell.com/post/newyeargoals/
<p>Here are some goals related to scientific research and clinical medicine that I’d like to see accomplished in 2018.</p>
<ul>
<li>Physicians come to know that precision/personalized medicine for the most part is based on a false premise</li>
<li>Machine learning/deep learning is understood to not find previously
unknown information in data in the majority of cases, and tends to
work better than traditional statistical models only when dominant
nonadditive effects are present and the signal:noise ratio is
decently high</li>
<li>Practitioners will make more progress in correctly using “old”
statistical tools such as regression models</li>
<li>Medical diagnosis is finally understood as a task in probabilistic
thinking, and sensitivity and specificity (which are characteristics
not only of tests but also of patients) are seldom used</li>
<li>Practitioners using cutpoints/thresholds for inherently continuous
measurements will finally go back to primary references and find
that the thresholds were never supported by data</li>
<li>Dichotomania is seen as a failure to understand utility/loss/cost
functions and as a tragic loss of information</li>
<li>Clinical quality improvement initiatives will rely on randomized
trial evidence and deemphasize purely observational evidence;
learning health systems will learn things that are actually true</li>
<li>Clinicians will give up on the idea that randomized clinical trials
do not generalize to realworld settings</li>
<li>Fewer prepost studies will be done</li>
<li>More research will be reproducible with sounder sample size
calculations, all data manipulation and analysis fully scripted, and
data available for others to analyze in different ways</li>
<li>Fewer sample size calculations will be based on a ‘miracle’ effect
size</li>
<li>Noninferiority studies will no longer use noninferiority margins
that are far beyond clinically significant</li>
<li>Fewer sample size calculations will be undertaken and more
sequential experimentation done</li>
<li>More Bayesian studies will be designed and executed</li>
<li>Classification accuracy will be mistrusted as a measure of
predictive accuracy</li>
<li>More researchers will realize that estimation rather than hypothesis
testing is the goal</li>
<li>Change from baseline will seldom be *computed,* not to mention not
used in an analysis</li>
<li>Percents will begin to be replaced with fractions and ratios</li>
<li>Fewer researchers will draw <strong>any</strong> conclusion from large pvalues
other than “the money was spent”</li>
<li>Fewer researchers will draw conclusions from small pvalues</li>
</ul>
<p>Some wishes expressed by others on Twitter:</p>
<ul>
<li>No more ROC curves</li>
<li>No more bar plots</li>
<li>Ban the term ‘statistical significance’ and ‘statistically
insignificant’</li>
</ul>

Scoring Multiple Variables, Too Many Variables and Too Few Observations: Data Reduction
http://fharrell.com/post/scoredatareduction/
Tue, 21 Nov 2017 15:40:00 +0000
http://fharrell.com/post/scoredatareduction/
<p>This post will grow to cover questions about data reduction methods, also known as <em>unsupervised learning</em> methods. These are intended primarily for two
purposes:</p>
<ul>
<li>collapsing correlated variables into an overall score so that one
does not have to disentangle correlated effects, which is a
difficult statistical task</li>
<li>reducing the effective number of variables to use in a regression or
other predictive model, so that fewer parameters need to be
estimated</li>
</ul>
<p>The latter example is the “too many variables too few subjects” problem.
Data reduction methods are covered in Chapter 4 of my book <em>Regression
Modeling Strategies</em>, and in some of the book’s case studies.</p>
<hr />
<h3 id="sachavarinwrites">Sacha Varin writes</h3>
<p><small>
I want to add/sum some variables having different units. I decide to
standardize (Zscores) the values and then, once transformed in
Zscores, I can sum them. The problem
is that my variables distributions are non Gaussian (my distributions
are not symmetrical (skewed), they are longtailed, I have all types of
weird distributions, I guess we can say the distributions are
intractable. I know that my distributions don’t need
to be gaussian to calculate Zscores, however, if the distributions are
not close to gaussian or at least symmetrical enough, I guess the
classical Zscore transformation: (Value  Mean)/SD is not valid, that’s why I decide, because my distributions are skewed and longtailed to use the Gini’s mean difference (robust and efficient
estimator).</p>
<ol>
<li>If the distributions are skewed and longtailed, can I standardize
the values using that formula (Value  Mean)/GiniMd ? Or the mean is not a good estimator in presence of skewed and longtailed distributions? What
about (Value  Median)/GiniMd ? Or what else with
GiniMd for a formula to standardize?</li>
<li>In presence of outliers, skewed and longtailed distributions, for
standardization, what formula is better to use
between (Value  Median)/MAD (=median
absolute deviation) or Value  Mean)/GiniMd ? And
why? My situation is not the predictive modeling case, but I want to sum the variables.
</small></li>
</ol>
<hr />
<p>These are excellent questions and touch on an interesting side issue.
My opinion is that standard deviations (SDs) are not very applicable to
asymmetric (skewed) distributions, and that they are not very robust
measures of dispersion. I’m glad you mentioned <a href="https://arxiv.org/pdf/1405.5027.pdf" target="_blank">Gini’s mean
difference</a>, which is the mean of
all absolute differences of pairs of observations. It is highly robust
and is surprisingly efficient as a measure of dispersion when compared
to the SD, even when normality
holds.</p>
<p>The questions also touch on the fact that when normalizing more than
one variable so that the variables may be combined, there is no magic
normalization method in statistics. I believe that Gini’s mean
difference is as good as any and better than the SD. It is also more
precise than the mean absolute difference from the mean or median, and
the mean may not be robust enough in some instances. But we have a rich
history of methods, such as principal components (PCs), that use
SDs.</p>
<p>What I’m about to suggest is a bit more
applicable to the case where you ultimately want to form a predictive
model, but it can also apply when the goal is to just combine several
variables. When the variables are continuous and are on different
scales, scaling them by SD or Gini’s mean difference will allow one to
create unitless quantities that may possibly be added. But the fact
that they are on different scales begs the question of whether they are
already “linear” or do they need separate nonlinear transformations to
be “combinable”.</p>
<p>I think that nonlinear PCs may be a better choice than just adding
scaled variables. When the predictor variables are correlated,
nonlinear PCs learn from the interrelationships, even occasionally
learning how to optimally transform each predictor to ultimately better
predict Y. The transformations (e.g., fitted spline functions) are
solved for to maximize predictability of a predictor, from the other
predictors or PCs of them. Sometimes the way the predictors move
together is the same way they relate to some ultimate outcome variable
that this undersupervised learning method does not have access to. An
example of this is in Section 4.7.3 of my book.</p>
<p>With a little bit of luck, the transformed predictors have more
symmetric distributions, so ordinary PCs computed on these transformed
variables, with their implied SD normalization, work pretty well. PCs
take into account that some of the component variables are highly
correlated with each other, and so are partially redundant and should
not receive the same weights (“loadings”) as other
variables.</p>
<p>The R transcan function in the Hmisc package has various options for nonlinear PCs, and these ideas are generalized in the R
<a href="https://cran.rproject.org/web/packages/homals" target="_blank">homals</a>
package.</p>
<p>How do we handle the case where the number of candidate predictors p is
large in comparison to the effective sample size n? Penalized maximum
likelihood estimation (e.g., ridge regression) and Bayesian regression
typically have the best performance, but data reduction methods are
competitive and sometimes more interpretable. For example, one can use
variable clustering and redundancy analysis as detailed in the RMS book
and course notes. Principal components (linear or nonlinear) can also
be an excellent approach to lowering the number of variables than need
to be related to the outcome variable Y. Two example approaches
are:</p>
<ol>
<li>Use the 15:1 rule of thumb to estimate how many predictors can
reliably be related to Y. Suppose that number is k. Use the first
k principal components to predict Y.</li>
<li>Enter PCs in decreasing order of variation (of the system of Xs)
explained and chose the number of PCs to retain using AIC. This is
far from stepwise regression which enters variables according to
their pvalues with Y. We are effectively entering variables in a
prespecified order with incomplete principal component
regression.</li>
</ol>
<p>Once the PC model is formed, one may attempt to interpret the model by
studying how raw predictors relate to the principal components or to the
overall predicted values.</p>
<p>Returning to Sacha’s original setting,
if linearity is assumed for all variables, then scaling by Gini’s mean
difference is reasonable. But psychometric properties should be
considered, and often the scale factors need to be derived from subject
matter rather than statistical
considerations.</p>

Statistical Criticism is Easy; I Need to Remember That Real People are Involved
http://fharrell.com/post/criticismeasy/
Sun, 05 Nov 2017 21:07:00 +0000
http://fharrell.com/post/criticismeasy/
<p>I have been critical of a number of articles, authors, and journals in
<a href="http://fharrell.com/post/errmed/" target="_blank">this</a>
growing blog article. Linking the blog with Twitter is a way to expose
the blog to more readers. It is far too easy to slip into hyperbole on
the blog and even easier on Twitter with its space limitations.
Importantly, many of the statistical problems pointed out in my article,
are very, very common, and I dwell on recent publications to get the
point across that inadequate statistical review at medical journals
remains a serious problem. Equally important, many of the issues I
discuss, from pvalues, null hypothesis testing to issues with change
scores are not well covered in medical education (of authors and
referees), and pvalues have caused a phenomenal amount of damage to the
research enterprise. Still, journals insist on emphasizing pvalues. I
spend a lot of time educating biomedical researchers about statistical
issues and as a reviewer for many medical journals, but still am on a
quest to impact journal editors.</p>
<p>Besides statistical issues, there are very real human issues, and
challenges in keeping clinicians interested in academic clinical
research when there are so many pitfalls, complexities, and compliance
issues. In the many clinical trials with which I have been involved,
I’ve always been glad to be the statistician and not the clinician
responsible for protocol logistics, informed consent, recruiting,
compliance, etc.</p>
<p>A recent case discussed
<a href="http://fharrell.com/post/errmed/#pcisham" target="_blank">here</a>
has brought the human issues home, after I came to know of the
extraordinary efforts made by the
<a href="http://www.thelancet.com/journals/lancet/article/PIIS01406736(17)327149/fulltext" target="_blank">ORBITA</a>
study’s first author, Rasha AlLamee, to make this study a reality.
Placebocontrolled device trials are very difficult to conduct and to
recruit patients into, and this was Rasha’s first effort to launch and
conduct a randomized clinical trial. I very much admire Rasha’s bravery
and perseverance in conducting this trial of PCI, when it is possible
that many past trials of PCI vs. medical theory were affected by placebo
effects.</p>
<p>Professor of Cardiology at Imperial College London, a coauthor on the
above paper, and Rasha’s mentor,
<a href="https://www.imperial.ac.uk/people/d.francis" target="_blank">Darrel Francis</a>, elegantly pointed
out to me that there is a real person on the receiving end of my
criticism, and I heartily agree with him that none of us would ever want
to discourage a clinical researcher from ever conducting her second
randomized trial. This is especially true when the investigator has a
burning interest to tackle difficult unanswered clinical questions. I
don’t mind criticizing statistical designs and analyses, but I can do a
better job of respecting the sincere efforts and hard work of biomedical
researchers.</p>
<p>I note in passing that I had the honor of being a coauthor with Darrel
on <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0081699" target="_blank">this paper</a>
of which I am extremely proud.</p>
<p>Dr Francis gave me permission to include his thoughts, which are below.
After that I list some ideas for making the path to presenting clinical
research findings a more pleasant journey.</p>
<hr />
<p><strong>As the PI for ORBITA, I apologise for this trial being 40 years late,
due to a staffing issue. I had to wait for the lead investigator, Rasha
AlLamee, to be born, go to school, study Medicine at Oxford University,
train in interventional cardiology, and start as a consultant in my
hospital, before she could begin the trial.</strong></p>
<p>Rasha had just finished her fellowship. She had experience in clinical
research, but this was her first leadership role in a trial. She was
brave to choose for her PhD a rigorous placebocontrolled trial in this
controversial but important area.</p>
<p>Funding was difficult: grant reviewers, presumably interventional
cardiologists, said the trial was (a) unethical and (b) unnecessary.
This trial only happened because Rasha was able to convince our
colleagues that the question was important and the patients would not be
without stenting for long. Recruitment was challenging because it
required interventionists to not apply the oculostenotic reflex. In the
end the key was Rasha keeping the message at the front of all our
colleagues’ minds with her boundless energy and enthusiasm.
Interestingly, when the concept was explained to patients, they agreed
to participate more easily than we thought, and dropped out less
frequently than we feared. This means we should indeed acquire
placebocontrolled data on interventional procedures.</p>
<p>Incidentally, I advocate the term “placebo” over “sham” for these
trials, for two reasons. First, placebo control is well recognised as
essential for assessing drug efficacy, and this helps people understand
the need for it with devices. Second, “sham” is a pejorative word,
implying deception. There is no deception in a placebo controlled trial,
only preagreed withholding of information.</p>
<hr />
<p>There are several ways to improve the system that I believe would foster
clinical research and make peer review more objective and productive.</p>
<ul>
<li>Have journals conduct reviews of background and methods without
knowledge of results.</li>
<li>Abandon journals and use researcherled online systems that invite
open post“publication” peer review and give researchers the
opportunities to improve their “paper” in an ongoing fashion.</li>
<li>If not publishing the entire paper online, deposit the background
and methods sections for open prejournal submission review.</li>
<li>Abandon null hypothesis testing and pvalues. Before that, always
keep in mind that a large pvalue means nothing more than “we don’t
yet have evidence against the null hypothesis”, and emphasize
confidence limits.</li>
<li>Embrace Bayesian methods that provide safer and more actionable
evidence, including measures that quantify clinical significance.
And if one is trying to amass evidence that the effects of two
treatments are similar, compute the direct probability of similarity
using a Bayesian model.</li>
<li>Improve statistical education of researchers, referees, and journal
editors, and strengthen statistical review for journals.</li>
<li>Until everyone understands the most important statistical concepts,
better educate researchers and peer reviewers on
<a href="http://biostat.mc.vanderbilt.edu/ManuscriptChecklist" target="_blank">statistical problems to avoid</a>.</li>
</ul>
<p>On a final note, I regularly review clinical trial design papers for
medical journals. I am often shocked at design flaws that authors state
are “too late to fix” in their response to the reviews. This includes
problems caused by improper endpoint variables that necessitated the
randomization of triple the number of patients actually needed to
establish efficacy. Such papers have often been through statistical
review before the journal submission. This points out two challenges:
(1) there is a lot of betweenstatistician variation that statisticians
need to address, and (2) there are many fundamental statistical concepts
that are not known to many statisticians (witness the widespread use of
change scores and dichotomization of variables even when senior
statisticians are among a paper’s authors).</p>

Continuous Learning from Data: No Multiplicities from Computing and Using Bayesian Posterior Probabilities as Often as Desired
http://fharrell.com/post/bayesseq/
Mon, 09 Oct 2017 00:00:00 +0000
http://fharrell.com/post/bayesseq/
<p class="rquote">
(In a Bayesian analysis) It is entirely appropriate to collect data
until a point has been proven or disproven, or until the data collector
runs out of time, money, or patience.<br>— <a href="http://psycnet.apa.org/doi/10.1037/h0044139">Edwards, Lindman, Savage (1963)</a>
</p>
<h1 id="introduction">Introduction</h1>
<p>Bayesian inference, which follows the <em>likelihood principle</em>, is not
affected by the experimental design or intentions of the investigator.
Pvalues can only be computed if both of these are known, and as been
described by
<a href="http://amstat.tandfonline.com/doi/abs/10.1080/00031305.1987.10475458" target="_blank">Berry</a>
(1987) and others, it is almost never the case that the computation of
the pvalue at the end of a study takes into account all the changes in
design that were necessitated when pure experimental designs encounter
the real world.</p>
<p>When performing multiple data looks as a study progress, one can
accelerate learning by more quickly abandoning treatments that do not
work, by sometimes stopping early for efficacy, and frequently by
arguing to extend a promising but asyetinconclusive study by adding
subjects over the originally intended sample size. Indeed the whole
exercise of computing a single sample size is thought to be voodoo by
most practicing statisticians. It has become almost comical to listen to
rationalizations for choosing larger detectable effect sizes so that
smaller sample sizes will yield adequate power.</p>
<p>Multiplicity and resulting inflation of type I error when using
frequentist methods is real. While Bayesians concern themselves with
“what did happen?”, frequentists must consider “what might have
happened?” because of the backwards time and information flow used in
their calculations. Frequentist inference must envision an indefinitely
long string of identical experiments and must consider extremes of data
over potential studies and over multiple looks within each study if
multiple looks were intended. Multiplicity comes from the chances (over
study repetitions and data looks) you give data to be more extreme (if
the null hypothesis holds), not from the chances you give an effect to
be real. It is only the latter that is of concern to a Bayesian.
Bayesians entertain only one dataset at a time, and if one computes
posterior probabilities of efficacy multiple times, it is only the last
value calculated that matters.</p>
<p>To better understand the last point, consider a probabilistic pattern
recognition system for identifying enemy targets in combat. Suppose the
initial assessment when the target is distant is a probability of 0.3 of
being an enemy vehicle. Upon coming closer the probability rises to 0.8.
Finally the target is close enough (or the air clears) so that the
pattern analyzer estimates a probability of 0.98. The fact that the
probabilty was <0.98 earlier is of no consequence as the gunner
prepares to fire a canon. Even though the probability may actually
decrease while the shell is in the air due to new information, the
probability at the time of firing was completely valid based on then
available information.</p>
<p>This is very much how an experimenter would work in a Bayesian clinical
trial. The stopping rule is unimportant when interpreting the final
evidence. Earlier data looks are irrelevant. The only ways a Bayesian
would cheat would be to ignore a later look if it is less favorable than
an earlier look, or to try to pull the wool over reviewers’ eyes by
changing the prior distribution once data patterns emerge.</p>
<p>The meaning and accuracy of posterior probabilities of efficacy in a
clinical trial are mathematical necessities that follow from Bayes’
rule, if the data model is correctly specified (this model is needed
just as much by frequentist methods). So no simulations are needed to
demonstrate these points. But for the nonmathematically minded,
simulations can be comforting. For everyone, simulation code exposes the
logic flow in the Bayesian analysis paradigm.</p>
<p>One other thing: when the frequentist does a sequential trial with
possible early termination, the sampling distribution of the statistics
becomes extremely complicated, but must be derived to allow one to
obtain proper point estimates and confidence limits. It is almost never
the case that the statistician actually performs these complex
adjustments in a clinical trial with multiple looks. One example of the
harm of ignoring this problem is that if the trial stops fairly early
for efficacy, efficacy will be overestimated. On the other hand, the
Bayesian posterior mean/median/mode of the efficacy parameter will be
perfectly calibrated by the prior distribution you assume. If the prior
is skeptical and one stops early, the posterior mean will be “pulled
back” by a perfect amount, as shown in the simulation below.</p>
<p>We consider the simplest clinical trial design for illustration. The
efficacy measure is assumed to be normally distributed with mean μ and
variance 1.0, μ=0 indicates no efficacy, and μ<0 indicates a
detrimental effect. Our inferential jobs are to see if evidence may be
had for a positive effect and to see if further there is evidence for a
clinically meaningful effect (except for the futility analysis, we will
ignore the latter in what follows). Our business task is to not spend
resources on treatments that have a low chance of having a meaningful
benefit to patients. The latter can also be an ethical issue: we’d like
not to expose too many patients to an ineffective treatment. In the
simulation, we stop for futility when the probability that μ<0.05
exceeds 0.9, considering μ=0.05 to be a minimal clinically important
effect.</p>
<p>The logic flow in the simulation exposes what is assumed by the Bayesian
analysis.</p>
<ol>
<li>The prior distribution for the unknown effect μ is taken as a
mixture of two normal distributions, each with mean zero. This is a
skeptical prior that gives an equal chance for detriment as for
benefit from the treatment. Any prior would have done.</li>
<li>In the next step it is seen that the Bayesian does not consider a
stream of identical trials but instead (and only when studying
performance of Bayesian operating characteristics) considers a
stream of trials with <strong>different</strong> efficacies of treatment, by
drawing a single value of μ from the prior distribution. This is
done independently for 50,000 simulated studies. Posterior
probabilities are not informed by this value of μ. Bayesians operate
in a predictive mode, trying for example to estimate Prob(μ>0) no
matter what the value of μ.</li>
<li>For the current value of μ, simulate an observation from a normal
distribution with mean μ and SD=1.0. [In the code below all n=500
subjects’ data are simulated at once then revealed oneatatime.]</li>
<li>Compute the posterior probability of efficacy (μ>0) and of
futility (μ<0.05) using the original prior and latest data.</li>
<li>Stop the study if the probability of efficacy ≥0.95 or the
probability of futility ≥0.9.</li>
<li>Repeat the last 3 steps, sampling one more subject each time and
performing analyses on the accumulated set of subjects to date.</li>
<li>Stop the study when 500 subjects have entered.</li>
</ol>
<p>What is it that the Bayesian must demonstrate to the frequentist and
reviewers? She must demonstrate that the posterior probabilities
computed as stated above are accurate, i.e., they are well calibrated.
From our simulation design, the final posterior probability will either
be the posterior probability computed after the last (500th) subject has
entered, the probability of futility at the time of stopping for
futility, or the probability of efficacy at the time of stopping for
efficacy. How do we tell if the posterior probability is accurate? By
comparing it to the value of μ (unknown to the posterior probability
calculation) that generated the sequence of data points that were
analyzed. We can compute a smooth nonparametric calibration curve for
each of (efficacy, futility) where the binary events are μ>0 and μ<0.05, respectively. For the subset of the 50,000 studies that were
stopped early, the range of probabilities is limited so we can just
compare the mean posterior probability at the moment of stopping with
the proportion of such stopped studies for which efficacy (futility) was
the truth. The mathematics of Bayes dictates the mean probability and
the proportion must be the same (if enough trials are run so that
simulation error approaches zero). This is what happened in the
simulations.</p>
<p>For the smaller set of studies not stopping early, the posterior
probability of efficacy is uncertain and will have a much wider range.
The calibration accuracy of these probabilities is checked using a
nonparametric calibration curve estimator just as we do in validating
risk models, by fitting the relationship between the posterior
probability and the binary event μ>0.</p>
<p>The simulations also demonstrated that the posterior mean efficacy at
the moment of stopping is perfectly calibrated as an estimator of the
true unknown μ.</p>
<p>Simulations were run in R and used functions in the R Hmisc and rms
package. The results are below. Feel free to take the code and alter it
to run any simulations you’d like.</p>
<pre><code class="languager">require(rms)
</code></pre>
<pre><code class="languager">knitrSet(lang='blogdown', echo=TRUE)
gmu < htmlGreek('mu')
half < htmlSpecial('half')
geq < htmlTranslate('>=')
knitr::read_chunk('fundefs.r')
</code></pre>
<h1 id="specificationofprior">Specification of Prior</h1>
<p>The prior distribution is skeptical against large values of efficacy, and assumes that detriment is equally likely as benefit of treatment. The prior favors small effects. It is a 1:1 mixture of two normal distributes each with mean 0. The SD of the first distribution is chosen so that P(μ > 1) = 0.1, and the SD of the second distribution is chosen so that P(μ > 0.25) = 0.05. Posterior probabilities upon early stopping would have the same accuracy no matter which prior is chosen as long as the same prior generating μ is used to generate the data.</p>
<pre><code class="languager">sd1 < 1 / qnorm(1  0.1)
sd2 < 0.25 / qnorm(1  0.05)
wt < 0.5 # 1:1 mixture
pdensity < function(x) wt * dnorm(x, 0, sd1) + (1  wt) * dnorm(x, 0, sd2)
x < seq(3, 3, length=200)
plot(x, pdensity(x), type='l', xlab='Efficacy', ylab='Prior Degree of Belief')
</code></pre>
<p><img src="http://fharrell.com/post/bayesseq_files/figurehtml/skepprior1.png" width="672" /></p>
<h1 id="sequentialtestingsimulation">Sequential Testing Simulation</h1>
<pre><code class="languager">simseq < function(N, prior.mu=0, prior.sd, wt, mucut=0, mucutf=0.05,
postcut=0.95, postcutf=0.9,
ignore=20, nsim=1000) {
prior.mu < rep(prior.mu, length=2)
prior.sd < rep(prior.sd, length=2)
sd1 < prior.sd[1]; sd2 < prior.sd[2]
v1 < sd1 ^ 2
v2 < sd2 ^ 2
j < 1 : N
cmean < Mu < PostN < Post < Postf < postfe < postmean < numeric(nsim)
stopped < stoppedi < stoppedf < stoppedfu < stopfe < status <
integer(nsim)
notignored <  (1 : ignore)
# Derive function to compute posterior mean
pmean < gbayesMixPost(NA, NA, d0=prior.mu[1], d1=prior.mu[2],
v0=v1, v1=v2, mix=wt, what='postmean')
for(i in 1 : nsim) {
# See http://stats.stackexchange.com/questions/70855
component < if(wt == 1) 1 else sample(1 : 2, size=1, prob=c(wt, 1.  wt))
mu < prior.mu[component] + rnorm(1) * prior.sd[component]
# mu < rnorm(1, mean=prior.mu, sd=prior.sd) if only 1 component
Mu[i] < mu
y < rnorm(N, mean=mu, sd=1)
ybar < cumsum(y) / j # all N means for N sequential analyses
pcdf < gbayesMixPost(ybar, 1. / j,
d0=prior.mu[1], d1=prior.mu[2],
v0=v1, v1=v2, mix=wt, what='cdf')
post < 1  pcdf(mucut)
PostN[i] < post[N]
postf < pcdf(mucutf)
s < stopped[i] <
if(max(post) < postcut) N else min(which(post >= postcut))
Post[i] < post[s] # posterior at stopping
cmean[i] < ybar[s] # observed mean at stopping
# If want to compute posterior median at stopping:
# pcdfs < pcdf(mseq, x=ybar[s], v=1. / s)
# postmed[i] < approx(pcdfs, mseq, xout=0.5, rule=2)$y
# if(abs(postmed[i]) == max(mseq)) stop(paste('program error', i))
postmean[i] < pmean(x=ybar[s], v=1. / s)
# Compute stopping time if ignore the first "ignore" looks
stoppedi[i] < if(max(post[notignored]) < postcut) N
else
ignore + min(which(post[notignored] >= postcut))
# Compute stopping time if also allow to stop for futility:
# posterior probability mu < 0.05 > 0.9
stoppedf[i] < if(max(post) < postcut & max(postf) < postcutf) N
else
min(which(post >= postcut  postf >= postcutf))
# Compute stopping time for pure futility analysis
s < if(max(postf) < postcutf) N else min(which(postf >= postcutf))
Postf[i] < postf[s]
stoppedfu[i] < s
## Another way to do this: find first look that stopped for either
## efficacy or futility. Record status: 0:not stopped early,
## 1:stopped early for futility, 2:stopped early for efficacy
## Stopping time: stopfe, post prob at stop: postfe
stp < post >= postcut  postf >= postcutf
s < stopfe[i] < if(any(stp)) min(which(stp)) else N
status[i] < if(any(stp)) ifelse(postf[s] >= postcutf, 1, 2) else 0
postfe[i] < if(any(stp)) ifelse(status[i] == 2, post[s],
postf[s]) else post[N]
}
list(mu=Mu, post=Post, postn=PostN, postf=Postf,
stopped=stopped, stoppedi=stoppedi,
stoppedf=stoppedf, stoppedfu=stoppedfu,
cmean=cmean, postmean=postmean,
postfe=postfe, status=status, stopfe=stopfe)
}
</code></pre>
<pre><code class="languager">set.seed(1)
z < simseq(500, prior.mu=0, prior.sd=c(sd1, sd2), wt=wt, postcut=0.95,
postcutf=0.9, nsim=50000)
mu < z$mu
post < z$post
postn < z$postn
st < z$stopped
sti < z$stoppedi
stf < z$stoppedf
stfu < z$stoppedfu
cmean < z$cmean
postmean< z$postmean
postf < z$postf
status < z$status
postfe < z$postfe
rmean < function(x) formatNP(mean(x), digits=3)
k < status == 2
kf < status == 1
</code></pre>
<ul>
<li>Run 50,000 <b>different</b> clinical trials (differ on amount of efficacy)</li>
<li>For each, generate μ (true efficacy) from the prior</li>
<li>Generate data (n=500) under this truth</li>
<li>½ of the trials have zero or negative efficacy</li>
<li>Do analysis after 1, 2, …, 500 subjects studied</li>
<li>Stop the study when 0.95 sure efficacy > 0, i.e., stop the instant the posterior prob. that the unknown mean μ is positive is ≥ 0.95</li>
<li><p>Also stop for futility: the instant P(μ < 0.05) ≥ 0.9</p></li>
<li><p>20393 trials stopped early for efficacy</p></li>
<li><p>28438 trials stopped early for futility</p></li>
<li><p>1169 trials went to completion (n=500)</p></li>
<li><p>Average posterior prob. of efficacy at stopping for efficacy: 0.961</p></li>
<li><p>Of trials stopped early for efficacy, proportion with μ > 0: 0.960</p></li>
<li><p>Average posterior prob. of futility at stopping for futility: 0.920</p></li>
<li><p>Of trials stopped early for futility, proportion with μ < 0.05: 0.923</p></li>
</ul>
<p>The simulations took about 25 seconds in total.</p>
<h1 id="calibrationofposteriorprobabilitiesofefficacyforstudiesgoingtocompletion">Calibration of Posterior Probabilities of Efficacy for Studies Going to Completion</h1>
<p>Above we saw perfect calibration of the probabilities of efficacy and futility upon stopping. Let’s now examine the remaining probabilities, for the 1169 trials going to completion. For this we use the same type of nonparametric calibration curve estimation as used for validating risk prediction models. This curve estimates the relationship between the estimated probability of efficacy (Bayesian posterior probability) and the true probability of efficacy.</p>
<pre><code class="languager">k < status == 0
pp < postfe[k]
truly.efficacious < mu[k] > 0
v < val.prob(pp, truly.efficacious)
</code></pre>
<p><img src="http://fharrell.com/post/bayesseq_files/figurehtml/cal1.png" width="672" /></p>
<p>The posterior probabilities of efficacy tended to be between 0.45 (had they been much lower the trial would have been stopped for futility) and 0.95 (the cutoff for stopping for efficacy). Where there are data, the nonparametric calibration curve estimate is very close to the line of identity. Had we done even more simulations we would have had many more nonstopped studies and the calibration estimates would be even closer to the ideal. For example, when the posterior probability of efficacy is 0.6, the true probability that the treatment was effective (μ actually > 0) is 0.6.</p>
<h1 id="calibrationofposteriormeanatstoppingforefficacy">Calibration of Posterior Mean at Stopping for Efficacy</h1>
<p>When stopping early because of evidence that μ > 0, the sample mean will overestimate the true mean. But with the Bayesian analysis, where the prior favors smaller treatment effects, the posterior mean/median/mode is pulled back by a perfect amount, as shown in the plot below.</p>
<pre><code class="languager">plot(0, 0, xlab='Estimated Efficacy',
ylab='True Efficacy', type='n', xlim=c(2, 4), ylim=c(2, 4))
abline(a=0, b=1, col=gray(.9), lwd=4)
lines(supsmu(cmean, mu))
lines(supsmu(postmean, mu), col='blue')
text(2, .4, 'Sample mean')
text(1, .8, 'Posterior mean', col='blue')
</code></pre>
<p><img src="http://fharrell.com/post/bayesseq_files/figurehtml/estmu1.png" width="672" /></p>
<h1 id="exampletexttocommunicatestudydesignoverviewtoasponsor">Example Text to Communicate Study Design Overview to a Sponsor</h1>
<p>It is always the case that estimated a single fixed sample size is problematic, because a number of assumptions must be made, and the veracity of those assumptions is not known until the study is completed. A sequential Bayesian approach allows for a lower expected sample size if some allowance can be made for the possibility that if the study gets to a certain landmark, the results are equivocal, and the study can be extended. The idea is to compute the (Bayesian) probability of efficacy as often as desired. The study could be terminated early for futility or harm, and less likely, for efficacy. Such early termination would save more resources than one would spend to extend a promising but equivocal study, on the average. The intended sample size would be set. At that point, if results are equivocal but promising (e.g. Bayesian posterior probability of efficacy is > 0.8), the sponsor would have the option to decide to extend the study by adding more patients, perhaps in blocks of 50.</p>
<!
# Useful References
Berry[@ber87int], Edwards, Lindman and Savage[@edw63bay]
>
<h1 id="computingenvironment">Computing Environment</h1>
<p><!html_preserve><pre>
R version 3.4.4 (20180315)
Platform: x86_64pclinuxgnu (64bit)
Running under: Ubuntu 18.04 LTS</p>
<p>Matrix products: default
BLAS: /usr/lib/x86_64linuxgnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64linuxgnu/lapack/liblapack.so.3.7.1</p>
<p>attached base packages:
[1] methods stats graphics grDevices utils datasets base</p>
<p>other attached packages:
[1] rms_5.13 SparseM_1.77 Hmisc_4.12 ggplot2_2.2.1<br />
[5] Formula_1.23 survival_2.423 lattice_0.2035
</pre>
To cite R in publication use:
<p>R Core Team (2018).
<em>R: A Language and Environment for Statistical Computing</em>.
R Foundation for Statistical Computing, Vienna, Austria.
<a href="https://www.Rproject.org/">https://www.Rproject.org/</a>.
</p>
<!/html_preserve></p>

Bayesian vs. Frequentist Statements About Treatment Efficacy
http://fharrell.com/post/bayesfreqstmts/
Wed, 04 Oct 2017 00:00:00 +0000
http://fharrell.com/post/bayesfreqstmts/
<p><p class="rquote">
To avoid “false positives” do away with “positive”.<br><br>
A good poker player plays the odds by thinking to herself “The probability I can win with this hand is 0.91” and not “I’m going to win this game” when deciding the next move.<br><br>
State conclusions honestly, completely deferring judgments and actions to the ultimate decision makers. Just as it is better to <a href="http://fharrell.com/post/classification">make predictions than classifications</a> in prognosis and diagnosis, use the word “probably” liberally, and avoid thinking “the evidence against the null hypothesis is strong, so we conclude the treatment works” which creates the opportunity of a false positive.<br><br>
Propagation of uncertainties throughout research, reporting, and implementation will result in better decision making and getting more data when needed. Imagine a physician saying to a patient “The chance this drug will lower your blood pressure by more than 3mmHg is 0.93.”
</p>
The following examples are intended to show the advantages of Bayesian reporting of
treatment efficacy analysis, as well as to provide examples contrasting
with frequentist reporting. As detailed
<a href="http://fharrell.com/post/pvallitany/" target="_blank">here</a>,
there are many problems with pvalues, and some of those problems will
be apparent in the examples below. Many of the advantages of Bayes are
summarized <a href="http://fharrell.com/post/journey/" target="_blank">here</a>.
As seen below, Bayesian posterior probabilities prevent one from
concluding equivalence of two treatments on an outcome when the data do
not support that (i.e., the <a href="http://fharrell.com/post/errmed/" target="_blank">“absence of evidence is not evidence of
absence”</a> error).</p>
<p>Suppose that a parallel group randomized clinical trial is conducted to
gather evidence about the relative efficacy of new treatment B to a
control treatment A. Suppose there are two efficacy endpoints: systolic
blood pressure (SBP) and time until cardiovascular/cerebrovascular
event. Treatment effect on the first endpoint is assumed to be
summarized by the BA difference in true mean SBP. The second endpoint
is assumed to be summarized as a true B:A hazard ratio (HR). For the
Bayesian analysis, assume that prespecified skeptical prior
distributions were chosen as follows. For the unknown difference in mean
SBP, the prior was normal with mean 0 with SD chosen so that the
probability that the absolute difference in SBP between A and B exceeds
10mmHg was only 0.05. For the HR, the log HR was assumed to have a
normal distribution with mean 0 and SD chosen so that the prior
probability that the HR>2 or HR<<sup>1</sup>⁄<sub>2</sub> was 0.05. Both priors
specify that it is equally likely that treatment B is effective as it is
detrimental. The two prior distributions will be referred to as p1 and
p2.</p>
<h3 id="example1socallednegativetrialconsideringonlysbp">Example 1: Socalled “Negative” Trial (Considering only SBP)</h3>
<p>Frequentist Statement</p>
<ul>
<li>Incorrect Statement: Treatment B did not improve SBP when compared
to A (p=0.4)</li>
<li>Confusing Statement: Treatment B was not significantly different
from treatment A (p=0.4)</li>
<li>Accurate Statement: We were unable to find evidence against the
hypothesis that A=B (p=0.4). More data will be needed. As the
statistical analysis plan specified a frequentist approach, the
study did not provide evidence of similarity of A and B (but see the
confidence interval below).</li>
<li>Supplemental Information: The observed BA difference in means was
4mmHg with a 0.95 confidence interval of [5, 13]. If this study
could be indefinitely replicated and the same approach used to
compute a confidence interval each time, 0.95 of such varying
confidence intervals would contain the unknown true difference in
means. Based on the current study, the probability that the true difference is within [5, 13] is either zero or one, i.e., we don’t really know how to interpret the interval.</li>
</ul>
<p>Bayesian Statement</p>
<ul>
<li>Assuming prior distribution p1 for the mean difference of SBP, the
probability that SBP with treatment B is lower than treatment A is
0.67. Alternative statement: SBP is probably (0.67) reduced with
treatment B. The probability that B is inferior to A is 0.33.
Assuming a minimally clinically important difference in SBP of
3mmHg, the probability that the mean for A is within 3mmHg of the
mean for B is 0.53, so the study is uninformative about the question
of similarity of A and B.</li>
<li>Supplemental Information: The posterior mean difference in SBP was
3.3mmHg and the 0.95 credible interval is [4.5, 10.5]. The
probability is 0.95 that the true treatment effect is in the
interval [4.5, 10.5]. [could include the posterior density
function here, with a shaded right tail with area 0.67.]</li>
</ul>
<h3 id="example2socalledpositivetrial">Example 2: Socalled “Positive” Trial</h3>
<p>Frequentist Statement</p>
<ul>
<li>Incorrect Statement: The probability that there is no difference in
mean SBP between A and B is 0.02</li>
<li>Confusing Statement: There was a statistically significant
difference between A and B (p=0.02).</li>
<li>Correct Statement: There is evidence against the null hypothesis of
no difference in mean SBP (p=0.02), and the observed difference
favors B. Had the experiment been exactly replicated indefinitely,
0.02 of such repetitions would result in more impressive results if
A=B.</li>
<li>Supplemental Information: Similar to above.</li>
<li>Second Outcome Variable, If the pvalue is Small: Separate
statement, of same form as for SBP.</li>
</ul>
<p>Bayesian Statement</p>
<ul>
<li>Assuming prior p1, the probability that B lowers SBP when compared
to A is 0.985. Alternative statement: SBP is probably (0.985)
reduced with treatment B. The probability that B is inferior to A is
0.015.</li>
<li>Supplemental Information: Similar to above, plus evidence about
clinically meaningful effects, e.g.: The probability that B lowers
SBP by more than 3mmHg is 0.81.</li>
<li>Second Outcome Variable: Bayesian approach allows one to make a
separate statement about the clinical event HR and to state evidence
about the joint effect of treatment on SBP and HR. Examples:
Assuming prior p2, HR is probably (0.79) lower with treatment B.
Assuming priors p1 and p2, the probability that treatment B both
decreased SBP and decreased event hazard was 0.77. The probability
that B improved <strong>either</strong> of the two endpoints was 0.991.</li>
</ul>
<p>One would also report basic results. For SBP, frequentist results might
be chosen as the mean difference and its standard error. Basic Bayesian
results could be said to be the entire posterior distribution of the SBP
mean difference.</p>
<p>Note that if multiple looks were made as the trial progressed, the
frequentist estimates (including the observed mean difference) would
have to undergo complex adjustments. Bayesian results require no
modification whatsoever, but just involve reporting the latest available
cumulative evidence.</p>