tag:blogger.com,1999:blog-53761393226965034422017-09-19T05:56:32.862-05:00Statistical ThinkingThis blog is devoted to statistical thinking and its impact on science and everyday life. Emphasis is given to maximizing the use of information, avoiding statistical pitfalls, describing problems caused by the frequentist approach to statistical inference, describing advantages of Bayesian and likelihood methods, and discussing intended and unintended differences between statistics and data science. I'll also cover regression modeling strategies, clinical trials, and drug evaluation.Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.comBlogger17125tag:blogger.com,1999:blog-5376139322696503442.post-26668194490177253302017-08-01T10:12:00.002-05:002017-08-02T06:06:57.378-05:00Integrating Audio, Video, and Discussion Boards with Course NotesAs a biostatistics teacher I've spent a lot of time thinking about inverting the classroom and adding multimedia content. My first thought was to create YouTube videos corresponding to sections in my lecture notes. This typically entails recording the computer screen while going through slides, adding a voiceover. I realized that the maintenance of such videos is difficult, and this also creates a barrier to adding new content. In addition, the quality of the video image is lower than just having the student use a pdf viewer on the original notes. For those reasons I decided to create audio narration for the sections in the notes to largely capture what I would say during a live lecture. The audio <tt>mp3</tt> files are stored on a local server and are streamed on demand when a study clicks on the audio icon in a section of the notes. The audio recordings can also be downloaded one-at-a-time or in a batch. <p>The notes themselves are created using <tt>LaTeX, R</tt>, and <tt>knitr</tt> using a <tt>LaTeX</tt> style I created that is a compromise format between projecting slides and printing notes. In the future I will explore using <tt>bookdown</tt> for creating content in <tt>html</tt> instead of <tt>pdf</tt>. In either case, the notes can change significantly when R commands within them are re-executed by <tt>knitr</tt> in <tt>R</tt>. <p>An example of a page of <tt>pdf</tt> notes with icons that link to audio or video content is in Section 10.5 of <a href="http://biostat.mc.vanderbilt.edu/tmp/bbr.pdf">BBR</a>. I add red letters in the right margin for each subsection in the text, and occasionally call out these letters in the audio so that the student will know where I am. <p>There are several student activities for which the course would benefit by recording information. Two of them are students pooling notes taken during class sessions, and questions and answers between sessions. The former might be handled by simultaneous editing or wiki curation on the cloud, and I haven't thought very much about how to link this with the course notes to in effect expand the notes for the next class of students. Let's consider the Q&A aspect. It would be advantageous for questions and answers to "grow", and for future students to take advantage of the Q&As from past students. Being able to be looking at a subsection in the course notes and quickly linking to cumulative Q&A on that topic is a plus. My first attempt at this was to set up a <a href="http://slack.com">slack.com</a> team for courses in our department, and then setting up a channel for each of the two courses I teach. As <tt>slack</tt> does not allow sub-channels, the discussions need to be organized in some way. I went about this by assigning a mnemonic in the course notes that should be mentioned when a threaded discussion is started in <tt>slack</tt>. Students can search for discussions about a subsection in the notes by searching for that mnemonic. I have put hyperlinks from the notes to a slack search expression that is supposed to bring up discussions related to the mnemonic in the course's <tt>slack</tt> channel. The problem is that <tt>slack</tt> doesn't have a formal URL construction that guarantees that a hyperlink to a URL with that expression will cause the correct discussions to pop up in the user's browser. This is a work in progress, and other ideas are welcomed. See Section 10.5.1 of <a href="http://biostat.mc.vanderbilt.edu/tmp/bbr.pdf">BBR</a> for an example where an icon links to slack (see the mnemonic <tt>reg-simple</tt>). <p>Besides being hard to figure out how to create URLs o get the student and instructor directly into a specific discussion, <tt>slack</tt> has the disadvantage that users need to be invited to join the team. If every team member is to be from the instructor's university, you can configure <tt>slack</tt> so that anyone with an email address in the instructor's domain can be added to the team automatically. <p>I have entertained another approach of using <a href="http://disqus.com">disqus</a> for linking student comments to sections of notes. This is very easy to set up, but when one wants to have a separate discussion about each notes subsection, I haven't figured out how to have <tt>disqus</tt> use keywords or some other means to separate the discussions. <p><a href="http://stats.stackexchange.com">stats.stackexchange.com</a> is the world's most active Q&A and discussion board for statistics. Its ability to format questions, answers, comments, math equations, and images is unsurpassed. Perhaps every discussion about a statistical issue should be started in <tt>stackexchange</tt> and then linked to from the course notes. This has the disadvantage of needing to link to multiple existing <tt>stackexchange</tt> questions related to one topic, but has the great advantage of gathering input from statisticians around the world, not just those in the class. <p>No mater which method for entering Q & A is used, I think that such comments need to be maintained separately from the course notes because of the dynamic, reproducible nature of the notes using <tt>knitr</tt>. Just as important, when I add new static content to the notes I want the existing student comments to just move appropriately with these changes. Hyperlinking to Q & A does that. There is one more issue not discussed above - students often annotate the <tt>pdf</tt> file, but their annotations are undone when I produce an update to he notes. It would be nice to have some sort of dynamic annotation capability. This is likely to work better as I use <tt>R bookdown</tt> for new notes I develop. <p>I need your help in refining the approach or discovering completely new approaches to coordination of information using the course notes as a hub. Please add comments to this post below, or short suggestions to <tt>@f2harrell</tt> on <tt>twitter</tt>. Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com11tag:blogger.com,1999:blog-5376139322696503442.post-12940417679748867562017-06-01T14:18:00.001-05:002017-07-17T06:42:04.522-05:00EHRs and RCTs: Outcome Prediction vs. Optimal Treatment Selection<div align="center">Frank Harrell<br>Professor of Biostatistics<br>Vanderbilt University School of Medicine<br><br>Laura Lazzeroni<br>Professor of Psychiatry and, by courtesy, of Medicine (Cardiovascular Medicine) and of Biomedical Data Science<br>Stanford University School of Medicine<br><span style="font-size: 70%;"><b>Revised July 17, 2017</b></span></div> <p>It is often said that randomized clinical trials (RCTs) are the gold standard for learning about therapeutic effectiveness. This is because the treatment is assigned at random so no variables, measured or unmeasured, will be truly related to treatment assignment. The result is an unbiased estimate of treatment effectiveness. On the other hand, observational data arising from clinical practice has all the biases of physicians and patients in who gets which treatment. Some treatments are indicated for certain types of patients; some are reserved for very sick ones. The fact is that the selection of treatment is often chosen on the basis of patient characteristics that influence patient outcome, some of which may be unrecorded. When the outcomes of different groups of patients receiving different treatments are compared, without adjustment for patient characteristics related to treatment selection and outcome, the result is a bias called <i>confounding by indication</i>. <p>To set the stage for our discussion of the challenges caused by confounding by indication, incomplete data, and unreliable data, first consider a nearly ideal observational treatment study then consider an ideal RCT. First, consider a potentially optimal observational cohort design that has some possibility of providing an accurate treatment outcome comparison. Suppose that an investigator has obtained $2M in funding to hire trained research nurses to collect data completely and accurately, and she has gone to the trouble of asking five expert clinicians in the disease/treatment area to each independently list the patient characteristics they perceive are used to select therapies for patients. The result is a list of 18 distinct patient characteristics, for which a data dictionary is written and case report forms are collected. Data collectors are instructed to obtain these 18 variables on every patient with very few exceptions, and other useful variables, especially strong prognostic factors, are collected in addition. Details about treatment are also captured, including the start and ending dates of treatment, doses, and dose schedule. Outcomes are well defined and never missing. The sample size is adequate, and when data collection is complete, analysis of covariance is used to estimate the outcome difference for treatment A vs. treatment B. Then the study PI discovers that there is a strong confounder that none of the five experts thought of, and a sensitivity analysis indicates that the original treatment effect estimate might have been washed away by the additional confounder had it been collected. The study results in no reliable knowledge about the treatments. <p>The study just described represents a high level of observational study quality, and still needed some luck to be useful. The treatments, entry criteria, and follow-up clock were well defined, and there were almost no missing data. Contrast that with the electronic health record (EHR). If questions of therapeutic efficacy are so difficult to answer with nearly perfect observational data how can they be reliably answered from EHR data alone? <p>To complete our introduction to the discussion, envision a well-conducted parallel-group RCT with complete follow-up and highly accurate and relevant baseline data capture. Study inclusion criteria allowed for a wide range of age and severity of disease. The endpoint is time until a devastating clinical event. The treatment B:treatment A covariate-adjusted hazard ratio is 0.8 with 0.95 credible interval of [0.65, 0.93]. The authors, avoiding unreliable subgroup analysis, perform a careful but comprehensive assessment of interaction between patient types and treatment effect, finding no evidence for heterogeneity of treatment effect (HTE). The hazard ratio of 0.8 is widely generalizable, even to populations with much different baseline risk. A simple nomogram is drawn to show how to estimate absolute risk reduction by treatment B at 3 years, given a patient's baseline 3y risk. <p><hr><p>There is an alarming trend in advocates of learning from the EHR saying that statistical inference can be bypassed because (1) large numbers overcome all obstacles, (2) the EHR reflects actual clinical practice and patient populations, and (3) if you can predict outcomes for individual patients you can just find out for which treatment the predicted outcomes are optimal. Advocates of such "logic" often go on to say that RCTs are not helpful because the proportion of patients seen in practice that would qualify for the trial is very small with randomized patients being unrepresentative of the clinical population, because the trial only estimates the average treatment effect, because there must be HTE, and because treatment conditions are unrepresentative. Without HTE, precision medicine would have no basis. But evidence of substantial HTE has yet to be generally established and its existence in particular cases can be an artifact of the outcome scale used for the analysis. See <a href="http://www.fharrell.com/2017/01/randomized-clinical-trials-do-not-mimic.html">this</a> for more about the first two complaints about RCTs. Regarding (1), researchers too often forget that measurement or sample bias does not diminish no matter how large the sample size. Often, large sample sizes only provide precise estimates of the wrong quantity. <p>To illustrate this problem, suppose that one is interested in estimating and testing the treatment effect, B-A, of a certain blood pressure lowering medication (drug B) when compared to another drug (A). Assume a relevant subset of the EHR can be extracted in which patients started initial monotherapy at a defined date and systolic blood pressure (SBP) was measured routinely at useful follow-up intervals. Suppose that the standard deviation (SD) of SBP across patients is 8 mmHg regardless of treatment group. Suppose further that minor confounding by indication is present due to the failure to adjust for an unstated patient feature involved in the drug choice, which creates a systematic unidirectional bias of 2 mmHg in estimating the true B-A difference in mean SBP. If the EHR has m patients in each treatment group, the variance of the estimated mean difference is the sum of the variances of the two individual means or 64/m + 64/m = 128/m. But the variance only tells us about how close our sample estimate is to the incorrect value, B-A + 2 mmHg. It is the mean squared error, the variance plus the square of the bias or 128/m + 4, that relates to the probability that the estimate is close to the true treatment effect B-A. As m gets larger, the variance goes to zero indicating a stable estimate has been achieved. But the bias is constant so the mean squared error remains at 4 (root mean squared error = 2 mmHg). <p>Now consider an RCT that is designed not to estimate the mean SBP for A or the mean SBP for B but, as with all randomized trials, is designed to estimate the B-A difference (treatment effect). If the trial randomized m subjects per treatment group, the variance of the mean difference is 128/m and the mean squared error is also 128/m. The comparison of the square root of mean squared errors for an EHR study and an equal-sized RCT is depicted in the figure below. Here, we have even given the EHR study the benefit of the doubt in assuming that SBP is measured as accurately as would be the case in the RCT. This is unlikely, and so in reality the results presented below are optimistic for the performance of the EHR. <p><a href="https://2.bp.blogspot.com/-zSGx9OiXWH0/WUwehdXfzLI/AAAAAAAAJsY/uNFhbNvp0VwCgOI1AdQvDgpQU99iKjxCQCLcBGAs/s1600/mse.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-zSGx9OiXWH0/WUwehdXfzLI/AAAAAAAAJsY/uNFhbNvp0VwCgOI1AdQvDgpQU99iKjxCQCLcBGAs/s400/mse.png" width="500" height="375" data-original-width="600" data-original-height="450" /></a><p>EHR studies have the potential to provide far larger sample sizes than RCTs, but note that an RCT with a total sample size of 64 subjects is as informative as an EHR study with infinitely many patients. <b>Bigger is not better</b>. What if the SBP measurements from the EHR, not collected under any protocol, are less accurate than those collected under the RCT protocol? Let’s exemplify that by setting the SD for SBP to 10 mmHg for the EHR while leaving it as 8 mmHg for the RCT. For very large sample sizes, bias trumps variance so the breakeven point of 64 subjects remains, but for non-large EHRs the increased variability of measured SBPs harms the margin of error of EHR estimate of mean SBP difference. <p> We have addressed estimation error for the treatment effect, but note that while an EHR-based statistical test for any treatment difference will have excellent power for large n, this comes at the expense of being far from preserving the type I error, which is essentially 1.0 due to the estimation bias causing the two-sample statistical test to be biased, . <p> Interestingly, bias decreases the benefits achieved by larger sample sizes to the extent that, in contrast to an unbiased RCT, the mean squared error for an EHR of size 3000 in our example is nearly identical to what it would be with an infinite sample size. While this disregards the need for larger samples to target multiple treatments or distinct patient populations, it does suggest that overcoming the specific resource-intensive challenges associated with handling huge EHR samples may yield fewer advances in medical treatment than anticipated by some, if the effects of bias are considered. <p>There is a mantra heard in data science that you just need to "let the data speak." You can indeed learn much from observational data if quality and completeness of data are high (this is for another discussion; EHRs have major weakness just in these two aspects). But data frequently teach us things that are <a href="https://www.youtube.com/watch?v=TGGGDpb04Yc">just plain wrong</a>. This is due to a variety of reasons, including seeing trends and patterns that can be easily explained by pure noise. Moreover, treatment group comparisons in EHRs can reflect both the effects of treatment and the effects of specific prior patient conditions that led to the choice of treatment in the first place - conditions that may not be captured in the EHR. The latter problem is confounding by indication, and this can only be overcome by randomization, strong assumptions, or having high-quality data on all the potential confounders (patient baseline characteristics related to treatment selection and to outcome--rarely if ever possible). Many clinical researchers relying on EHRs do not take the time to even list the relevant patient characteristics before rationalizing that the EHR is adequate. To make matters worse, EHRs frequently do not provide accurate data on when patients started and stopped treatment. Furthermore, the availability of patient outcomes can depend on the very course of treatment and treatment response under study. For example, when a trial protocol is not in place, lab tests are not ordered at pre-specified times but because of a changing patient condition. If EHR cannot provide a reliable estimate of the average treatment effect how could it provide reliable estimates of differential treatment benefit (HTE)? <p>Regarding the problem with signal vs. noise in "let the data speak", we envision a clinician watching someone playing a slot machine in Las Vegas. The clinician observes that a small jackpot was hit after 17 pulls of the lever, and now has a model for success: go to a random slot machine with enough money to make 17 pulls. Here the problem is not a biased sample but pure noise. <p>Observational data, when complete and accurate, can form the basis for accurate predictions. But what are predictions really good for? Generally speaking, predictions can be used to estimate likely patient outcomes given prevailing clinical practice and treatment choices, with typical adherence to treatment. Prediction is good for natural history studies and for counseling patients about their likely outcomes. What is needed for selecting optimum treatments is an answer to the "what if" question: what is the likely outcome of this patient were she given treatment A vs. were she given treatment B? This is inherently a problem of causal inference, which is why such questions are best answered using experimental designs, such as RCTs. When there is evidence that the complete, accurate observational data captured and eliminated confounding by indication, then and only then can observational data be a substitute for RCTs in making smart treatment choices. <p>What is a good global strategy for making optimum decisions for individual patients? Much more could be said, but for starters consider the following steps: <ul><li>Obtain the best covariate-adjusted estimate of relative treatment effect (e.g., odds ratio, hazards ratio) from an RCT. Check whether this estimate is constant or whether it depends on patient characteristics (i.e., whether heterogeneity of treatment effect exists on the relative scale). One possible strategy, using fully specified interaction tests adjusted for all main effects, is in <i><a href="http://biostat.mc.vanderbilt.edu/ClinStat">Biostatistics for Biomedical Research</a></i> in the Analysis of Covariance chapter.</li><li>Develop a predictive model from complete, accurate observational data, and perform strong interval validation using the bootstrap to verify absolute calibration accuracy. Use this model to handle risk magnification whereby absolute treatment benefits are greater for sicker patients in most cases.</li><li>Apply the relative treatment effects from the RCT, separately for treatment A and treatment B, to the estimated outcome risk from the observational data to obtain estimates of absolute treatment benefit (B vs. A) for the patient. See the first figure below which relates a hazard ratio to absolute improvement in survival probability.</li><li>Develop a nomogram using the RCT data to estimate absolute treatment benefit for an individual patient. See the second figure below whose bottom axis is the difference between two logistic regression models. (Both figures are from <a href="http://www.fharrell.com/p/blog-page.html">BBR</a> Chapter 13)</li><li>For more about such strategies, see Stephen Senn's <a href="https://www.slideshare.net/StephenSenn1/real-world-modified"> presentation</a>.</li></ul><p> <a href="https://3.bp.blogspot.com/-yAY1ywPyu-g/WTBoTF59TFI/AAAAAAAAJj4/SABvbBe0cLkDoa-u__cXOGk-O9_oMt-DwCLcB/s1600/hr.png" imageanchor="1" ><img border="0" src="https://3.bp.blogspot.com/-yAY1ywPyu-g/WTBoTF59TFI/AAAAAAAAJj4/SABvbBe0cLkDoa-u__cXOGk-O9_oMt-DwCLcB/s400/hr.png" width="400" height="400" data-original-width="1000" data-original-height="1000" /></a> <a href="https://2.bp.blogspot.com/-3FsCQSnmPOI/WTBoafP-7xI/AAAAAAAAJj8/y6keBXrKzt8LxpZvDm094MxR8KBPfEivACLcB/s1600/nom.png" imageanchor="1" ><img border="0" src="https://2.bp.blogspot.com/-3FsCQSnmPOI/WTBoafP-7xI/AAAAAAAAJj8/y6keBXrKzt8LxpZvDm094MxR8KBPfEivACLcB/s400/nom.png" width="400" height="400" data-original-width="1000" data-original-height="1000" /></a>Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com6tag:blogger.com,1999:blog-5376139322696503442.post-36233537246790302852017-04-08T08:36:00.001-05:002017-08-25T12:03:09.795-05:00Statistical Errors in the Medical Literature<div style="text-align:center"><span style="font-size: 80%;"><em><a href="#dietqual">Updated</a> 2017-08-25 </em></span></div><p><ol><li><a href="#pval">Misinterpretation of P-values and Main Study Results</a></li><li><a href="#catg">Dichotomania</a></li><li><a href="#change">Problems With Change Scores</a></li><li><a href="#subgroup">Improper Subgrouping</a></li><li><a href="#serial">Serial Data and Response Trajectories</a></li></ol><hr />As Doug Altman famously wrote in his <a href="http://www.bmj.com/content/308/6924/283">Scandal of Poor Medical Research</a> in BMJ in 1994, the quality of how statistical principles and analysis methods are applied in medical research is quite poor. According to Doug and to many others such as <a href="http://blogs.bmj.com/bmj/2014/01/31/richard-smith-medical-research-still-a-scandal/">Richard Smith</a>, the problems have only gotten worse. The purpose of this blog article is to contain a running list of new papers in major medical journals that are statistically problematic, based on my random encounters with the literature.<br /><br />One of the most pervasive problems in the medical literature (and in other subject areas) is misuse and misinterpretation of p-values as detailed <a href="http://www.fharrell.com/2017/02/a-litany-of-problems-with-p-values.html">here</a>, and chief among these issues is perhaps the <a href="http://www.bmj.com/content/311/7003/485">absence of evidence is not evidence of absence</a> error written about so clearly by Altman and Bland. The following thought will likely rattle many biomedical researchers but I've concluded that most of the gross misinterpretation of large p-values by falsely inferring that a treatment is not effective is caused by (1) the investigators not being brave enough to conclude "We haven't learned anything from this study", i.e., they feel compelled to believe that their investments of time and money must be worth something, (2) journals accepting such papers without demanding a proper statistical interpretation in the conclusion. One example of proper wording would be "This study rules out, with 0.95 confidence, a reduction in the odds of death that is more than by a factor of 2." Ronald Fisher, when asked how to interpret a large p-value, said "Get more data." <p>Adoption of Bayesian methods would <a href="http://www.fharrell.com/2017/02/my-journey-from-frequentist-to-bayesian.html">solve many problems</a> including this one. Whether a p-value is small or large a Bayesian can compute the posterior probability of similarity of outcomes of two treatments (e.g., Prob(0.85 < odds ratio < 1/0.85)), and the researcher will often find that this probability is not large enough to draw a conclusion of similarity. On the other hand, what if even under a skeptical prior distribution the Bayesian posterior probability of efficacy were 0.8 in a "negative" trial? Would you choose for yourself the standard therapy when it had a 0.2 chance of being better than the new drug? [Note: I am not talking here about regulatory decisions.] Imagine a Bayesian world where it is standard to report the results for the primary endpoint using language such as: <ul><li>The probability of any efficacy is 0.94 (so the probability of non-efficacy is 0.06).</li><li>The probability of efficacy greater than a factor of 1.2 is 0.78 (odds ratio < 1/1.2).</li><li>>The probability of similarity to within a factor of 1.2 is 0.3.</li><li>The probability that the true odds ratio is between [0.6, 0.99] is 0.95 (credible interval; doesn't use the long-run tendency of confidence intervals to include the true value for 0.95 of confidence intervals computed).</li></ul>In a so-called "negative" trial we frequently see the phrase "treatment B was not significantly different from treatment A" without thinking out how little information that carries. Was the power really adequate? Is the author talking about an observed statistic (probably yes) or the true unknown treatment effect? Why should we care more about statistical significance than clinical significance? The phrase "was not significantly different" seems to be a way to avoid the real issues of interpretation of large p-values. <p>Since my #1 area of study is statistical modeling, especially predictive modeling, I pay a lot of attention to model development and model validation as done in the medical literature, and I routinely encounter published papers where the authors do not have basic understanding of the statistical principles involved. This seems to be especially true when a statistician is not among the paper's authors. I'll be commenting on papers in which I encounter statistical modeling, validation, or interpretation problems. <p><h3><a name="pval">Misinterpretation of P-values and of Main Study Results</a></h3>One of the most problematic examples I've seen is in the March 2017 paper <a href="http://www.nejm.org/doi/full/10.1056/nejmoa1616218#t=article">Levosimendan in Patients with Left Ventricular Dysfunction Undergoing Cardiac Surgery</a> by Rajenda Mehta in the New England Journal of Medicine. The study was designed to detect a miracle - a 35% relative odds reduction with drug compared to placebo, and used a power requirement of only 0.8 (type II error a whopping 0.2). [The study also used some questionable alpha-spending that Bayesians would find quite odd.] For the primary endpoint, the adjusted odds ratio was 1.00 with 0.99 confidence interval [0.66, 1.54] and p=0.98. Yet the authors concluded "Levosimendan was not associated with a rate of the composite of death, renal-replacement therapy, perioperative myocardial infarction, or use of a mechanical cardiac assist device that was lower than the rate with placebo among high-risk patients undergoing cardiac surgery with the use of cardiopulmonary bypass." Their own data are consistent with a 34% reduction (as well as a 54% increase)! Almost nothing was learned from this underpowered study. It may have been too disconcerting for the authors and the journal editor to have written "We were only able to rule out a massive benefit of drug." [Note: two treatments can have agreement in outcome probabilities by chance just as they can have differences by chance.] It would be interesting to see the Bayesian posterior probability that the true unknown odds ratio is in [0.85, 1/0.85]. <p>The primary endpoint is the union of death, dialysis, MI, or use of a cardiac assist device. This counts these four endpoints as equally bad. An ordinal response variable would have yielded more statistical information/precision and perhaps increased power. And instead of dealing with multiplicity issues and alpha-spending, the multiple endpoints could have been dealt with more elegantly with a Bayesian analysis. For example, one could easily compute the joint probability that the odds ratio for the primary endpoint is less than 0.8 and the odds ratio for the secondary endpoint is less than 1 [the secondary endpoint was death or assist device and and is harder to demonstrate because of its lower incidence, and is perhaps more of a "hard endpoint"]. In the Bayesian world of forward directly relevant probabilities there is no need to consider multiplicity. There is only a need to state the assertions for which one wants to compute current probabilities. <p>The paper also contains inappropriate assessments of interactions with treatment using subgroup analysis with arbitrary cutpoints on continuous baseline variables and failure to adjust for other main effects when doing the subgroup analysis. <p>This paper had a fine statistician as a co-author. I can only conclude that the pressure to avoid disappointment with a conclusion of spending a lot of money with little to show for it was in play. <p>Why was such an underpowered study launched? Why do researchers attempt "hail Mary passes"? Is a study that is likely to be futile fully ethical? Do medical journals allow this to happen because of some vested interest?<br /><h4>Similar Examples</h4>Perhaps the above example is no worse than many. Examples of "absence of evidence" misinterpretations abound. Consider the <a href="http://jamanetwork.com/journals/jama/article-abstract/2612911">JAMA</a> paper by Kawazoe et al published 2017-04-04. They concluded that "Mortality at 28 days was not significantly different in the dexmedetomidine group vs the control group (19 patients [22.8%] vs 28 patients [30.8%]; hazard ratio, 0.69; 95% CI, 0.38-1.22;P > = .20)." The point estimate was a reduction in hazard of death by 31% and the data are consistent with the reduction being as large as 62%! <p>Or look at <a href="http://jamanetwork.com/journals/jama/article-abstract/2613159">this</a> 2017-03-21 JAMA article in which the authors concluded "Among healthy postmenopausal older women with a mean baseline serum 25-hydroxyvitamin D level of 32.8 ng/mL, supplementation with vitamin D<sub>3</sub> and calcium compared with placebo did not result in a significantly lower risk of all-type cancer at 4 years." even though the observed hazard ratio was 0.7, with lower confidence limit of a whopping 53% reduction in the incidence of cancer. And the 0.7 was an <i>unadjusted</i> hazard ratio; the hazard ratio could well have been more impressive had covariate adjustment been used to account for outcome heterogeneity within each treatment arm. <p><h3><a name="catg">Dichotomania</a></h3>Dichotomania, as discussed by <a href="https://www.researchgate.net/profile/Stephen_Senn/publication/221689734_Dichotomania_an_obsessive_compulsive_disorder_that_is_badly_affecting_the_quality_of_analysis_of_pharmaceutical_trials/links/0fcfd5109734cb6268000000.pdf?origin=publication_list">Stephen Senn</a>, is a very prevalent problem in medical and epidemiologic research. Categorization of continuous variables for analysis is inefficient at best and misleading at worst. This JAMA paper by <a href="http://jamanetwork.com/journals/jama/article-abstract/2620089">VISION study investigators</a> "Association of Postoperative High-Sensitivity Troponin Levels With Myocardial Injury and 30-Day Mortality Among Patients Undergoing Noncardiac Surgery" is an excellent example of bad statistical practice that limits the amount of information provided by the study. The authors categorized high-sensitivity troponin T levels measured post-op and related these to the incidence of death. They used four intervals of troponin, and there is important heterogeneity of patients within these intervals. This is especially true for the last interval (> 1000 ng/L). Mortality may be much higher for troponin values that are much larger than 1000. The relationship should have been analyzed with a continuous analysis, e.g., logistic regression with a regression spline for troponin, nonparametric smoother, etc. The final result could be presented in a simple line graph with confidence bands. <p>An example of dichotomania that may not be surpassed for some time is <a href="http://qualitysafety.bmj.com/content/early/2017/04/17/bmjqs-2016-006239">Simplification of the HOSPITAL Score for Predicting 30-day Readmissions</a> by Carole E Aubert, et al in <i>BMJ Quality and Safety</i> 2017-04-17. The authors arbitrarily dichotomized several important predictors, resulting in a major loss of information, then dichotomized their resulting predictive score, sacrificing much of what information remained. The authors failed to grasp probabilities, resulting in risk of 30-day readmission of "unlikely" and "likely". The categorization of predictor variables leaves demonstrable outcome heterogeneity within the intervals of predictor values. Then taking an already oversimplified predictive score and dichotomizing it is essentially saying to the reader "We don't like the integer score we just went to the trouble to develop." I now have serious doubts about the thoroughness of reviews at <i>BMJ Quality and Safety</i>. <p><a name="alcohol">A very high-profile paper</a> was published in BMJ on 2017-06-06: <a href="http://www.bmj.com/content/357/bmj.j2353">Moderate alcohol consumption as risk factor for adverse brain outcomes and cognitive decline: longitudinal cohort study</a> by Anya Topiwala et al. The authors had a golden opportunity to estimate the dose-response relationship between amount of alcohol consumed and quantitative brain changes. Instead the authors squandered the data by doing analyzes that either assumed that responses are linear in alcohol consumption or worse, by splitting consumption into 6 heterogeneous intervals when in fact consumption was shown in their Figure 3 to have a nice continuous distribution. How much more informative (and statistically powerful) it would have been to fit a quadratic or a restricted cubic spline function to consumption to estimate the continuous dose-response curve. <p><a name="dbpcut">The NEJM</a> keeps giving us great teaching examples with its 2017-08-03 edition. In <a href="http://www.nejm.org/doi/full/10.1056/NEJMoa1704154">Angiotensin II for the treatment of vasodilatory shock</a> by Ashish Khanna et al, the authors constructed a bizarre response variable: "The primary end point was a response with respect to mean arterial pressure at hour 3 after the start of infusion, with response defined as an increase from baseline of at least 10 mm Hg or an increase to at least 75 mm Hg, without an increase in the dose of background vasopressors." This form of dichotomania has been discredited by <a href="http://www.citeulike.org/user/harrelfe/article/13265588">Stephen Senn</a> who provided a similar example in which he decoded the response function to show that the lucky patient is one (in the NEJM case) who has a starting blood pressure of 74mmHg. His example is below: <p><a href="https://1.bp.blogspot.com/-YKXz4zVec7I/WYOabfTAkQI/AAAAAAAAJ4Q/sQzai2o8WosaaIfiYDxeIye4Sl3a7z67QCLcBGAs/s1600/dichotomaniaFig3.png" imageanchor="1" ><img border="0" src="https://1.bp.blogspot.com/-YKXz4zVec7I/WYOabfTAkQI/AAAAAAAAJ4Q/sQzai2o8WosaaIfiYDxeIye4Sl3a7z67QCLcBGAs/s320/dichotomaniaFig3.png" width="320" height="246" data-original-width="520" data-original-height="400" /></a><p>When a clinical trial's response variable is one that is arbitrary, loses information and power, is difficult to interpret, and means different things for different patients, expect trouble. <h3><a name="change">Change from Baseline</a></h3>Many authors and pharmaceutical clinical trialists make the mistake of analyzing change from baseline instead of making the raw follow-up measurements the primary outcomes, covariate-adjusted for baseline. To compute change scores requires many assumptions to hold, e.g.: <ol><li>the variable is not used as an inclusion/exclusion criterion for the study, otherwise regression to the mean will be strong</li><li>if the variable is used to select patients for the study, a second post-enrollment baseline is measured and this baseline is the one used for all subsequent analysis</li><li>the post value must be linearly related to the pre value</li><li>the variable must be perfectly transformed so that subtraction "works" and the result is not baseline-dependent</li><li>the variable must not have floor and ceiling effects</li><li>the variable must have a smooth distribution</li><li>the slope of the pre value vs. the follow-up measurement must be close to 1.0 when both variables are properly transformed (using the same transformation on both)</li></ol>Details about problems with analyzing change may be found <a href="http://biostat.mc.vanderbilt.edu/MeasureChange">here</a>. A general problem with the approach is that when Y is ordinal but not interval-scaled, differences in Y may no longer be ordinal. So analysis of change loses the opportunity to do a robust, powerful analysis using a covariate-adjusted ordinal response model such as the proportional odds or proportional hazards model. Such ordinal response models do not require one to be correct in how to transform Y. <p>Regarding 3. above, if pre is not linearly related to post, there is no transformation that can make a change score work. <p>Regarding 7. above, often the baseline is not as relevant as thought and the slope will be less than 1. When the treatment can cure every patient, the slope will be zero. Sometimes the relationship between baseline and follow-up Y is not even linear, as in one example I've seen based on the Hamilton D depression scale. <p>The purpose of a parallel-group randomized clinical trial is to compare the parallel groups, not to compare a patient with herself at baseline. The central question is for two patients with the same pre measurement value of x, one given treatment A and the other treatment B, will the patients tend to have different post-treatment values? This is exactly what analysis of covariance assesses. Within-patient change is affected strongly by regression to the mean and measurement error. When the baseline value is one of the patient inclusion/exclusion criteria, the only meaningful change score requires one to have a second baseline measurement post patient qualification to cancel out much of the regression to the mean effect. It is he second baseline that would be subtracted from the follow-up measurement. <p>The savvy researcher knows that analysis of covariance is required to "rescue" a change score analysis. This effectively cancels out the change score and gives the right answer even if the slope of post on pre is not 1.0. But this works only in the linear model case, and it can be confusing to have the pre variable on both the left and right hand sides of the statistical model. And if Y is ordinal but not interval-scaled, the difference in two ordinal variables is no longer even ordinal. Think of how meaningless difference from baseline in ordinal pain categories are. A <b>major problem</b> in the use of change score summaries, even when a correct analysis of covariance has been done, is that many papers and drug product labels still quote change scores out of context. <p>Patient-reported outcome scales are particularly problematic. An article published 2017-05-07 in JAMA, doi:10.1001/jama.2017.5103 like many other articles makes the error of trusting change from baseline as an appropriate analysis variable. Mean change from baseline may not apply to anyone in the trial. Consider a 5-point ordinal pain scale with values Y=1,2,3,4,5. Patients starting with no pain (Y=1) cannot improve, so their mean change must be zero. Patients starting at Y=5 have the most opportunity to improve, so their mean change will be large. A treatment that improves pain scores by an average of one point may average a two point improvement for patients for whom any improvement is possible. Stating mean changes out of context of the baseline state can be meaningless. <p>The NEJM paper <a name="endom" href="http://www.nejm.org/doi/full/10.1056/NEJMoa1700089">Treatment of Endometriosis-Associated Pain with Elagolix, an Oral GnRH Antagonist</a> by Hugh Taylor et al is based on a disastrous set of analyses, combining all the problems above. The authors computed change from baseline on variables that do not have the correct properties for subtraction, engaged in dichotomania by doing responder analysis, and in addition used last observation carried forward to handle dropouts. A proper analysis would have been a longitudinal analysis using all available data that avoided imputation of post-dropout values and used raw measurements as the responses. Most importantly, the twin clinical trials randomized 872 women, and had proper analyses been done the required sample size to achieve the same power would have been far less. Besides the ethical issue of randomizing an unnecessarily large number of women to inferior treatment, the approach used by the investigators maximized the cost of these positive trials. <p>The NEJM paper <a name="glucpct" href="http://www.nejm.org/doi/full/10.1056/NEJMoa1703501">Oral Glucocorticoid–Sparing Effect of Benralizumab in Severe Asthma</a> by Parameswaran Nair et al not only takes the problematic approach of using change scores from baseline in a parallel group design but they used percent change from baseline as the raw data in the analysis. This is an asymmetric measure for which arithmetic doesn't work. For example, suppose that one patient increases from 1 to 2 and another decreases from 2 to 1. The corresponding percent changes are 100% and -50%. The overall summary should be 0% change, not +25% as found by taking the simple average. Doing arithmetic on percent change can essentially involve adding ratios; ratios that are not proportions are never added; they are multiplied. What was needed was an analysis of covariance of raw oral glucocorticoid dose values adjusted for baseline after taking an appropriate transformation of dose, or using a more robust transformation-invariant ordinal semi-parametric model on the raw follow-up doses (e.g., proportional odds model). <p>In <a name="dravet" href="http://www.nejm.org/doi/full/10.1056/NEJMoa1611618">Trial of Cannabidiol for Drug-Resistant Seizures in the Dravet Syndrome</a> in NEJM 2017-05-25, Orrin Devinsky et al take seizure frequency, which might have a nice distribution such as the Poisson, and compute its change from baseline, which is likely to have a hard-to-model distribution. Once again, authors failed to recognize that the purpose of a parallel group design is to compare the parallel groups. Then the authors engaged in improper subtraction, improper use of percent change, dichotomania, and loss of statistical power simultaneously: "The percentage of patients who had at least a 50% reduction in convulsive-seizure frequency was 43% with cannabidiol and 27% with placebo (odds ratio, 2.00; 95% CI, 0.93 to 4.30; P=0.08)." The authors went on to analyze the change in a discrete ordinal scale, where change (subtraction) cannot have a meaning independent of the starting point at baseline. <p><a name="trop">Troponins</a> (T) are myocardial proteins that are released when the heart is damaged. A high-sensitivity T assay is a high-information cardiac biomarker used to diagnose myocardial infarction and to assess prognosis. I have been hoping to find a well-designed study with standardized serially measured T that is optimally analyzed, to provide answers to the following questions: <ol> <li>What is the shape of the relationship between the latest T measurement and time until a clinical endpoint?</li> <li>How does one use a continuous T to estimate risk?</li> <li>If T were measured previously, does the previous measurement add any predictive information to the current T?</li> <li>If both the earlier and current T measurement are needed to predict outcome, how should they be combined? Is what's important the difference of the two? Is it the ratio? Is it the difference in square roots of T?</li> <li>Is the 99<sup>th</sup> percentile of T for normal subjects useful as a prognostic threshold?</li></ol>The 2017-05-16 <i>Circulation</i> paper <a href="http://circ.ahajournals.org/content/135/20/1911">Serial Measurement of High-Sensitivity Troponin I and Cardiovascular Outcomes in Patients With Type 2 Diabetes Mellitus in the EXAMINE Trial</a> by Matthew Cavender et al was based on a well-designed cardiovascular safety study of diabetes in which uniformly measured high-sensitivity troponin I measurements were made at baseline and six months after randomization to the diabetes drug Alogliptin. [Note: I was on the DSMB for this study] The authors nicely envisioned a landmark analysis based on six-month survivors. But instead of providing answers to the questions above, the authors engaged in dichotomania and never checked whether changes in T or changes in log T possessed the appropriate properties to be used as a valid change score, i.e., they did not plot change in T vs. baseline T or log T ratio vs. baseline T and demonstrate a flat line relationship. Their statistical analysis used statistical methods from 50 years ago, even doing the notorious "test for trend" that tests for a linear correlation between an outcome and an integer category interval number. The authors seem to be unaware of the many flexible tools developed (especially starting in the mid 1980s) for statistical modeling that would answer the questions posed above. <p>Cavender et all stratified T in <1.9 ng/L, 1.9-<10 ng/L, 10-<26 ng/L, and ≥26 ng/L. Fully 1/2 of the patients were in the second interval. Except for the first interval (T below the lower detection limit) the groups are heterogeneous with regard to outcome risks. And there are no data from this study or previous studies that validates these cutpoints. To validate them, the relationship between T and outcome risk would have to be shown to be discontinuous at the cutpoints, and flat between them. <p>From their paper we still don't know how to use T continuously, and we don't know whether baseline T is informative once a clinician has obtained an updated T. The inclusion of a 3-D block diagram in the supplemental material is symptomatic of the data presentation problems in this paper. <p> It's not as though T hasn't been analyzed correctly. In a 1996 <a href="http://www.nejm.org/doi/full/10.1056/NEJM199610313351801">NEJM paper</a>, Ohman et al used a nonparametric smoother to estimate the continuous relationship between T and 30-day risk. Instead, Cavender, et al created arbitrary heterogeneous intervals of both baseline and 6m T, then created various arbitrary ways to look at change from baseline and its relationship to risk. <p> An analysis that would have answered my questions would have been to <ol> <li>Fit a standard Cox proportional hazards time-to-event model with the usual baseline characteristics</li> <li>Add to this model a tensor spline in the baseline and 6m T levels, i.e., a smooth 3-D relationship between baseline T, 6m T, and log hazard, allowing for interaction, and restricting the 3-D surface to be smooth. See for example <a href="http://www.fharrell.com/p/blog-page.html">BBR Figure 4.23</a>. One can do this by using restricted cubic splines in both T's and by computing cross-products of these terms for the interactions. By fitting a flexible smooth surface, the data would be able to speak for themselves without imposing linearity or additivity assumptions and without assuming that change or change in log T is how these variables combine.</li> <li>Do a formal test of whether baseline T (as either a main effect or as an effect modifier of the 6m T effect, i.e., interaction effect) is associated with outcome when controlling for 6m T and ordinary baseline variables</li> <li>Quantify the prognostic value added by baseline T by computing the fraction of likelihood ratio chi-square due to both T's combined that is explained by baseline T. Do likewise to show the added value of 6m T. Details about these methods may be found in <a href="http://biostat.mc.vanderbilt.edu/rms">Regression Modeling Strategies</a>, <i>2<sup>nd</sup> edition</i></li> </ol> Without proper analyses of T as a continuous variable, the reader is left with confusion as to how to really use T in practice, and is given no insight into whether changes are relevant or the baseline T can be ignored with a later T is obtained. In all the clinical outcome studies I've analyzed (including repeated LV ejection fractions and serum creatinines), the latest measurement has been what really mattered, and it hasn't mattered very much how the patient got there. <p> As long as continuous markers are categorized, clinicians are going to get suboptimal risk prediction and are going to find that more markers need to be added to the model to recover the information lost by categorizing the original markers. They will also continue to be surprised that other researchers find different "cutpoints", not realizing that when things don't exist, people will forever argue about their manifestations. <h3><a name="subgroup">Improper Subgrouping</a></h3>The JAMA Internal Medicine Paper <a href="https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2628971">Effect of Statin Treatment vs Usual Care on Primary Cardiovascular Prevention Among Older Adults</a> by Benjamin Han et al makes the classic statistical error of attempting to learn about differences in treatment effectiveness by subgrouping rather than by correctly modeling interactions. They compounded the error by not adjusting for covariates when comparing treatments in the subgroups, and even worse, by subgrouping on a variable for which grouping is ill-defined and information-losing: age. They used age intervals of 65-74 and 75+. A proper analysis would have been, for example, modeling age as a smooth nonlinear function (e.g., using a restricted cubic spline) and interacting this function with treatment to allow for a high-resolution, non-arbitrary analysis that allows for nonlinear interaction. Results could be displayed by showing the estimated treatment hazard ratio and confidence bands (y-axis) vs. continuous age (x-axis). The authors' analysis avoids the question of a dose-response relationship between age and treatment effect. A full strategy for interaction modeling for assessing heterogeneity of treatment effect (AKA <i>precision medicine</i>) may be found in the analysis of covariance chapter in <a href="http://biostat.mc.vanderbilt.edu/ClinStat">Biostatistics for Biomedical Research</a>. <p>To make matters worse, the above paper included patients with a sharp cutoff of 65 years of age as the lower limit. How much more informative it would have been to have a linearly increasing (in age) enrollment function that reaches a probability of 1.0 at 65y. Assuming that something magic happens at age 65 with regard to cholesterol reduction is undoubtedly a mistake. <h3><a name="serial">Serial Data and Response Trajectories</a></h3>Serial data (aka longitudinal data) with multiple follow-up assessments per patient presents special challenges and opportunities. My preferred analysis strategy uses full likelihood or Bayesian continuous-time analysis, using generalized least squares or mixed effects models. This allows each patient to have different measurement times, analysis of the data using actual days since randomization instead of clinic visit number, and non-random dropouts as long as the missing data are missing at random. Missing at random here means that given the baseline variables and the previous follow-up measurements the current measurement is missing completely at random. Imputation is not needed. <p>In the <i>Hypertension</i> July 2017 article <a href="https://doi.org/10.1161/HYPERTENSIONAHA.117.09221">Heterogeneity in Early Responses in ALLHAT (Antihypertensive and Lipid-Lowering Treatment to Prevent Heart Attack Trial)</a> by Sanket Dhruva et al, the authors did advanced statistical analysis that is a level above the papers discussed elsewhere in this article. However, their claim of avoiding dichotomania is unfounded. The authors were primarily interested in the relationship between blood pressures measured at randomization, 1m, 3m, 6m with post-6m outcomes, and they correctly envisioned the analysis as a landmark analysis of patients who were event-free at 6m. They did a careful cluster analysis of blood pressure trajectories from 0-6m. But their chosen method assumes that the variety of trajectories falls into two simple homogeneous trajectory classes (immediate responders and all others). Trajectories of continuous measurements, like the continuous measurements themselves, rarely fall into discrete categories with shape and level homogeneity within the categories. The analyses would in my opinion have been better, and would have been simpler, had everything been considered on a continuum. <p>With landmark analysis we now have 4 baseline measurements: the new baseline (previously called the 6m blood pressure) and 3 historical measurements. One can use these as 4 covariates to predict time until clinical post-6m outcome using a standard time-to-event model such as the Cox proportional hazards model. In doing so, we are estimating the prognosis associated with every possible trajectory and we can solve for the trajectory that yields the best outcome. We can also do a formal statistical test for whether the trajectories can be summarized more simply than with a 4-dimensional construct, e.g., whether the final blood pressure contains all the prognostic information. Besides specifying the model with baseline covariates (in addition to other original baseline covariates), one also has the option of creating a tall and thin dataset with 4 records per patient (if correlations are accounted for, e.g., cluster sandwich or cluster bootstrap covariance estimates) and modeling outcome using updated covariates and possible interactions with time to allow for time-varying blood pressure effects. <p>A logistic regression trick described in my book <i>Regression Modeling Strategies</i> comes in handy for modeling how baseline characteristics such as sex, age, or randomized treatment relate to the trajectories. Here one predicts the baseline variable of interest using the four blood pressures. By studying the 4 regression coefficients one can see exactly how the trajectories differ between patients grouped by the baseline variable. This includes studying differences in trajectories by treatment with no dichotomization. For example, if there is a significant association (using a composite (chunk) test) between treatment and any of the 4 blood pressures and in the logistic model predicting treatment, that implies that the reverse is true: one or more of the blood pressures is associated with treatment. Suppose for example that a 4 d.f. test demonstrates some association, the 1 d.f. for the first blood pressure is very significant, and the 3 d.f. test for the last 3 blood pressures is not. This would be interpreted as the treatment having an early effect that wears off shortly thereafter. [For this particular study, with the first measurement being made pre-randomization, such a result would indicate failure of randomization and no blood-pressure response to treatment of any kind.] Were the 4 regression coefficients to be negative and in descending order, this would indicate a progressive reduction in blood pressure due to treatment. <p>Returning to the originally stated preferred analysis when blood pressure is the outcome of interest (and not time to clinical events), one can use generalized least squares to predict the longitudinal blood pressure trends from treatment. This will be more efficient and also allows one to adjust for baseline variables other than treatment. It would probably be best to make the original baseline blood pressure a baseline variable and to have 3 serial measurements in the longitudinal model. Time would usually be modeled continuously (e.g., using a restricted cubic spline function). But in the Dhruva article the measurements were made at a small number of discrete times, so time could be considered a categorical variable with 3 levels. <p><a name="dietqual">I have had misgivings</a> for many years about the quality of statistical methods used by the Channing Lab at Harvard, as well as misgivings about the quality of nutritional epidemiology research in general. My misgivings were again confirmed in the 2017-07-13 NEJM publication <a href="http://www.nejm.org/doi/full/10.1056/NEJMoa1613502">Association of Changes in Diet Quality with Total and Cause-Specific Mortality</a> by Mercedes Sotos-Prieto et al. There are the usual concerns about confounding and possible alternate explanations, which the authors did not fully deal with (and why did the authors not include an analysis that characterized which types of subjects tended to have changes in their dietary quality?). But this paper manages to combine dichotomania with probably improper change score analysis and hard-to-interpret results. It started off as a nicely conceptualized landmark analysis in which dietary quality scores were measured during both an 8-year and a 16-year period, and these scores were related to total and all-cause mortality following those landmark periods. But then things went seriously wrong. The authors computed change in diet scores from the start to the end of the qualification period, did not validate that these are proper change scores (see above for more about that), and engaged in percentiling as if the number of neighbors with worse diets than you is what predicts your mortality rather than the absolute quality of your own diet. They then grouped the changes into quintile groups without justification, and examined change quantile score group effects in Cox time-to-event models. It is telling that the baseline dietary scores varied greatly over the change quintiles. The authors emphasized the 20-percentile increase in each score when interpreting result. What does that mean? How is it related to absolute diet quality scores? <p>The high quality dataset available to the authors could have been used to answer real questions of interest using statistical analyses that did not have hidden assumptions. From their analyses we have no idea of how the subjects' diet trajectories affected mortality, or indeed whether then change in diet quality was as important as the most recent diet quality for the subject, ignoring how the subject arrived at that point at the end of the qualification period. What would be an informative analysis? Start with the simpler one: used a smooth tensor spline interaction surface to estimate relative log hazard of mortality, and construct a 3-D plot with initial diet quality on the x-axis, final (landmark) diet quality on the y-axis, and relative log hazard on the z-axis. Then the more in-depth modeling analysis can be done in which one uses multiple measures of diet quality over time and relates the trajectory (its shape, average level, etc.) to hazard of death. Suppose that absolute diet quality was measured at four baseline points. These four variables could be related to outcome and one could solve for the trajectory that was associated with the lowest mortality. For a study that is almost wholly statistical, it is a shame that modern statistical methods appeared to not even be considered. And for heaven's sake <b>analyze the raw diet scales and do not percentile them</b>. Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com48tag:blogger.com,1999:blog-5376139322696503442.post-53483888705676718762017-03-16T11:53:00.004-05:002017-03-27T07:32:40.018-05:00Subjective Ranking of Quality of Research by Subject Matter AreaWhile being engaged in biomedical research for a few decades and watching reproducibility of research as a whole, I've developed my own ranking of reliability/quality/usefulness of research across several subject matter areas. This list is far from complete. Let's start with a subjective list of what I perceive as the areas in which published research is least likely to be both true and useful. The following list is ordered in ascending order of quality, with the most problematic area listed first. You'll notice that there is a vast number of areas not listed for which I have minimal experience.<br /><span style="color: blue;"><br /></span><span style="color: blue;">Some excellent research is done in all subject areas. This list is based on my perception of the <b>proportion</b> of publications in the indicated area that are rigorously scientific, reproducible, and useful.</span><br /><h4>Subject Areas With Least Reliable/Reproducible/Useful Research</h4><div><ol><li>any area where there is no pre-specified statistical analysis plan and the analysis can change on the fly when initial results are disappointing</li><li>behavioral psychology</li><li>studies of corporations to find characteristics of "winners"; regression to the mean kicks in making predictions useless for changing your company</li><li>animal experiments on fewer than 30 animals</li><li><span style="color: blue;">discovery genetics not making use of biology while doing large-scale variant/gene screening</span></li><li>nutritional epidemiology</li><li>electronic health record research reaching clinical conclusions without understanding confounding by indication and other limitations of data</li><li>pre-post studies with no randomization</li><li>non-nutritional epidemiology <span style="color: blue;">not having a fully pre-specified statistical analysis plan</span> <span style="color: blue;">[few epidemiology papers use state-of-the-art statistical methods and have a sensitivity analysis related to unmeasured confounders]</span></li><li>prediction studies based on dirty and inadequate data</li><li>personalized medicine</li><li>biomarkers</li><li>observational treatment comparisons that do not qualify for the second list (below)</li><li>small adaptive dose-finding cancer trials (3+3 etc.)</li></ol><h4>Subject Areas With Most Reliable/Reproducible/Useful Research</h4><div>The most reliable and useful research areas are listed first. All of the following are assumed to (1) have a prospective pre-specified statistical analysis plan and (2) purposeful prospective quality-controlled data acquisition (yes this applies to high-quality non-randomized observational research).</div><ol><li>randomized crossover studies</li><li>multi-center randomized experiments</li><li>single-center randomized experiments <span style="color: blue;">with non-overly-optimistic sample sizes</span></li><li>adaptive randomized clinical trials with large sample sizes</li><li><span style="color: blue;">physics</span></li><li><span style="color: blue;">pharmaceutical industry research that is overseen by FDA</span></li><li>cardiovascular research</li><li>observational research <span style="color: blue;">[however only a very small minority of observational research projects have a prospective analysis plan and high enough data quality to qualify for this list]</span></li></ol><div><br /></div><h4><span style="color: blue;">Some Suggested Remedies</span></h4></div><div><span style="color: blue;">Peer review of research grants and manuscripts is done primarily by experts in the subject matter area under study. Most journal editors and grant reviewers are not expert in biostatistics. Every grant application and submitted manuscript should undergo rigorous methodologic peer review by methodologic experts such as biostatisticians and epidemiologists. All data analyses should be driven by a prospective statistical analysis plan, and the entire self-contained data manipulation and analysis code should be submitted to journals so that potential reproducibility and adherence to the statistical analysis plan can be confirmed. Readers should have access to the data in most cases and should be able to reproduce all study findings using the authors' code, plus run their own analyses on the authors' data to check robustness of findings.</span><br /><span style="color: blue;"><br /></span><span style="color: blue;">Medical journals are reluctant to (1) publish critical letters to the editor and (2) retract papers. This has to change.</span><br /><span style="color: blue;"><br /></span><span style="color: blue;">In academia, too much credit is still given to the quantity of publications and not to their quality and reproducibility. This too must change. The pharmaceutical industry has FDA to validate their research. The NIH does not serve this role for academia.</span><br /><span style="color: blue;"><br /></span><span style="color: blue;">Rochelle Tractenberg, Chair of the American Statistical Association Committee on Professional Ethics and a biostatistician at Georgetown University said in a 2017-02-22 interview with <i>The Australian</i> that many questionable studies would not have been published had formal statistical reviews been done. When she reviews a paper she starts with the premise that the statistical analysis was incorrectly executed. She stated that "Bad statistics is bad science."</span><br /><span style="color: blue;"><br /></span></div><div></div>Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com14tag:blogger.com,1999:blog-5376139322696503442.post-27641736403136926482017-03-01T07:30:00.002-06:002017-07-09T14:13:20.136-05:00Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring RulesIn <a href="http://www.fharrell.com/2017/01/classification-vs-prediction.html">this article</a> I discussed the many advantages or probability estimation over classification. Here I discuss a particular problem related to classification, namely the harm done by using improper accuracy scoring rules. Accuracy scores are used to drive feature selection, parameter estimation, and for measuring predictive performance on models derived using any optimization algorithm. For this discussion let Y denote a no/yes false/true 0/1 event being predicted, and let Y=0 denote a non-event and Y=1 the event occurred.<br /><br />As discussed <a href="https://en.wikipedia.org/wiki/Scoring_rule">here</a> and <a href="http://psiexp.ss.uci.edu/research/papers/MerkleSteyvers.pdf">here</a>, a <i>proper accuracy scoring</i> rule is a metric applied to probability forecasts. It is a metric that is optimized when the forecasted probabilities are identical to the true outcome probabilities. A <i>continuous</i> accuracy scoring rule is a metric that makes full use of the entire range of predicted probabilities and does not have a large jump because of an infinitesimal change in a predicted probability. The two most commonly used proper scoring rules are the quadratic error measure, i.e., mean squared error or <a href="https://en.wikipedia.org/wiki/Brier_score">Brier score</a>, and the logarithmic scoring rule, which is a linear translation of the log likelihood for a binary outcome model (Bernoulli trials). The logarithmic rule gives more credit to extreme predictions that are "right", but a single prediction of 1.0 when Y=0 or 0.0 when Y=1 will result in infinity no matter how accurate were all the other predictions. Because of the optimality properties of maximum likelihood estimation, the logarithmic scoring rule is in a sense the gold standard, but we more commonly use the Brier score because of its easier interpretation and its ready decomposition into various metrics measuring calibration-in-the-small, calibration-in-the-large, and discrimination.<br /><br /><i>Classification accuracy</i> is a discontinuous scoring rule. It implicitly or explicitly uses thresholds for probabilities, and moving a prediction from 0.0001 below the threshold to 0.0001 above the thresholds results in a full accuracy change of 1/N. Classification accuracy is also an improper scoring rule. It can be optimized by choosing the wrong predictive features and giving them the wrong weights. This is best shown by a simple example that appears in <a href="http://biostat.mc.vanderbilt.edu/ClinStat">Biostatistics for Biomedical Research</a> Chapter 18 in which 400 simulated subjects have an overall fraction of Y=1 of 0.57. Consider the use of binary logistic regression to predict the probability that Y=1 given a certain set of covariates, and classify a subject as having Y=1 if the predicted probability exceeds 0.5. We simulate values of age and sex and simulate binary values of Y according to a logistic model with strong age and sex effects; the true log odds of Y=1 are <span style="font-family: "courier new" , "courier" , monospace; font-size: x-small;">(age-50)*.04 + .75*(sex=m)</span>. Fit four binary logistic models in order: a model containing only age as a predictor, one containing only sex, one containing both age and sex, and a model containing no predictors (i.e., it only has an intercept parameter). The results are in the following table:<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-SlRTo_OLpZA/WLQjYnNwGTI/AAAAAAAAI3Q/WFauwVRfBeYYlLfRL04Z29S6b0uLbWqTACLcB/s1600/z.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="233" src="https://4.bp.blogspot.com/-SlRTo_OLpZA/WLQjYnNwGTI/AAAAAAAAI3Q/WFauwVRfBeYYlLfRL04Z29S6b0uLbWqTACLcB/s320/z.png" width="320" /></a></div>Both the gold standard likelihood ratio chi-square statistic and the improper pure discrimination c-index (AUROC) indicate that both age and sex are important predictors of Y. Yet the highest proportion correct (classification accuracy) occurs when sex is ignored. According to the improper score, the sex variable has negative information. It is telling that a model that predicted Y=1 for every observation, i.e., one that completely ignored age and sex and only has the intercept in the model, would be 0.573 accurate, only slightly above the accuracy of using sex alone to predict Y.<br /><br />The use of a discontinuous improper accuracy score such as proportion "classified" "correctly" has led to countless misleading findings in bioinformatics, machine learning, and data science. In some extreme cases the machine learning expert failed to note that their claimed predictive accuracy was less than that achieved by ignoring the data, e.g., by just predicting Y=1 when the observed prevalence of Y=1 was 0.98 whereas their extensive data analysis yielded an accuracy of 0.97. As discusssed <a href="http://www.fharrell.com/2017/01/classification-vs-prediction.html">here</a>, fans of "classifiers" sometimes subsample from observations in the most frequent outcome category (here Y=1) to get an artificial 50/50 balance of Y=0 and Y=1 when developing their classifier. Fans of such deficient notions of accuracy fail to realize that their classifier will not apply to a population when a much different prevalence of Y=1 than 0.5.<br /><br /><i>Sensitivity</i> and <i>specificity</i> are one-sided or conditional versions of classification accuracy. As such they are also discontinuous improper accuracy scores, and optimizing them will result in the wrong model.<br /><br /><a href="http://biostat.mc.vanderbilt.edu/rms">Regression Modeling Strategies</a> Chapter 10 goes into more problems with classification accuracy, and discusses many measures of the quality of probability estimates. The text contains suggested measures to emphasize such as Brier score, pseudo R-squared (a simple function of the logarithmic scoring rule), c-index, and especially smooth nonparametric calibration plots to demonstrate absolute accuracy of estimated probabilities.<br /><br /><br />Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com4tag:blogger.com,1999:blog-5376139322696503442.post-2350316649043495732017-02-19T10:23:00.000-06:002017-07-27T09:06:50.541-05:00My Journey From Frequentist to Bayesian Statistics<div style="text-align: center;"><small><i>Type I error for smoke detector: probability of alarm given no fire=0.05<br>Bayesian: probability of fire given current air data<br><br>Frequentist smoke alarm designed as most research is done:<br>Set the alarm trigger so as to have a 0.8 chance of detecting an inferno<br><br>Advantage of actionable evidence quantification:<br>Set the alarm to trigger when the posterior probability of a fire exceeds 0.02 while at home and at 0.01 while away </i></small></div><br><br> <div style="text-align: justify;">If I had been taught Bayesian modeling before being taught the frequentist paradigm, I'm sure I would have always been a Bayesian. I started becoming a Bayesian about 1994 because of an <a href="http://www.citeulike.org/user/harrelfe/article/13264891">influential paper</a> by David Spiegelhalter and because I worked in the same building at Duke University as Don Berry. Two other things strongly contributed to my thinking: difficulties explaining p-values and confidence intervals (especially the latter) to clinical researchers, and difficulty of learning group sequential methods in clinical trials. When I talked with Don and learned about the flexibility of the Bayesian approach to clinical trials, and saw Spiegelhalter's embrace of Bayesian methods because of its problem-solving abilities, I was hooked. [Note: I've heard Don say that he became Bayesian after multiple attempts to teach statistics students the exact definition of a confidence interval. He decided the concept was defective.]</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">At the time I was working on clinical trials at Duke and started to see that multiplicity adjustments were arbitrary. This started with a clinical trial coordinated by Duke in which low dose and high dose of a new drug were to be compared to placebo, using an alpha cutoff of 0.03 for each comparison to adjust for multiplicity. The comparison of high dose with placebo resulted in a p-value of 0.04 and the trial was labeled completely "negative" which seemed problematic to me. [Note: the p-value was two-sided and thus didn't give any special "credit" for the treatment effect coming out in the right direction.]</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">I began to see that the hypothesis testing framework wasn't always the best approach to science, and that in biomedical research the typical hypothesis was an artificial construct designed to placate a reviewer who believed that an NIH grant's specific aims must include null hypotheses. I saw the contortions that investigators went through to achieve this, came to see that questions are more relevant than hypotheses, and estimation was even more important than questions. With Bayes, estimation is emphasized. I very much like Bayesian modeling instead of hypothesis testing. I saw that a large number of clinical trials were incorrectly interpreted when p>0.05 because the investigators involved failed to realize that a p-value can only provide evidence against a hypothesis. Investigators are motivated by "we spent a lot of time and money and must have gained something from this experiment." The classic "<a href="http://www.bmj.com/content/311/7003/485">absence of evidence is not evidence of absence</a>" error results, whereas with Bayes it is easy to estimate the probability of similarity of two treatments. Investigators will be surprised to know how little we have learned from clinical trials that are not huge when p>0.05.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">I listened to many discussions of famous clinical trialists debating what should be the primary endpoint in a trial, the co-primary endpoint, the secondary endpoints, co-secondary endpoints, etc. This was all because of their paying attention to alpha-spending. I realized this was all a game.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">I came to not believe in the possibility of infinitely many repetitions of identical experiments, as required to be envisioned in the frequentist paradigm. When I looked more thoroughly into the multiplicity problem, and sequential testing, and I looked at Bayesian solutions, I became more of a believer in the approach. I learned that posterior probabilities have a simple interpretation independent of the stopping rule and frequency of data looks. I got involved in working with the FDA and then consulting with pharmaceutical companies, and started observing how multiple clinical endpoints were handled. I saw a closed testing procedures where a company was seeking a superiority claim for a new drug, and if there was insufficient evidence for such a claim, they wanted to seek a non-inferiority claim on another endpoint. They developed a closed testing procedure that when diagrammed truly looked like a train wreck. I felt there had to be a better approach, so I sought to see how far posterior probabilities could be pushed. I found that with MCMC simulation of Bayesian posterior draws I could quite simply compute probabilities such as P(any efficacy), P(efficacy more than trivial), P(non-inferiority), P(efficacy on endpoint A and on either endpoint B or endpoint C), and P(benefit on more than 2 of 5 endpoints). I realized that frequentist multiplicity problems came from the chances you give data to be more extreme, not from the chances you give assertions to be true.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">I enjoy the fact that posterior probabilities define their own error probabilities, and that they count not only inefficacy but also harm. If P(efficacy)=0.97, P(no effect or harm)=0.03. This is the "regulator's regret", and type I error is not the error of major interest (is it really even an 'error'?). One minus a p-value is P(data in general are less extreme than that observed if H0 is true) which is the probability of an event I'm not that interested in.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">The extreme amount of time I spent analyzing data led me to understand other problems with the frequentist approach. Parameters are either in a model or not in a model. We test for interactions with treatment and hope that the p-value is not between 0.02 and 0.2. We either include the interactions or exclude them, and the power for the interaction test is modest. Bayesians have a prior for the differential treatment effect and can easily have interactions "half in" the model. Dichotomous irrevocable decisions are at the heart of many of the statistical modeling problems we have today. I really like penalized maximum likelihood estimation (which is really empirical Bayes) but once we have a penalized model all of our frequentist inferential framework fails us. No one can interpret a confidence interval for a biased (shrunken; penalized) estimate. On the other hand, the Bayesian posterior probability density function, after shrinkage is accomplished using skeptical priors, is just as easy to interpret as had the prior been flat. For another example, consider a categorical predictor variable that we hope is predicting in an ordinal (monotonic) fashion. We tend to either model it as ordinal or as completely unordered (using k-1 indicator variables for k categories). A Bayesian would say "let's use a prior that favors monotonicity but allows larger sample sizes to override this belief."</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Now that adaptive and sequential experiments are becoming more popular, and a formal mechanism is needed to use data from one experiment to inform a later experiment (a good example being the use of adult clinical trial data to inform clinical trials on children when it is difficult to enroll a sufficient number of children for the child data to stand on their own), Bayes is needed more than ever. It took me a while to realize something that is quite profound: A Bayesian solution to a simple problem (e.g., 2-group comparison of means) can be embedded into a complex design (e.g., adaptive clinical trial) <b>without modification</b>. Frequentist solutions require highly complex modifications to work in the adaptive trial setting.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">I met likelihoodist <a href="http://biostat.mc.vanderbilt.edu/JeffreyBlume">Jeffrey Blume</a> in 2008 and started to like the likelihood approach. It is more Bayesian than frequentist. I plan to learn more about this paradigm. </div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">Several readers have asked me how I could believe all this and publish a frequentist-based book such as <i>Regression Modeling Strategies</i>. There are two primary reasons. First, I started writing the book before I knew much about Bayes. Second, I performed a lot of simulation studies that showed that purely empirical model-building had a low chance of capturing clinical phenomena correctly and of validating on new datasets. I worked extensively with cardiologists such as Rob Califf, Dan Mark, Mark Hlatky, David Prior, and Phil Harris who give me the ideas for injecting clinical knowledge into model specification. From that experience I wrote <i>Regression Modeling Strategies</i> in the most Bayesian way I could without actually using specific Bayesian methods. I did this by emphasizing subject-matter-guided model specification. The section in the book about specification of interaction terms is perhaps the best example. When I teach the full-semester version of my course I interject Bayesian counterparts to many of the techniques covered.</div><div style="text-align: justify;"><br /></div><div style="text-align: justify;">There are challenges in moving more to a Bayesian approach. The ones I encounter most frequently are:</div><ol><li style="text-align: justify;">Teaching clinical trialists to embrace Bayes when they already do in spirit but not operationally. Unlearning things is much more difficult than learning things.</li><li style="text-align: justify;">How to work with sponsors, regulators, and NIH principal investigators to specify the (usually skeptical) prior up front, and to specify the amount of applicability assumed for previous data.</li><li style="text-align: justify;">What is a Bayesian version of the multiple degree of freedom "chunk test"? Partitioning sums of squares or the log likelihood into components, e.g., combined test of interaction and combined test of nonlinearities, is very easy and natural in the frequentist setting.</li><li style="text-align: justify;">How do we specify priors for complex entities such as the degree of monotonicity of the effect of a continuous predictor in a regression model? The Bayesian approach to this will ultimately be more satisfying, but operationalizing this is not easy.</li></ol><div style="text-align: justify;">With new tools such as <a href="http://mc-stan.org/">Stan</a> and well written accessible books such as <a href="http://www.citeulike.org/user/harrelfe/article/14172337">Kruschke's</a> it's getting to be easier to be Bayesian each day. The R <a href="https://cran.r-project.org/web/packages/brms">brms</a> package, which uses Stan, makes a large class of regression models even more accessible.<br /><br /><br /><br /><hr><a name='more'></a>See the following for discussions about this article that are not on this blog.<br /><br /><ul><li>https://news.ycombinator.com/item?id=13684429</li></ul></div><div style="text-align: justify;"><br /></div>Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com37tag:blogger.com,1999:blog-5376139322696503442.post-21027066199349958892017-02-05T09:43:00.000-06:002017-02-26T06:48:43.203-06:00Interactive Statistical Graphics: Showing More By Showing LessVersion 4 of the R <a href="http://biostat.mc.vanderbilt.edu/Hmisc">Hmisc</a> packge and version 5 of the R <a href="http://biostat.mc.vanderbilt.edu/Rrms">rms</a> package interfaces with interactive <a href="https://plot.ly/r/">plotly</a> graphics, which is an interface to the D3 javascript graphics library. This allows various results of statistical analyses to be viewed interactively, with pre-programmed drill-down information. More examples will be added here. We start with a video showing a new way to display survival curves. <br /><br />Note that plotly graphics are best used with RStudio Rmarkdown html notebooks, and are distributed to reviewers as self-contained (but somewhat large) html files. Printing is discouraged, but possible, using snapshots of the interactive graphics.<br /><br />Concerning the second bullet point below, boxplots have a high ink:information ratio and hide bimodality and other data features. Many statisticians prefer to use dot plots and violin plots. I liked those methods for a while, then started to have trouble with the choice of a smoothing bandwidth in violin plots, and found that dot plots do not scale well to very large datasets, whereas spike histograms are useful for all sample sizes. Users of dot charts have to have a dot stand for more than one observation if N is large, and I found the process too arbitrary. For spike histograms I typically use 100 or 200 bins. When the number of distinct data values is below the specified number of bins, I just do a frequency tabulation for all distinct data values, rounding only when two of the values are very close to each other. A spike histogram approximately reduces to a rug plot when there are no ties in the data, and I very much like rug plots.<br /><br /><ul><li>rms survplotp <a href="https://youtu.be/EoIB_Obddrk">video</a>: plotting survival curves</li><li>Hmisc histboxp <a href="http://data.vanderbilt.edu/fh/R/Hmisc/examples.nb.html#better_demonstration_of_boxplot_replacement">interactive html example</a>: spike histograms plus selected quantiles, mean, and Gini's mean difference - replacement for boxplots - show all the data! Note bimodal distributions and zero blood pressure values for patients having a cardiac arrest.</li></ul><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-hsRxcCS4HMs/WKTJXR7PO5I/AAAAAAAAIVo/9bzR4N5u_aUCHWJ2m2kqNvpGLKhQzfDcACK4B/s1600/histboxp.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="107" src="https://3.bp.blogspot.com/-hsRxcCS4HMs/WKTJXR7PO5I/AAAAAAAAIVo/9bzR4N5u_aUCHWJ2m2kqNvpGLKhQzfDcACK4B/s320/histboxp.png" width="320" /></a></div><div><br /></div>Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com5tag:blogger.com,1999:blog-5376139322696503442.post-58368734038944632682017-02-05T08:39:00.000-06:002017-02-05T09:00:02.045-06:00A Litany of Problems With p-valuesIn my opinion, null hypothesis testing and p-values have done significant harm to science. The purpose of this note is to catalog the many problems caused by p-values. As readers post new problems in their comments, more will be incorporated into the list, so this is a work in progress.<br /><br />The American Statistical Association has done a great service by issuing its <a href="http://www.amstat.org/asa/files/pdfs/P-ValueStatement.pdf">Statement on Statistical Significance and P-values</a>. Now it's time to act. To create the needed motivation to change, we need to fully describe the depth of the problem.<br /><br />It is important to note that no statistical paradigm is perfect. Statisticians should choose paradigms that solve the greatest number of real problems and have the fewest number of faults. This is why I believe that the Bayesian and likelihood paradigms should replace frequentist inference.<br /><br />Consider an assertion such as "the coin is fair", "treatment A yields the same blood pressure as treatment B", "B yields lower blood pressure than A", or "B lowers blood pressure at least 5mmHg before A." Consider also a compound assertion such as "A lowers blood pressure by at least 3mmHg and does not raise the risk of stroke."<br /><br /><h3>A. Problems With Conditioning</h3><div><ol><li>p-values condition on what is unknown (the assertion of interest; H<sub>0</sub>) and do not condition on what is known (the data).</li><li>This conditioning does not respect the flow of time and information; p-values are backward probabilities.</li></ol><div><h3>B. Indirectness</h3></div><ol><li>Because of A above, p-values provide only indirect evidence and are problematic as evidence metrics. They are sometimes monotonically related to the evidence (e.g., when the prior distribution is flat) we need but are not properly calibrated for decision making.</li><li>p-values are used to bring indirect evidence against an assertion but cannot bring evidence in favor of the assertion. </li><li>As detailed <a href="http://www.fharrell.com/2017/01/null-hypothesis-significance-testing.html">here</a>, the idea of proof by contradiction is a stretch when working with probabilities, so trying to quantify evidence for an assertion by bringing evidence against its complement is on shaky ground.</li><li>Because of A, p-values are difficult to interpret and very few non-statisticians get it right. The best article on misinterpretations I've found is <a href="http://www.citeulike.org/user/harrelfe/article/14042559">here</a>.</li></ol><div><h3>C. Problem Defining the Event Whose Probability is Computed</h3></div><ol><li>In the continuous data case, the probability of getting a result as extreme as that observed with our sample is zero, so the p-value is the probability of getting a result <i>more extreme</i> than that observed. Is this the correct point of reference?</li><li>How does <i>more extreme</i> get defined if there are sequential analyses and multiple endpoints or subgroups? For sequential analyses do we consider planned analyses are analyses intended to be run even if they were not?</li></ol><div><h3>D. Problems Actually Computing p-values</h3></div><ol><li>In some discrete data cases, e.g., comparing two proportions, there is tremendous disagreement among statisticians about how p-values should be calculated. In a famous 2x2 table from an ECMO adaptive clinical trial, 13 p-values have been computed from the same data, ranging from 0.001 to 1.0. And many statisticians do not realize that Fisher's so-called "exact" test is not very accurate in many cases.</li><li>Outside of binomial, exponential, and normal (with equal variance) and a few other cases, p-values are actually very difficult to compute exactly, and many p-values computed by statisticians are of unknown accuracy (e.g., in logistic regression and mixed effects models). The more non-quadratic the log likelihood function the more problematic this becomes in many cases. </li><li>One can compute (sometimes requiring simulation) the type-I error of many multi-stage procedures, but actually computing a p-value that can be taken out of context can be quite difficult and sometimes impossible. One example: one can control the false discovery probability (incorrectly usually referred to as a rate), and ad hoc modifications of nominal p-values have been proposed, but these are not necessarily in line with the real definition of a p-value.</li></ol><div><h3>E. The Multiplicity Mess</h3></div><ol><li>Frequentist statistics does not have a recipe or blueprint leading to a unique solution for multiplicity problems, so when many p-values are computed, the way they are penalized for multiple comparisons results in endless arguments. A Bonferroni multiplicity adjustment is consistent with a Bayesian prior distribution specifying that the probability that all null hypotheses are true is a constant no matter how many hypotheses are tested. By contrast, Bayesian inference reflects the facts that P(A ∪ B) ≥ max(P(A), P(B)) and P(A ∩ B) ≤ min(P(A), P(B)) when A and B are assertions about a true effect.</li><li>There remains controversy over the choice of 1-tailed vs. 2-tailed tests. The 2-tailed test can be thought of as a multiplicity penalty for being potentially excited about either a positive effect or a negative effect of a treatment. But few researchers want to bring evidence that a treatment harms patients; a pharmaceutical company would not seek a licensing claim of harm. So when one computes the probability of obtaining an effect larger than that observed if there is no true effect, why do we too often ignore the sign of the effect and compute the (2-tailed) p-value?</li><li>Because it is a very difficult problem to compute p-values when the assertion is compound, researchers using frequentist methods do not attempt to provide simultaneous evidence regarding such assertions and instead rely on ad hoc multiplicity adjustments.</li><li>Because of A1, statistical testing with multiple looks at the data, e.g., in sequential data monitoring, is ad hoc and complex. Scientific flexibility is discouraged. The p-value for an early data look must be adjusted for future looks. The p-value at the final data look must be adjusted for the earlier inconsequential looks. Unblinded sample size re-estimation is another case in point. If the sample size is expanded to gain more information, there is a multiplicity problem and some of the methods commonly used to analyze the final data effectively discount the first wave of subjects. How can that make any scientific sense?</li><li>Most practitioners of frequentist inference do not understand that multiplicity comes from chances you give data to be extreme, not from chances you give true effects to be present.</li></ol><div><h3>F. Problems With Non-Trivial Hypotheses</h3></div><ol><li>It is difficult to test non-point hypotheses such as "drug A is similar to drug B".</li><li>There is no straightforward way to test compound hypotheses coming from logical unions and intersections. </li></ol><div><h3>G. Inability to Incorporate Context and Other Information</h3></div><ol><li>Because extraordinary claims require extraordinary evidence, there is a serious problem with the p-value's inability to incorporate context or prior evidence. A Bayesian analysis of the existence of ESP would no doubt start with a very skeptical prior that would require extraordinary data to overcome, but the bar for getting a "significant" p-value is fairly low. Frequentist inference has a greater risk for getting the direction of an effect wrong (see <a href="http://andrewgelman.com/">here</a> for more).</li><li>p-values are unable to incorporate outside evidence. As a converse to 1, strong prior beliefs are unable to be handled by p-values, and in some cases the results in a lack of progress. Nate Silver in <i>The Signal and the Noise</i> beautifully details how the conclusion that cigarette smoking causes lung cancer was greatly delayed (with a large negative effect on public health) because scientists (especially Fisher) were caught up in the frequentist way of thinking, dictating that only randomized trial data would yield a valid p-value for testing cause and effect. A Bayesian prior that was very strongly against the belief that smoking was causal is obliterated by the incredibly strong observational data. Only by incorporating prior skepticism could one make a strong conclusion with non-randomized data in the smoking-lung cancer debate.</li><li>p-values require subjective input from the producer of the data rather than from the consumer of the data.</li></ol><h3>H. Problems Interpreting and Acting on "Positive" Findings</h3><ol><li>With a large enough sample, a trivial effect can cause an impressively small p-value (statistical significance ≠ clinical significance).</li><li>Statisticians and subject matter researchers (especially the latter) sought a "seal of approval" for their research by naming a cutoff on what should be considered "statistically significant", and a cutoff of p=0.05 is most commonly used. Any time there is a threshold there is a motive to game the system, and gaming (p-hacking) is rampant. Hypotheses are exchanged if the original H<sub>0</sub> is not rejected, subjects are excluded, and because statistical analysis plans are not pre-specified as required in clinical trials and regulatory activities, researchers and their all-too-accommodating statisticians play with the analysis until something "significant" emerges.</li><li>When the p-value is small, researchers act as though the point estimate of the effect is a population value.</li><li>When the p-value is small, researchers believe that their conceptual framework has been validated. </li></ol><div><h3>I. Problems Interpreting and Acting on "Negative" Findings</h3><ol><li>Because of B2, large p-values are uninformative and do not assist the researcher in decision making (Fisher said that a large p-value means "get more data").</li></ol><div><br /></div></div><div></div></div>Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com82tag:blogger.com,1999:blog-5376139322696503442.post-42968695135969555792017-01-27T07:03:00.002-06:002017-07-17T06:39:13.445-05:00Randomized Clinical Trials Do Not Mimic Clinical Practice, Thank GoodnessRandomized clinical trials (RCT) have long been held as the gold standard for generating evidence about the effectiveness of medical and surgical treatments, and for good reason. But I commonly hear clinicians lament that the results of RCTs are not generalizable to medical practice, primarily for two reasons:<br /><ol><li>Patients in clinical practice are different from those enrolled in RCTs</li><li>Drug adherence in clinical practice is likely to be lower than that achieved in RCTs, resulting in lower efficacy.</li></ol><div>Point 2 is hard to debate because RCTs are run under protocol and research personnel are watching and asking about patients' adherence (but more about this below). But point 1 is a misplaced worry in the majority of trials. The explanation requires getting to the heart of what RCTs are really intended to do: provide evidence for <b>relative</b> treatment effectiveness. There are some trials that provide evidence for both relative and absolute effectiveness. This is especially true when the efficacy measure employed is absolute as in measuring blood pressure reduction due to a new treatment. But many trials use binary or time-to-event endpoints and the resulting efficacy measure is on a relative scale such as the odds ratio or hazard ratio.</div><div><br /></div><div>RCTs of even drastically different patients can provide estimates of relative treatment benefit on odds or hazard ratio scales that are highly transportable. This is most readily seen in subgroup analyses provided by the trials themselves - so called forest plots that demonstrate remarkable constancy of relative treatment benefit. When an effect ratio is applied to a population with a much different risk profile, that relative effect can still fully apply. It is only likely that the absolute treatment benefit will change, and it is easy to estimate the absolute benefit (e.g., risk difference) for a patient given the relative benefit and the absolute baseline risk for the subject. This is covered in detail in <a href="http://biostat.mc.vanderbilt.edu/tmp/bbr.pdf">Biostatistics for Biomedical Research</a>, Section 13.6. See also Stephen Senn's excellent <a href="https://www.slideshare.net/StephenSenn1/real-world-modified">presentation</a>.</div><div><br /></div><div>Clinical practice provides anecdotal evidence that biases clinicians. What a clinician sees in her practice is patient i on treatment A and patient j on treatment B. She may remember how patient i fared in comparison to patient j, not appreciate confounding by indication, and suppose this provides a valid estimate of the difference in effectiveness in treatment A vs. B. But the real therapeutic question is how does the outcome of a patient were she given treatment A compare to her outcome were she given treatment B. The gold standard design is thus the randomized crossover design, when the treatment is short acting. Stephen Senn eloquently <a href="http://onlinelibrary.wiley.com/doi/10.1002/sim.6739/abstract">writes</a> about how a 6-period 2-treatment crossover study can even do what proponents of personalized medicine mistakenly think they can do with a parallel-group randomized trial: estimate treatment effectiveness for individual patients.<br /><br />For clinical practice to provide the evidence really needed, the clinician would have to see patients and assign treatments using one of the top four approaches listed in the hierarchy of evidence below. Entries are in the order of strongest evidence requiring the least assumptions to the weakest evidence. Note that crossover studies, when feasible, even surpass randomized studies of matched identical twins in the quality and relevance of information they provide.<br /><br /><span style="font-size: medium;">Let </span><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;"><span style="font-style: italic;">i</span></span></sub><span style="font-size: medium;"> denote patient </span><span style="font-size: medium;"><span style="font-style: italic;">i</span></span><span style="font-size: medium;"> and the treatments be denoted by </span><span style="font-size: medium;"><span style="font-style: italic;">A</span></span><span style="font-size: medium;"> and </span><span style="font-size: medium;"><span style="font-style: italic;">B</span></span><span style="font-size: medium;">. Thus </span><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">2</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">B</span></span></sup><span style="font-size: medium;"> represents patient 2 on treatment </span><span style="font-size: medium;"><span style="font-style: italic;">B</span></span><span style="font-size: medium;">. </span><span style="font-size: medium;"><span style="text-decoration: overline;"><span style="font-style: italic;">P</span></span></span><sub><span style="font-size: medium;">1</span></sub><span style="font-size: medium;"> represents the average outcome over a sample of patients from which patient 1 was selected. HTE is heterogeneity of treatment effect.</span><br /><span style="font-size: medium;"><br /></span><br /><table class="cellpading0" style="border-spacing: 6px;"><tbody><tr><td style="white-space: nowrap;"><span style="font-size: large;"><span style="font-weight: bold;">Design</span></span></td><td style="white-space: nowrap;"><span style="font-size: large;"><span style="font-weight: bold;">Patients Compared</span></span><span style="font-size: large;"></span></td></tr><tr><td class="hbar" colspan="2"><span style="font-size: medium;"></span></td></tr><tr><td style="white-space: nowrap;"><span style="font-size: medium;">6-period crossover</span></td><td style="white-space: nowrap;"><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">1</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">A</span></span></sup><span style="font-size: medium;"> vs </span><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">1</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">B</span></span></sup><span style="font-size: medium;"> (directly measure HTE)</span></td></tr><tr><td style="white-space: nowrap;"><span style="font-size: medium;">2-period crossover</span></td><td style="white-space: nowrap;"><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">1</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">A</span></span></sup><span style="font-size: medium;"> vs </span><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">1</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">B</span></span></sup><span style="font-size: medium;"></span></td></tr><tr><td style="white-space: nowrap;"><span style="font-size: medium;">RCT in idential twins</span></td><td style="white-space: nowrap;"><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">1</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">A</span></span></sup><span style="font-size: medium;"> vs </span><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">1</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">B</span></span></sup><span style="font-size: medium;"></span></td></tr><tr><td style="white-space: nowrap;"><span style="font-size: medium;">∥</span><span style="font-size: medium;"> group RCT</span></td><td style="white-space: nowrap;"><span style="font-size: medium;"><span style="text-decoration: overline;"><span style="font-style: italic;">P</span></span></span><sub><span style="font-size: medium;">1</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">A</span></span></sup><span style="font-size: medium;"> vs </span><span style="font-size: medium;"><span style="text-decoration: overline;"><span style="font-style: italic;">P</span></span></span><sub><span style="font-size: medium;">2</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">B</span></span></sup><span style="font-size: medium;">, </span><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">1</span></sub><span style="font-size: medium;">=</span><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">2</span></sub><span style="font-size: medium;"> on avg</span></td></tr><tr><td style="white-space: nowrap;"><span style="font-size: medium;">Observational, good artificial control</span></td><td style="white-space: nowrap;"><span style="font-size: medium;"><span style="text-decoration: overline;"><span style="font-style: italic;">P</span></span></span><sub><span style="font-size: medium;">1</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">A</span></span></sup><span style="font-size: medium;"> vs </span><span style="font-size: medium;"><span style="text-decoration: overline;"><span style="font-style: italic;">P</span></span></span><sub><span style="font-size: medium;">2</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">B</span></span></sup><span style="font-size: medium;">, </span><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">1</span></sub><span style="font-size: medium;">=</span><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">2</span></sub><span style="font-size: medium;"> hopefully on avg</span></td></tr><tr><td style="white-space: nowrap;"><span style="font-size: medium;">Observational, poor artificial control</span></td><td style="white-space: nowrap;"><span style="font-size: medium;"><span style="text-decoration: overline;"><span style="font-style: italic;">P</span></span></span><sub><span style="font-size: medium;">1</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">A</span></span></sup><span style="font-size: medium;"> vs </span><span style="font-size: medium;"><span style="text-decoration: overline;"><span style="font-style: italic;">P</span></span></span><sub><span style="font-size: medium;">2</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">B</span></span></sup><span style="font-size: medium;">, </span><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">1</span></sub><span style="font-size: medium;">≠ </span><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">2</span></sub><span style="font-size: medium;"> on avg</span></td></tr><tr><td style="white-space: nowrap;"><span style="font-size: medium;">Real-world physician practice</span></td><td style="white-space: nowrap;"><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">1</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">A</span></span></sup><span style="font-size: medium;"> vs </span><span style="font-size: medium;"><span style="font-style: italic;">P</span></span><sub><span style="font-size: medium;">2</span></sub><sup><span style="font-size: medium;"><span style="font-style: italic;">B</span></span></sup><span style="font-size: medium;"></span></td></tr><tr><td class="hbar" colspan="2"><span style="font-size: medium;"></span></td></tr></tbody></table></div><div><br /></div><div>The best experimental designs yield the best evidence a clinician needs to answer the "what if" therapeutic question for the one patient in front of her.<br /><br />Regarding adherence, proponents of "real world" evidence advocate for estimating treatment effects in the context of making treatment adherence low as in clinical practice. This would result in lower efficacy and the abandonment of many treatments. It is hard to argue that a treatment should not be available for a potentially adherent patient because her fellow patients were poor adherers. Note that an RCT is the best hope for estimating efficacy as a function of adherence, through for example an instrumental variable analysis (the randomization assignment is a truly valid instrument). Much more needs to be said about how to handle treatment adherence and what should be the target adherence in an RCT, but overall it is a good thing that RCTs do not mimic clinical practice. We are entering a new era of pragmatic clinical trials. Pragmatic trials are worthy of in-depth discussion, but it is not a stretch to say that the chief advantage of pragmatic trials is not that they provide results that are more relevant to clinical practice but that they are cheaper and faster than traditional randomized trials.<br /><br /></div><hr><span style="font-size: medium;">Updated 2017-06-25 (last paragraph regarding adherence) </span>Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com3tag:blogger.com,1999:blog-5376139322696503442.post-86293762733253603672017-01-25T06:16:00.001-06:002017-01-25T06:16:46.539-06:00Clinicians' Misunderstanding of Probabilities Makes Them Like Backwards Probabilities Such As Sensitivity, Specificity, and Type I ErrorImaging watching a baseball game, seeing the batter get a hit, and hearing the announcer say "The chance that the batter is left handed is now 0.2!" No one would care. Baseball fans are interested in the chance that a batter will get a hit conditional on his being right handed (handedness being already known to the fan), the handedness of the pitcher, etc. Unless one is an archaeologist or medical examiner, the interest is in forward probabilities conditional on current and past states. We are interested in the probability of the unknown given the known and the probability of a future event given past and present conditions and events.<br /><br />Clinicians are people trained in the science and practice of medicine, and most of them are very good at it. They are also very good at many aspects of research. But they are generally not taught probability, and this can limit their research skills. Many excellent clinicians even let their limitations in understanding probability make them believe that their clinical decision making is worse than it actually is. I have taught many clinicians who say "I need a hard and fast rule so I know how to diagnosis or treat patients. I need a hard cutoff on blood pressure, HbA1c, etc. so that I know what to do, and the fact that I either treat or not treat the patient means that I don't want to consider a probability of disease but desire a simple classification rule." This makes the clinician try to influence the statistician to use inefficient, arbitrary methods such as categorization, stratification, and matching.<br /><br />In reality, clinicians do not act that way when treating patients. They are smart enough to know that if a patient has cholesterol just over someone's arbitrary threshold they may not start statin therapy right away if the patient has no other risk factors (e.g., smoking) going against him. They know that sometimes you start a patient on a lower dose and see how she responds, or start one drug and try it for a while and then switch drugs if the efficacy is unacceptable or there is a significant side effect.<br /><br />So I emphasize the need to understand probabilities when I'm teaching clinicians. A probability is a self-contained summary of the current information, except for the patient's risk aversion and other utilities. Clinicians need to be comfortable with a probability of 0.5 meaning "we don't know much" and not requesting a classification of disease/normal that does nothing but cover up the problem. A classification does not account for gray zones or patient and physician utility functions.<br /><br />Even physicians who understand the meaning of a probability are often not understanding conditioning. Conditioning is all important, and conditioning on different things massively changes the meaning of the probabilities being computed. Every physician I've known has been taught probabilistic medical diagnosis by first learning about sensitivity (sens) and specificity (spec). These are probabilities that are in backwards time- and information flow order. How did this happen? Sensitivity, specificity, and receiver operating characteristic curves were developed for radar and radio research in the military. It was a important to receive radio signals from distant aircraft, and to detect an incoming aircraft on radar. The ability to detect something that is really there is definitely important. In the 1950s, virologists appropriated these concepts to measure the performance of viral cultures. Virus needs to be detected when it's present, and not detected when it's not. Sensitivity is the probability of detecting a condition when it is truly present, and specificity is the probability of not detecting it when it is truly absent. One can see how these probabilities would be useful outside of virology and bacteriology when the samples are retrospective, as in a case-control studies. But I believe that clinicians and researchers would be better off if backward probabilities were not taught or were mentioned only to illustrate how <b>not</b> to think about a problem.<br /><br />But the way medical students are educated, they assume that sens and spec are what you first consider in a prospective cohort of patients! This gives the professor the opportunity of teaching Bayes' rule and requires the use of a supposedly unconditional probability known as <i>prevalence</i> which is actually not very well defined. The students plugs everything into Bayes' rule and fails to notice that several quantities cancel out. The result is the following: the proportion of patients with a positive test who have disease, and the proportion with a negative test who have disease. These are trivially calculated from the cohort data without knowing anything about sens, spec, and Bayes. This way of thinking harms the student's understanding for years to come and influences those who later engage in clinical and pharmaceutical research to believe that type I error and p-values are directly useful.<br /><br />The situation in medical diagnosis gets worse when referral bias (also called workup bias) is present. When certain types of patients do not get a final diagnosis, sens and spec are biased. For example, younger women with a negative test may not get the painful procedure that yields the final diagnosis. There are formulas that must be used to correct sens and spec. But wait! When Bayes' rule is used to obtain the probability of disease we needed in the first place, these corrections completely cancel out when the usual correction methods are used! Using forward probabilities in the first place means that one just conditions on age, sex, and result of the initial diagnostic test and no special methods other than (sometimes) logistic regression are required.<br /><br />There is an analogy to statistical testing. p-values and type I error are affected by sequential testing and a host of other factors, but forward-time probabilities (Bayesian posterior probabilities) are not. Posterior probabilities condition on what is known and does not have to imagine alternate paths to getting to what is known (as do sens and spec when workup bias exists). p-values and type I errors are backwards-information-flow measures, and clinical researchers and regulators come to believe that type I error is the error of interest. They also very frequently misinterpret p-values. The p-value is one minus spec, and power is sens. The posterior probability is exactly analogous to the probability of disease.<br /><br />Sens and spec are so pervasive in medicine, bioinformatics, and biomarker research that we don't question how silly they would be in other contexts. Do we dichotomize a response variable so that we can compute the probability that a patient is on treatment B given a "positive" response? On the contrary we want to know the full continuous distribution of the response given the assigned treatment. Again this represents forward probabilities.<br /><br />Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com10tag:blogger.com,1999:blog-5376139322696503442.post-71852801308958021812017-01-23T07:47:00.002-06:002017-01-25T12:18:25.892-06:00Split-Sample Model ValidationMethods used to obtain unbiased estimates of future performance of statistical prediction models and classifiers include data splitting and resampling. The two most commonly used resampling methods are cross-validation and bootstrapping. To be as good as the bootstrap, about 100 repeats of 10-fold cross-validation are required.<br /><br />As discussed in more detail in Section 5.3 of <a href="http://biostat.mc.vanderbilt.edu/rms">Regression Modeling Strategies Course Notes</a> and the same section of the RMS book, data splitting is an unstable method for validating models or classifiers, especially when the number of subjects is less than about 20,000 (fewer if signal:noise ratio is high). This is because were you to split the data again, develop a new model on the training sample, and test it on the holdout sample, the results are likely to vary significantly. Data splitting requires a significantly larger sample size than resampling to work acceptably well. See also Section 10.11 of <a href="http://biostat.mc.vanderbilt.edu/tmp/bbr.pdf">BBR</a>.<br /><br />There are also very subtle problems:<br /><br /><ol><li>When feature selection is done, data splitting validates just one of a myriad of potential models. In effect it validates an example model. Resampling (repeated cross-validation or the bootstrap) validate the process that was used to develop the model. Resampling is honest in reporting the results because it depicts the uncertainty in feature selection, e.g., the disagreements in which variables are selected from one resample to the next.</li><li>It is not uncommon for researchers to be disappointed in the test sample validation and to ask for a "re-do" whereby another split is made or the modeling starts over, or both. When reporting the final result they sometimes neglect to mention that the result was the third attempt at validation.</li><li>Users of split-sample validation are wise to recombine the two samples to get a better model once the first model is validated. But then they have no validation of the new combined data model.</li></ol><div>There is a less subtle problem but one that is ordinarily not addressed by investigators: unless both the training and test samples are huge, split-sample validation is not nearly as accurate as the bootstrap. See for example the section <i>Studies of Methods Used in the Text </i><a href="http://biostat.mc.vanderbilt.edu/rms">here</a>. As shown in a simulation appearing there, bootstrapping is typically more accurate than data splitting and cross-validation that does not use a large number of repeats. This is shown by estimating the "true" performance, e.g., the R-squared or c-index on an infinitely large dataset (infinite here means 50,000 subjects for practical purposes). The performance of an accuracy estimate is taken as the mean squared error of the estimate against the model's performance in the 50,000 subjects.</div><div><br /></div><div>Data are too precious to not be used in model development/parameter estimation. Resampling methods allow the data to be used for both development and validation, and they do a good job in estimating the likely future performance of a model. Data splitting only has an advantage when the test sample is held by another researcher to ensure that the validation is unbiased.<br /><br /><h3>Update 2017-01-25</h3>Many investigators have been told that they must do an "external" validation, and they split the data by time or geographical location. They are sometimes surprised that the model developed in one country or time does not validate in another. They should not be; this is an indirect way of saying there are time or country effects. Far better would be to learn about and estimate time and location effects by including them in a unified model. Then rigorous internal validation using the bootstrap, accounting for time and location all along the way. The end result is a model that is useful for prediction at times and locations that were at least somewhat represented in the original dataset, but without assuming that time and location effects are nil.<br /><br /></div><div><br /></div>Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com6tag:blogger.com,1999:blog-5376139322696503442.post-9469918730797607892017-01-18T18:33:00.000-06:002017-03-04T12:17:07.654-06:00Fundamental Principles of StatisticsThere are many principles involved in the theory and practice of statistics, but here are the ones that guide my practice the most.<br /><ol><li>Use methods grounded in theory or extensive simulation</li><li>Understand uncertainty</li><li>Design experiments to maximize information</li><li>Understand the measurements you are analyzing and don't hesitate to question how the underlying information was captured</li><li>Be more interested in questions than in null hypotheses, and be more interested in estimation than in answering narrow questions</li><li>Use all information in data during analysis</li><li>Use discovery and estimation procedures not likely to claim that noise is signal</li><li>Strive for optimal quantification of evidence about effects</li><li>Give decision makers the inputs (<i>other</i> than the utility function) that optimize decisions</li><li>Present information in ways that are intuitive, maximize information content, and are correctly perceived</li><li>Give the client what she needs, not what she wants</li><li>Teach the client to want what she needs</li></ol><div><i><span style="font-size: x-small;"><br /></span></i><i><span style="font-size: x-small;">... the statistician must be instinctively and primarily a logician and a scientist in the broader sense, and only secondarily a user of the specialized statistical techniques.</span></i><br /><i><span style="font-size: x-small;"><br /></span></i><i><span style="font-size: x-small;">In considering the refinements and modifications of the scientific method which particularly apply to the work of the statistician, the first point to be emphasized is that the statistician is always dealing with probabilities and degrees of uncertainty. He is, in effect, a Sherlock Holmes of figures, who must work mainly, or wholly, from circumstantial evidence.</span></i><br /><br /><div style="text-align: right;"><span style="font-size: x-small;">Malcolm C Rorty: Statistics and the Scientific Method. JASA 26:1-10, 1931.</span></div></div><div><span style="font-size: x-small;"><br /></span></div><br /><div><br /></div>Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com0tag:blogger.com,1999:blog-5376139322696503442.post-12827695989229990502017-01-16T10:52:00.002-06:002017-01-24T07:28:15.871-06:00Ideas for Future Articles<div class="MsoNormal" style="background-color: white; margin: 0cm 0cm 10pt;"></div><div>Suggestions for future articles are welcomed as comments to this entry. Some topics I intend to write about are listed below.</div><ol><li><span style="color: #222222; font-family: "calibri";">The litany of problems with p-values - catalog of all the problems I can think of</span></li><li><span style="color: #222222; font-family: "calibri";">Matching vs. covariate adjustment (see below from Arne Warnke)</span></li><li><span style="color: #222222; font-family: "calibri";">Statistical strategy for propensity score modeling and usage</span></li><li><span style="color: #222222; font-family: "calibri";">Analysis of change: why so many things go wrong</span></li><li><span style="color: #222222; font-family: "calibri";">What exactly is a type I error and should we care? (analogy: worrying about the chance of a false positive diagnostic test vs. computing current probability of disease given whatever the test result was). Alternate title: Why Clinicians' Misunderstanding of Probabilities Makes Them Like Backwards Probabilities Such As Sensitivity, Specificity, and Type I Error.</span></li><li><span style="color: #222222; font-family: "calibri";">Forward vs. backwards probabilities and why forward probabilities serve as their own error probabilities (we have been fed backwards probabilities such as p-values, sensitivity, and specificity for so long it's hard to look forward)</span></li><li><span style="color: #222222; font-family: "calibri";">What is the full meaning of a posterior probability?</span></li><li><span style="color: #222222; font-family: "calibri";">Posterior probabilities can be computed as often as desired</span></li><li><span style="color: #222222; font-family: "calibri";">Statistical critiques of published articles in the biomedical literature</span></li><li><span style="color: #222222; font-family: "calibri";">New dynamic graphics capabilities using R plotly in the R Hmisc package: Showing more by initially showing less</span></li><li><span style="color: #222222; font-family: "calibri";">Moving from pdf to html for statistical reporting</span></li><li><span style="color: #222222; font-family: "calibri";">Is machine learning statistics or computer science?</span></li><li><span style="color: #222222; font-family: "calibri";">Sample size calculation: Is it voodoo?</span></li><li><span style="color: #222222; font-family: "calibri";">Difference between Bayesian modeling and frequentist inference</span></li><li><span style="color: #222222; font-family: "calibri";">Proper accuracy scoring rules and why improper scores such as proportion "classified" "correctly" give misleading results.</span></li></ol><br /><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span style="font-size: small;"><span style="font-family: "calibri";"></span></span></div><a name='more'></a><span style="font-size: small;"><span style="font-family: "calibri";"><br /></span></span><br /><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span style="font-size: small;"><span style="font-family: "calibri";">A few weeks ago we had a small discussion at CrossValidated about the pros and cons of matching.<u></u><u></u></span></span></div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span style="font-size: small;"><span style="font-family: "calibri";"><a data-saferedirecturl="https://www.google.com/url?hl=en&q=http://stats.stackexchange.com/questions/248676/analysis-strategy-for-rare-outcome-with-matching&source=gmail&ust=1484667382479000&usg=AFQjCNGeqIzlveltVWYDrMAJ7NOTMDrWjw" href="http://stats.stackexchange.com/questions/248676/analysis-strategy-for-rare-outcome-with-matching" style="color: #1155cc;" target="_blank">http://stats.stackexchange.<wbr></wbr>com/questions/248676/analysis-<wbr></wbr>strategy-for-rare-outcome-<wbr></wbr>with-matching</a></span></span></div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span style="font-size: small;"><span style="font-family: "calibri";">I am sorry that I did not had enough time to elaborate further on the support of matching procedures (in my field researchers do not focus much on a bias-variance tradeoff but they prioritize on minimizing biases. For that reason, they like matching procedures).<u></u><u></u></span></span></div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span style="font-family: "calibri";"><span style="font-size: small;">Now, I have seen that you started a blog recently (congratulations!). </span></span><span style="font-family: "calibri";"><span style="font-size: small;">I would like to encourage to take up the topic of matching because it is probably interesting for many applied researchers.<br />I think in your ‘philosophy’, this would belong to the point “Preserve all the information in the data”.<u></u><u></u></span></span></div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span style="font-size: small;"><span style="font-family: "calibri";">Here, perhaps some input for a blog post. Back then, you wrote:<u></u><u></u></span></span></div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span class="m_-6858845881952282685comment-copy"><i><span style="font-size: small;"><span style="font-family: "calibri";">Matching on continuous variables results in an incomplete adjustment because the variables have to be binned.<u></u><u></u></span></span></i></span></div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span class="m_-6858845881952282685comment-copy"><span style="font-size: small;"><span style="font-family: "calibri";">What about propensity score matching?<u></u><u></u></span></span></span></div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span class="m_-6858845881952282685comment-copy"><i><span style="font-size: small;"><span style="font-family: "calibri";">Matching throws away good data from observations that would be good matches.<u></u><u></u></span></span></i></span></div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span class="m_-6858845881952282685comment-copy"><span style="font-size: small;"><span style="font-family: "calibri";">I agree<u></u><u></u></span></span></span></div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span class="m_-6858845881952282685comment-copy"><i><span style="font-size: small;"><span style="font-family: "calibri";">Extrapolation bias is only a significant problem if there is a covariate by group interaction, and users of matching methods ignore interactions anyway.<u></u><u></u></span></span></i></span></div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span class="m_-6858845881952282685comment-copy"><span style="font-size: small;"><span style="font-family: "calibri";">Here, you go too far (in my view). You can add interactions, again for example with propensity score matching. Imbens and Rubin (2015) suggest a procedure using quadratic and interaction terms of the covariates.<u></u><u></u></span></span></span><br /><br /> Comment: Nice to know this exists but I've never seen a paper that used matching attempt to explore interactions.</div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span style="font-size: small;"><span style="font-family: "calibri";"><span class="m_-6858845881952282685comment-copy"><i>If you don't want to make regression assumptions that are unverifiable, remove observations outside the overlap region just as with matching.</i></span><i><u></u><u></u></i></span></span></div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "segoe ui"; margin: 0cm 0cm 10pt;"><div style="font-size: 13.3333px;"><span style="font-size: small;"><span style="font-family: "calibri";">Which assumptions do you refer to? I think that treating everyone the same (statistically) is also an unverifiable assumption (do you disagree?). What is your opinion about weighted least squares?<u></u><u></u></span></span></div><div style="font-size: 13.3333px;"><span style="font-size: small;"><span style="font-family: "calibri";"><br /></span></span></div><span style="font-size: x-small;"><span style="font-family: "calibri";"><span style="font-size: 13.3333px;"> </span><span style="font-size: x-small;">Comment: This is the no-interaction assumption. If you assume additivity then it's more OK to have a no-overlap region, otherwise throw-away non-overlap regions and do a conditional analysis. Not clear on the need for weighting here. In general I like conditioning over weighting.</span></span></span></div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span style="font-family: "calibri"; font-size: small;">Arne Jonas Warnke</span></div><div class="MsoNormal" style="background-color: white; color: #222222; font-family: "Segoe UI"; font-size: 13.3333px; margin: 0cm 0cm 10pt;"><span style="font-family: "calibri"; font-size: small;">Labour Markets, Human Resources and Social Policy<br />Internet: <a data-saferedirecturl="https://www.google.com/url?hl=en&q=http://www.zew.de&source=gmail&ust=1484667382500000&usg=AFQjCNEWgTCixFOJMtuAG_jF2Z6OHrpGmA" href="http://www.zew.de/" style="color: #1155cc;" target="_blank">www.zew.de</a> <a data-saferedirecturl="https://www.google.com/url?hl=en&q=http://www.zew.eu&source=gmail&ust=1484667382500000&usg=AFQjCNGpB0ievH9X1dB4ZfmY9XFeqhsA-A" href="http://www.zew.eu/" style="color: #1155cc;" target="_blank">www.zew.eu</a></span></div>Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com37tag:blogger.com,1999:blog-5376139322696503442.post-40024397928327588772017-01-15T11:51:00.002-06:002017-07-09T14:35:40.206-05:00Classification vs. Prediction<div style="text-align:center"><span style="font-size: 80%;"><em>Classification combines prediction and decision making and usurps the decision maker in specifying costs of wrong decisions. The classification rule must be reformulated if costs/utilities change. Predictions are separate from decisions and can be used by any decision maker. </em></span></div><p>The field of machine learning arose somewhat independently of the field of statistics. As a result, machine learning experts tend not to emphasize probabilistic thinking. Probabilistic thinking and understanding uncertainty and variation are hallmarks of statistics. By the way, one of the best books about probabilistic thinking is Nate Silver's <i>The Signal and The Noise: Why So Many Predictions Fail But Some Don't</i>. In the medical field, a classic paper is David Spiegelhalter's <a href="http://www.citeulike.org/user/harrelfe/article/13264888">Probabilistic Prediction in Patient Management and Clinical Trials</a>.<br /><br />By not thinking probabilistically, machine learning advocates frequently utilize classifiers instead of using risk prediction models. The situation has gotten acute: many machine learning experts actually label logistic regression as a classification method (it is not). It is important to think about what classification really implies. Classification is in effect a decision. Optimum decisions require making full use of available data, developing predictions, and applying a loss/utility/cost function to make a decision that, for example, minimizes expected loss or maximizes expected utility. Different end users have different utility functions. In risk assessment this leads to their having different risk thresholds for action. Classification assumes that every user has the same utility function and that the utility function implied by the classification system is<i> that</i> utility function.<br /><br />Classification is a <i>forced choice</i>. In marketing where the advertising budget is fixed, analysts generally know better than to try to classify a potential customer as someone to ignore or someone to spend resources on. They do this by modeling probabilities and creating a <i>lift curve</i>, whereby potential customers are sorted in decreasing order of estimated probability of purchasing a product. To get the "biggest bang for the buck", the marketer who can afford to advertise to n persons picks the n highest-probability customers as targets. This is rational, and classification is not needed here.<br /><br />A frequent argument from data users, e.g., physicians, is that ultimately they need to make a binary decision, so binary classification is needed. This is simply not true. First of all, it is often the case that the best decision is "no decision; get more data" when the probability of disease is in the middle. In many other cases, the decision is revocable, e.g., the physician starts the patient on a drug at a lower dose and decides later whether to change the dose or the medication. In surgical therapy the decision to operate is irrevocable, but the choice of <i>when</i> to operate is up to the surgeon and the patient and depends on severity of disease and symptoms. At any rate, if binary classification is needed, it must be done <b>at the point of care when all utilities are known</b>, not in a data analysis.<br /><br />When are forced choices appropriate? I think that one needs to consider whether the problem is mechanistic or stochastic/probabilistic. Machine learning advocates often want to apply methods made for the former to problems where biologic variation, sampling variability, and measurement errors exist. It may be best to apply classification techniques instead just to high signal:noise ratio situations such as those in which there there is a known gold standard and one can replicate the experiment and get almost the same result each time. An example is pattern recognition - visual, sound, chemical composition, etc. If one creates an optical character recognition algorithm, the algorithm can be trained by exposing it to any number of replicates of attempts to classify an image as the letters A, B, ... The user of such a classifier may not have time to consider whether any of the classifications were "close calls." And the signal:noise ratio is extremely high. In addition, there is a single "right" answer for each character.<br /><br />When close calls are possible, probability estimates are called for. One beauty of probabilities is that they are their own error measures. If the probability of disease is 0.1 and the current decision is not to treat the patient, the probability of this being an error is by definition 0.1. A probability of 0.4 may lead the physician to run another lab test or do a biopsy. When the signal:noise ratio is small, classification is usually not a good goal; there one must model <i>tendencies</i>, i.e., probabilities.<br /><br />The U.S. Weather Service has always phrased rain forecasts as probabilities. I do not want a classification of "it will rain today." There is a slight loss/disutility of carrying an umbrella, and I want to be the one to make the tradeoff.<br /><br />Whether engaging in credit risk scoring, weather forecasting, climate forecasting, marketing, diagnosis a patient's disease, or estimating a patient's prognosis, I do not want to use a classification method. I want risk estimates with credible intervals or confidence intervals. My opinion is that machine learning classifiers are best used in mechanistic high signal:noise ratio situations, and that probability models should be used in most other situations.<br /><br />This is related to a subtle point that has been lost on many analysts. Complex machine learning algorithms, which allow for complexities such as high-order interactions, require an <a href="http://www.citeulike.org/user/harrelfe/article/13467382">enormous amount of data</a> unless the signal:noise ratio is high, another reason for reserving some machine learning techniques for such situations. Regression models which capitalize on additivity assumptions (when they are true, and this is approximately true is much of the time) can yield accurate probability models without having massive datasets. And when the outcome variable being predicted has more than two levels, a single regression model fit can be used to obtain all kinds of interesting quantities, e.g., predicted mean, quantiles, exceedance probabilities, and instantaneous hazard rates.<br /><br />A special problem with classifiers illustrates an important issue. Users of machine classifiers know that a highly imbalanced sample with regard to a binary outcome variable Y results in a strange classifier. For example, if the sample has 1000 diseased patients and 1,000,000 non-diseased patients, the best classifier may classify everyone as non-diseased; you will be correct 0.999 of the time. For this reason the odd practice of subsampling the controls is used in an attempt to balance the frequencies and get some variation that will lead to sensible looking classifiers (users of regression models would never exclude good data to get an answer). Then they have to, in some ill-defined way, construct the classifier to make up for biasing the sample. It is simply the case that a classifier trained to a 1/1000 prevalence situation will not be applicable to a population with a vastly different prevalence. The classifier would have to be re-trained on the new sample, and the patterns detected may change greatly. Logistic regression on the other hand elegantly handles this situation by either (1) having as predictors the variables that made the prevalence so low, or (2) recalibrating the intercept (only) for another dataset with much higher prevalence. Classifiers' extreme dependence on prevalence may be enough to make some researchers always use probability estimators instead.<br /><br />One of the key elements in choosing a method is having a sensitive accuracy scoring rule with the correct statistical properties. Experts in machine classification seldom have the background to understand this enormously important issue, and choosing an improper accuracy score such as proportion classified correctly will result in a bogus model. This will be discussed in a future blog.<br /><br />Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com21tag:blogger.com,1999:blog-5376139322696503442.post-56405130006316528232017-01-14T08:14:00.001-06:002017-08-31T08:09:54.087-05:00p-values and Type I Errors are Not the Probabilities We NeedIn trying to guard against false conclusions, researchers often attempt to minimize the risk of a "false positive" conclusion. In the field of assessing the efficacy of medical and behavioral treatments for improving subjects' outcomes, falsely concluding that a treatment is effective when it is not is an important consideration. Nowhere is this more important than in the drug and medical device regulatory environments, because a treatment thought not to work can be given a second chance as better data arrive, but a treatment judged to be effective may be approved for marketing, and if later data show that the treatment was actually not effective (or was only trivially effective) it is difficult to remove the treatment from the market if it is safe. The probability of a treatment not being effective is the probability of "regulator's regret." One must be very clear on what is conditioned upon (assumed) in computing this probability. Does one condition on the true effectiveness or does one condition on the available data? Type I error conditions on the treatment having no effect and does not entertain the possibility that the treatment actually worsens the patients' outcomes. Can one quantify evidence for making a wrong decision if one assumes that all conclusions of non-zero effect are wrong up front because H<sub>0</sub> was assumed to be true? Aren't useful error probabilities the ones that are not based on assumptions about what we are assessing but rather just on the data available to us?<br /><br />Statisticians have convinced regulators that long-run operating characteristics of a testing procedure should rule the day, e.g., if we did 1000 clinical trials where efficacy was always zero, we want no more than 50 of these trials to be judged as "positive." Never mind that this type I error operating characteristic does not refer to making a correct judgment for the clinical trial at hand. Still, there is a belief that type I error is the probability of regulator's regret (a false positive), i.e., that the treatment is not effective when the data indicate it is. In fact, clinical trialists have been sold a bill of goods by statisticians. No probability derived from an assumption that the treatment has zero effect can provide evidence about that effect. Nor does it measure the chance of the error actually in question. All probabilities are conditional on <i>something</i>, and to be useful they must condition on the <i>right thing</i>. This usually means that what is conditioned upon must be knowable.<br /><br />The probability of regulator's regret is the probability that a treatment doesn't work given the data. So the probability we really seek is the probability that the treatment has no effect <i>or that it has a backwards effect</i>. This is precisely one minus the Bayesian posterior probability of efficacy.<br /><br />In reality, there is unlikely to exist a treatment that has exactly zero effect. As <a href="http://www.citeulike.org/user/harrelfe/article/10529649">Tukey argued in 1991</a>, the effects of treatments A and B are always different, to some decimal place. So the null hypothesis is always false and the type I error could be said to be always zero.<br /><br />The best paper I've read about the many ways in which p-values are misinterpreted is <a href="http://www.citeulike.org/user/harrelfe/article/14042559">Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations</a> written by a group of renowned statisticians. One of my favorite quotes from this paper is<br /><br /><blockquote class="tr_bq">Thus to claim that the null P value is the probability that chance alone produced the observed association is completely backwards: The P value is a probability computed assuming chance was operating alone. The absurdity of the common backwards interpretation might be appreciated by pondering how the P value, which is a probability deduced from a set of assumptions (the statistical model), can possibly refer to the probability of those assumptions.</blockquote>In 2016 the American Statistical Association took a <a href="https://matloff.wordpress.com/2016/03/07/after-150-years-the-asa-says-no-to-p-values">stand</a> against over-reliance on p-values. This would have made a massive impact on all branches of science had it been issued 50 years ago but better late than never.<br /><br /><h3>Update 2017-01-19</h3>Though believed to be true by many non-statisticians, p-values are not the probability that H<sub>0</sub> is true, and to turn them into such probabilities requires Bayes' rule. If you are going to use Bayes' rule you might as well formulate the problem as a full Bayesian model. This has many benefits, not the least of them being that you can select an appropriate prior distribution and you will get exact inference. Attempts by several authors to convert p-values to probabilities of interest (just as sensitivity and specificity are converted to probability of disease once one knows the prevalence of disease) have taken the prior to be discontinuous, putting a high probability on H<sub>0</sub> being exactly true. In my view it is much more sensible to believe that there is no discontinuity in the prior at the point represented by H<sub>0</sub>, encapsulating prior knowledge instead by saying that values near H<sub>0</sub> are more likely if no relevant prior information is available.<br /><br />Returning to the non-relevance of type I error as discussed above, and ignoring for the moment that long-run operating characteristics do not directly assist us in making judgments about the current experiment, there is a subtle problem that leads researchers to believe that by controlling type I "error" they think they have quantified the probability of misleading evidence. As discussed at length by my colleague <a href="http://www.citeulike.org/user/harrelfe/author/Blume">Jeffrey Blume</a>, once an experiment is done the probability that positive evidence is misleading is <b>not</b> type I error. And what exactly does "error" mean in "type I error?" It is the probability of rejecting H<sub>0</sub> when H<sub>0</sub> is exactly true, just as the p-value is the probability of obtaining data more impressive than that observed given H<sub>0</sub> is true. Are these really error probabilities? Perhaps ... if you have been misled earlier into believing that we should base conclusions on how unlikely the observed data would have been observed under H<sub>0</sub>. Part of the problem is in the loaded word "reject." Rejecting H<sub>0</sub> by seeing data that are unlikely if H<sub>0</sub> is true is perhaps the real error.<br /><br />The "error quantification" truly needed is the probability that a treatment doesn't work given all the current evidence, which as stated above is simply one minus the Bayesian posterior probability of positive efficacy.<br /><br /><h3>Update 2017-01-20</h3>Type I error control is an indirect way to being careful about claims of effects. It should never have been the preferred method for achieving that goal. Seen another way, we would choose type I error as the quantity to be controlled if we wanted to:<br /><br /><ul><li>require the experimenter to visualize an infinite number of experiments that might have been run, and assume that the current experiment could be exactly replicated</li><li>be interested in long-run operating characteristics vs. judgments needing to be made for the one experiment at hand</li><li>be interested in the probability that other replications result in data more extreme than mine if there is no treatment effect</li><li>require early looks at the data to be discounted for future looks</li><li>require past looks at the data to be discounted for earlier inconsequential looks</li><li>create other multiplicity considerations, all of them arising from the chances you give data to be extreme as opposed to the chances that you give effects to be positive</li><li>data can be more extreme for a variety of reasons such as trying to learn faster by looking more often or trying to learn more by comparing more doses or more drugs</li></ul><div>The Bayesian approach focuses on the chances you give effects to be positive and does not have multiplicity issues (potential issues such as examining treatment effects in multiple subgroups are handled by the shrinkage that automatically results when you use the 'right' Bayesian hierarchical model).</div><div><br /></div><div>The p-value is the chance that someone else would observe data more extreme than mine if the effect is truly zero (if they could exactly replicate my experiment) and not the probability of no (or a negative) effect of treatment given my data.</div><div><br /></div><br /><br /><h3>Update 2017-05-10</h3>As discussed in Gamalo-Siebers at al DOI: 10.1002/pst.1807 the type I error is the probability of making an assertion of an effect when no such effect exists. It is <b>not</b> the probability of regret for a decision maker, e.g., it is not the probability of a drug regulator's regret. The probability of regret is the probability that the drug doesn't work or is harmful when the decision maker had decided it was helpful. It is the probability of harm or no benefit when an assertion of benefit is made. This is best thought of as the probability of harm or no benefit given the data which is one minus the probability of efficacy. Prob(assertion|no benefit) is not equal to 1-Prob(benefit|data). Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com11tag:blogger.com,1999:blog-5376139322696503442.post-58513343866216728782017-01-14T07:15:00.001-06:002017-01-15T20:25:45.952-06:00Null Hypothesis Significance Testing Never WorkedMuch has been written about problems with our most-used statistical paradigm: frequentist null hypothesis significance testing (NHST), p-values, type I and type II errors, and confidence intervals. Rejection of straw-man null hypotheses leads researchers to believe that their theories are supported, and the unquestioning use of a threshold such as p<0.05 has resulted in hypothesis substitution, search for subgroups, and other gaming that has badly damaged science. But we seldom examine whether the original idea of NHST actually delivered on its goal of making good decisions about effects, given the data.<br /><br />NHST is based on something akin to proof by contradiction. The best non-mathematical definition of the p-value I've ever seen is due to <a href="http://www.citeulike.org/user/harrelfe/article/14166520">Nicholas Maxwell</a>: "the degree to which the data are embarrassed by the null hypothesis." p-values provide evidence against something, never in favor of something, and are the basis for NHST. But proof by contradiction is only fully valid in the context of rules of logic where assertions are true or false without any uncertainty. The classic paper <a href="http://www.citeulike.org/user/harrelfe/article/10529649">The Earth is Round (p<.05)</a> by Jacob Cohen has a beautiful example pointing out the fallacy of combining probabilistic ideas with proof by contradiction in an attempt to make decisions about an effect.<br /><blockquote class="tr_bq"><br />The following is almost but not quite the reasononing of null hypothesis rejection: </blockquote><blockquote class="tr_bq">If the null hypothesis is correct, then this datum (D) can not occur.<br />It has, however, occurred.<br />Therefore the null hypothesis is false. </blockquote><blockquote class="tr_bq">If this were the reasoning of H<sub>0</sub> testing, then it would be formally correct. … But this is not the reasoning of NHST. Instead, it makes this reasoning probabilistic, as follows: </blockquote><blockquote class="tr_bq">If the null hypothesis is correct, then these data are highly unlikely.<br />These data have occurred.<br />Therefore, the null hypothesis is highly unlikely. </blockquote><blockquote class="tr_bq">By making it probabilistic, it becomes invalid. … the syllogism becomes formally incorrect and leads to a conclusion that is not sensible: </blockquote><blockquote class="tr_bq">If a person is an American, then he is probably not a member of Congress. (TRUE, RIGHT?)<br />This person is a member of Congress.<br />Therefore, he is probably not an American. (Pollard & Richardson, 1987) </blockquote><blockquote class="tr_bq">… The illusion of attaining improbability or the Bayesian Id's wishful thinking error …<br /> </blockquote><blockquote class="tr_bq">Induction has long been a problem in the philosophy of science. Meehl (1990) attributed to the distinguished philosopher Morris Raphael Cohen the saying "All logic texts are divided into two parts. In the first part, on deductive logic, the fallacies are explained; in the second part, on inductive logic, they are committed."</blockquote>Sometimes when an approach leads to numerous problems, the approach itself is OK and the problems can be repaired. But besides all the other problems caused by NHST (including need for arbitrary multiplicity adjustments, need for consideration of investigator intentions and not just her actions, rejecting H<sub>0</sub> for trivial effects, incentivizing gaming, interpretation difficulties, etc.) it may be the case that the overall approach is defective and should not have been adopted.<br /><br />With all of the amazing things Ronald Fisher gave us, and even though he recommended against the unthinking rejection of H<sub>0</sub>, his frequentist approach and dislike of the Bayesian approach did us all a disservice. He called the Bayesian method invalid and was possibly intellectually dishonest when he labeled it as "inverse probability." In fact the p-value is an indirect inverse probability and Bayesian posterior probabilities are direct forwards probabilities that do not condition on a hypothesis, and the Bayesian approach has not only been shown to be valid, but it actually delivers on its promise.<br /><br />Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com21tag:blogger.com,1999:blog-5376139322696503442.post-75934167279084927472017-01-13T07:49:00.001-06:002017-09-10T09:51:18.604-05:00IntroductionStatistics is a field that is a science unto itself and that benefits all other fields and everyday life. What is unique about statistics is its proven tools for decision making in the face of uncertainty, understanding sources of variation and bias, and most importantly, <i>statistical thinking</i>. Statistical thinking is a different way of thinking that is part detective, skeptical, and involves alternate takes on a problem. An excellent example of statistical thinking is statistician Abraham Wald's <a href="https://en.wikipedia.org/wiki/Abraham_Wald">analysis</a> of British bombers surviving to return to their base in World War II: his conclusion was to reinforce bombers in areas in which no damage was observed. For other great examples watch my colleague Chris Fonnesbeck's <a href="https://www.youtube.com/watch?v=TGGGDpb04Yc">Statistical Thinking for Data Science</a>.<br /><br />Some of my personal philosophy of statistics can be summed up in the list below:<br /><ul style="background-color: white; font-family: arial, verdana, sans-serif; font-size: 14px;"><li style="background-color: transparent;">Statistics needs to be fully integrated into research; experimental design is all important</li><li style="background-color: transparent;">Don't be afraid of using modern methods</li><li style="background-color: transparent;">Preserve all the information in the data; Avoid categorizing continuous variables and predicted values at all costs</li><li style="background-color: transparent;">Don't assume that anything operates linearly</li><li style="background-color: transparent;">Account for model uncertainty and avoid it when possible by using subject matter knowledge</li><li style="background-color: transparent;">Use the bootstrap routinely</li><li style="background-color: transparent;">Make the sample size a <a href="https://stats.stackexchange.com/questions/256623">random variable</a> when possible</li><li style="background-color: transparent;">Use Bayesian methods whenever possible</li><li style="background-color: transparent;">Use excellent graphics, liberally</li><li style="background-color: transparent;">To be trustworthy research must be reproducible</li><li style="background-color: transparent;">All data manipulation and statistical analysis <b>must</b> be reproducible (one ramification being that I advise against the use of point and click software in most cases)</li></ul><div><span style="font-family: "arial" , "verdana" , sans-serif;"><span style="font-size: 14px;">Statistics has multiple challenges today, which I break down into three major sources:</span></span></div><div><ol><li><span style="font-family: "arial" , "verdana" , sans-serif;"><span style="font-size: 14px;">Statistics has been and continues to be taught in a traditional way, leading to statisticians believing that our historical approach to estimation, prediction, and inference was good enough.</span></span></li><li><span style="font-family: "arial" , "verdana" , sans-serif;"><span style="font-size: 14px;">Statisticians do not receive sufficient training in computer science and computational methods, too often leaving those areas to others who get so good at dealing with vast quantities of data that they assume they can be self-sufficient in statistical analysis and not seek involvement of statisticians. Many persons who analyze data do not have sufficient training in statistics.</span></span></li><li><span style="font-family: "arial" , "verdana" , sans-serif;"><span style="font-size: 14px;">Subject matter experts (e.g., clinical researchers and epidemiologists) try to avoid statistical complexity by "dumbing down" the problem using dichotomization, and statisticians, always trying to be helpful, fail to argue the case that dichotomization of continuous or ordinal variables is almost never an appropriate way to view or analyze data. Statisticians in general do not sufficiently involve themselves in measurement issues.</span></span></li></ol><div><span style="font-family: "arial" , "verdana" , sans-serif;"><span style="font-size: 14px;">I will be discussing several of the issues in future blogs, especially item 1 above and items 2 and 4 below.</span></span><br /><span style="font-family: "arial" , "verdana" , sans-serif;"><span style="font-size: 14px;"><br /></span></span><span style="font-family: "arial" , "verdana" , sans-serif;"><span style="font-size: 14px;">Complacency in the field of statistics and in statistical education has resulted in </span></span></div><div><ol><li>reliance on large-sample theory so that inaccurate normal distribution-based tools can be used, as opposed to tailoring the analyses to data characteristics using the bootstrap and semiparametric models</li><li>belief that null hypothesis significance testing ever answered the scientific question and the p-values are useful</li><li>avoidance of the likelihood school of inference (relative likelihood, likelihood support intervals, likelihood ratios, etc.)</li><li>avoidance of Bayesian methods (posterior distributions, credible intervals, predictive distributions, etc.)</li></ol><div><br /></div></div></div><div><br /></div>I was interviewed by Kevin Gray in July, 2017 where more of my opinions about statistics may be <a href="http://www.greenbookblog.org/2017/08/02/vital-statistics-you-never-learned-because-theyre-never-taught/">found</a>. Frank Harrellhttp://www.blogger.com/profile/15263496257600444093noreply@blogger.com5