Posts on Statistical Thinking
http://fharrell.com/post/
Recent content in Posts on Statistical Thinking
Hugo  gohugo.io
enus
© 2018
Sun, 01 Jan 2017 00:00:00 +0000

Why I Don't Like Percents
http://fharrell.com/post/percent/
Fri, 19 Jan 2018 00:00:00 +0000
http://fharrell.com/post/percent/
<p>The numbers zero and one are special; zero because it is a minimum or center point for many measurements and because it is the addition identity constant (x + 0 = x), and one because it is the multiplication identity constant (x × 1 = x) and corresponds to units of measurements. Many important quantities are between 0 and 1, including proportions of a whole and probabilities. One hundred is not special in the same sense as unity, so percent (per 100) doesn’t do anything for me (why not per thousand?).
<style>
img {
height: auto;
maxwidth: 70px;
marginleft: auto;
marginright: auto;
display: block;
}
</style></p>
<p>When a quantity doubles, it gets back to its original value by halving. When in increases by 100% it gets back to its original value by decreasing 50%. Case almost closed. Whereas an increase of 33.33% is balanced by a decrease of 25%, an increase by a factor of <sup>4</sup>⁄<sub>3</sub> is balanced by a decrease to a factor of <sup>3</sup>⁄<sub>4</sub> . If you put 100 dollars into an account that yields 3% interest annually, you will have 100 * (1.03<sup>10</sup>) or 134 dollars after 10 years. To get back to your original value you’d have to lose 2.91% per year for 10 years.</p>
<p>I like fractions like <sup>3</sup>⁄<sub>4</sub>, or the decimal equivalent 0.75. I like ratios, because they are symmetric. Chaining together relative increases is simple with ratios. An increase by a factor of 1.5 followed by an increase by a factor of 1.4 is an increase by a factor of 1.5 * 1.4 or 2.1. A 50% increase followed by a 40% increase is an increase of 110%. To get the right answer with percent increase you have to convert back to ratios, do the multiplication, then convert back to percent.</p>
<p>Many numbers that we quote are probabilities, and a probability is formally a number between 0 and 1. So I don’t like “the chance of rain is 10%” but prefer “the chance of rain is 0.1 or <sup>1</sup>⁄<sub>10</sub>”. When discussing statistical analyses it is especially irksome to see statements such as “significance levels of 5% or power of 90%”. Probabilities are being discussed, so I prefer 0.05 and 0.9.</p>
<p>I have seen clinicians confused over statements such as “the chance of a stroke is 0.5%”, interpreting this as 50%. If we say “the chance of a stroke is 0.005” such confusion is less likely. And I don’t need percent signs everywhere.</p>
<p>Percent change has even more problems than percent. I have often witnessed confusion from statements such as “the chance of stroke increased by 50%”. If the base stroke probability was 0.02 does the speaker mean that it is now 0.52? Not very likely, but you can’t be sure. More likely she meant that the chance of stroke is now 0.02 + 0.5 * 0.02 = 0.03. It would always be clear to instead say one of the following:</p>
<ul>
<li>The chance of stroke went from 0.02 to 0.03</li>
<li>The chance of stroke increased by 0.01 (or the <em>absolute</em> chance of stroke increased by 0.01)</li>
<li>The chance of stroke increased by a factor of 1.5</li>
</ul>
<p>We need to achieve clarity by settling on a convention for wording foldchange decreases. If the chance of stroke decreases from 0.03 to 0.02 and we feel compelled to summarize the <em>relative</em> decrease in risk, we could say that risk of stroke decreased by a factor of 1.5. But even though it looks a bit awkward, I think it would be clearest to say the following, if 0.02 corresponded to treatment A and 0.03 corresponded to treatment B: treatment A multiplied the risk of stroke by <sup>2</sup>⁄<sub>3</sub> in comparison to treatment A. Or you could say that treatment A modified the risk of stroke by a factor of <sup>2</sup>⁄<sub>3</sub>, or that the A:B risk ratio is <sup>2</sup>⁄<sub>3</sub> or 0.667.</p>
<p>Many quantities reported in the scientific literature are naturally ratios. For example, odds ratios and hazard ratios are commonly used. If the ratio of stroke hazard rates treatment B compared to treatment A is 0.75, I prefer to report “the B:A stroke hazard ratio was 0.75.” There’s no need to say that there was a 25% reduction in stroke hazard rate.</p>
<p>Percents have perhaps one good use. When they represent fractions and we don’t care to present but two decimal places of accuracy, i.e., the percents you calculate are all whole numbers, percents may be OK. But I would still prefer numbers like 0.02, 0.86 and to avoid a symbol (%) when just dealing with numbers.</p>

How Can Machine Learning be Reliable When the Sample is Adequate for Only One Feature?
http://fharrell.com/post/mlsamplesize/
Thu, 11 Jan 2018 00:00:00 +0000
http://fharrell.com/post/mlsamplesize/
<p>The ability to estimate how one continuous variable relates to another continuous variable is basic to the ability to create good predictions. Correlation coefficients are unitless, but estimating them requires similar sample sizes to estimating parameters we directly use in prediction such as slopes (regression coefficients). When the shape of the relationship between X and Y is not known to be linear, a little more sample size is needed than if we knew that linearity held so that all we had to estimate was a slope and an intercept. This will be addressed later.</p>
<p>Consider <a href="http://fharrell.com/doc/bbr.pdf#nameddest=sec:corrn">BBR Section 8.5.2</a> where it is shown that the sample size needed to estimate a correlation coefficient to within a margin of error as bad as ±0.2 with 0.95 confidence is about 100 subjects, and to achieve a better margin of error of ±0.1 requires about 400 subjects. Let’s reproduce that plot for the “hardest to estimate” case where the true correlation is 0.</p>
<style>
p.caption {
fontsize: 0.6em;
}
</style>
<pre class="r"><code>require(Hmisc)</code></pre>
<pre class="r"><code>knitrSet(lang='blogdown')</code></pre>
<pre class="r"><code>plotCorrPrecision(rho=0, n=seq(10, 1000, length=100), ylim=c(0, .4), method='none')
abline(h=seq(0, .4, by=0.025), v=seq(25, 975, by=25), col=gray(.9))</code></pre>
<div class="figure"><span id="fig:plotprec"></span>
<img src="http://fharrell.com/post/mlsamplesize_files/figurehtml/plotprec1.png" alt="Margin for error (length of longer side of asymmetric 0.95 confidence interval) for r in estimating ρ, when ρ=0. Calculations are based on the Fisher z transformation of r." width="672" />
<p class="caption">
Figure 1: Margin for error (length of longer side of asymmetric 0.95 confidence interval) for r in estimating ρ, when ρ=0. Calculations are based on the Fisher z transformation of r.
</p>
</div>
<p>I have seen many papers in the biomedical research literature in which investigators “turned loose” a machine learning or deep learning algorithm with hundreds of candidate features and a sample size that by the above logic is inadequate had there only been one candidate feature. How can ML possibly learn how hundreds of predictors combine to predict an outcome when our knowledge of statistics would say this is impossible? The short answer is that it can’t. Researchers claiming to have developed a useful predictive instrument with ML in the limited sample size case seldom do a rigorous internal validation that demonstrates the relationship between predicted and observed values (i.e., the calibration curve) to be a straight 45° line through the origin. I have worked with a colleague who had previously worked with a ML group who found a predictive signal (high R<sup>2</sup>) with over 1000 candidate features and N=50 subjects. In trying to check their results on new subjects we appear to be finding an R<sup>2</sup> about 1/4 as large as originally claimed.</p>
<p><span class="citation">Ploeg, Austin, and Steyerberg (<a href="#refplo14mod">2014</a>)</span> in their article <a href="https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471228814137">Modern modelling techniques are data hungry</a> estimated that to have a very high chance of rigorously validating, many machine learning algorithms require 200 events per <em>candidate</em> feature (they found that logistic regression requires 20 events per candidate features). So it seems that “big data” methods sometimes create the need for “big data” when traditional statistical methods may not require such huge sample sizes (at least when the dimensionality is not extremely high). [Note: in higher dimensonal situations it is possible to specify a traditional statistical model for the prespecified “important” predictors and to add in principal components and other summaries of the remaining features.] Machine learning algorithms do seem to have unique advantages in high signal:noise ratio situations such as image and sound pattern recognition problems. Medical diagnosis and outcome prediction problems involve a low signal:noise ratio, i.e., the R<sup>2</sup> are typically low and the outcome variable Y is typically measured with error.</p>
<p>I’ve shown the sample size needed to estimate a correlation coefficient with a certain precision. What about the sample size needed to estimate the whole relationship between a single continuous predictor and the probability of a binary outcome? Similar to what is presented in <a href="http://fharrell.com/doc/rms.pdf#nameddest=sec:lrmn">RMS Notes Section 10.2.3</a>, let’s simulate the average maximum (over a range of X) absolute prediction error (on the probability scale). The following R program does this, for various sample sizes. 1000 simulated datasets are analyzed for each sample size considered.</p>
<pre class="r"><code># X = universe of X values if X considered fixed, in random order
# xp = grid of x values at which to obtain and judge predictions
require(rms)</code></pre>
<pre class="r"><code>sim < function(assume = c('linear', 'smooth'),
X,
ns=seq(25, 300, by=25), nsim=1000,
xp=seq(1.5, 1.5, length=200), sigma=1.5) {
assume < match.arg(assume)
maxerr < numeric(length(ns))
pactual < plogis(xp)
xfixed < ! missing(X)
j < 0
worst < nsim
for(n in ns) {
j < j + 1
maxe < 0
if(xfixed) x < X[1 : n]
nsuccess < 0
for(k in 1 : nsim) {
if(! xfixed) x < rnorm(n, 0, sigma)
P < plogis(x)
y < ifelse(runif(n) <= P, 1, 0)
f < switch(assume,
linear = lrm(y ~ x),
smooth = lrm(y ~ rcs(x, 4)))
if(length(f$fail) && f$fail) next
nsuccess < nsuccess + 1
phat < predict(f, data.frame(x=xp), type='fitted')
maxe < maxe + max(abs(phat  pactual))
}
maxe < maxe / nsuccess
maxerr[j] < maxe
worst < min(worst, nsuccess)
}
if(worst < nsim) cat('For at least one sample size, could only run', worst, 'simulations\n')
list(x=ns, y=maxerr)
}
plotsim < function(object, xlim=range(ns), ylim=c(0.04, 0.2)) {
ns < object$x; maxerr < object$y
plot(ns, maxerr, type='l', xlab='N', xlim=xlim, ylim=ylim,
ylab=expression(paste('Average Maximum ', abs(hat(P)  P))))
minor.tick()
abline(h=c(.05, .1, .15), col=gray(.85))
}
set.seed(1)
X < rnorm(300, 0, sd=1.5) # Allows use of same X's for both simulations
simrun < TRUE
# If blogdown handled caching, would not need to manually cache with Load and Save
if(simrun) Load(errLinear) else {
errLinear < sim(assume='linear', X=X)
Save(errLinear)
}
plotsim(errLinear)</code></pre>
<div class="figure"><span id="fig:logisticsim"></span>
<img src="http://fharrell.com/post/mlsamplesize_files/figurehtml/logisticsim1.png" alt="Simulated expected maximum error in estimating probabilities for x ∈ [1.5, 1.5] with a single normally distributed X with mean zero. The true relationship between X and P(Y=1  X) is assumed to be logit(Y=1) = X. The logistic model fits that are repeated in the simulation assume the relationship is linear, but estimates the slope and intercept. In reality, we wouldn't know that a relationship is linear, and if we allowed it to be nonlinear there would be a bit more variance to the estimated curve, resulting in larger average absolute errors than what are shown in the figure (see below)." width="672" />
<p class="caption">
Figure 2: Simulated expected maximum error in estimating probabilities for x ∈ [1.5, 1.5] with a single normally distributed X with mean zero. The true relationship between X and P(Y=1  X) is assumed to be logit(Y=1) = X. The logistic model fits that are repeated in the simulation assume the relationship is linear, but estimates the slope and intercept. In reality, we wouldn’t know that a relationship is linear, and if we allowed it to be nonlinear there would be a bit more variance to the estimated curve, resulting in larger average absolute errors than what are shown in the figure (see below).
</p>
</div>
<p>But wait—the above simulation assumes that we already knew that the relationship was linear. In practice, most relationships are nonlinear but we don’t know the true transformation. Assume the relationship between X and logit(Y=1) is smooth, we can estimate the relationship reliably with a restricted cubic spline function. Here we use 4 knots, which gives rise to the addition of two nonlinear terms to the model for a total of 3 parameters to estimate not counting the intercept. By estimating these parameters we are estimating the smooth transformation of X and by simulating this process repeatedly we are allowing for “transformation uncertainty”.</p>
<pre class="r"><code>set.seed(1)
if(simrun) Load(errSmooth) else {
errSmooth < sim(assume='smooth', X=X, ns=seq(50, 300, by=25))
Save(errSmooth)
}
plotsim(errSmooth, xlim=c(25, 300))
lines(errLinear, col=gray(.8))</code></pre>
<div class="figure"><span id="fig:simrcs"></span>
<img src="http://fharrell.com/post/mlsamplesize_files/figurehtml/simrcs1.png" alt="Estimated mean maximum (over X) absolute errors in estimating P(Y=1) when X is not assumed to predict the logit linearly (black line). The earlier estimates when linearity was assumed are shown with a gray scale line. Restricted cubic splines could not be fitted for n=25." width="672" />
<p class="caption">
Figure 3: Estimated mean maximum (over X) absolute errors in estimating P(Y=1) when X is not assumed to predict the logit linearly (black line). The earlier estimates when linearity was assumed are shown with a gray scale line. Restricted cubic splines could not be fitted for n=25.
</p>
</div>
<p>You can see that the sample size must exceed 300 just to have sufficient reliability in estimating probabilities just over the range of X of [1.5, 1.5] when we do not know that the relationship is linear and we allow it to be nonlinear.</p>
<p>The morals of the story are</p>
<ul>
<li>Beware of claims of good predictive ability for ML algorithms when sample sizes are not huge in relationship to the number of candidate features</li>
<li>For any problem, whether using machine learning or regression, compute the sample size needed to obtain highly reliable predictions with only a single prespecified predictive feature</li>
<li>If you are not sure that relationships are simple so that you allow various transformations to be attempted, uncertainty increases and so does the expected absolute predicton error</li>
<li>If your sample size is not much bigger than the above minimum, beware of doing any highdimensional analysis unless you have very clean data and a high signal:noise ratio</li>
<li>Also remember that when Y is binary, the minimum sample size necessary just to estimate the intercept in a logistic regression model (equivalent to estimating a single proporton) is 96 (see <a href="http://fharrell.com/doc/bbr.pdf#nameddest=sec:htestpn">BBR Section 5.6.3</a>) So it is impossible with binary Y to accurately estimate P(Y=1  X) when there are <em>any</em> candidate predictors if n < 96 (and n=96 only achives a margin of error of ±0.1 in estimating risk).</li>
<li>When the number of candidate features is huge and the sample size is not, expect the list of “selected” features to be volatile, predictive discrimination to be overstated, and absolute predictive accuracy (calibration curve) to be very problematic</li>
<li>In general, know how many observations are required to allow you to reliably learn from the number of candidate features you have</li>
</ul>
<p>See <a href="http://fharrell.com/doc/bbr.pdf#nameddest=chap:hdata">BBR Chapter 20</a> for an approach to estimating the needed sample size for a given sample size and number of candidate predictors.</p>
<div id="references" class="section level1 unnumbered">
<h1>References</h1>
<div id="refs" class="references">
<div id="refplo14mod">
<p>Ploeg, Tjeerd van der, Peter C. Austin, and Ewout W. Steyerberg. 2014. “Modern modelling techniques are data hungry: a simulation study for predicting dichotomous endpoints.” <em>BMC Medical Research Methodology</em> 14 (1). BioMed Central Ltd: 137+. <a href="http://dx.doi.org/10.1186/1471228814137" class="uri">http://dx.doi.org/10.1186/1471228814137</a>.</p>
</div>
</div>
</div>

New Year Goals
http://fharrell.com/post/newyeargoals/
Fri, 29 Dec 2017 00:00:00 +0000
http://fharrell.com/post/newyeargoals/
<p>Here are some goals related to scientific research and clinical medicine that I’d like to see accomplished in 2018.</p>
<ul>
<li>Physicians come to know that precision/personalized medicine for the most part is based on a false premise</li>
<li>Machine learning/deep learning is understood to not find previously
unknown information in data in the majority of cases, and tends to
work better than traditional statistical models only when dominant
nonadditive effects are present and the signal:noise ratio is
decently high</li>
<li>Practitioners will make more progress in correctly using “old”
statistical tools such as regression models</li>
<li>Medical diagnosis is finally understood as a task in probabilistic
thinking, and sensitivity and specificity (which are characteristics
not only of tests but also of patients) are seldom used</li>
<li>Practitioners using cutpoints/thresholds for inherently continuous
measurements will finally go back to primary references and find
that the thresholds were never supported by data</li>
<li>Dichotomania is seen as a failure to understand utility/loss/cost
functions and as a tragic loss of information</li>
<li>Clinical quality improvement initiatives will rely on randomized
trial evidence and deemphasize purely observational evidence;
learning health systems will learn things that are actually true</li>
<li>Clinicians will give up on the idea that randomized clinical trials
do not generalize to realworld settings</li>
<li>Fewer prepost studies will be done</li>
<li>More research will be reproducible with sounder sample size
calculations, all data manipulation and analysis fully scripted, and
data available for others to analyze in different ways</li>
<li>Fewer sample size calculations will be based on a ‘miracle’ effect
size</li>
<li>Noninferiority studies will no longer use noninferiority margins
that are far beyond clinically significant</li>
<li>Fewer sample size calculations will be undertaken and more
sequential experimentation done</li>
<li>More Bayesian studies will be designed and executed</li>
<li>Classification accuracy will be mistrusted as a measure of
predictive accuracy</li>
<li>More researchers will realize that estimation rather than hypothesis
testing is the goal</li>
<li>Change from baseline will seldom be *computed,* not to mention not
used in an analysis</li>
<li>Percents will begin to be replaced with fractions and ratios</li>
<li>Fewer researchers will draw <strong>any</strong> conclusion from large pvalues
other than “the money was spent”</li>
<li>Fewer researchers will draw conclusions from small pvalues</li>
</ul>
<p>Some wishes expressed by others on Twitter:</p>
<ul>
<li>No more ROC curves</li>
<li>No more bar plots</li>
<li>Ban the term ‘statistical significance’ and ‘statistically
insignificant’</li>
</ul>

Scoring Multiple Variables, Too Many Variables and Too Few Observations: Data Reduction
http://fharrell.com/post/scoredatareduction/
Tue, 21 Nov 2017 15:40:00 +0000
http://fharrell.com/post/scoredatareduction/
<p>This post will grow to cover questions about data reduction methods, also known as <em>unsupervised learning</em> methods. These are intended primarily for two
purposes:</p>
<ul>
<li>collapsing correlated variables into an overall score so that one
does not have to disentangle correlated effects, which is a
difficult statistical task</li>
<li>reducing the effective number of variables to use in a regression or
other predictive model, so that fewer parameters need to be
estimated</li>
</ul>
<p>The latter example is the “too many variables too few subjects” problem.
Data reduction methods are covered in Chapter 4 of my book <em>Regression
Modeling Strategies</em>, and in some of the book’s case studies.</p>
<hr />
<h3 id="sachavarinwrites">Sacha Varin writes</h3>
<p><small>
I want to add/sum some variables having different units. I decide to
standardize (Zscores) the values and then, once transformed in
Zscores, I can sum them. The problem
is that my variables distributions are non Gaussian (my distributions
are not symmetrical (skewed), they are longtailed, I have all types of
weird distributions, I guess we can say the distributions are
intractable. I know that my distributions don’t need
to be gaussian to calculate Zscores, however, if the distributions are
not close to gaussian or at least symmetrical enough, I guess the
classical Zscore transformation: (Value  Mean)/SD is not valid, that’s why I decide, because my distributions are skewed and longtailed to use the Gini’s mean difference (robust and efficient
estimator).</p>
<ol>
<li>If the distributions are skewed and longtailed, can I standardize
the values using that formula (Value  Mean)/GiniMd ? Or the mean is not a good estimator in presence of skewed and longtailed distributions? What
about (Value  Median)/GiniMd ? Or what else with
GiniMd for a formula to standardize?</li>
<li>In presence of outliers, skewed and longtailed distributions, for
standardization, what formula is better to use
between (Value  Median)/MAD (=median
absolute deviation) or Value  Mean)/GiniMd ? And
why? My situation is not the predictive modeling case, but I want to sum the variables.
</small></li>
</ol>
<hr />
<p>These are excellent questions and touch on an interesting side issue.
My opinion is that standard deviations (SDs) are not very applicable to
asymmetric (skewed) distributions, and that they are not very robust
measures of dispersion. I’m glad you mentioned <a href="https://arxiv.org/pdf/1405.5027.pdf" target="_blank">Gini’s mean
difference</a>, which is the mean of
all absolute differences of pairs of observations. It is highly robust
and is surprisingly efficient as a measure of dispersion when compared
to the SD, even when normality
holds.</p>
<p>The questions also touch on the fact that when normalizing more than
one variable so that the variables may be combined, there is no magic
normalization method in statistics. I believe that Gini’s mean
difference is as good as any and better than the SD. It is also more
precise than the mean absolute difference from the mean or median, and
the mean may not be robust enough in some instances. But we have a rich
history of methods, such as principal components (PCs), that use
SDs.</p>
<p>What I’m about to suggest is a bit more
applicable to the case where you ultimately want to form a predictive
model, but it can also apply when the goal is to just combine several
variables. When the variables are continuous and are on different
scales, scaling them by SD or Gini’s mean difference will allow one to
create unitless quantities that may possibly be added. But the fact
that they are on different scales begs the question of whether they are
already “linear” or do they need separate nonlinear transformations to
be “combinable”.</p>
<p>I think that nonlinear PCs may be a better choice than just adding
scaled variables. When the predictor variables are correlated,
nonlinear PCs learn from the interrelationships, even occasionally
learning how to optimally transform each predictor to ultimately better
predict Y. The transformations (e.g., fitted spline functions) are
solved for to maximize predictability of a predictor, from the other
predictors or PCs of them. Sometimes the way the predictors move
together is the same way they relate to some ultimate outcome variable
that this undersupervised learning method does not have access to. An
example of this is in Section 4.7.3 of my book.</p>
<p>With a little bit of luck, the transformed predictors have more
symmetric distributions, so ordinary PCs computed on these transformed
variables, with their implied SD normalization, work pretty well. PCs
take into account that some of the component variables are highly
correlated with each other, and so are partially redundant and should
not receive the same weights (“loadings”) as other
variables.</p>
<p>The R transcan function in the Hmisc package has various options for nonlinear PCs, and these ideas are generalized in the R
<a href="https://cran.rproject.org/web/packages/homals" target="_blank">homals</a>
package.</p>
<p>How do we handle the case where the number of candidate predictors p is
large in comparison to the effective sample size n? Penalized maximum
likelihood estimation (e.g., ridge regression) and Bayesian regression
typically have the best performance, but data reduction methods are
competitive and sometimes more interpretable. For example, one can use
variable clustering and redundancy analysis as detailed in the RMS book
and course notes. Principal components (linear or nonlinear) can also
be an excellent approach to lowering the number of variables than need
to be related to the outcome variable Y. Two example approaches
are:</p>
<ol>
<li>Use the 15:1 rule of thumb to estimate how many predictors can
reliably be related to Y. Suppose that number is k. Use the first
k principal components to predict Y.</li>
<li>Enter PCs in decreasing order of variation (of the system of Xs)
explained and chose the number of PCs to retain using AIC. This is
far from stepwise regression which enters variables according to
their pvalues with Y. We are effectively entering variables in a
prespecified order with incomplete principal component
regression.</li>
</ol>
<p>Once the PC model is formed, one may attempt to interpret the model by
studying how raw predictors relate to the principal components or to the
overall predicted values.</p>
<p>Returning to Sacha’s original setting,
if linearity is assumed for all variables, then scaling by Gini’s mean
difference is reasonable. But psychometric properties should be
considered, and often the scale factors need to be derived from subject
matter rather than statistical
considerations.</p>

Statistical Criticism is Easy; I Need to Remember That Real People are Involved
http://fharrell.com/post/criticismeasy/
Sun, 05 Nov 2017 21:07:00 +0000
http://fharrell.com/post/criticismeasy/
<p>I have been critical of a number of articles, authors, and journals in
<a href="http://fharrell.com/post/errmed/" target="_blank">this</a>
growing blog article. Linking the blog with Twitter is a way to expose
the blog to more readers. It is far too easy to slip into hyperbole on
the blog and even easier on Twitter with its space limitations.
Importantly, many of the statistical problems pointed out in my article,
are very, very common, and I dwell on recent publications to get the
point across that inadequate statistical review at medical journals
remains a serious problem. Equally important, many of the issues I
discuss, from pvalues, null hypothesis testing to issues with change
scores are not well covered in medical education (of authors and
referees), and pvalues have caused a phenomenal amount of damage to the
research enterprise. Still, journals insist on emphasizing pvalues. I
spend a lot of time educating biomedical researchers about statistical
issues and as a reviewer for many medical journals, but still am on a
quest to impact journal editors.</p>
<p>Besides statistical issues, there are very real human issues, and
challenges in keeping clinicians interested in academic clinical
research when there are so many pitfalls, complexities, and compliance
issues. In the many clinical trials with which I have been involved,
I’ve always been glad to be the statistician and not the clinician
responsible for protocol logistics, informed consent, recruiting,
compliance, etc.</p>
<p>A recent case discussed
<a href="http://fharrell.com/post/errmed/#pcisham" target="_blank">here</a>
has brought the human issues home, after I came to know of the
extraordinary efforts made by the
<a href="http://www.thelancet.com/journals/lancet/article/PIIS01406736(17)327149/fulltext" target="_blank">ORBITA</a>
study’s first author, Rasha AlLamee, to make this study a reality.
Placebocontrolled device trials are very difficult to conduct and to
recruit patients into, and this was Rasha’s first effort to launch and
conduct a randomized clinical trial. I very much admire Rasha’s bravery
and perseverance in conducting this trial of PCI, when it is possible
that many past trials of PCI vs. medical theory were affected by placebo
effects.</p>
<p>Professor of Cardiology at Imperial College London, a coauthor on the
above paper, and Rasha’s mentor,
<a href="https://www.imperial.ac.uk/people/d.francis" target="_blank">Darrel Francis</a>, elegantly pointed
out to me that there is a real person on the receiving end of my
criticism, and I heartily agree with him that none of us would ever want
to discourage a clinical researcher from ever conducting her second
randomized trial. This is especially true when the investigator has a
burning interest to tackle difficult unanswered clinical questions. I
don’t mind criticizing statistical designs and analyses, but I can do a
better job of respecting the sincere efforts and hard work of biomedical
researchers.</p>
<p>I note in passing that I had the honor of being a coauthor with Darrel
on <a href="http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0081699" target="_blank">this paper</a>
of which I am extremely proud.</p>
<p>Dr Francis gave me permission to include his thoughts, which are below.
After that I list some ideas for making the path to presenting clinical
research findings a more pleasant journey.</p>
<hr />
<p><strong>As the PI for ORBITA, I apologise for this trial being 40 years late,
due to a staffing issue. I had to wait for the lead investigator, Rasha
AlLamee, to be born, go to school, study Medicine at Oxford University,
train in interventional cardiology, and start as a consultant in my
hospital, before she could begin the trial.</strong></p>
<p>Rasha had just finished her fellowship. She had experience in clinical
research, but this was her first leadership role in a trial. She was
brave to choose for her PhD a rigorous placebocontrolled trial in this
controversial but important area.</p>
<p>Funding was difficult: grant reviewers, presumably interventional
cardiologists, said the trial was (a) unethical and (b) unnecessary.
This trial only happened because Rasha was able to convince our
colleagues that the question was important and the patients would not be
without stenting for long. Recruitment was challenging because it
required interventionists to not apply the oculostenotic reflex. In the
end the key was Rasha keeping the message at the front of all our
colleagues’ minds with her boundless energy and enthusiasm.
Interestingly, when the concept was explained to patients, they agreed
to participate more easily than we thought, and dropped out less
frequently than we feared. This means we should indeed acquire
placebocontrolled data on interventional procedures.</p>
<p>Incidentally, I advocate the term “placebo” over “sham” for these
trials, for two reasons. First, placebo control is well recognised as
essential for assessing drug efficacy, and this helps people understand
the need for it with devices. Second, “sham” is a pejorative word,
implying deception. There is no deception in a placebo controlled trial,
only preagreed withholding of information.</p>
<hr />
<p>There are several ways to improve the system that I believe would foster
clinical research and make peer review more objective and productive.</p>
<ul>
<li>Have journals conduct reviews of background and methods without
knowledge of results.</li>
<li>Abandon journals and use researcherled online systems that invite
open post“publication” peer review and give researchers the
opportunities to improve their “paper” in an ongoing fashion.</li>
<li>If not publishing the entire paper online, deposit the background
and methods sections for open prejournal submission review.</li>
<li>Abandon null hypothesis testing and pvalues. Before that, always
keep in mind that a large pvalue means nothing more than “we don’t
yet have evidence against the null hypothesis”, and emphasize
confidence limits.</li>
<li>Embrace Bayesian methods that provide safer and more actionable
evidence, including measures that quantify clinical significance.
And if one is trying to amass evidence that the effects of two
treatments are similar, compute the direct probability of similarity
using a Bayesian model.</li>
<li>Improve statistical education of researchers, referees, and journal
editors, and strengthen statistical review for journals.</li>
<li>Until everyone understands the most important statistical concepts,
better educate researchers and peer reviewers on
<a href="http://biostat.mc.vanderbilt.edu/ManuscriptChecklist" target="_blank">statistical problems to avoid</a>.</li>
</ul>
<p>On a final note, I regularly review clinical trial design papers for
medical journals. I am often shocked at design flaws that authors state
are “too late to fix” in their response to the reviews. This includes
problems caused by improper endpoint variables that necessitated the
randomization of triple the number of patients actually needed to
establish efficacy. Such papers have often been through statistical
review before the journal submission. This points out two challenges:
(1) there is a lot of betweenstatistician variation that statisticians
need to address, and (2) there are many fundamental statistical concepts
that are not known to many statisticians (witness the widespread use of
change scores and dichotomization of variables even when senior
statisticians are among a paper’s authors).</p>

Continuous Learning from Data: No Multiplicities from Computing and Using Bayesian Posterior Probabilities as Often as Desired
http://fharrell.com/post/bayesseq/
Mon, 09 Oct 2017 00:00:00 +0000
http://fharrell.com/post/bayesseq/
<p><div align="center"><span style="fontsize:80%;">
(In a Bayesian analysis) It is entirely appropriate to collect data
until a point has been<br>proven or disproven, or until the data collector
runs out of time, money, or patience.<br><a href="http://psycnet.apa.org/doi/10.1037/h0044139" target="_blank">Edwards, Lindman,
Savage</a> (1963)
</span></div><br></p>
<h1 id="introduction">Introduction</h1>
<p>Bayesian inference, which follows the <em>likelihood principle</em>, is not
affected by the experimental design or intentions of the investigator.
Pvalues can only be computed if both of these are known, and as been
described by
<a href="http://amstat.tandfonline.com/doi/abs/10.1080/00031305.1987.10475458" target="_blank">Berry</a>
(1987) and others, it is almost never the case that the computation of
the pvalue at the end of a study takes into account all the changes in
design that were necessitated when pure experimental designs encounter
the real world.</p>
<p>When performing multiple data looks as a study progress, one can
accelerate learning by more quickly abandoning treatments that do not
work, by sometimes stopping early for efficacy, and frequently by
arguing to extend a promising but asyetinconclusive study by adding
subjects over the originally intended sample size. Indeed the whole
exercise of computing a single sample size is thought to be voodoo by
most practicing statisticians. It has become almost comical to listen to
rationalizations for choosing larger detectable effect sizes so that
smaller sample sizes will yield adequate power.</p>
<p>Multiplicity and resulting inflation of type I error when using
frequentist methods is real. While Bayesians concern themselves with
“what did happen?”, frequentists must consider “what might have
happened?” because of the backwards time and information flow used in
their calculations. Frequentist inference must envision an indefinitely
long string of identical experiments and must consider extremes of data
over potential studies and over multiple looks within each study if
multiple looks were intended. Multiplicity comes from the chances (over
study repetitions and data looks) you give data to be more extreme (if
the null hypothesis holds), not from the chances you give an effect to
be real. It is only the latter that is of concern to a Bayesian.
Bayesians entertain only one dataset at a time, and if one computes
posterior probabilities of efficacy multiple times, it is only the last
value calculated that matters.</p>
<p>To better understand the last point, consider a probabilistic pattern
recognition system for identifying enemy targets in combat. Suppose the
initial assessment when the target is distant is a probability of 0.3 of
being an enemy vehicle. Upon coming closer the probability rises to 0.8.
Finally the target is close enough (or the air clears) so that the
pattern analyzer estimates a probability of 0.98. The fact that the
probabilty was <0.98 earlier is of no consequence as the gunner
prepares to fire a canon. Even though the probability may actually
decrease while the shell is in the air due to new information, the
probability at the time of firing was completely valid based on then
available information.</p>
<p>This is very much how an experimenter would work in a Bayesian clinical
trial. The stopping rule is unimportant when interpreting the final
evidence. Earlier data looks are irrelevant. The only ways a Bayesian
would cheat would be to ignore a later look if it is less favorable than
an earlier look, or to try to pull the wool over reviewers’ eyes by
changing the prior distribution once data patterns emerge.</p>
<p>The meaning and accuracy of posterior probabilities of efficacy in a
clinical trial are mathematical necessities that follow from Bayes’
rule, if the data model is correctly specified (this model is needed
just as much by frequentist methods). So no simulations are needed to
demonstrate these points. But for the nonmathematically minded,
simulations can be comforting. For everyone, simulation code exposes the
logic flow in the Bayesian analysis paradigm.</p>
<p>One other thing: when the frequentist does a sequential trial with
possible early termination, the sampling distribution of the statistics
becomes extremely complicated, but must be derived to allow one to
obtain proper point estimates and confidence limits. It is almost never
the case that the statistician actually performs these complex
adjustments in a clinical trial with multiple looks. One example of the
harm of ignoring this problem is that if the trial stops fairly early
for efficacy, efficacy will be overestimated. On the other hand, the
Bayesian posterior mean/median/mode of the efficacy parameter will be
perfectly calibrated by the prior distribution you assume. If the prior
is skeptical and one stops early, the posterior mean will be “pulled
back” by a perfect amount, as shown in the simulation below.</p>
<p>We consider the simplest clinical trial design for illustration. The
efficacy measure is assumed to be normally distributed with mean μ and
variance 1.0, μ=0 indicates no efficacy, and μ<0 indicates a
detrimental effect. Our inferential jobs are to see if evidence may be
had for a positive effect and to see if further there is evidence for a
clinically meaningful effect (except for the futility analysis, we will
ignore the latter in what follows). Our business task is to not spend
resources on treatments that have a low chance of having a meaningful
benefit to patients. The latter can also be an ethical issue: we’d like
not to expose too many patients to an ineffective treatment. In the
simulation, we stop for futility when the probability that μ<0.05
exceeds 0.9, considering μ=0.05 to be a minimal clinically important
effect.</p>
<p>The logic flow in the simulation exposes what is assumed by the Bayesian
analysis.</p>
<ol>
<li>The prior distribution for the unknown effect μ is taken as a
mixture of two normal distributions, each with mean zero. This is a
skeptical prior that gives an equal chance for detriment as for
benefit from the treatment. Any prior would have done.</li>
<li>In the next step it is seen that the Bayesian does not consider a
stream of identical trials but instead (and only when studying
performance of Bayesian operating characteristics) considers a
stream of trials with <strong>different</strong> efficacies of treatment, by
drawing a single value of μ from the prior distribution. This is
done independently for 50,000 simulated studies. Posterior
probabilities are not informed by this value of μ. Bayesians operate
in a predictive mode, trying for example to estimate Prob(μ>0) no
matter what the value of μ.</li>
<li>For the current value of μ, simulate an observation from a normal
distribution with mean μ and SD=1.0. [In the code below all n=500
subjects’ data are simulated at once then revealed oneatatime.]</li>
<li>Compute the posterior probability of efficacy (μ>0) and of
futility (μ<0.05) using the original prior and latest data.</li>
<li>Stop the study if the probability of efficacy ≥0.95 or the
probability of futility ≥0.9.</li>
<li>Repeat the last 3 steps, sampling one more subject each time and
performing analyses on the accumulated set of subjects to date.</li>
<li>Stop the study when 500 subjects have entered.</li>
</ol>
<p>What is it that the Bayesian must demonstrate to the frequentist and
reviewers? She must demonstrate that the posterior probabilities
computed as stated above are accurate, i.e., they are well calibrated.
From our simulation design, the final posterior probability will either
be the posterior probability computed after the last (500th) subject has
entered, the probability of futility at the time of stopping for
futility, or the probability of efficacy at the time of stopping for
efficacy. How do we tell if the posterior probability is accurate? By
comparing it to the value of μ (unknown to the posterior probability
calculation) that generated the sequence of data points that were
analyzed. We can compute a smooth nonparametric calibration curve for
each of (efficacy, futility) where the binary events are μ>0 and μ<0.05, respectively. For the subset of the 50,000 studies that were
stopped early, the range of probabilities is limited so we can just
compare the mean posterior probability at the moment of stopping with
the proportion of such stopped studies for which efficacy (futility) was
the truth. The mathematics of Bayes dictates the mean probability and
the proportion must be the same (if enough trials are run so that
simulation error approaches zero). This is what happened in the
simulations.</p>
<p>For the smaller set of studies not stopping early, the posterior
probability of efficacy is uncertain and will have a much wider range.
The calibration accuracy of these probabilities is checked using a
nonparametric calibration curve estimator just as we do in validating
risk models, by fitting the relationship between the posterior
probability and the binary event μ>0.</p>
<p>The simulations also demonstrated that the posterior mean efficacy at
the moment of stopping is perfectly calibrated as an estimator of the
true unknown μ.</p>
<p>Simulations were run in R and used functions in the R Hmisc and rms
package. The results are below. Feel free to take the code and alter it
to run any simulations you’d like.</p>
<pre><code class="languager">require(rms)
</code></pre>
<pre><code class="languager">knitrSet(lang='blogdown', echo=TRUE)
gmu < htmlGreek('mu')
half < htmlSpecial('half')
geq < htmlTranslate('>=')
knitr::read_chunk('fundefs.r')
</code></pre>
<h1 id="specificationofprior">Specification of Prior</h1>
<p>The prior distribution is skeptical against large values of efficacy, and assumes that detriment is equally likely as benefit of treatment. The prior favors small effects. It is a 1:1 mixture of two normal distributes each with mean 0. The SD of the first distribution is chosen so that P(μ > 1) = 0.1, and the SD of the second distribution is chosen so that P(μ > 0.25) = 0.05. Posterior probabilities upon early stopping would have the same accuracy no matter which prior is chosen as long as the same prior generating μ is used to generate the data.</p>
<pre><code class="languager">sd1 < 1 / qnorm(1  0.1)
sd2 < 0.25 / qnorm(1  0.05)
wt < 0.5 # 1:1 mixture
pdensity < function(x) wt * dnorm(x, 0, sd1) + (1  wt) * dnorm(x, 0, sd2)
x < seq(3, 3, length=200)
plot(x, pdensity(x), type='l', xlab='Efficacy', ylab='Prior Degree of Belief')
</code></pre>
<p><img src="http://fharrell.com/post/bayesseq_files/figurehtml/skepprior1.png" width="672" /></p>
<h1 id="sequentialtestingsimulation">Sequential Testing Simulation</h1>
<pre><code class="languager">simseq < function(N, prior.mu=0, prior.sd, wt, mucut=0, mucutf=0.05,
postcut=0.95, postcutf=0.9,
ignore=20, nsim=1000) {
prior.mu < rep(prior.mu, length=2)
prior.sd < rep(prior.sd, length=2)
sd1 < prior.sd[1]; sd2 < prior.sd[2]
v1 < sd1 ^ 2
v2 < sd2 ^ 2
j < 1 : N
cmean < Mu < PostN < Post < Postf < postfe < postmean < numeric(nsim)
stopped < stoppedi < stoppedf < stoppedfu < stopfe < status <
integer(nsim)
notignored <  (1 : ignore)
# Derive function to compute posterior mean
pmean < gbayesMixPost(NA, NA, d0=prior.mu[1], d1=prior.mu[2],
v0=v1, v1=v2, mix=wt, what='postmean')
for(i in 1 : nsim) {
# See http://stats.stackexchange.com/questions/70855
component < if(wt == 1) 1 else sample(1 : 2, size=1, prob=c(wt, 1.  wt))
mu < prior.mu[component] + rnorm(1) * prior.sd[component]
# mu < rnorm(1, mean=prior.mu, sd=prior.sd) if only 1 component
Mu[i] < mu
y < rnorm(N, mean=mu, sd=1)
ybar < cumsum(y) / j # all N means for N sequential analyses
pcdf < gbayesMixPost(ybar, 1. / j,
d0=prior.mu[1], d1=prior.mu[2],
v0=v1, v1=v2, mix=wt, what='cdf')
post < 1  pcdf(mucut)
PostN[i] < post[N]
postf < pcdf(mucutf)
s < stopped[i] <
if(max(post) < postcut) N else min(which(post >= postcut))
Post[i] < post[s] # posterior at stopping
cmean[i] < ybar[s] # observed mean at stopping
# If want to compute posterior median at stopping:
# pcdfs < pcdf(mseq, x=ybar[s], v=1. / s)
# postmed[i] < approx(pcdfs, mseq, xout=0.5, rule=2)$y
# if(abs(postmed[i]) == max(mseq)) stop(paste('program error', i))
postmean[i] < pmean(x=ybar[s], v=1. / s)
# Compute stopping time if ignore the first "ignore" looks
stoppedi[i] < if(max(post[notignored]) < postcut) N
else
ignore + min(which(post[notignored] >= postcut))
# Compute stopping time if also allow to stop for futility:
# posterior probability mu < 0.05 > 0.9
stoppedf[i] < if(max(post) < postcut & max(postf) < postcutf) N
else
min(which(post >= postcut  postf >= postcutf))
# Compute stopping time for pure futility analysis
s < if(max(postf) < postcutf) N else min(which(postf >= postcutf))
Postf[i] < postf[s]
stoppedfu[i] < s
## Another way to do this: find first look that stopped for either
## efficacy or futility. Record status: 0:not stopped early,
## 1:stopped early for futility, 2:stopped early for efficacy
## Stopping time: stopfe, post prob at stop: postfe
stp < post >= postcut  postf >= postcutf
s < stopfe[i] < if(any(stp)) min(which(stp)) else N
status[i] < if(any(stp)) ifelse(postf[s] >= postcutf, 1, 2) else 0
postfe[i] < if(any(stp)) ifelse(status[i] == 2, post[s],
postf[s]) else post[N]
}
list(mu=Mu, post=Post, postn=PostN, postf=Postf,
stopped=stopped, stoppedi=stoppedi,
stoppedf=stoppedf, stoppedfu=stoppedfu,
cmean=cmean, postmean=postmean,
postfe=postfe, status=status, stopfe=stopfe)
}
</code></pre>
<pre><code class="languager">set.seed(1)
z < simseq(500, prior.mu=0, prior.sd=c(sd1, sd2), wt=wt, postcut=0.95,
postcutf=0.9, nsim=50000)
mu < z$mu
post < z$post
postn < z$postn
st < z$stopped
sti < z$stoppedi
stf < z$stoppedf
stfu < z$stoppedfu
cmean < z$cmean
postmean< z$postmean
postf < z$postf
status < z$status
postfe < z$postfe
rmean < function(x) formatNP(mean(x), digits=3)
k < status == 2
kf < status == 1
</code></pre>
<ul>
<li>Run 50,000 <b>different</b> clinical trials (differ on amount of efficacy)</li>
<li>For each, generate μ (true efficacy) from the prior</li>
<li>Generate data (n=500) under this truth</li>
<li>½ of the trials have zero or negative efficacy</li>
<li>Do analysis after 1, 2, …, 500 subjects studied</li>
<li>Stop the study when 0.95 sure efficacy > 0, i.e., stop the instant the posterior prob. that the unknown mean μ is positive is ≥ 0.95</li>
<li><p>Also stop for futility: the instant P(μ < 0.05) ≥ 0.9</p></li>
<li><p>20393 trials stopped early for efficacy</p></li>
<li><p>28438 trials stopped early for futility</p></li>
<li><p>1169 trials went to completion (n=500)</p></li>
<li><p>Average posterior prob. of efficacy at stopping for efficacy: 0.961</p></li>
<li><p>Of trials stopped early for efficacy, proportion with μ > 0: 0.960</p></li>
<li><p>Average posterior prob. of futility at stopping for futility: 0.920</p></li>
<li><p>Of trials stopped early for futility, proportion with μ < 0.05: 0.923</p></li>
</ul>
<p>The simulations took about 25 seconds in total.</p>
<h1 id="calibrationofposteriorprobabilitiesofefficacyforstudiesgoingtocompletion">Calibration of Posterior Probabilities of Efficacy for Studies Going to Completion</h1>
<p>Above we saw perfect calibration of the probabilities of efficacy and futility upon stopping. Let’s now examine the remaining probabilities, for the 1169 trials going to completion. For this we use the same type of nonparametric calibration curve estimation as used for validating risk prediction models. This curve estimates the relationship between the estimated probability of efficacy (Bayesian posterior probability) and the true probability of efficacy.</p>
<pre><code class="languager">k < status == 0
pp < postfe[k]
truly.efficacious < mu[k] > 0
v < val.prob(pp, truly.efficacious)
</code></pre>
<p><img src="http://fharrell.com/post/bayesseq_files/figurehtml/cal1.png" width="672" /></p>
<p>The posterior probabilities of efficacy tended to be between 0.45 (had they been much lower the trial would have been stopped for futility) and 0.95 (the cutoff for stopping for efficacy). Where there are data, the nonparametric calibration curve estimate is very close to the line of identity. Had we done even more simulations we would have had many more nonstopped studies and the calibration estimates would be even closer to the ideal. For example, when the posterior probability of efficacy is 0.6, the true probability that the treatment was effective (μ actually > 0) is 0.6.</p>
<h1 id="calibrationofposteriormeanatstoppingforefficacy">Calibration of Posterior Mean at Stopping for Efficacy</h1>
<p>When stopping early because of evidence that μ > 0, the sample mean will overestimate the true mean. But with the Bayesian analysis, where the prior favors smaller treatment effects, the posterior mean/median/mode is pulled back by a perfect amount, as shown in the plot below.</p>
<pre><code class="languager">plot(0, 0, xlab='Estimated Efficacy',
ylab='True Efficacy', type='n', xlim=c(2, 4), ylim=c(2, 4))
abline(a=0, b=1, col=gray(.9), lwd=4)
lines(supsmu(cmean, mu))
lines(supsmu(postmean, mu), col='blue')
text(2, .4, 'Sample mean')
text(1, .8, 'Posterior mean', col='blue')
</code></pre>
<p><img src="http://fharrell.com/post/bayesseq_files/figurehtml/estmu1.png" width="672" /></p>
<!
# Useful References
Berry[@ber87int], Edwards, Lindman and Savage[@edw63bay]
>
<h1 id="computingenvironment">Computing Environment</h1>
<p><!html_preserve><pre>
R version 3.4.2 (20170928)
Platform: x86_64pclinuxgnu (64bit)
Running under: Ubuntu 17.10</p>
<p>Matrix products: default
BLAS: /usr/lib/x86_64linuxgnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64linuxgnu/lapack/liblapack.so.3.7.1</p>
<p>attached base packages:
[1] methods stats graphics grDevices utils datasets base</p>
<p>other attached packages:
[1] rms_5.12 SparseM_1.77 Hmisc_4.12 ggplot2_2.2.1<br />
[5] Formula_1.22 survival_2.413 lattice_0.2035
</pre>
To cite R in publication use:
<p>R Core Team (2017).
<em>R: A Language and Environment for Statistical Computing</em>.
R Foundation for Statistical Computing, Vienna, Austria.
<a href="https://www.Rproject.org/">https://www.Rproject.org/</a>.
</p>
<!/html_preserve></p>

Bayesian vs. Frequentist Statements About Treatment Efficacy
http://fharrell.com/post/bayesfreqstmts/
Wed, 04 Oct 2017 07:00:00 +0000
http://fharrell.com/post/bayesfreqstmts/
<p>The following examples are intended to show the advantages of Bayesian reporting of
treatment efficacy analysis, as well as to provide examples contrasting
with frequentist reporting. As detailed
<a href="http://fharrell.com/post/pvallitany/" target="_blank">here</a>,
there are many problems with pvalues, and some of those problems will
be apparent in the examples below. Many of the advantages of Bayes are
summarized <a href="http://fharrell.com/post/journey/" target="_blank">here</a>.
As seen below, Bayesian posterior probabilities prevent one from
concluding equivalence of two treatments on an outcome when the data do
not support that (i.e., the <a href="http://fharrell.com/post/errmed/" target="_blank">“absence of evidence is not evidence of
absence”</a> error).</p>
<p>Suppose that a parallel group randomized clinical trial is conducted to
gather evidence about the relative efficacy of new treatment B to a
control treatment A. Suppose there are two efficacy endpoints: systolic
blood pressure (SBP) and time until cardiovascular/cerebrovascular
event. Treatment effect on the first endpoint is assumed to be
summarized by the BA difference in true mean SBP. The second endpoint
is assumed to be summarized as a true B:A hazard ratio (HR). For the
Bayesian analysis, assume that prespecified skeptical prior
distributions were chosen as follows. For the unknown difference in mean
SBP, the prior was normal with mean 0 with SD chosen so that the
probability that the absolute difference in SBP between A and B exceeds
10mmHg was only 0.05. For the HR, the log HR was assumed to have a
normal distribution with mean 0 and SD chosen so that the prior
probability that the HR>2 or HR<<sup>1</sup>⁄<sub>2</sub> was 0.05. Both priors
specify that it is equally likely that treatment B is effective as it is
detrimental. The two prior distributions will be referred to as p1 and
p2.</p>
<h3 id="example1socallednegativetrialconsideringonlysbp">Example 1: Socalled “Negative” Trial (Considering only SBP)</h3>
<p>Frequentist Statement</p>
<ul>
<li>Incorrect Statement: Treatment B did not improve SBP when compared
to A (p=0.4)</li>
<li>Confusing Statement: Treatment B was not significantly different
from treatment A (p=0.4)</li>
<li>Accurate Statement: We were unable to find evidence against the
hypothesis that A=B (p=0.4). More data will be needed. As the
statistical analysis plan specified a frequentist approach, the
study did not provide evidence of similarity of A and B (but see the
confidence interval below).</li>
<li>Supplemental Information: The observed BA difference in means was
4mmHg with a 0.95 confidence interval of [5, 13]. If this study
could be indefinitely replicated and the same approach used to
compute the confidence interval each time, 0.95 of such varying
confidence intervals would contain the unknown true difference in
means.</li>
</ul>
<p>Bayesian Statement</p>
<ul>
<li>Assuming prior distribution p1 for the mean difference of SBP, the
probability that SBP with treatment B is lower than treatment A is
0.67. Alternative statement: SBP is probably (0.67) reduced with
treatment B. The probability that B is inferior to A is 0.33.
Assuming a minimally clinically important difference in SBP of
3mmHg, the probability that the mean for A is within 3mmHg of the
mean for B is 0.53, so the study is uninformative about the question
of similarity of A and B.</li>
<li>Supplemental Information: The posterior mean difference in SBP was
3.3mmHg and the 0.95 credible interval is [4.5, 10.5]. The
probability is 0.95 that the true treatment effect is in the
interval [4.5, 10.5]. [could include the posterior density
function here, with a shaded right tail with area 0.67.]</li>
</ul>
<h3 id="example2socalledpositivetrial">Example 2: Socalled “Positive” Trial</h3>
<p>Frequentist Statement</p>
<ul>
<li>Incorrect Statement: The probability that there is no difference in
mean SBP between A and B is 0.02</li>
<li>Confusing Statement: There was a statistically significant
difference between A and B (p=0.02).</li>
<li>Correct Statement: There is evidence against the null hypothesis of
no difference in mean SBP (p=0.02), and the observed difference
favors B. Had the experiment been exactly replicated indefinitely,
0.02 of such repetitions would result in more impressive results if
A=B.</li>
<li>Supplemental Information: Similar to above.</li>
<li>Second Outcome Variable, If the pvalue is Small: Separate
statement, of same form as for SBP.</li>
</ul>
<p>Bayesian Statement</p>
<ul>
<li>Assuming prior p1, the probability that B lowers SBP when compared
to A is 0.985. Alternative statement: SBP is probably (0.985)
reduced with treatment B. The probability that B is inferior to A is
0.015.</li>
<li>Supplemental Information: Similar to above, plus evidence about
clinically meaningful effects, e.g.: The probability that B lowers
SBP by more than 3mmHg is 0.81.</li>
<li>Second Outcome Variable: Bayesian approach allows one to make a
separate statement about the clinical event HR and to state evidence
about the joint effect of treatment on SBP and HR. Examples:
Assuming prior p2, HR is probably (0.79) lower with treatment B.
Assuming priors p1 and p2, the probability that treatment B both
decreased SBP and decreased event hazard was 0.77. The probability
that B improved <strong>either</strong> of the two endpoints was 0.991.</li>
</ul>
<p>One would also report basic results. For SBP, frequentist results might
be chosen as the mean difference and its standard error. Basic Bayesian
results could be said to be the entire posterior distribution of the SBP
mean difference.</p>
<p>Note that if multiple looks were made as the trial progressed, the
frequentist estimates (including the observed mean difference) would
have to undergo complex adjustments. Bayesian results require no
modification whatsoever, but just involve reporting the latest available
cumulative evidence.</p>

Integrating Audio, Video, and Discussion Boards with Course Notes
http://fharrell.com/post/coursemedia/
Tue, 01 Aug 2017 10:12:00 +0000
http://fharrell.com/post/coursemedia/
<p>As a biostatistics teacher I’ve spent a lot of time thinking about inverting
the classroom and adding multimedia content. My first thought was to
create YouTube videos corresponding to sections in my lecture notes.
This typically entails recording the computer screen while going through
slides, adding a voiceover. I realized that the maintenance of such
videos is difficult, and this also creates a barrier to adding new
content. In addition, the quality of the video image is lower than just
having the student use a pdf viewer on the original notes. For those
reasons I decided to create audio narration for the sections in the
notes to largely capture what I would say during a live lecture. The
audio <code>mp3</code> files are stored on a local server and are streamed on
demand when a study clicks on the audio icon in a section of the notes.
The audio recordings can also be downloaded oneatatime or in a batch.</p>
<p>The notes themselves are created using <code>LaTeX, R</code>, and <code>knitr</code> using a
<code>LaTeX</code> style I created that is a compromise format between projecting
slides and printing notes. In the future I will explore using <code>bookdown</code>
for creating content in <code>html</code> instead of <code>pdf</code>. In either case, the
notes can change significantly when R commands within them are
reexecuted by <code>knitr</code> in <code>R</code>.</p>
<p>An example of a page of <code>pdf</code> notes with icons that link to audio or
video content is in Section 10.5 of
<a href="http://fharrell.com/links" target="_blank">BBR</a>. I add red letters
in the right margin for each subsection in the text, and occasionally
call out these letters in the audio so that the student will know where
I am.</p>
<p>There are several student activities for which the course would benefit
by recording information. Two of them are students pooling notes taken
during class sessions, and questions and answers between sessions. The
former might be handled by simultaneous editing or wiki curation on the
cloud, and I haven’t thought very much about how to link this with the
course notes to in effect expand the notes for the next class of
students. Let’s consider the Q&A aspect. It would be advantageous for
questions and answers to “grow”, and for future students to take
advantage of the Q&As from past students. Being able to be looking at a
subsection in the course notes and quickly linking to cumulative Q&A on
that topic is a plus. My first attempt at this was to set up a
<a href="http://slack.com" target="_blank">slack.com</a> team for courses in our department, and
then setting up a channel for each of the two courses I teach. As
<code>slack</code> does not allow subchannels, the discussions need to be
organized in some way. I went about this by assigning a mnemonic in the
course notes that should be mentioned when a threaded discussion is
started in <code>slack</code>. Students can search for discussions about a
subsection in the notes by searching for that mnemonic. I have put
hyperlinks from the notes to a slack search expression that is supposed
to bring up discussions related to the mnemonic in the course’s <code>slack</code>
channel. The problem is that <code>slack</code> doesn’t have a formal URL
construction that guarantees that a hyperlink to a URL with that
expression will cause the correct discussions to pop up in the user’s
browser. This is a work in progress, and other ideas are welcomed. See
Section 10.5.1 of <a href="http://fharrell.com/links" target="_blank">BBR</a>
for an example where an icon links to slack (see the mnemonic
<code>regsimple</code>).</p>
<p>Besides being hard to figure out how to create URLs o get the student
and instructor directly into a specific discussion, <code>slack</code> has the
disadvantage that users need to be invited to join the team. If every
team member is to be from the instructor’s university, you can configure
<code>slack</code> so that anyone with an email address in the instructor’s domain
can be added to the team automatically.</p>
<p>I have entertained another approach of using <a href="http://disqus.com" target="_blank">disqus</a>
for linking student comments to sections of notes. This is very easy to
set up, but when one wants to have a separate discussion about each
notes subsection, I haven’t figured out how to have <code>disqus</code> use
keywords or some other means to separate the discussions.</p>
<p><a href="http://stats.stackexchange.com" target="_blank">stats.stackexchange.com</a> is the world’s
most active Q&A and discussion board for statistics. Its ability to
format questions, answers, comments, math equations, and images is
unsurpassed. Perhaps every discussion about a statistical issue should
be started in <code>stackexchange</code> and then linked to from the course notes.
This has the disadvantage of needing to link to multiple existing
<code>stackexchange</code> questions related to one topic, but has the great
advantage of gathering input from statisticians around the world, not
just those in the class.</p>
<p>No mater which method for entering Q & A is used, I think that such
comments need to be maintained separately from the course notes because
of the dynamic, reproducible nature of the notes using <code>knitr</code>. Just as
important, when I add new static content to the notes I want the
existing student comments to just move appropriately with these changes.
Hyperlinking to Q & A does that. There is one more issue not discussed
above  students often annotate the <code>pdf</code> file, but their annotations
are undone when I produce an update to he notes. It would be nice to
have some sort of dynamic annotation capability. This is likely to work
better as I use <code>R bookdown</code> for new notes I develop.</p>
<p>I need your help in refining the approach or discovering completely new
approaches to coordination of information using the course notes as a
hub. Please add comments to this post below, or short suggestions to
<code>@f2harrell</code> on <code>twitter</code>.</p>

EHRs and RCTs: Outcome Prediction vs. Optimal Treatment Selection
http://fharrell.com/post/ehrsrcts/
Thu, 01 Jun 2017 14:18:00 +0000
http://fharrell.com/post/ehrsrcts/
<div align="center">
Frank Harrell<br>
Professor of Biostatistics<br>
Vanderbilt University School of Medicine<br><br>
Laura Lazzeroni<br>
Professor of Psychiatry and, by courtesy, of Medicine (Cardiovascular
Medicine) and of Biomedical Data Science<br>
Stanford University School of Medicine<br>
<small>Revised July 17, 2017</small>
</div>
<p>It is often said that randomized clinical trials (RCTs) are the gold
standard for learning about therapeutic effectiveness. This is because
the treatment is assigned at random so no variables, measured or
unmeasured, will be truly related to treatment assignment. The result is
an unbiased estimate of treatment effectiveness. On the other hand,
observational data arising from clinical practice has all the biases of
physicians and patients in who gets which treatment. Some treatments are
indicated for certain types of patients; some are reserved for very sick
ones. The fact is that the selection of treatment is often chosen on the
basis of patient characteristics that influence patient outcome, some of
which may be unrecorded. When the outcomes of different groups of
patients receiving different treatments are compared, without adjustment
for patient characteristics related to treatment selection and outcome,
the result is a bias called <em>confounding by indication</em>.</p>
<p>To set the stage for our discussion of the challenges caused by
confounding by indication, incomplete data, and unreliable data, first
consider a nearly ideal observational treatment study then consider an
ideal RCT. First, consider a potentially optimal observational cohort
design that has some possibility of providing an accurate treatment
outcome comparison. Suppose that an investigator has obtained $2M in
funding to hire trained research nurses to collect data completely and
accurately, and she has gone to the trouble of asking five expert
clinicians in the disease/treatment area to each independently list the
patient characteristics they perceive are used to select therapies for
patients. The result is a list of 18 distinct patient characteristics,
for which a data dictionary is written and case report forms are
collected. Data collectors are instructed to obtain these 18 variables
on every patient with very few exceptions, and other useful variables,
especially strong prognostic factors, are collected in addition. Details
about treatment are also captured, including the start and ending dates
of treatment, doses, and dose schedule. Outcomes are well defined and
never missing. The sample size is adequate, and when data collection is
complete, analysis of covariance is used to estimate the outcome
difference for treatment A vs. treatment B. Then the study PI discovers
that there is a strong confounder that none of the five experts thought
of, and a sensitivity analysis indicates that the original treatment
effect estimate might have been washed away by the additional confounder
had it been collected. The study results in no reliable knowledge about
the treatments.</p>
<p>The study just described represents a high level of observational study
quality, and still needed some luck to be useful. The treatments, entry
criteria, and followup clock were well defined, and there were almost
no missing data. Contrast that with the electronic health record (EHR).
If questions of therapeutic efficacy are so difficult to answer with
nearly perfect observational data how can they be reliably answered from
EHR data alone?</p>
<p>To complete our introduction to the discussion, envision a
wellconducted parallelgroup RCT with complete followup and highly
accurate and relevant baseline data capture. Study inclusion criteria
allowed for a wide range of age and severity of disease. The endpoint is
time until a devastating clinical event. The treatment B:treatment A
covariateadjusted hazard ratio is 0.8 with 0.95 credible interval of
[0.65, 0.93]. The authors, avoiding unreliable subgroup analysis,
perform a careful but comprehensive assessment of interaction between
patient types and treatment effect, finding no evidence for
heterogeneity of treatment effect (HTE). The hazard ratio of 0.8 is
widely generalizable, even to populations with much different baseline
risk. A simple nomogram is drawn to show how to estimate absolute risk
reduction by treatment B at 3 years, given a patient’s baseline 3y risk.</p>
<hr />
<p>There is an alarming trend in advocates of learning from the EHR saying
that statistical inference can be bypassed because (1) large numbers
overcome all obstacles, (2) the EHR reflects actual clinical practice
and patient populations, and (3) if you can predict outcomes for
individual patients you can just find out for which treatment the
predicted outcomes are optimal. Advocates of such “logic” often go on to
say that RCTs are not helpful because the proportion of patients seen in
practice that would qualify for the trial is very small with randomized
patients being unrepresentative of the clinical population, because the
trial only estimates the average treatment effect, because there must be
HTE, and because treatment conditions are unrepresentative. Without HTE,
precision medicine would have no basis. But evidence of substantial HTE
has yet to be generally established and its existence in particular
cases can be an artifact of the outcome scale used for the analysis. See
<a href="http://fharrell.com/post/rctmimic/" target="_blank">this</a>
for more about the first two complaints about RCTs. Regarding (1),
researchers too often forget that measurement or sample bias does not
diminish no matter how large the sample size. Often, large sample sizes
only provide precise estimates of the wrong quantity.</p>
<p>To illustrate this problem, suppose that one is interested in estimating
and testing the treatment effect, BA, of a certain blood pressure
lowering medication (drug B) when compared to another drug (A). Assume a
relevant subset of the EHR can be extracted in which patients started
initial monotherapy at a defined date and systolic blood pressure (SBP)
was measured routinely at useful followup intervals. Suppose that the
standard deviation (SD) of SBP across patients is 8 mmHg regardless of
treatment group. Suppose further that minor confounding by indication is
present due to the failure to adjust for an unstated patient feature
involved in the drug choice, which creates a systematic unidirectional
bias of 2 mmHg in estimating the true BA difference in mean SBP. If the
EHR has m patients in each treatment group, the variance of the
estimated mean difference is the sum of the variances of the two
individual means or 64/m + 64/m = 128/m. But the variance only tells us
about how close our sample estimate is to the incorrect value, BA + 2
mmHg. It is the mean squared error, the variance plus the square of the
bias or 128/m + 4, that relates to the probability that the estimate is
close to the true treatment effect BA. As m gets larger, the variance
goes to zero indicating a stable estimate has been achieved. But the
bias is constant so the mean squared error remains at 4 (root mean
squared error = 2 mmHg).</p>
<p>Now consider an RCT that is designed not to estimate the mean SBP for A
or the mean SBP for B but, as with all randomized trials, is designed to
estimate the BA difference (treatment effect). If the trial randomized
m subjects per treatment group, the variance of the mean difference is
128/m and the mean squared error is also 128/m. The comparison of the
square root of mean squared errors for an EHR study and an equalsized
RCT is depicted in the figure below. Here, we have even given the EHR
study the benefit of the doubt in assuming that SBP is measured as
accurately as would be the case in the RCT. This is unlikely, and so in
reality the results presented below are optimistic for the performance
of the EHR.</p>
<figure >
<img src="http://fharrell.com/img/mse.png" width="80%" />
</figure>
<p>EHR studies have the potential to provide far larger sample sizes than
RCTs, but note that an RCT with a total sample size of 64 subjects is as
informative as an EHR study with infinitely many patients. <strong>Bigger is
not better</strong>. What if the SBP measurements from the EHR, not collected
under any protocol, are less accurate than those collected under the RCT
protocol? Let’s exemplify that by setting the SD for SBP to 10 mmHg for
the EHR while leaving it as 8 mmHg for the RCT. For very large sample
sizes, bias trumps variance so the breakeven point of 64 subjects
remains, but for nonlarge EHRs the increased variability of measured
SBPs harms the margin of error of EHR estimate of mean SBP difference.</p>
<p>We have addressed estimation error for the treatment effect, but note
that while an EHRbased statistical test for any treatment difference
will have excellent power for large n, this comes at the expense of
being far from preserving the type I error, which is essentially 1.0 due
to the estimation bias causing the twosample statistical test to be
biased.</p>
<p>Interestingly, bias decreases the benefits achieved by larger sample
sizes to the extent that, in contrast to an unbiased RCT, the mean
squared error for an EHR of size 3000 in our example is nearly identical
to what it would be with an infinite sample size. While this disregards
the need for larger samples to target multiple treatments or distinct
patient populations, it does suggest that overcoming the specific
resourceintensive challenges associated with handling huge EHR samples
may yield fewer advances in medical treatment than anticipated by some,
if the effects of bias are considered.</p>
<p>There is a mantra heard in data science that you just need to “let the
data speak.” You can indeed learn much from observational data if
quality and completeness of data are high (this is for another
discussion; EHRs have major weakness just in these two aspects). But
data frequently teach us things that are <a href="https://www.youtube.com/watch?v=TGGGDpb04Yc" target="_blank">just plain
wrong</a>. This is due to a
variety of reasons, including seeing trends and patterns that can be
easily explained by pure noise. Moreover, treatment group comparisons in
EHRs can reflect both the effects of treatment and the effects of
specific prior patient conditions that led to the choice of treatment in
the first place  conditions that may not be captured in the EHR. The
latter problem is confounding by indication, and this can only be
overcome by randomization, strong assumptions, or having highquality
data on all the potential confounders (patient baseline characteristics
related to treatment selection and to outcome–rarely if ever possible).
Many clinical researchers relying on EHRs do not take the time to even
list the relevant patient characteristics before rationalizing that the
EHR is adequate. To make matters worse, EHRs frequently do not provide
accurate data on when patients started and stopped treatment.
Furthermore, the availability of patient outcomes can depend on the very
course of treatment and treatment response under study. For example,
when a trial protocol is not in place, lab tests are not ordered at
prespecified times but because of a changing patient condition. If EHR
cannot provide a reliable estimate of the average treatment effect how
could it provide reliable estimates of differential treatment benefit
(HTE)?</p>
<p>Regarding the problem with signal vs. noise in “let the data speak”, we
envision a clinician watching someone playing a slot machine in Las
Vegas. The clinician observes that a small jackpot was hit after 17
pulls of the lever, and now has a model for success: go to a random slot
machine with enough money to make 17 pulls. Here the problem is not a
biased sample but pure noise.</p>
<p>Observational data, when complete and accurate, can form the basis for
accurate predictions. But what are predictions really good for?
Generally speaking, predictions can be used to estimate likely patient
outcomes given prevailing clinical practice and treatment choices, with
typical adherence to treatment. Prediction is good for natural history
studies and for counseling patients about their likely outcomes. What is
needed for selecting optimum treatments is an answer to the “what if”
question: what is the likely outcome of this patient were she given
treatment A vs. were she given treatment B? This is inherently a problem
of causal inference, which is why such questions are best answered using
experimental designs, such as RCTs. When there is evidence that the
complete, accurate observational data captured and eliminated
confounding by indication, then and only then can observational data be
a substitute for RCTs in making smart treatment choices.</p>
<p>What is a good global strategy for making optimum decisions for
individual patients? Much more could be said, but for starters consider
the following steps:</p>
<ul>
<li>Obtain the best covariateadjusted estimate of relative treatment
effect (e.g., odds ratio, hazards ratio) from an RCT. Check whether
this estimate is constant or whether it depends on patient
characteristics (i.e., whether heterogeneity of treatment effect
exists on the relative scale). One possible strategy, using fully
specified interaction tests adjusted for all main effects, is in
<em><a href="http://fharrell.com/links" target="_blank">Biostatistics for Biomedical Research</a></em>
in the Analysis of Covariance chapter.</li>
<li>Develop a predictive model from complete, accurate observational
data, and perform strong interval validation using the bootstrap to
verify absolute calibration accuracy. Use this model to handle risk
magnification whereby absolute treatment benefits are greater for
sicker patients in most cases.</li>
<li>Apply the relative treatment effects from the RCT, separately for
treatment A and treatment B, to the estimated outcome risk from the
observational data to obtain estimates of absolute treatment benefit
(B vs. A) for the patient. See the first figure below which relates
a hazard ratio to absolute improvement in survival probability.</li>
<li>Develop a nomogram using the RCT data to estimate absolute treatment
benefit for an individual patient. See the second figure below whose
bottom axis is the difference between two logistic regression
models. (Both figures are from
<a href="http://fharrell.com/links" target="_blank">BBR</a> Chapter 13)</li>
<li>For more about such strategies, see Stephen Senn’s
<a href="https://www.slideshare.net/StephenSenn1/realworldmodified" target="_blank">presentation</a>.</li>
</ul>
<figure >
<img src="http://fharrell.com/img/hrvssurv.png" width="90%" />
</figure>
<figure >
<img src="http://fharrell.com/img/gustonomogram.png" width="90%" />
</figure>

Statistical Errors in the Medical Literature
http://fharrell.com/post/errmed/
Sat, 08 Apr 2017 08:36:00 +0000
http://fharrell.com/post/errmed/
<ol>
<li><a href="#pval">Misinterpretation of Pvalues and Main Study Results</a></li>
<li><a href="#catg">Dichotomania</a></li>
<li><a href="#change">Problems With Change Scores</a></li>
<li><a href="#subgroup">Improper Subgrouping</a></li>
<li><a href="#serial">Serial Data and Response Trajectories</a></li>
</ol>
<hr />
<p>As Doug Altman famously wrote in his <em><a href="http://www.bmj.com/content/308/6924/283" target="_blank">Scandal of Poor Medical
Research</a></em> in BMJ in 1994, the
quality of how statistical principles and analysis methods are applied
in medical research is quite poor. According to Doug and to many others
such as <a href="http://blogs.bmj.com/bmj/2014/01/31/richardsmithmedicalresearchstillascandal" target="_blank">Richard Smith</a>,
the problems have only gotten worse. The purpose of this blog article
is to contain a running list of new papers in major medical journals
that are statistically problematic, based on my random encounters with
the literature.</p>
<p>One of the most pervasive problems in the medical literature (and in
other subject areas) is misuse and misinterpretation of pvalues as
detailed <a href="http://www.fharrell.com/2017/02/alitanyofproblemswithpvalues.html" target="_blank">here</a>,
and chief among these issues is perhaps the <em><a href="http://www.bmj.com/content/311/7003/485" target="_blank">absence of evidence is not
evidence of absence</a> error</em>
written about so clearly by Altman and Bland. The following thought
will likely rattle many biomedical researchers but I’ve concluded that
most of the gross misinterpretation of large pvalues by falsely
inferring that a treatment is not effective is caused by (1) the
investigators not being brave enough to conclude “We haven’t learned
anything from this study”, i.e., they feel compelled to believe that
their investments of time and money must be worth something, (2)
journals accepting such papers without demanding a proper statistical
interpretation in the conclusion. One example of proper wording would
be “This study rules out, with 0.95 confidence, a reduction in the odds
of death that is more than by a factor of 2.” Ronald Fisher, when asked
how to interpret a large pvalue, said “Get more data.” Adoption of
Bayesian methods would <a href="http://www.fharrell.com/2017/02/myjourneyfromfrequentisttobayesian.html" target="_blank">solve many
problems</a>
including this one. Whether a pvalue is small or large a Bayesian can
compute the posterior probability of similarity of outcomes of two
treatments (e.g., Prob(0.85 < odds ratio < <sup>1</sup>⁄<sub>0</sub>.85)), and the
researcher will often find that this probability is not large enough to
draw a conclusion of similarity. On the other hand, what if even under
a skeptical prior distribution the Bayesian posterior probability of
efficacy were 0.8 in a “negative” trial? Would you choose for yourself
the standard therapy when it had a 0.2 chance of being better than the
new drug? [Note: I am not talking here about regulatory decisions.]
Imagine a Bayesian world where it is standard to report the results for
the primary endpoint using language such as:</p>
<ul>
<li>The probability of any efficacy is 0.94 (so the probability of
nonefficacy is 0.06).</li>
<li>The probability of efficacy greater than a factor of 1.2 is 0.78
(odds ratio < <sup>1</sup>⁄<sub>1</sub>.2).</li>
<li>The probability of similarity to within a factor of 1.2 is 0.3.</li>
<li>The probability that the true odds ratio is between [0.6, 0.99] is
0.95 (credible interval; doesn’t use the longrun tendency of
confidence intervals to include the true value for 0.95 of
confidence intervals computed).</li>
</ul>
<p>In a socalled “negative” trial we frequently see the phrase “treatment
B was not significantly different from treatment A” without thinking out
how little information that carries. Was the power really adequate? Is
the author talking about an observed statistic (probably yes) or the
true unknown treatment effect? Why should we care more about
statistical significance than clinical significance? The phrase “was
not significantly different” seems to be a way to avoid the real issues
of interpretation of large pvalues. Since my #1 area of study is
statistical modeling, especially predictive modeling, I pay a lot of
attention to model development and model validation as done in the
medical literature, and I routinely encounter published papers where the
authors do not have basic understanding of the statistical principles
involved. This seems to be especially true when a statistician is not
among the paper’s authors. I’ll be commenting on papers in which I
encounter statistical modeling, validation, or interpretation problems.</p>
<p><a name="pval"></a></p>
<h3 id="misinterpretationofpvaluesandofmainstudyresults">Misinterpretation of Pvalues and of Main Study Results</h3>
<p>One of the most problematic examples I’ve seen is in the March 2017
paper <a href="http://www.nejm.org/doi/full/10.1056/nejmoa1616218#t=article" target="_blank">Levosimendan in Patients with Left Ventricular Dysfunction
Undergoing Cardiac
Surgery</a> by
Rajenda Mehta in the New England Journal of Medicine. The study was
designed to detect a miracle  a 35% relative odds reduction with drug
compared to placebo, and used a power requirement of only 0.8 (type II
error a whopping 0.2). [The study also used some questionable
alphaspending that Bayesians would find quite odd.] For the primary
endpoint, the adjusted odds ratio was 1.00 with 0.99 confidence interval
[0.66, 1.54] and p=0.98. Yet the authors concluded “Levosimendan was
not associated with a rate of the composite of death, renalreplacement
therapy, perioperative myocardial infarction, or use of a mechanical
cardiac assist device that was lower than the rate with placebo among
highrisk patients undergoing cardiac surgery with the use of
cardiopulmonary bypass.” Their own data are consistent with a 34%
reduction (as well as a 54% increase)! Almost nothing was learned from
this underpowered study. It may have been too disconcerting for the
authors and the journal editor to have written “We were only able to
rule out a massive benefit of drug.” [Note: two treatments can have
agreement in outcome probabilities by chance just as they can have
differences by chance.] It would be interesting to see the Bayesian
posterior probability that the true unknown odds ratio is in [0.85,
<sup>1</sup>⁄<sub>0</sub>.85]. The primary endpoint is the union of death, dialysis, MI, or
use of a cardiac assist device. This counts these four endpoints as
equally bad. An ordinal response variable would have yielded more
statistical information/precision and perhaps increased power. And
instead of dealing with multiplicity issues and alphaspending, the
multiple endpoints could have been dealt with more elegantly with a
Bayesian analysis. For example, one could easily compute the joint
probability that the odds ratio for the primary endpoint is less than
0.8 and the odds ratio for the secondary endpoint is less than 1 [the
secondary endpoint was death or assist device and and is harder to
demonstrate because of its lower incidence, and is perhaps more of a
“hard endpoint”]. In the Bayesian world of forward directly relevant
probabilities there is no need to consider multiplicity. There is only
a need to state the assertions for which one wants to compute current
probabilities.</p>
<p>The paper also contains inappropriate assessments of interactions with
treatment using subgroup analysis with arbitrary cutpoints on continuous
baseline variables and failure to adjust for other main effects when
doing the subgroup analysis.</p>
<p>This paper had a fine statistician as a coauthor. I can only conclude
that the pressure to avoid disappointment with a conclusion of spending
a lot of money with little to show for it was in play.
Why was such an underpowered study launched? Why do researchers attempt
“hail Mary passes”? Is a study that is likely to be futile fully
ethical? Do medical journals allow this to happen because of some
vested interest?</p>
<h4 id="similarexamples">Similar Examples</h4>
<p>Perhaps the above example is no worse than many. Examples of “absence
of evidence” misinterpretations abound. Consider the
<a href="http://jamanetwork.com/journals/jama/articleabstract/2612911" target="_blank">JAMA</a>
paper by Kawazoe et al published 20170404. They concluded that
“Mortality at 28 days was not significantly different in the
dexmedetomidine group vs the control group (19 patients [22.8%] vs 28
patients [30.8%]; hazard ratio, 0.69; 95% CI, 0.381.22;P
>= .20).” The point estimate was a reduction in hazard of death by
31% and the data are consistent with the reduction being as large as
62%! Or look at
<a href="http://jamanetwork.com/journals/jama/articleabstract/2613159" target="_blank">this</a>
20170321 JAMA article in which the authors concluded “Among healthy
postmenopausal older women with a mean baseline serum 25hydroxyvitamin
D level of 32.8 ng/mL, supplementation with vitamin D~3~ and calcium
compared with placebo did not result in a significantly lower risk of
alltype cancer at 4 years.” even though the observed hazard ratio was
0.7, with lower confidence limit of a whopping 53% reduction in the
incidence of cancer. And the 0.7 was an <em>unadjusted</em> hazard ratio; the
hazard ratio could well have been more impressive had covariate
adjustment been used to account for outcome heterogeneity within each
treatment arm.</p>
<p>An <a name="pcisham"></a> incredibly highprofile
paper published online 20171102 in <em>The Lancet</em> demonstrates a lack of
understanding of some statistical issues. In <a href="http://www.thelancet.com/journals/lancet/article/PIIS01406736(17)327149/fulltext" target="_blank">Percutaneous coronary
intervention in stable angina (ORBITA): a doubleblind, randomised
controlled
trial</a>
by Rasha AlLamee et al, the authors (or was it the journal editor?)
boldly claimed “In patients with medically treated angina and severe
coronary stenosis, PCI did not increase exercise time by more than the
effect of a placebo procedure.” The authors are to be congratulated on
using a rigorous sham control, but the authors, reviewers, and editor
allowed a classic <em>absence of evidence is not evidence of absence</em> error
to be made in attempting to interpret p=0.2 for the primary analysis of
exercise time in this small (n=200) RCT. In doing so they ignored the
useful (but flawed; see below) 0.95 confidence interval of this effect
of [8.9, 42] seconds of exercise time increase for PCI. Thus their
data are consistent with a 42 second increase in exercise time by real
PCI. It is also important to note that the authors fell into the <a href="#change">change
from baseline trap</a> by disrespecting their own parallel group design. They should have asked the covariateadjusted question: For two patients starting with the same exercise capacity, one assigned PCI and one assigned PCI sham, what is the average difference in followup exercise time?</p>
<p><strong>But</strong> there are other ways to view this study. Sham studies are
difficult to fund and difficult to recruit large number of patients.
Criticizing the interpretation of the statistical analysis fails to
recognize the value of the study. One value is the study’s ruling out an
exercise time improvement greater than 42s (with 0.95 confidence). If,
as several cardiologists have told me, 42s is not very meaningful to the
patient, then the study is definitive and clinically relevant. I just
wish that authors and especially editors would use exactly correct
language in abstracts of articles. For this trial, suitable language
would have been along these lines: The study did not find evidence
against the null hypothesis of no change in exercise time (p=0.2), but
was able to (with 0.95 confidence) rule out an effect larger than 42s. A
Bayesian analysis would have been even more clinically useful. For
example, one might find that the posterior probability that the increase
in exercise time with PCI is less than 20s is 0.97. And our infatuation
with 2tailed pvalues comes into play here. A Bayesian posterior
probability of <em>any</em> improvement might be around 0.88, far more
“positive” than what someone who misunderstands pvalues would conclude
from an “insignificant” pvalue. Other thoughts concerning the ORBITA
trial may be found
<a href="http://www.fharrell.com/2017/11/statisticalcriticismiseasyineedto.html" target="_blank">here</a>.</p>
<p><a name="catg"></a></p>
<h3 id="dichotomania">Dichotomania</h3>
<p>Dichotomania, as discussed by <a href="https://www.researchgate.net/profile/Stephen_Senn/publication/221689734_Dichotomania_an_obsessive_compulsive_disorder_that_is_badly_affecting_the_quality_of_analysis_of_pharmaceutical_trials/links/0fcfd5109734cb6268000000.pdf?origin=publication_list" target="_blank">Stephen
Senn</a>,
is a very prevalent problem in medical and epidemiologic research.
Categorization of continuous variables for analysis is inefficient at
best and <a href="https://www.ncbi.nlm.nih.gov/pubmed/24475020" target="_blank">misleading and arbitrary at
worst</a>. This JAMA paper
by <a href="http://jamanetwork.com/journals/jama/articleabstract/2620089" target="_blank">VISION study investigators</a> “Association
of Postoperative HighSensitivity Troponin Levels With Myocardial Injury
and 30Day Mortality Among Patients Undergoing Noncardiac Surgery” is an
excellent example of bad statistical practice that limits the amount of
information provided by the study. The authors categorized
highsensitivity troponin T levels measured postop and related these to
the incidence of death. They used four intervals of troponin, and there
is important heterogeneity of patients within these intervals. This is
especially true for the last interval (> 1000 ng/L). Mortality may
be much higher for troponin values that are much larger than 1000. The
relationship should have been analyzed with a continuous analysis, e.g.,
logistic regression with a regression spline for troponin, nonparametric
smoother, etc. The final result could be presented in a simple line
graph with confidence bands. An example of dichotomania that may not be
surpassed for some time is <a href="http://qualitysafety.bmj.com/content/early/2017/04/17/bmjqs2016006239" target="_blank">Simplification of the HOSPITAL Score for
Predicting 30day Readmissions</a>
by Carole E Aubert, et al in <em>BMJ Quality and Safety</em> 20170417. The
authors arbitrarily dichotomized several important predictors, resulting
in a major loss of information, then dichotomized their resulting
predictive score, sacrificing much of what information remained. The
authors failed to grasp probabilities, resulting in risk of 30day
readmission of “unlikely” and “likely”. The categorization of predictor
variables leaves demonstrable outcome heterogeneity within the intervals
of predictor values. Then taking an already oversimplified predictive
score and dichotomizing it is essentially saying to the reader “We don’t
like the integer score we just went to the trouble to develop.” I now
have serious doubts about the thoroughness of reviews at <em>BMJ Quality
and Safety</em>.</p>
<p>A very highprofile paper <a name="alcohol"></a> was
published in BMJ on 20170606: <a href="http://www.bmj.com/content/357/bmj.j2353" target="_blank">Moderate alcohol consumption as risk
factor for adverse brain outcomes and cognitive decline: longitudinal
cohort study</a> by Anya Topiwala
et al. The authors had a golden opportunity to estimate the
doseresponse relationship between amount of alcohol consumed and
quantitative brain changes. Instead the authors squandered the data by
doing analyzes that either assumed that responses are linear in alcohol
consumption or worse, by splitting consumption into 6 heterogeneous
intervals when in fact consumption was shown in their Figure 3 to have a
nice continuous distribution. How much more informative (and
statistically powerful) it would have been to fit a quadratic or a
restricted cubic spline function to consumption to estimate the
continuous doseresponse curve.</p>
<p>The NEJM <a name="dbpcut"></a> keeps giving us great
teaching examples with its 20170803 edition. In <a href="http://www.nejm.org/doi/full/10.1056/NEJMoa1704154" target="_blank">Angiotensin II for
the treatment of vasodilatory
shock</a> by Ashish
Khanna et al, the authors constructed a bizarre response variable: “The
primary end point was a response with respect to mean arterial pressure
at hour 3 after the start of infusion, with response defined as an
increase from baseline of at least 10 mm Hg or an increase to at least
75 mm Hg, without an increase in the dose of background vasopressors.”
This form of dichotomania has been discredited by <a href="http://www.citeulike.org/user/harrelfe/article/13265588" target="_blank">Stephen
Senn</a> who
provided a similar example in which he decoded the response function to
show that the lucky patient is one (in the NEJM case) who has a starting
blood pressure of 74mmHg. His example is below:</p>
<figure >
<img src="http://fharrell.com/img/dichotomaniaFig3.png" width="60%" />
</figure>
<p>When a clinical trial’s response variable is one that is arbitrary,
loses information and power, is difficult to interpret, and means
different things for different patients, expect trouble.</p>
<p><a name="change"></a></p>
<h3 id="changefrombaseline">Change from Baseline</h3>
<p>Many authors and pharmaceutical clinical trialists make the mistake of
analyzing change from baseline instead of making the raw followup
measurements the primary outcomes, covariateadjusted for baseline. To
compute change scores requires many assumptions to hold, e.g.:</p>
<ol>
<li>the variable is not used as an inclusion/exclusion criterion for the
study, otherwise regression to the mean will be strong</li>
<li>if the variable is used to select patients for the study, a second
postenrollment baseline is measured and this baseline is the one
used for all subsequent analysis</li>
<li>the post value must be linearly related to the pre value</li>
<li>the variable must be perfectly transformed so that subtraction
“works” and the result is not baselinedependent</li>
<li>the variable must not have floor and ceiling effects</li>
<li>the variable must have a smooth distribution</li>
<li>the slope of the pre value vs. the followup measurement must be
close to 1.0 when both variables are properly transformed (using the
same transformation on both)</li>
</ol>
<p>Details about problems with analyzing change may be found in
<a href="http://fharrell.com/links" target="_blank">BBR</a> Section 14.4
and <a href="http://biostat.mc.vanderbilt.edu/MeasureChange" target="_blank">here</a>, and
references may be found
<a href="http://www.citeulike.org/user/harrelfe/tag/change" target="_blank">here</a>. See also
<a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3286439" target="_blank">this</a>. A general
problem with the approach is that when Y is ordinal but not
intervalscaled, differences in Y may no longer be ordinal. So analysis
of change loses the opportunity to do a robust, powerful analysis using
a covariateadjusted ordinal response model such as the proportional
odds or proportional hazards model. Such ordinal response models do not
require one to be correct in how to transform Y. Regarding 3. above, if
pre is not linearly related to post, there is no transformation that can
make a change score work.</p>
<p>Regarding 7. above, often the baseline is not as relevant as thought and
the slope will be less than 1. When the treatment can cure every
patient, the slope will be zero. Sometimes the relationship between
baseline and followup Y is not even linear, as in one example I’ve seen
based on the Hamilton D depression scale.</p>
<p>The purpose of a parallelgroup randomized clinical trial is to compare
the parallel groups, not to compare a patient with herself at baseline.
The central question is for two patients with the same pre measurement
value of x, one given treatment A and the other treatment B, will the
patients tend to have different posttreatment values? This is exactly
what analysis of covariance assesses. Withinpatient change is affected
strongly by regression to the mean and measurement error. When the
baseline value is one of the patient inclusion/exclusion criteria, the
only meaningful change score requires one to have a second baseline
measurement post patient qualification to cancel out much of the
regression to the mean effect. It is he second baseline that would be
subtracted from the followup measurement.</p>
<p>The savvy researcher knows that analysis of covariance is required to
“rescue” a change score analysis. This effectively cancels out the
change score and gives the right answer even if the slope of post on pre
is not 1.0. But this works only in the linear model case, and it can be
confusing to have the pre variable on both the left and right hand sides
of the statistical model. And if Y is ordinal but not intervalscaled,
the difference in two ordinal variables is no longer even ordinal. Think
of how meaningless difference from baseline in ordinal pain categories
are. A <strong>major problem</strong> in the use of change score summaries, even when
a correct analysis of covariance has been done, is that many papers and
drug product labels still quote change scores out of context.</p>
<p>Patientreported outcome scales are particularly problematic. An
article published 20170507 in <a href="http://doi:10.1001/jama.2017.5103" target="_blank">JAMA</a> like
many other articles makes the error of trusting change from baseline as
an appropriate analysis variable. Mean change from baseline may not
apply to anyone in the trial. Consider a 5point ordinal pain scale
with values Y=1,2,3,4,5. Patients starting with no pain (Y=1) cannot
improve, so their mean change must be zero. Patients starting at Y=5
have the most opportunity to improve, so their mean change will be
large. A treatment that improves pain scores by an average of one point
may average a two point improvement for patients for whom any
improvement is possible. Stating mean changes out of context of the
baseline state can be meaningless.</p>
<p>The NEJM paper <a href="http://www.nejm.org/doi/full/10.1056/NEJMoa1700089" target="_blank">Treatment of EndometriosisAssociated Pain with
Elagolix, an Oral GnRH Antagonist</a> <a name="endom"></a>
by Hugh Taylor et al is based on a disastrous set of analyses, combining
all the problems above. The authors computed change from baseline on
variables that do not have the correct properties for subtraction,
engaged in dichotomania by doing responder analysis, and in addition
used last observation carried forward to handle dropouts. A proper
analysis would have been a longitudinal analysis using all available
data that avoided imputation of postdropout values and used raw
measurements as the responses. Most importantly, the twin clinical
trials randomized 872 women, and had proper analyses been done the
required sample size to achieve the same power would have been far less.
Besides the ethical issue of randomizing an unnecessarily large number
of women to inferior treatment, the approach used by the investigators
maximized the cost of these positive trials.</p>
<p>The NEJM paper <a href="http://www.nejm.org/doi/full/10.1056/NEJMoa1703501" target="_blank">Oral Glucocorticoid–Sparing Effect of Benralizumab in
Severe
Asthma</a> <a name="glucpct"></a> by
Parameswaran Nair et al not only takes the problematic approach of using
change scores from baseline in a parallel group design but they used
percent change from baseline as the raw data in the analysis. This is an
asymmetric measure for which arithmetic doesn’t work. For example,
suppose that one patient increases from 1 to 2 and another decreases
from 2 to 1. The corresponding percent changes are 100% and 50%. The
overall summary should be 0% change, not +25% as found by taking the
simple average. Doing arithmetic on percent change can essentially
involve adding ratios; ratios that are not proportions are never added;
they are multiplied. What was needed was an analysis of covariance of
raw oral glucocorticoid dose values adjusted for baseline after taking
an appropriate transformation of dose, or using a more robust
transformationinvariant ordinal semiparametric model on the raw
followup doses (e.g., proportional odds model).</p>
<p>In <a href="http://www.nejm.org/doi/full/10.1056/NEJMoa1611618" target="_blank">Trial of Cannabidiol for DrugResistant Seizures in the Dravet
Syndrome</a> <a name="dravet"></a>
in NEJM 20170525, Orrin Devinsky et al take seizure frequency, which
might have a nice distribution such as the Poisson, and compute its
change from baseline, which is likely to have a hardtomodel
distribution. Once again, authors failed to recognize that the purpose
of a parallel group design is to compare the parallel groups. Then the
authors engaged in improper subtraction, improper use of percent change,
dichotomania, and loss of statistical power simultaneously: “The
percentage of patients who had at least a 50% reduction in
convulsiveseizure frequency was 43% with cannabidiol and 27% with
placebo (odds ratio, 2.00; 95% CI, 0.93 to 4.30; P=0.08).” The authors
went on to analyze the change in a discrete ordinal scale, where change
(subtraction) cannot have a meaning independent of the starting point at
baseline.</p>
<p>Troponins <a name="trop"></a> (T) are myocardial
proteins that are released when the heart is damaged. A highsensitivity
T assay is a highinformation cardiac biomarker used to diagnose
myocardial infarction and to assess prognosis. I have been hoping to
find a welldesigned study with standardized serially measured T that is
optimally analyzed, to provide answers to the following questions:</p>
<ol>
<li>What is the shape of the relationship between the latest T
measurement and time until a clinical endpoint?</li>
<li>How does one use a continuous T to estimate risk?</li>
<li>If T were measured previously, does the previous measurement add any
predictive information to the current T?</li>
<li>If both the earlier and current T measurement are needed to predict
outcome, how should they be combined? Is what’s important the
difference of the two? Is it the ratio? Is it the difference in
square roots of T?</li>
<li>Is the 99^th^ percentile of T for normal subjects useful as a
prognostic threshold?</li>
</ol>
<p>The 20170516 <em>Circulation</em> paper <a href="http://circ.ahajournals.org/content/135/20/1911" target="_blank">Serial Measurement of
HighSensitivity Troponin I and Cardiovascular Outcomes in Patients With
Type 2 Diabetes Mellitus in the EXAMINE
Trial</a> by Matthew
Cavender et al was based on a welldesigned cardiovascular safety study
of diabetes in which uniformly measured highsensitivity troponin I
measurements were made at baseline and six months after randomization to
the diabetes drug Alogliptin. [Note: I was on the DSMB for this study]
The authors nicely envisioned a landmark analysis based on sixmonth
survivors. But instead of providing answers to the questions above, the
authors engaged in dichotomania and never checked whether changes in T
or changes in log T possessed the appropriate properties to be used as a
valid change score, i.e., they did not plot change in T vs. baseline T
or log T ratio vs. baseline T and demonstrate a flat line relationship.
Their statistical analysis used statistical methods from 50 years ago,
even doing the notorious “test for trend” that tests for a linear
correlation between an outcome and an integer category interval number.
The authors seem to be unaware of the many flexible tools developed
(especially starting in the mid 1980s) for statistical modeling that
would answer the questions posed above. Cavender et all stratified T in
<1.9 ng/L, 1.9<10 ng/L, 10<26 ng/L, and ≥26 ng/L. Fully <sup>1</sup>⁄<sub>2</sub>
of the patients were in the second interval. Except for the first
interval (T below the lower detection limit) the groups are
heterogeneous with regard to outcome risks. And there are no data from
this study or previous studies that validates these cutpoints. To
validate them, the relationship between T and outcome risk would have to
be shown to be discontinuous at the cutpoints, and flat between them.</p>
<p>From their paper we still don’t know how to use T continuously, and we
don’t know whether baseline T is informative once a clinician has
obtained an updated T. The inclusion of a 3D block diagram in the
supplemental material is symptomatic of the data presentation problems
in this paper.</p>
<p>It’s not as though T hasn’t been analyzed correctly. In a 1996 <a href="http://www.nejm.org/doi/full/10.1056/NEJM199610313351801" target="_blank">NEJM
paper</a>, Ohman
et al used a nonparametric smoother to estimate the continuous
relationship between T and 30day risk. Instead, Cavender, et al created
arbitrary heterogeneous intervals of both baseline and 6m T, then
created various arbitrary ways to look at change from baseline and its
relationship to risk.</p>
<p>An analysis that would have answered my questions would have been to</p>
<ol>
<li>Fit a standard Cox proportional hazards timetoevent model with the
usual baseline characteristics</li>
<li>Add to this model a tensor spline in the baseline and 6m T levels,
i.e., a smooth 3D relationship between baseline T, 6m T, and log
hazard, allowing for interaction, and restricting the 3D surface to
be smooth. See for example <a href="http://www.fharrell.com/links" target="_blank">BBR Figure
4.23</a>. One can do this by
using restricted cubic splines in both T’s and by computing
crossproducts of these terms for the interactions. By fitting a
flexible smooth surface, the data would be able to speak for
themselves without imposing linearity or additivity assumptions and
without assuming that change or change in log T is how these
variables combine.</li>
<li>Do a formal test of whether baseline T (as either a main effect or
as an effect modifier of the 6m T effect, i.e., interaction effect)
is associated with outcome when controlling for 6m T and ordinary
baseline variables</li>
<li>Quantify the prognostic value added by baseline T by computing the
fraction of likelihood ratio chisquare due to both T’s combined
that is explained by baseline T. Do likewise to show the added value
of 6m T. Details about these methods may be found in <a href="http://biostat.mc.vanderbilt.edu/rms" target="_blank">Regression
Modeling Strategies</a>, <em>2^nd^
edition</em></li>
</ol>
<p>Without proper analyses of T as a continuous variable, the reader is
left with confusion as to how to really use T in practice, and is given
no insight into whether changes are relevant or the baseline T can be
ignored with a later T is obtained. In all the clinical outcome studies
I’ve analyzed (including repeated LV ejection fractions and serum
creatinines), the latest measurement has been what really mattered, and
it hasn’t mattered very much how the patient got there. As long as
continuous markers are categorized, clinicians are going to get
suboptimal risk prediction and are going to find that more markers need
to be added to the model to recover the information lost by categorizing
the original markers. They will also continue to be surprised that other
researchers find different “cutpoints”, not realizing that when things
don’t exist, people will forever argue about their manifestations.</p>
<p><a name="subgroup"></a></p>
<h3 id="impropersubgrouping">Improper Subgrouping</h3>
<p>The JAMA Internal Medicine Paper <a href="https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2628971" target="_blank">Effect of Statin Treatment vs Usual
Care on Primary Cardiovascular Prevention Among Older
Adults</a>
by Benjamin Han et al makes the classic statistical error of attempting
to learn about differences in treatment effectiveness by subgrouping
rather than by correctly modeling interactions. They compounded the
error by not adjusting for covariates when comparing treatments in the
subgroups, and even worse, by subgrouping on a variable for which
grouping is illdefined and informationlosing: age. They used age
intervals of 6574 and 75+. A proper analysis would have been, for
example, modeling age as a smooth nonlinear function (e.g., using a
restricted cubic spline) and interacting this function with treatment to
allow for a highresolution, nonarbitrary analysis that allows for
nonlinear interaction. Results could be displayed by showing the
estimated treatment hazard ratio and confidence bands (yaxis) vs.
continuous age (xaxis). The authors’ analysis avoids the question of a
doseresponse relationship between age and treatment effect. A full
strategy for interaction modeling for assessing heterogeneity of
treatment effect (AKA <em>precision medicine</em>) may be found in the analysis
of covariance chapter in <a href="http://biostat.mc.vanderbilt.edu/ClinStat" target="_blank">Biostatistics for Biomedical
Research</a>. To make matters
worse, the above paper included patients with a sharp cutoff of 65 years
of age as the lower limit. How much more informative it would have been
to have a linearly increasing (in age) enrollment function that reaches
a probability of 1.0 at 65y. Assuming that something magic happens at
age 65 with regard to cholesterol reduction is undoubtedly a mistake.</p>
<p><a name="serial"></a></p>
<h3 id="serialdataandresponsetrajectories">Serial Data and Response Trajectories</h3>
<p>Serial data (aka longitudinal data) with multiple followup assessments
per patient presents special challenges and opportunities. My preferred
analysis strategy uses full likelihood or Bayesian continuoustime
analysis, using generalized least squares or mixed effects models. This
allows each patient to have different measurement times, analysis of the
data using actual days since randomization instead of clinic visit
number, and nonrandom dropouts as long as the missing data are missing
at random. Missing at random here means that given the baseline
variables and the previous followup measurements the current
measurement is missing completely at random. Imputation is not needed.
In the <em>Hypertension</em> July 2017 article <a href="https://doi.org/10.1161/HYPERTENSIONAHA.117.09221" target="_blank">Heterogeneity in Early
Responses in ALLHAT (Antihypertensive and LipidLowering Treatment to
Prevent Heart Attack
Trial)</a> by Sanket
Dhruva et al, the authors did advanced statistical analysis that is a
level above the papers discussed elsewhere in this article. However,
their claim of avoiding dichotomania is unfounded. The authors were
primarily interested in the relationship between blood pressures
measured at randomization, 1m, 3m, 6m with post6m outcomes, and they
correctly envisioned the analysis as a landmark analysis of patients who
were eventfree at 6m. They did a careful cluster analysis of blood
pressure trajectories from 06m. But their chosen method assumes that
the variety of trajectories falls into two simple homogeneous trajectory
classes (immediate responders and all others). Trajectories of
continuous measurements, like the continuous measurements themselves,
rarely fall into discrete categories with shape and level homogeneity
within the categories. The analyses would in my opinion have been
better, and would have been simpler, had everything been considered on a
continuum.</p>
<p>With landmark analysis we now have 4 baseline measurements: the new
baseline (previously called the 6m blood pressure) and 3 historical
measurements. One can use these as 4 covariates to predict time until
clinical post6m outcome using a standard timetoevent model such as
the Cox proportional hazards model. In doing so, we are estimating the
prognosis associated with every possible trajectory and we can solve for
the trajectory that yields the best outcome. We can also do a formal
statistical test for whether the trajectories can be summarized more
simply than with a 4dimensional construct, e.g., whether the final
blood pressure contains all the prognostic information. Besides
specifying the model with baseline covariates (in addition to other
original baseline covariates), one also has the option of creating a
tall and thin dataset with 4 records per patient (if correlations are
accounted for, e.g., cluster sandwich or cluster bootstrap covariance
estimates) and modeling outcome using updated covariates and possible
interactions with time to allow for timevarying blood pressure
effects.</p>
<p>A logistic regression trick described in my book <em>Regression Modeling
Strategies</em> comes in handy for modeling how baseline characteristics
such as sex, age, or randomized treatment relate to the trajectories.
Here one predicts the baseline variable of interest using the four blood
pressures. By studying the 4 regression coefficients one can see exactly
how the trajectories differ between patients grouped by the baseline
variable. This includes studying differences in trajectories by
treatment with no dichotomization. For example, if there is a
significant association (using a composite (chunk) test) between
treatment and any of the 4 blood pressures and in the logistic model
predicting treatment, that implies that the reverse is true: one or more
of the blood pressures is associated with treatment. Suppose for example
that a 4 d.f. test demonstrates some association, the 1 d.f. for the
first blood pressure is very significant, and the 3 d.f. test for the
last 3 blood pressures is not. This would be interpreted as the
treatment having an early effect that wears off shortly thereafter.
[For this particular study, with the first measurement being made
prerandomization, such a result would indicate failure of randomization
and no bloodpressure response to treatment of any kind.] Were the 4
regression coefficients to be negative and in descending order, this
would indicate a progressive reduction in blood pressure due to
treatment.</p>
<p>Returning to the originally stated preferred analysis when blood
pressure is the outcome of interest (and not time to clinical events),
one can use generalized least squares to predict the longitudinal blood
pressure trends from treatment. This will be more efficient and also
allows one to adjust for baseline variables other than treatment. It
would probably be best to make the original baseline blood pressure a
baseline variable and to have 3 serial measurements in the longitudinal
model. Time would usually be modeled continuously (e.g., using a
restricted cubic spline function). But in the Dhruva article the
measurements were made at a small number of discrete times, so time
could be considered a categorical variable with 3 levels.</p>
<p>I have had misgivings <a name="dietqual"></a> for
many years about the quality of statistical methods used by the Channing
Lab at Harvard, as well as misgivings about the quality of nutritional
epidemiology research in general. My misgivings were again confirmed in
the 20170713 NEJM publication <a href="http://www.nejm.org/doi/full/10.1056/NEJMoa1613502" target="_blank">Association of Changes in Diet Quality
with Total and CauseSpecific
Mortality</a> by
Mercedes SotosPrieto et al. There are the usual concerns about
confounding and possible alternate explanations, which the authors did
not fully deal with (and why did the authors not include an analysis
that characterized which types of subjects tended to have changes in
their dietary quality?). But this paper manages to combine dichotomania
with probably improper change score analysis and hardtointerpret
results. It started off as a nicely conceptualized landmark analysis in
which dietary quality scores were measured during both an 8year and a
16year period, and these scores were related to total and allcause
mortality following those landmark periods. But then things went
seriously wrong. The authors computed change in diet scores from the
start to the end of the qualification period, did not validate that
these are proper change scores (see above for more about that), and
engaged in percentiling as if the number of neighbors with worse diets
than you is what predicts your mortality rather than the absolute
quality of your own diet. They then grouped the changes into quintile
groups without justification, and examined change quantile score group
effects in Cox timetoevent models. It is telling that the baseline
dietary scores varied greatly over the change quintiles. The authors
emphasized the 20percentile increase in each score when interpreting
result. What does that mean? How is it related to absolute diet quality
scores?</p>
<p>The high quality dataset available to the authors could have been used
to answer real questions of interest using statistical analyses that did
not have hidden assumptions. From their analyses we have no idea of how
the subjects’ diet trajectories affected mortality, or indeed whether
then change in diet quality was as important as the most recent diet
quality for the subject, ignoring how the subject arrived at that point
at the end of the qualification period. What would be an informative
analysis? Start with the simpler one: used a smooth tensor spline
interaction surface to estimate relative log hazard of mortality, and
construct a 3D plot with initial diet quality on the xaxis, final
(landmark) diet quality on the yaxis, and relative log hazard on the
zaxis. Then the more indepth modeling analysis can be done in which
one uses multiple measures of diet quality over time and relates the
trajectory (its shape, average level, etc.) to hazard of death. Suppose
that absolute diet quality was measured at four baseline points. These
four variables could be related to outcome and one could solve for the
trajectory that was associated with the lowest mortality. For a study
that is almost wholly statistical, it is a shame that modern statistical
methods appeared to not even be considered. And for heaven’s sake
<strong>analyze the raw diet scales and do not percentile them</strong>.</p>

Subjective Ranking of Quality of Research by Subject Matter Area
http://fharrell.com/post/rankqual/
Thu, 16 Mar 2017 11:53:00 +0000
http://fharrell.com/post/rankqual/
<p>While being engaged in biomedical research for a few decades and watching
reproducibility of research as a whole, I’ve developed my own ranking of
reliability/quality/usefulness of research across several subject matter
areas. This list is far from complete. Let’s start with a subjective
list of what I perceive as the areas in which published research is
least likely to be both true and useful. The following list is ordered
in ascending order of quality, with the most problematic area listed
first. You’ll notice that there is a vast number of areas not listed for
which I have minimal experience.
<strong>Some excellent research is done in all subject
areas.</strong> This list is based on my perception of the <em>proportion</em> of
publications in the indicated area that are rigorously scientific,
reproducible, and useful.</p>
<h4 id="subjectareaswithleastreliablereproducibleusefulresearch">Subject Areas With Least Reliable/Reproducible/Useful Research</h4>
<ol>
<li>any area where there is no prespecified statistical analysis plan
and the analysis can change on the fly when initial results are
disappointing</li>
<li>behavioral psychology</li>
<li>studies of corporations to find characteristics of “winners”;
regression to the mean kicks in making predictions useless for
changing your company</li>
<li>animal experiments on fewer than 30 animals</li>
<li>discovery genetics not making use of biology while doing
largescale variant/gene screening</li>
<li>nutritional epidemiology</li>
<li>electronic health record research reaching clinical conclusions
without understanding confounding by indication and other
limitations of data</li>
<li>prepost studies with no randomization</li>
<li>nonnutritional epidemiology not having a fully prespecified
statistical analysis plan [few epidemiology
papers use stateoftheart statistical methods and have a
sensitivity analysis related to unmeasured
confounders]</li>
<li>prediction studies based on dirty and inadequate data</li>
<li>personalized medicine</li>
<li>biomarkers</li>
<li>observational treatment comparisons that do not qualify for the
second list (below)</li>
<li>small adaptive dosefinding cancer trials (3+3 etc.)</li>
</ol>
<h4 id="subjectareaswithmostreliablereproducibleusefulresearch">Subject Areas With Most Reliable/Reproducible/Useful Research</h4>
<p>The most reliable and useful research areas are listed first. All of
the following are assumed to (1) have a prospective prespecified
statistical analysis plan and (2) purposeful prospective
qualitycontrolled data acquisition (yes this applies to highquality
nonrandomized observational research).</p>
<ol>
<li>randomized crossover studies</li>
<li>multicenter randomized experiments</li>
<li>singlecenter randomized experiments with nonoverlyoptimistic
sample sizes</li>
<li>adaptive randomized clinical trials with large sample sizes</li>
<li>physics</li>
<li>pharmaceutical industry research that is overseen by
FDA</li>
<li>cardiovascular research</li>
<li>observational research [however only a very small minority of
observational research projects have a prospective analysis plan and
high enough data quality to qualify for this
list]</li>
</ol>
<h4 id="somesuggestedremedies">Some Suggested Remedies</h4>
<p>Peer review of research grants and manuscripts is done primarily by
experts in the subject matter area under study. Most journal editors
and grant reviewers are not expert in biostatistics. Every grant
application and submitted manuscript should undergo rigorous
methodologic peer review by methodologic experts such as
biostatisticians and epidemiologists. All data analyses should be
driven by a prospective statistical analysis plan, and the entire
selfcontained data manipulation and analysis code should be submitted
to journals so that potential reproducibility and adherence to the
statistical analysis plan can be confirmed. Readers should have access
to the data in most cases and should be able to reproduce all study
findings using the authors’ code, plus run their own analyses on the
authors’ data to check robustness of findings. Medical journals are reluctant to (1) publish critical letters to the editor and (2) retract papers. This has to change.</p>
<p>In academia, too much credit is still given to
the quantity of publications and not to their quality and
reproducibility. This too must change. The pharmaceutical industry has
FDA to validate their research. The NIH does not serve this role for
academia.</p>
<p>Rochelle Tractenberg, Chair of the American
Statistical Association Committee on Professional Ethics and a
biostatistician at Georgetown University said in a 20170222 interview
with <em>The Australian</em> that many questionable studies would not have been
published had formal statistical reviews been done. When she reviews a
paper she starts with the premise that the statistical analysis was
incorrectly executed. She stated that “Bad statistics is bad
science.”</p>

Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules
http://fharrell.com/post/classdamage/
Wed, 01 Mar 2017 07:30:00 +0000
http://fharrell.com/post/classdamage/
<p>I discussed the many advantages or probability estimation over
classification. Here I discuss a particular problem related to
classification, namely the harm done by using improper accuracy scoring
rules. Accuracy scores are used to drive feature selection, parameter
estimation, and for measuring predictive performance on models derived
using any optimization algorithm. For this discussion let Y denote a
no/yes false/true 0/1 event being predicted, and let Y=0 denote a
nonevent and Y=1 the event occurred.</p>
<p>As discussed <a href="https://en.wikipedia.org/wiki/Scoring_rule" target="_blank">here</a> and <a href="http://psiexp.ss.uci.edu/research/papers/MerkleSteyvers.pdf" target="_blank">here</a>,
a <em>proper accuracy scoring</em> rule is a metric applied to probability
forecasts. It is a metric that is optimized when the forecasted
probabilities are identical to the true outcome probabilities. A
<em>continuous</em> accuracy scoring rule is a metric that makes full use of
the entire range of predicted probabilities and does not have a large
jump because of an infinitesimal change in a predicted probability. The
two most commonly used proper scoring rules are the quadratic error
measure, i.e., mean squared error or <a href="https://en.wikipedia.org/wiki/Brier_score" target="_blank">Brier
score</a>, and the logarithmic
scoring rule, which is a linear translation of the log likelihood for a
binary outcome model (Bernoulli trials). The logarithmic rule gives
more credit to extreme predictions that are “right”, but a single
prediction of 1.0 when Y=0 or 0.0 when Y=1 will result in infinity no
matter how accurate were all the other predictions. Because of the
optimality properties of maximum likelihood estimation, the logarithmic
scoring rule is in a sense the gold standard, but we more commonly use
the Brier score because of its easier interpretation and its ready
decomposition into various metrics measuring calibrationinthesmall,
calibrationinthelarge, and discrimination.</p>
<p><em>Classification accuracy</em> is a discontinuous scoring rule. It
implicitly or explicitly uses thresholds for probabilities, and moving a
prediction from 0.0001 below the threshold to 0.0001 above the
thresholds results in a full accuracy change of 1/N. Classification
accuracy is also an improper scoring rule. It can be optimized by
choosing the wrong predictive features and giving them the wrong
weights. This is best shown by a simple example that appears
in <a href="http://fharrell.com/links" target="_blank">Biostatistics for Biomedical Research</a> Chapter 18 in which
400 simulated subjects have an overall fraction of Y=1 of 0.57. Consider
the use of binary logistic regression to predict the probability that
Y=1 given a certain set of covariates, and classify a subject as having
Y=1 if the predicted probability exceeds 0.5. We simulate values of age
and sex and simulate binary values of Y according to a logistic model
with strong age and sex effects; the true log odds of Y=1 are
(age50)<em>.04 + .75</em>(sex=m). We fit four binary logistic models in order: a model containing only age
as a predictor, one containing only sex, one containing both age and
sex, and a model containing no predictors (i.e., it only has an
intercept parameter). The results are in the following table:</p>
<p><img src="http://fharrell.com/img/classaccexample.png" alt="" /></p>
<p>Both the gold standard likelihood ratio chisquare statistic and the
improper pure discrimination cindex (AUROC) indicate that both age and
sex are important predictors of Y. Yet the highest proportion correct
(classification accuracy) occurs when sex is ignored. According to the
improper score, the sex variable has negative information. It is
telling that a model that predicted Y=1 for every observation, i.e., one
that completely ignored age and sex and only has the intercept in the
model, would be 0.573 accurate, only slightly below the accuracy of
using sex alone to predict Y.</p>
<p>The use of a discontinuous improper accuracy score such as proportion
“classified” “correctly” has led to countless misleading findings in
bioinformatics, machine learning, and data science. In some extreme
cases the machine learning expert failed to note that their claimed
predictive accuracy was less than that achieved by ignoring the data,
e.g., by just predicting Y=1 when the observed prevalence of Y=1 was
0.98 whereas their extensive data analysis yielded an accuracy of 0.97.
As discusssed <a href="http://fharrell.com/post/classification/" target="_blank">here</a>,
fans of “classifiers” sometimes subsample from observations in the most
frequent outcome category (here Y=1) to get an artificial <sup>50</sup>⁄<sub>50</sub> balance
of Y=0 and Y=1 when developing their classifier. Fans of such deficient
notions of accuracy fail to realize that their classifier will not apply
to a population when a much different prevalence of Y=1 than 0.5.</p>
<p><em>Sensitivity</em> and <em>specificity</em> are onesided or conditional versions of
classification accuracy. As such they are also discontinuous improper
accuracy scores, and optimizing them will result in the wrong model.
<a href="http://biostat.mc.vanderbilt.edu/rms" target="_blank">Regression Modeling Strategies</a> Chapter 10 goes into
more problems with classification accuracy, and discusses many measures
of the quality of probability estimates. The text contains suggested
measures to emphasize such as Brier score, pseudo Rsquared (a simple
function of the logarithmic scoring rule), cindex, and especially
smooth nonparametric calibration plots to demonstrate absolute accuracy
of estimated probabilities.</p>
<p>An excellent discussion with more information may be found
<a href="https://stats.stackexchange.com/questions/312780" target="_blank">here</a>.</p>

My Journey From Frequentist to Bayesian Statistics
http://fharrell.com/post/journey/
Sun, 19 Feb 2017 10:23:00 +0000
http://fharrell.com/post/journey/
<p><small>
Type I error for smoke detector: probability of alarm given no fire=0.05<br>
Bayesian: probability of fire given current air data</p>
<p>Frequentist smoke alarm designed as most research is done:<br>
Set the alarm trigger so as to have a 0.8 chance of detecting an
inferno</p>
<p>Advantage of actionable evidence quantification:<br>
Set the alarm to trigger when the posterior probability of a fire
exceeds 0.02 while at home and at 0.01 while away
</small></p>
<p>If I had been taught Bayesian modeling before being taught the
frequentist paradigm, I’m sure I would have always been a Bayesian. I
started becoming a Bayesian about 1994 because of an <a href="http://www.citeulike.org/user/harrelfe/article/13264891" target="_blank">influential
paper</a> by David
Spiegelhalter and because I worked in the same building at Duke
University as Don Berry. Two other things strongly contributed to my
thinking: difficulties explaining pvalues and confidence intervals
(especially the latter) to clinical researchers, and difficulty of
learning group sequential methods in clinical trials. When I talked
with Don and learned about the flexibility of the Bayesian approach to
clinical trials, and saw Spiegelhalter’s embrace of Bayesian methods
because of its problemsolving abilities, I was hooked. [Note: I’ve
heard Don say that he became Bayesian after multiple attempts to teach
statistics students the exact definition of a confidence interval. He
decided the concept was defective.]</p>
<p>At the time I was working on clinical trials at Duke and started to see
that multiplicity adjustments were arbitrary. This started with a
clinical trial coordinated by Duke in which low dose and high dose of a
new drug were to be compared to placebo, using an alpha cutoff of 0.03
for each comparison to adjust for multiplicity. The comparison of high
dose with placebo resulted in a pvalue of 0.04 and the trial was
labeled completely “negative” which seemed problematic to me. [Note:
the pvalue was twosided and thus didn’t give any special “credit” for
the treatment effect coming out in the right direction.]</p>
<p>I began to see that the hypothesis testing framework wasn’t always the
best approach to science, and that in biomedical research the typical
hypothesis was an artificial construct designed to placate a reviewer
who believed that an NIH grant’s specific aims must include null
hypotheses. I saw the contortions that investigators went through to
achieve this, came to see that questions are more relevant than
hypotheses, and estimation was even more important than questions.<br />
With Bayes, estimation is emphasized. I very much like Bayesian
modeling instead of hypothesis testing. I saw that a large number of
clinical trials were incorrectly interpreted when p>0.05 because the
investigators involved failed to realize that a pvalue can only provide
evidence against a hypothesis. Investigators are motivated by “we spent
a lot of time and money and must have gained something from this
experiment.” The classic “<a href="http://www.bmj.com/content/311/7003/485" target="_blank">absence of evidence is not evidence of
absence</a>” error results,
whereas with Bayes it is easy to estimate the probability of similarity
of two treatments. Investigators will be surprised to know how little
we have learned from clinical trials that are not huge when p>0.05.</p>
<p>I listened to many discussions of famous clinical trialists debating
what should be the primary endpoint in a trial, the coprimary endpoint,
the secondary endpoints, cosecondary endpoints, etc. This was all
because of their paying attention to alphaspending. I realized this
was all a game.</p>
<p>I came to not believe in the possibility of infinitely many repetitions
of identical experiments, as required to be envisioned in the
frequentist paradigm. When I looked more thoroughly into the
multiplicity problem, and sequential testing, and I looked at Bayesian
solutions, I became more of a believer in the approach. I learned that
posterior probabilities have a simple interpretation independent of the
stopping rule and frequency of data looks. I got involved in working
with the FDA and then consulting with pharmaceutical companies, and
started observing how multiple clinical endpoints were handled. I saw a
closed testing procedures where a company was seeking a superiority
claim for a new drug, and if there was insufficient evidence for such a
claim, they wanted to seek a noninferiority claim on another endpoint.
They developed a closed testing procedure that when diagrammed truly
looked like a train wreck. I felt there had to be a better approach, so
I sought to see how far posterior probabilities could be pushed. I
found that with MCMC simulation of Bayesian posterior draws I could
quite simply compute probabilities such as P(any efficacy), P(efficacy
more than trivial), P(noninferiority), P(efficacy on endpoint A and on
either endpoint B or endpoint C), and P(benefit on more than 2 of 5
endpoints). I realized that frequentist multiplicity problems came from
the chances you give data to be more extreme, not from the chances you
give assertions to be true.</p>
<p>I enjoy the fact that posterior probabilities define their own error
probabilities, and that they count not only inefficacy but also harm.
If P(efficacy)=0.97, P(no effect or harm)=0.03. This is the
“regulator’s regret”, and type I error is not the error of major
interest (is it really even an ‘error’?). One minus a pvalue is P(data
in general are less extreme than that observed if H0 is true) which is
the probability of an event I’m not that interested in.</p>
<p>The extreme amount of time I spent analyzing data led me to understand
other problems with the frequentist approach. Parameters are either in
a model or not in a model. We test for interactions with treatment and
hope that the pvalue is not between 0.02 and 0.2. We either include
the interactions or exclude them, and the power for the interaction test
is modest. Bayesians have a prior for the differential treatment effect
and can easily have interactions “half in” the model. Dichotomous
irrevocable decisions are at the heart of many of the statistical
modeling problems we have today. I really like penalized maximum
likelihood estimation (which is really empirical Bayes) but once we have
a penalized model all of our frequentist inferential framework fails us.
No one can interpret a confidence interval for a biased (shrunken;
penalized) estimate. On the other hand, the Bayesian posterior
probability density function, after shrinkage is accomplished using
skeptical priors, is just as easy to interpret as had the prior been
flat. For another example, consider a categorical predictor variable
that we hope is predicting in an ordinal (monotonic) fashion. We tend
to either model it as ordinal or as completely unordered (using k1
indicator variables for k categories). A Bayesian would say “let’s use
a prior that favors monotonicity but allows larger sample sizes to
override this belief.”</p>
<p>Now that adaptive and sequential experiments are becoming more popular,
and a formal mechanism is needed to use data from one experiment to
inform a later experiment (a good example being the use of adult
clinical trial data to inform clinical trials on children when it is
difficult to enroll a sufficient number of children for the child data
to stand on their own), Bayes is needed more than ever. It took me a
while to realize something that is quite profound: A Bayesian solution
to a simple problem (e.g., 2group comparison of means) can be embedded
into a complex design (e.g., adaptive clinical trial) <strong>without
modification</strong>. Frequentist solutions require highly complex
modifications to work in the adaptive trial setting.</p>
<p>I met likelihoodist <a href="http://biostat.mc.vanderbilt.edu/JeffreyBlume" target="_blank">Jeffrey
Blume</a> in 2008 and
started to like the likelihood approach. It is more Bayesian than
frequentist. I plan to learn more about this paradigm. Jeffrey has an excellent <a href="http://statisticalevidence.com" target="_blank">web site</a>.</p>
<p>Several readers have asked me how I could believe all this and publish a
frequentistbased book such as <em>Regression Modeling Strategies</em>. There
are two primary reasons. First, I started writing the book before I
knew much about Bayes. Second, I performed a lot of simulation studies
that showed that purely empirical modelbuilding had a low chance of
capturing clinical phenomena correctly and of validating on new
datasets. I worked extensively with cardiologists such as Rob Califf,
Dan Mark, Mark Hlatky, David Prior, and Phil Harris who give me the
ideas for injecting clinical knowledge into model specification. From
that experience I wrote <em>Regression Modeling Strategies</em> in the most
Bayesian way I could without actually using specific Bayesian methods.
I did this by emphasizing subjectmatterguided model specification.
The section in the book about specification of interaction terms is
perhaps the best example. When I teach the fullsemester version of my
course I interject Bayesian counterparts to many of the techniques
covered.</p>
<p>There are challenges in moving more to a Bayesian approach. The ones I
encounter most frequently are:</p>
<ol>
<li>Teaching clinical trialists to embrace Bayes when they already do in
spirit but not operationally. Unlearning things is much more
difficult than learning things.</li>
<li>How to work with sponsors, regulators, and NIH principal
investigators to specify the (usually skeptical) prior up front, and
to specify the amount of applicability assumed for previous data.</li>
<li>What is a Bayesian version of the multiple degree of freedom “chunk
test”? Partitioning sums of squares or the log likelihood into
components, e.g., combined test of interaction and combined test of
nonlinearities, is very easy and natural in the frequentist setting.</li>
<li>How do we specify priors for complex entities such as the degree of
monotonicity of the effect of a continuous predictor in a regression
model? The Bayesian approach to this will ultimately be more
satisfying, but operationalizing this is not easy.</li>
</ol>
<p>With new tools such as <a href="http://mcstan.org/" target="_blank">Stan</a> and well written
accessible books such
as <a href="http://www.citeulike.org/user/harrelfe/article/14172337" target="_blank">Kruschke’s</a> it’s
getting to be easier to be Bayesian each day. The R
<a href="https://cran.rproject.org/web/packages/brms" target="_blank">brms</a> package, which uses
Stan, makes a large class of regression models even more accessible.</p>
<h3 id="update20171229">Update 20171229</h3>
<p>Another reason for moving from frequentism to Bayes is that frequentist
ideas are so confusing that even expert statisticians frequently
misunderstand them, and are tricked into dichotomous thinking because of
the adoption of null hypothesis significance testing (NHST). The
<a href="http://www.blakemcshane.com/Papers/jasa_dichotomization.pdf" target="_blank">paper</a> by
BB McShane and D Gal in JASA demonstrates alarming errors in
interpretation by many authors of JASA papers. If those with a high
level of statistical training make frequent interpretation errors could
frequentist statistics be fundamentally flawed? Yes! In McShane and
Gal’s paper they described two surveys sent to authors of JASA, as well
as to authors of articles not appearing in the statistical literature
(luckily for statisticians the nonstatisticians fared a bit worse).
Some of their key findings are as follows.</p>
<ol>
<li>When a pvalue is present, (primarily frequentist) statisticians
confuse population vs. sample, especially if the pvalue is large.
Even when directly asked whether patients <em>in this sample</em> fared
batter on one treatment than the other, the respondents often
answered according to whether or not p<0.05. Dichotomous
thinking crept in.</li>
<li>When asked whether evidence from the data made it more or less
likely that a drug is beneficial in the population, many
statisticians again were swayed by the pvalue and not tendencies
indicated by the raw data. The failed to understand that your
chances are improved by “playing the odds”, and gave different
answers whether one was playing the odds for an unknown person vs.
selecting treatment for themselves.</li>
<li>In previous studies by the authors, they found that “applied
researchers presented with not only a pvalue but also with a
posterior probability based on a noninformative prior were less
likely to make dichotomization errors.”</li>
</ol>
<p>The authors also echoed Wasserstein, Lazar, and Cobb’s concern that we
are setting researchers up for failure: “we teach NHST because that’s
what the scientific community and journal editors use but they use NHST
because that’s what we teach them. Indeed, statistics at the
undergraduate level as well as at the graduate level in applied fields
is often taught in a rote and recipelike manner that typically focuses
exclusively on the NHST paradigm.”</p>
<p>Some of the problems with frequentist statistics are the way in which
its methods are misused, especially with regard to dichotomization. But
an approach that is so easy to misuse and which sacrifices direct
inference in a futile attempt at objectivity still has fundamental
problems.</p>
<hr />
<p>Go <a href="https://news.ycombinator.com/item?id=13684429" target="_blank">here</a> for discussions about this article that are not on this blog.</p>

Interactive Statistical Graphics: Showing More By Showing Less
http://fharrell.com/post/interactivegraphicsless/
Sun, 05 Feb 2017 09:43:00 +0000
http://fharrell.com/post/interactivegraphicsless/
<p>Version 4 of the R <a href="http://biostat.mc.vanderbilt.edu/Hmisc" target="_blank"><code>Hmisc</code></a> package and version
5 of the R <a href="http://biostat.mc.vanderbilt.edu/Rrms" target="_blank"><code>rms</code></a> package
interfaces with interactive <a href="https://plot.ly/r/" target="_blank"><code>plotly</code></a> graphics, which
is an interface to the <code>D3</code> javascript graphics library. This allows
various results of statistical analyses to be viewed interactively, with
preprogrammed drilldown information. More examples will be added
here. We start with a video showing a new way to display survival
curves.</p>
<p>Note that plotly graphics are best used with RStudio Rmarkdown html
notebooks, and are distributed to reviewers as selfcontained (but
somewhat large) html files. Printing is discouraged, but possible, using
snapshots of the interactive graphics.</p>
<p>Concerning the second bullet point below, boxplots have a high
ink:information ratio and hide bimodality and other data features. Many
statisticians prefer to use dot plots and violin plots. I liked those
methods for a while, then started to have trouble with the choice of a
smoothing bandwidth in violin plots, and found that dot plots do not
scale well to very large datasets, whereas spike histograms are useful
for all sample sizes. Users of dot charts have to have a dot stand for
more than one observation if N is large, and I found the process too
arbitrary. For spike histograms I typically use 100 or 200 bins. When
the number of distinct data values is below the specified number of
bins, I just do a frequency tabulation for all distinct data values,
rounding only when two of the values are very close to each other. A
spike histogram approximately reduces to a rug plot when there are no
ties in the data, and I very much like rug plots.</p>
<ul>
<li><code>rms survplotp</code> <a href="https://youtu.be/EoIB_Obddrk" target="_blank">video</a>: plotting
survival curves</li>
<li><code>Hmisc histboxp</code> <a href="http://data.vanderbilt.edu/fh/R/Hmisc/examples.html#better_demonstration_of_boxplot_replacement" target="_blank">interactive html
example</a>:
spike histograms plus selected quantiles, mean, and Gini’s mean
difference  replacement for boxplots  show all the data! Note
bimodal distributions and zero blood pressure values for patients
having a cardiac arrest.</li>
</ul>
<figure >
<img src="http://fharrell.com/img/histboxp.png" width="100%" />
</figure>

A Litany of Problems With pvalues
http://fharrell.com/post/pvallitany/
Sun, 05 Feb 2017 08:39:00 +0000
http://fharrell.com/post/pvallitany/
<p>In my opinion, null hypothesis testing and pvalues have done significant harm
to science. The purpose of this note is to catalog the many problems
caused by pvalues. As readers post new problems in their comments,
more will be incorporated into the list, so this is a work in progress.</p>
<p>The American Statistical Association has done a great service by issuing
its <a href="http://www.amstat.org/asa/files/pdfs/PValueStatement.pdf" target="_blank">Statement on Statistical Significance and
Pvalues</a>.
Now it’s time to act. To create the needed motivation to change, we
need to fully describe the depth of the problem.</p>
<p>It is important to note that no statistical paradigm is perfect.
Statisticians should choose paradigms that solve the greatest number of
real problems and have the fewest number of faults. This is why I
believe that the Bayesian and likelihood paradigms should replace
frequentist inference.</p>
<p>Consider an assertion such as “the coin is fair”, “treatment A yields
the same blood pressure as treatment B”, “B yields lower blood pressure
than A”, or “B lowers blood pressure at least 5mmHg before A.” Consider
also a compound assertion such as “A lowers blood pressure by at least
3mmHg and does not raise the risk of stroke.”</p>
<h3 id="aproblemswithconditioning">A. Problems With Conditioning</h3>
<ol>
<li>pvalues condition on what is unknown (the assertion of interest;
H~0~) and do not condition on what is known (the data).</li>
<li>This conditioning does not respect the flow of time and information;
pvalues are backward probabilities.</li>
</ol>
<h3 id="bindirectness">B. Indirectness</h3>
<ol>
<li>Because of A above, pvalues provide only indirect evidence and are
problematic as evidence metrics. They are sometimes monotonically
related to the evidence (e.g., when the prior distribution is flat)
we need but are not properly calibrated for decision making.</li>
<li>pvalues are used to bring indirect evidence against an assertion
but cannot bring evidence in favor of the assertion.<br /></li>
<li>As
detailed <a href="http://www.fharrell.com/2017/01/nullhypothesissignificancetesting.html" target="_blank">here</a>,
the idea of proof by contradiction is a stretch when working with
probabilities, so trying to quantify evidence for an assertion by
bringing evidence against its complement is on shaky ground.</li>
<li>Because of A, pvalues are difficult to interpret and very few
nonstatisticians get it right. The best article on
misinterpretations I’ve found
is <a href="http://www.citeulike.org/user/harrelfe/article/14042559" target="_blank">here</a>.</li>
</ol>
<h3 id="cproblemdefiningtheeventwhoseprobabilityiscomputed">C. Problem Defining the Event Whose Probability is Computed</h3>
<ol>
<li>In the continuous data case, the probability of getting a result as
extreme as that observed with our sample is zero, so the pvalue is
the probability of getting a result <em>more extreme</em> than that
observed. Is this the correct point of reference?</li>
<li>How does <em>more extreme</em> get defined if there are sequential analyses
and multiple endpoints or subgroups? For sequential analyses do we
consider planned analyses are analyses intended to be run even if
they were not?</li>
</ol>
<h3 id="dproblemsactuallycomputingpvalues">D. Problems Actually Computing pvalues</h3>
<ol>
<li>In some discrete data cases, e.g., comparing two proportions, there
is tremendous disagreement among statisticians about how pvalues
should be calculated. In a famous 2x2 table from an ECMO adaptive
clinical trial, 13 pvalues have been computed from the same data,
ranging from 0.001 to 1.0. And many statisticians do not realize
that Fisher’s socalled “exact” test is not very accurate in many
cases.</li>
<li>Outside of binomial, exponential, and normal (with equal variance)
and a few other cases, pvalues are actually very difficult to
compute exactly, and many pvalues computed by statisticians are of
unknown accuracy (e.g., in logistic regression and mixed effects
models). The more nonquadratic the log likelihood function the more
problematic this becomes in many cases.</li>
<li>One can compute (sometimes requiring simulation) the typeI error of
many multistage procedures, but actually computing a pvalue that
can be taken out of context can be quite difficult and sometimes
impossible. One example: one can control the false discovery
probability (incorrectly usually referred to as a rate), and ad hoc
modifications of nominal pvalues have been proposed, but these are
not necessarily in line with the real definition of a pvalue.</li>
</ol>
<h3 id="ethemultiplicitymess">E. The Multiplicity Mess</h3>
<ol>
<li>Frequentist statistics does not have a recipe or blueprint leading
to a unique solution for multiplicity problems, so when many
pvalues are computed, the way they are penalized for multiple
comparisons results in endless arguments. A Bonferroni multiplicity
adjustment is consistent with a Bayesian prior distribution
specifying that the probability that all null hypotheses are true is
a constant no matter how many hypotheses are tested. By contrast,
Bayesian inference reflects the facts that P(A ∪ B) ≥ max(P(A),
P(B)) and P(A ∩ B) ≤ min(P(A), P(B)) when A and B are assertions
about a true effect.</li>
<li>There remains controversy over the choice of 1tailed vs. 2tailed
tests. The 2tailed test can be thought of as a multiplicity
penalty for being potentially excited about either a positive effect
or a negative effect of a treatment. But few researchers want to
bring evidence that a treatment harms patients; a pharmaceutical
company would not seek a licensing claim of harm. So when one
computes the probability of obtaining an effect larger than that
observed if there is no true effect, why do we too often ignore the
sign of the effect and compute the (2tailed) pvalue?</li>
<li>Because it is a very difficult problem to compute pvalues when the
assertion is compound, researchers using frequentist methods do not
attempt to provide simultaneous evidence regarding such assertions
and instead rely on ad hoc multiplicity adjustments.</li>
<li>Because of A1, statistical testing with multiple looks at the data,
e.g., in sequential data monitoring, is ad hoc and complex.
Scientific flexibility is discouraged. The pvalue for an early
data look must be adjusted for future looks. The pvalue at the
final data look must be adjusted for the earlier inconsequential
looks. Unblinded sample size reestimation is another case in
point. If the sample size is expanded to gain more information,
there is a multiplicity problem and some of the methods commonly
used to analyze the final data effectively discount the first wave
of subjects. How can that make any scientific sense?</li>
<li>Most practitioners of frequentist inference do not understand that
multiplicity comes from chances you give data to be extreme, not
from chances you give true effects to be present.</li>
</ol>
<h3 id="fproblemswithnontrivialhypotheses">F. Problems With NonTrivial Hypotheses</h3>
<ol>
<li>It is difficult to test nonpoint hypotheses such as “drug A is
similar to drug B”.</li>
<li>There is no straightforward way to test compound hypotheses coming
from logical unions and intersections.</li>
</ol>
<h3 id="ginabilitytoincorporatecontextandotherinformation">G. Inability to Incorporate Context and Other Information</h3>
<ol>
<li>Because extraordinary claims require extraordinary evidence, there
is a serious problem with the pvalue’s inability to incorporate
context or prior evidence. A Bayesian analysis of the existence of
ESP would no doubt start with a very skeptical prior that would
require extraordinary data to overcome, but the bar for getting a
“significant” pvalue is fairly low. Frequentist inference has a
greater risk for getting the direction of an effect wrong
(see <a href="http://andrewgelman.com/" target="_blank">here</a> for more).</li>
<li>pvalues are unable to incorporate outside evidence. As a converse
to 1, strong prior beliefs are unable to be handled by pvalues, and
in some cases the results in a lack of progress. Nate Silver
in <em>The Signal and the Noise</em> beautifully details how the conclusion
that cigarette smoking causes lung cancer was greatly delayed (with
a large negative effect on public health) because scientists
(especially Fisher) were caught up in the frequentist way of
thinking, dictating that only randomized trial data would yield a
valid pvalue for testing cause and effect. A Bayesian prior that
was very strongly against the belief that smoking was causal is
obliterated by the incredibly strong observational data. Only by
incorporating prior skepticism could one make a strong conclusion
with nonrandomized data in the smokinglung cancer debate.</li>
<li>pvalues require subjective input from the producer of the data
rather than from the consumer of the data.</li>
</ol>
<h3 id="hproblemsinterpretingandactingonpositivefindings">H. Problems Interpreting and Acting on “Positive” Findings</h3>
<ol>
<li>With a large enough sample, a trivial effect can cause an
impressively small pvalue (statistical significance ≠ clinical
significance).</li>
<li>Statisticians and subject matter researchers (especially the latter)
sought a “seal of approval” for their research by naming a cutoff on
what should be considered “statistically significant”, and a cutoff
of p=0.05 is most commonly used. Any time there is a threshold
there is a motive to game the system, and gaming (phacking) is
rampant. Hypotheses are exchanged if the original H~0~ is not
rejected, subjects are excluded, and because statistical analysis
plans are not prespecified as required in clinical trials and
regulatory activities, researchers and their alltooaccommodating
statisticians play with the analysis until something “significant”
emerges.</li>
<li>When the pvalue is small, researchers act as though the point
estimate of the effect is a population value.</li>
<li>When the pvalue is small, researchers believe that their conceptual
framework has been validated.<br /></li>
</ol>
<h3 id="iproblemsinterpretingandactingonnegativefindings">I. Problems Interpreting and Acting on “Negative” Findings</h3>
<ol>
<li>Because of B2, large pvalues are uninformative and do not assist
the researcher in decision making (Fisher said that a large pvalue
means “get more data”).</li>
</ol>
<hr />
<p>More recommended reading:</p>
<ul>
<li>William Briggs’ <a href="http://wmbriggs.com/post/9338" target="_blank">Everything Wrong With Pvalues Under One Roof</a></li>
</ul>