index – Statistical Thinking

Categories

2017

2018

2019

2020

2021

2022

2023

2024

2025

2026

accuracy-score

backward-probability

bayes

big-data

bioinformatics

biomarker

bootstrap

change-scores

classification

collaboration

computing

conditioning

covid-19

data-reduction

data-science

decision-making

design

diagnosis

dichotomization

drug-development

drug-evaluation

EHR

endpoints

evidence

forward-probability

generalizability

graphics

hypothesis-testing

inductive-reasoning

inference

judgment

likelihood

logic

machine-learning

measurement

medical

medical-literature

medicine

metrics

multiplicity

observational

ordinal

p-value

personalized-medicine

posterior

precision

prediction

principles

prior

probability

r

RCT

regression

reporting

reproducible

responder-analysis

sample-size

sensitivity

sequential

specificity

subgroup

survival-analysis

teaching

validation

variability

Causal by Design

design

drug-development

drug-evaluation

evidence

inference

observational

RCT

reporting

2026

Well-controlled randomized experiments when analyzed in a randomization-respecting way are causal by design and need no causal calculus to infer causation. If an observational study has any hope of providing reliable causal inference regarding therapeutic comparisons, it must be prospectively designed. Incorporating target trial emulation in a non-designed retrospective observational study does not enable causal inference.

Measures of Central Tendency for an Asymmetric Distribution, and Confidence Intervals

bootstrap

computing

inference

r

2025

There are three widely applicable measures of central tendency for general continuous distributions: the mean, median, and pseudomedian (the mode is useful for describing smooth theoretical distributions but not so useful when attempting to estimate the mode empirically). Each measure has its own advantages and disadvantages, and the usual confidence intervals for the mean may be very inaccurate when the distribution is very asymmetric. The central limit theorem may be of no help. In this article I discuss tradeoffs of the three location measures and describe why the pseudomedian is perhaps the overall winner due to its combination of robustness, efficiency, and having an accurate confidence interval. I study CI coverage of 18 procedures for the mean, one exact and one approximate procedure for the median, and five procedures for the pseudomedian, for samples of size \(n=200\) drawn from a lognormal distribution. Various bootstrap procedures are included in the study. The goal of the confidence interval procedures is to achieve non-coverage probabilities that are close to the nominal 0.025 level in both tails. The usual standard deviation-based central limit theorem approach failed in both tails. The BCa bootstrap method was the most accurate for computing confidence limits for the mean, but the upper limit was too small with \(n=200\), having non-coverage probability of 0.086 for the right tail instead of the nominal 0.025. Three types of intervals for the more robust pseudomedian were extremely accurate, giving more reasons to use the pseudomedian as a primary location measure, whether or not the distribution is symmetric.

Bootstrap Confidence Limits for Bootstrap Overfitting-Corrected Model Performance

computing

prediction

r

regression

validation

2025

The Efron-Gong optimism bootstrap has been used for decades to obtain reliable estimates of likely performance of statistical models on new data. It accomplishes this by estimating the bias (optimism) from overfitting and subtracting that bias from apparent model performance indexes. No fast reliable method for computing confidence intervals for overfitting-corrected measures currently exists, so analysts may have false confidence in internal model validations, especially for small datasets. The purpose of this research is to empirically derive a satisfactory fast algorithm for computing the needed confidence intervals when the model is a a binary logistic regression model. The approach is expected to work for a wide variety of models.

Minimal-Assumption Estimation of Survival Probability vs. a Continuous Variable

computing

prediction

r

regression

survival-analysis

validation

2025

There is no straightforward nonparametric smoother for estimating a smooth relationship between a continuous variable and the probability of survival past a fixed time when censoring is present. Several flexible methods are compared with regard to estimation error, and recommendations are made on the basis of a simulation study for one data generating mechanism. The results are particularly applicable to estimation of smooth calibration curves with right-censored data.

Statistical Computing Approaches to Maximum Likelihood Estimation

computing

data-science

inference

likelihood

ordinal

prediction

r

regression

2024

Maximum likelihood estimation (MLE) is central to estimation and development of predictive models. Outside of linear models and simple estimators, MLE requires trial-and-error iterative algorithms to find the set of parameter values that maximizes the likelihood, i.e., makes the observed data most likely to have been observed under the statistical model. There are many iterative optimization algorithms and R programming paradigms to choose from. There are also many pre-processing steps to consider such as how initial parameter estimates are guessed and whether and how the design matrix of covariates is mean-centered or orthogonalized to remove collinearities. While re-writing the R rms package logistic regression function lrm I explored several of these issues. Comparisons of execution time in R vs. Fortran are given. Different coding styles in both R and Fortran are also explored. Hopefully some of these explorations will help others who may not have studied MLE optimization and related statistical computing algorithms.

Adjudication and Statistical Efficiency

classification

decision-making

diagnosis

endpoints

judgment

measurement

medical

design

RCT

accuracy-score

inference

ordinal

subgroup

2024

This article addresses some statistical issues related to adjudication of clinical conditions in clinical and epidemiologic studies, concentrating on maximizing statistical information, efficiency, and power. This has a lot to do with capturing disagreements between adjudicators and uncertainty within an adjudicator.

The Burden of Demonstrating Statistical Validity of Clusters

classification

data-reduction

diagnosis

medicine

personalized-medicine

subgroup

2024

Patient clustering, often described as the finding of new phenotypes, is being used with increasing frequency in the medical literature. Most of the applications of clustering of observations are not well thought out, not even considering whether observation clustering aligns with the clinical goals. And the resulting clusters are not validated even in a statistical way. This article describes some of the challenges of observation clustering, and challenges researchers to carefully check that found clusters are compact and contain the important statistical information in the variables on which clustering is based.

Hosting Web Content

computing

2024

This article is about lessons I’ve learned in building and maintaining web sites, with hopefully helpful recommendations.

Traditional Frequentist Inference Uses Unrealistic Priors

bayes

design

inference

hypothesis-testing

RCT

multiplicity

2024

Considering a simple fixed sample-size non-adaptive design in which standard frequentist inference agrees with non-informative prior-based Bayesian inference, it is argued that the implied assumption about the unknown effect made by frequentist inference (and Bayesian inference if the non-informative prior is actually used) is quite unrealistic.

Borrowing Information Across Outcomes

bayes

design

RCT

accuracy-score

inference

ordinal

2024

In randomized clinical trials, power can be greatly increased and sample size reduced by using an ordinal outcome instead of a binary one. The proportional odds model is the most popular model for analyzing ordinal outcomes, and it borrows treatment effect information across outcome levels to obtain a single overall treatment effect as an odds ratio. When deaths can occur, it is logical to have death as one of the ordinal categories. Consumers of the results frequently seek evidence of a mortality reduction even though they were not willing to fund a study large enough to be able to detect this with decent power. The same goes when assessing whether there is an increase in mortality, indicating a severe safety problem for the new treatment. The partial proportional odds model provides a continuous bridge between standalone evidence for a mortality effect and obtaining evidence using statistically richer information on the combination of nonfatal and fatal endpoints. A simulation demonstrates the relationship between the amount of borrowing of treatment effect across outcome levels and the Bayesian power for finding evidence for a mortality reduction.

Proportional Odds Model Power Calculations for Ordinal and Mixed Ordinal/Continuous Outcomes

inference

hypothesis-testing

regression

ordinal

change-scores

design

endpoints

medicine

sample-size

2024

This article has detailed examples with complete R code for computing frequentist power for ordinal, continuous, and mixed ordinal/continuous outcomes in two-group comparisons with equal sample sizes. Mixed outcomes allow one to easily handle clinical event overrides of continuous response variables. The proportional odds model is used throughout, and care is taken to convert odds ratios to differences in medians or means to aid in understanding effect sizes. Since the Wilcoxon test is a special case of the proportional odds model, the examples also show how to tailor sample size calculations to the Wilcoxon test, at least when there are no covariates.

The log-rank Test Assumes More Than the Cox Model

inference

hypothesis-testing

regression

2024

It is well known that the score test for comparing survival distributions between two groups without covariate adjustment, using a Cox proportional hazards (PH) model, is identical to the log-rank \(\chi^2\) test when there are no tied failure times. Yet there persists a belief that the log-rank test is somehow completely nonparametric and does not assume PH. Log-rank and Cox approaches can only disagree if the more commonly used likelihood ratio (LR) statistic from the Cox model disagrees with the log-rank statistic (and the Cox score statistic). This article shows that in fact the log-rank and Cox LR statistics agree to a remarkable degree, and furthermore the hazard ratio arising from the log-rank test also has remarkable agreement with the Cox model counterpart. Since both methods assume PH and the log-rank test assumes within-group heterogeneity (because it doesn’t allow for covariate adjustment), the Cox model actually makes fewer assumptions than log-rank.

What Does a Statistical Method Assume?

inference

hypothesis-testing

regression

2024

Sometimes it is unclear exactly what a specific statistical estimator or analysis method is assuming. This is especially true for methods that at first glance appear to be nonparametric when in reality they are semiparametric. This article attempts to explain what it means to make different types of assumptions, and how to tell when a certain type of assumption is being made. It also describes the assumptions made by various commonly used statistical procedures.

Football Multiplicities

bayes

design

sequential

RCT

accuracy-score

backward-probability

decision-making

forward-probability

inference

multiplicity

prediction

probability

2024

Traditionally trained statisticians have much difficulty in accepting the absence of multiplicity issues with Bayesian sequential designs, i.e., that Bayesian posterior probabilities do not change interpretation or become miscalibrated just because a stopping rule is in effect. Most statisticians are used to dealing with backwards-information-flow probabilities which do have multiplicity issues, because they must deal with opportunities for data to be extreme. This leads them to believe that Bayesian methods must have some kind of hidden multiplicity problem. The chasm between forward and backwards probabilities is explored with a simple example involving continuous data looks where the ultimate truth is known. The stopping rule is the home NFL team having ≥ 0.9 probability of ultimately winning the game, and the correctness of the Bayesian-style forecast is readily checked.

Frank Harrell, Stephen Ruberg

How Does a Compound Symmetric Correlation Structure Translate to a Markov Model?

ordinal

prediction

regression

2023

Random intercepts in a model induce a compound symmetric correlation structure in which the correlation between two responses on the same subject have the same correlation…

Incorporating Historical Control Data Into an RCT

drug-evaluation

bayes

design

drug-development

inference

observational

posterior

prior

2023

Historical data (HD) are being used increasingly in Bayesian analyses when it is difficult to randomize enough patients to study effectiveness of a treatment. Such analyses summarize observational studies’ posterior effectiveness distribution (for two-arm HD) or standard-of-care outcome distribution (for one-arm HD) then turn that into a prior distribution for an RCT. The prior distribution is then flattened somewhat to discount the HD. Since Bayesian modeling makes it easy to fit multiple models at once, incorporation of the raw HD into the RCT analysis and discounting HD by explicitly modeling bias is perhaps a more direct approach than lowering the effective sample size of HD. Trust the HD sample size but not what the HD is estimating, and realize several benefits from using raw HD in the RCT analysis instead of relying on HD summaries that may hide uncertainties.

Wedding Bayesian and Frequentist Designs Created a Mess

2023

inference

RCT

bayes

design

evidence

multiplicity

posterior

prior

sequential

This article describes a real example in which use of a hybrid Bayesian-frequentist RCT design resulted in an analytical mess after overly successful participant recruitment.

Ordinal Models for Paired Data

2023

ordinal

hypothesis-testing

inference

regression

This article briefly discusses why the rank difference test is better than the Wilcoxon signed-rank test for paired data, then shows how to generalize the rank difference test using the proportional odds ordinal logistic semiparametric regression model. To make the regression model work for non-independent (paired) measurements, the robust cluster sandwich covariance estimator is used for the log odds ratio. Power and type I assertion \(\alpha\) probabilities are compared with the paired \(t\)-test for \(n=25\). The ordinal model yields \(\alpha=0.05\) under the null and has power that is virtually as good as the optimum paired \(t\)-test. For non-normal data the ordinal model power exceeds that of the parametric test.

Resources for Ordinal Regression Models

2022

2023

2024

endpoints

ordinal

regression

This article provides resources to assist researchers in understanding and using ordinal regression models, and provides arguments for their wider use.

Seven Common Errors in Decision Curve Analysis

decision-making

diagnosis

medicine

2023

I describe seven common errors in decision curve analysis. Avoidance of such errors will make decision curve analysis more reliable and useful.

Randomized Clinical Trials Do Not Mimic Clinical Practice, Thank Goodness

generalizability

design

medicine

RCT

drug-evaluation

personalized-medicine

evidence

2017

2023

Randomized clinical trials are successful because they do not mimic clinical practice. They remain highly clinically relevant despite this.

Biostatistical Modeling Plan

2023

accuracy-score

endpoints

ordinal

collaboration

data-reduction

design

medicine

prediction

regression

validation

bootstrap

This is an example statistical plan for project proposals where the goal is to develop a biostatistical model for prediction, and to do external or strong internal validation of the model.

How to Do Bad Biomarker Research

2022

big-data

bioinformatics

biomarker

bootstrap

data-science

decision-making

dichotomization

forward-probability

generalizability

medical-literature

multiplicity

personalized-medicine

prediction

principles

reporting

reproducible

responder-analysis

sample-size

sensitivity

This article covers some of the bad statistical practices that have crept into biomarker research, including setting the bar too low for demonstrating that biomarker information is new, believing that winning biomarkers are really “winners”, and improper use of continuous variables. Step-by-step guidance is given for ensuring that a biomarker analysis is not reproducible and does not provide clinically useful information.

R Workflow

2022

data-science

graphics

r

reproducible

An overview of R Workflow, which covers how to use R effectively all the way from importing data to analysis, and making use of Quarto for reproducible reporting.

Decision curve analysis for quantifying the additional benefit of a new marker

2022

biomarker

accuracy-score

decision-making

diagnosis

medicine

This article examines the benefits of decision curve analysis for assessing model performance when adding a new marker to an existing model. Decision curve analysis provides a clinically interpretable metric based on the number of events identified and interventions avoided.

Emily Vertosick and Andrew Vickers

Equivalence of Wilcoxon Statistic and Proportional Odds Model

2022

2024

endpoints

ordinal

drug-evaluation

hypothesis-testing

RCT

regression

In this article I provide much more extensive simulations showing the near perfect agreement between the odds ratio (OR) from a proportional odds (PO) model, and the Wilcoxon two-sample test statistic. The agreement is studied by degree of violation of the PO assumption and by the sample size. A refinement in the conversion formula between the OR and the Wilcoxon statistic scaled to 0-1 (corcordance probability) is provided.

Longitudinal Data: Think Serial Correlation First, Random Effects Second

drug-evaluation

endpoints

measurement

RCT

regression

2022

Most analysts automatically turn towards random effects models when analyzing longitudinal data. This may not always be the most natural, or best fitting approach.

Assessing the Proportional Odds Assumption and Its Impact

2022

accuracy-score

dichotomization

endpoints

ordinal

This article demonstrates how the proportional odds (PO) assumption and its impact can be assessed. General robustness to non-PO on either a main variable of interest or on an adjustment covariate are exemplified. Advantages of a continuous Bayesian blend of PO and non-PO are also discussed.

A Comparison of Decision Curve Analysis with Traditional Decision Analysis

decision-making

diagnosis

medicine

2021

We compare decision curve analysis and traditional decision analysis to illustrate their similarities and differences.

Commentary on Improving Precision and Power in Randomized Trials for COVID-19 Treatments Using Covariate Adjustment, for Binary, Ordinal, and Time-to-Event Outcomes

bayes

covid-19

design

generalizability

inference

metrics

ordinal

personalized-medicine

RCT

regression

reporting

2021

This is a commentary on the paper by Benkeser, Díaz, Luedtke, Segal, Scharfstein, and Rosenblum

Frank Harrell, Stephen Senn

Incorrect Covariate Adjustment May Be More Correct than Adjusted Marginal Estimates

2021

generalizability

RCT

regression

This article provides a demonstration that the perceived non-robustness of nonlinear models for covariate adjustment in randomized trials may be less of an issue than the non-transportability of marginal so-called robust estimators.

Avoiding One-Number Summaries of Treatment Effects for RCTs with Binary Outcomes

2021

generalizability

RCT

regression

This article presents an argument that for RCTs with a binary outcome the primary result should be a distribution and not any single number summary. The GUSTO-I study is used to exemplify risk difference distributions.

If You Like the Wilcoxon Test You Must Like the Proportional Odds Model

ordinal

hypothesis-testing

2021

accuracy-score

RCT

regression

metrics

Since the Wilcoxon test is a special case of the proportional odds (PO) model, if one likes the Wilcoxon test, one must like the PO model. This is made more convincing by showing examples of how one may accurately compute the Wilcoxon statistic from the PO model’s odds ratio.

Implementation of the PATH Statement

The recent PATH (Predictive Approaches to Treatment effect Heterogeneity) Statement outlines principles, criteria, and key considerations for applying predictive approaches to clinical trials to provide patient-centered evidence in support of decision making. Here challenges in implementing the PATH Statement are addressed with the GUSTO-I trial as a case study.

Ewout Steyerberg

Violation of Proportional Odds is Not Fatal

2020

ordinal

accuracy-score

RCT

regression

hypothesis-testing

metrics

Many researchers worry about violations of the proportional hazards assumption when comparing treatments in a randomized study. Besides the fact that this frequently makes them turn to a much worse approach, the harm done by violations of the proportional odds assumption usually do not prevent the proportional odds model from providing a reasonable treatment effect assessment.

Unadjusted Odds Ratios are Conditional

2020

generalizability

RCT

regression

This article discusses issues with unadjusted effect ratios such as odds ratios and hazard ratios, showing a simple example of non-generalizability of unadjusted odds ratios.

RCT Analyses With Covariate Adjustment

2020

drug-evaluation

generalizability

medicine

personalized-medicine

prediction

RCT

regression

This article summarizes arguments for the claim that the primary analysis of treatment effect in a RCT should be with adjustment for baseline covariates. It reiterates some findings and statements from classic papers, with illustration on the GUSTO-I trial.

Ewout Steyerberg
@ESteyerberg

Bayesian Methods to Address Clinical Development Challenges for COVID-19 Drugs and Biologics

bayes

RCT

design

drug-evaluation

medicine

responder-analysis

covid-19

The COVID-19 pandemic has elevated the challenge for designing and executing clinical trials with vaccines and drug/device combinations within a substantially shortened time frame. Numerous challenges in designing COVID-19 trials include lack of prior data for candidate interventions / vaccines due to the novelty of the disease, evolving standard of care and sense of urgency to speed up development programmes. We propose sequential and adaptive Bayesian trial designs to help address the challenges inherent in COVID-19 trials. In the Bayesian framework, several methodologies can be implemented to address the complexity of the primary endpoint choice. Different options could be used for the primary analysis of the WHO Severity Scale, frequently used in COVID-19 trials. We propose the longitudinal proportional odds mixed effects model using the WHO Severity Scale ordinal scale. This enables efficient utilization of all clinical information to optimize sample sizes and maximize the rate of acquiring evidence about treatment effects and harms.

Natalia Muhlemann MD, Rajat Mukherjee Phd, Frank Harrell PhD

Implications of Interactions in Treatment Comparisons

RCT

drug-evaluation

generalizability

medicine

observational

personalized-medicine

prediction

subgroup

2020

This article explains how the generalizability of randomized trial findings depends primarily on whether and how patient characteristics modify (interact with) the treatment effect. For an observational study this will be related to overlap in the propensity to receive treatment.

The Burden of Demonstrating HTE

RCT

generalizability

medicine

metrics

personalized-medicine

subgroup

2019

Reasons are given for why heterogeneity of treatment effect must be demonstrated, not assumed. An example is presented that shows that HTE must exceed a certain level before personalizing treatment results in better decisions than using the average treatment effect for everyone.

Assessing Heterogeneity of Treatment Effect, Estimating Patient-Specific Efficacy, and Studying Variation in Odds ratios, Risk Ratios, and Risk Differences

RCT

generalizability

medicine

metrics

personalized-medicine

prediction

subgroup

accuracy-score

2019

This article shows an example formally testing for heterogeneity of treatment effect in the GUSTO-I trial, shows how to use penalized estimation to obtain patient-specific efficacy, and studies variation across patients in three measures of treatment effect.

Statistically Efficient Ways to Quantify Added Predictive Value of New Measurements

prediction

sample-size

validation

accuracy-score

biomarker

diagnosis

medicine

reporting

2018

Researchers have used contorted, inefficient, and arbitrary analyses to demonstrated added value in biomarkers, genes, and new lab measurements. Traditional statistical measures have always been up to the task, and are more powerful and more flexible. It’s time to revisit them, and to add a few slight twists to make them more helpful.

In Machine Learning Predictions for Health Care the Confusion Matrix is a Matrix of Confusion

data-science

machine-learning

prediction

2018

The performance metrics chosen for prediction tools, and for Machine Learning in particular, have significant implications for health care and a penetrating understanding of the AUROC will lead to better methods, greater ML value, and ultimately, benefit patients.

Drew Griffin Levy
@DrewLevy

Data Methods Discussion Site

collaboration

teaching

2018

This article lays out the rationale and overall design of a new discussion site about quantitative methods.

Viewpoints on Heterogeneity of Treatment Effect and Precision Medicine

RCT

biomarker

decision-making

drug-evaluation

generalizability

medicine

metrics

personalized-medicine

prediction

subgroup

2018

This article provides my reflections after the PCORI/PACE Evidence and the Individual Patient meeting on 2018-05-31. The discussion includes a high-level view of heterogeneity of treatment effect in optimizing treatment for individual patients.

Navigating Statistical Modeling and Machine Learning

data-science

machine-learning

prediction

2018

This article elaborates on Frank Harrell’s post providing guidance in choosing between machine learning and statistical modeling for a prediction project.

Drew Griffin Levy
@DrewLevy

Road Map for Choosing Between Statistical Modeling and Machine Learning

data-science

machine-learning

prediction

2018

This article provides general guidance to help researchers choose between machine learning and statistical modeling for a prediction project.

Musings on Multiple Endpoints in RCTs

RCT

bayes

design

drug-evaluation

evidence

hypothesis-testing

medicine

multiplicity

p-value

posterior

endpoints

2018

This article discusses issues related to alpha spending, effect sizes used in power calculations, multiple endpoints in RCTs, and endpoint labeling. Changes in endpoint priority is addressed. Included in the the discussion is how Bayesian probabilities more naturally allow one to answer multiple questions without all-too-arbitrary designations of endpoints as “primary” and “secondary”. And we should not quit trying to learn.

Improving Research Through Safer Learning from Data

design

evidence

generalizability

inference

judgment

measurement

prior

bayes

2018

What are the major elements of learning from data that should inform the research process? How can we prevent having false confidence from statistical analysis? Does a Bayesian approach result in more honest answers to research questions? Is learning inherently subjective anyway, so we need to stop criticizing Bayesians’ subjectivity? How important and possible is pre-specification? When should replication be required? These and other questions are discussed.

Is Medicine Mesmerized by Machine Learning?

machine-learning

accuracy-score

classification

data-science

decision-making

medicine

prediction

validation

2018

Deep learning and other forms of machine learning are getting a lot of press in medicine. The reality doesn’t match the hype, and interpretable statistical models still have a lot to offer.

Information Gain From Using Ordinal Instead of Binary Outcomes

RCT

design

ordinal

dichotomization

inference

precision

responder-analysis

sample-size

2018

This article gives examples of information gained by using ordinal over binary response variables. This is done by showing that for the same sample size and power, smaller effects can be detected.

Why I Don’t Like Percents

metrics

2018

I prefer fractions and ratios over percents. Here are the reasons.

How Can Machine Learning be Reliable When the Sample is Adequate for Only One Feature?

prediction

machine-learning

sample-size

validation

precision

accuracy-score

2018

It is easy to compute the sample size N₁ needed to reliably estimate how one predictor relates to an outcome. It is next to impossible for a machine learning algorithm entertaining hundreds of features to yield reliable answers when the sample size < N₁.

New Year Goals

2018

2019

Methodologic goals and wishes for research and clinical practice for 2018

Scoring Multiple Variables, Too Many Variables and Too Few Observations: Data Reduction

variability

data-reduction

2017

This article addresses data reduction, also called unsupervised learning.

Statistical Criticism is Easy; I Need to Remember That Real People are Involved

RCT

2017

Criticism of medical journal articles is easy. I need to keep in mind that much good research is done even if there are some flaws in the design, analysis, or interpretation. I also need to remember that real people are involved.

Continuous Learning from Data: No Multiplicities from Computing and Using Bayesian Posterior Probabilities as Often as Desired

bayes

sequential

RCT

2017

This article describes the drastically different way that sequential data looks operate in a Bayesian setting compared to a classical frequentist setting.

Bayesian vs. Frequentist Statements About Treatment Efficacy

reporting

inference

p-value

RCT

bayes

drug-evaluation

evidence

hypothesis-testing

2017

This article contrasts language used when reporting a classical frequentist treatment comparison vs. a Bayesian one, and describes why Bayesian statements convey more actionable information.

Integrating Audio, Video, and Discussion Boards with Course Notes

collaboration

teaching

r

reproducible

2017

In this article I seek recommendations for integrating various media for teaching long courses.

EHRs and RCTs: Outcome Prediction vs. Optimal Treatment Selection

prediction

generalizability

drug-evaluation

evidence

subgroup

EHR

design

medicine

inference

big-data

RCT

personalized-medicine

2017

Observational data from electronic health records may contain biases that large sample sizes do not overcome. Moderate confounding by indication may render an infinitely large observational study less useful than a small randomized trial for estimating relative treatment effectiveness.

Frank Harrell, Laura Lazzeroni

Statistical Errors in the Medical Literature

prediction

logic

p-value

validation

bayes

evidence

subgroup

dichotomization

medicine

inference

change-scores

RCT

personalized-medicine

responder-analysis

hypothesis-testing

medical-literature

2017

This article catalogs several types of statistical problems that occur frequently in the medical journal articles.

Subjective Ranking of Quality of Research by Subject Matter Area

2017

This is a subjective ranking of topical areas by the typical quality of research published in the area. Keep in mind that top-quality research can occur in any area when the research team is multi-disciplinary, team members are at the top of their game, and peer review is functional.

Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules

prediction

machine-learning

accuracy-score

dichotomization

probability

bioinformatics

validation

classification

data-science

2017

Estimating tendencies is usually a more appropriate goal than classification, and classification leads to the use of discontinuous accuracy scores which give rise to misleading results.

My Journey from Frequentist to Bayesian Statistics

inference

p-value

likelihood

RCT

bayes

multiplicity

posterior

drug-evaluation

principles

evidence

hypothesis-testing

2017

This is the story of what influenced me to become a Bayesian statistician after being trained as a classical frequentist statistician, and practicing only that mode of statistics for many years.

Interactive Statistical Graphics: Showing More By Showing Less

survival-analysis

graphics

r

2017

With interactive graphics one can start by showing the most important data features, then drill down to see details.

A Litany of Problems With p-values

decision-making

bayes

multiplicity

p-value

hypothesis-testing

2017

p-values are very often misinterpreted. p-values and null hypothesis significant testing have hurt science. This article attempts to catalog all the ways in which these happen.

Clinicians’ Misunderstanding of Probabilities Makes Them Like Backwards Probabilities Such As Sensitivity, Specificity, and Type I Error

specificity

probability

backward-probability

forward-probability

p-value

bayes

conditioning

diagnosis

decision-making

dichotomization

medicine

bioinformatics

biomarker

sensitivity

posterior

accuracy-score

classification

2017

The error of the transposed conditional is rampant in research. Conditioning on what is unknowable to predict what is already known leads to a host of complexities and interpretation problems.

Split-Sample Model Validation

prediction

bootstrap

validation

2017

The many disadvantages of split-sample validation, including subtle ones, are discussed.

Fundamental Principles of Statistics

design

measurement

principles

2017

2023

2026

This brief note catalogs what I feel are some of the most important principles to guide statistical practice.

Ideas for Future Articles

2017

Suggestions for future articles, by readers

Classification vs. Prediction

prediction

decision-making

machine-learning

accuracy-score

classification

data-science

2017

Classification involves a forced-choice premature decision, and is often misused in machine learning applications. Probability modeling involves the quantification of tendencies and usually addresses the real project goals.

Null Hypothesis Significance Testing Never Worked

logic

inference

bayes

p-value

hypothesis-testing

inductive-reasoning

2017

This article explains why for decision making the original idea of null hypothesis testing never delivered on its goal.

p-values and Type I Errors are Not the Probabilities We Need

judgment

inference

likelihood

bayes

multiplicity

p-value

prior

hypothesis-testing

2017

p-values are not what decision makers need, nor are they what most decision makers think they are getting.

Introduction

2017

principles

Introducing the Statistical Thinking Blog

Reuse

---
title: ""
listing:
  - id: post
    contents: "*/index.qmd"
    type: default
    fields: [date, title, description, categories, author, reading-time]  
    sort: "date desc"
    categories: cloud
    sort-ui: true
    filter-ui: true
    page-size: 10
page-layout: full
title-block-banner: false
---



::: {#post}
:::