Thursday, March 16, 2017

Subjective Ranking of Quality of Research by Subject Matter Area

While being engaged in biomedical research for a few decades and watching reproducibility of research as a whole, I've developed my own ranking of reliability/quality/usefulness of research across several subject matter areas.  This list is far from complete.  Let's start with a subjective list of what I perceive as the areas in which published research is least likely to be both true and useful.  The following list is ordered in ascending order of quality, with the most problematic area listed first. You'll notice that there is a vast number of areas not listed for which I have minimal experience.

Some excellent research is done in all subject areas.  This list is based on my perception of the proportion of publications in the indicated area that are rigorously scientific, reproducible, and useful.

Subject Areas With Least Reliable/Reproducible/Useful Research

  1. any area where there is no pre-specified statistical analysis plan and the analysis can change on the fly when initial results are disappointing
  2. behavioral psychology
  3. studies of corporations to find characteristics of "winners"; regression to the mean kicks in making predictions useless for changing your company
  4. animal experiments on fewer than 30 animals
  5. discovery genetics not making use of biology while doing large-scale variant/gene screening
  6. nutritional epidemiology
  7. electronic health record research reaching clinical conclusions without understanding confounding by indication and other limitations of data
  8. pre-post studies with no randomization
  9. non-nutritional epidemiology not having a fully pre-specified statistical analysis plan [few epidemiology papers use state-of-the-art statistical methods and have a sensitivity analysis related to unmeasured confounders]
  10. prediction studies based on dirty and inadequate data
  11. personalized medicine
  12. biomarkers
  13. observational treatment comparisons that do not qualify for the second list (below)
  14. small adaptive dose-finding cancer trials (3+3 etc.)

Subject Areas With Most Reliable/Reproducible/Useful Research

The most reliable and useful research areas are listed first.  All of the following are assumed to (1) have a prospective pre-specified statistical analysis plan and (2) purposeful prospective quality-controlled data acquisition (yes this applies to high-quality non-randomized observational research).
  1. randomized crossover studies
  2. multi-center randomized experiments
  3. single-center randomized experiments with non-overly-optimistic sample sizes
  4. adaptive randomized clinical trials with large sample sizes
  5. physics
  6. pharmaceutical industry research that is overseen by FDA
  7. cardiovascular research
  8. observational research [however only a very small minority of observational research projects have a prospective analysis plan and high enough data quality to qualify for this list]

Some Suggested Remedies

Peer review of research grants and manuscripts is done primarily by experts in the subject matter area under study.  Most journal editors and grant reviewers are not expert in biostatistics.  Every grant application and submitted manuscript should undergo rigorous methodologic peer review by methodologic experts such as biostatisticians and epidemiologists.  All data analyses should be driven by a prospective statistical analysis plan, and the entire self-contained data manipulation and analysis code should be submitted to journals so that potential reproducibility and adherence to the statistical analysis plan can be confirmed.  Readers should have access to the data in most cases and should be able to reproduce all study findings using the authors' code, plus run their own analyses on the authors' data to check robustness of findings.

Medical journals are reluctant to (1) publish critical letters to the editor and (2) retract papers.  This has to change.

In academia, too much credit is still given to the quantity of publications and not to their quality and reproducibility.  This too must change.  The pharmaceutical industry has FDA to validate their research.  The NIH does not serve this role for academia.

Rochelle Tractenberg, Chair of the American Statistical Association Committee on Professional Ethics and a biostatistician at Georgetown University said in a 2017-02-22 interview with The Australian that many questionable studies would not have been published had formal statistical reviews been done.  When she reviews a paper she starts with the premise that the statistical analysis was incorrectly executed.  She stated that "Bad statistics is bad science."

Wednesday, March 1, 2017

Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules

In this article I discussed the many advantages or probability estimation over classification.  Here I discuss a particular problem related to classification, namely the harm done by using improper accuracy scoring rules.  Accuracy scores are used to drive feature selection, parameter estimation, and for measuring predictive performance on models derived using any optimization algorithm.  For this discussion let Y denote a no/yes false/true 0/1 event being predicted, and let Y=0 denote a non-event and Y=1 the event occurred.

As discussed here and here, a proper accuracy scoring rule is a metric applied to probability forecasts. It is a metric that is optimized when the forecasted probabilities are identical to the true outcome probabilities.  A continuous accuracy scoring rule is a metric that makes full use of the entire range of predicted probabilities and does not have a large jump because of an infinitesimal change in a predicted probability.  The two most commonly used proper scoring rules are the quadratic error measure, i.e., mean squared error or Brier score, and the logarithmic scoring rule, which is a linear translation of the log likelihood for a binary outcome model (Bernoulli trials).  The logarithmic rule gives more credit to extreme predictions that are "right", but a single prediction of 1.0 when Y=0 or 0.0 when Y=1 will result in infinity no matter how accurate were all the other predictions.  Because of the optimality properties of maximum likelihood estimation, the logarithmic scoring rule is in a sense the gold standard, but we more commonly use the Brier score because of its easier interpretation and its ready decomposition into various metrics measuring calibration-in-the-small, calibration-in-the-large, and discrimination.

Classification accuracy is an improper scoring rule.  It implicitly or explicitly uses thresholds for probabilities, and moving a prediction from 0.0001 below the threshold to 0.0001 above the thresholds results in a full accuracy change of 1/N.  Classification accuracy is also an improper scoring rule.  It can be optimized by choosing the wrong predictive features and giving them the wrong weights.  This is best shown by a simple example that appears in Biostatistics for Biomedical Research Chapter 18 in which 400 simulated subjects have an overall fraction of Y=1 of 0.57. Consider the use of  binary logistic regression to predict the probability that Y=1 given a certain set of covariates, and classify a subject as having Y=1 if the predicted probability exceeds 0.5.  We simulate values of age and sex and simulate binary values of Y according to a logistic model with strong age and sex effects; the true log odds of Y=1 are (age-50)*.04 + .75*(sex=m).   Fit four binary logistic models in order: a model containing only age as a predictor, one containing only sex, one containing both age and sex, and a model containing no predictors (i.e., it only has an intercept parameter).  The results are in the following table:

Both the gold standard likelihood ratio chi-square statistic and the improper pure discrimination c-index (AUROC) indicate that both age and sex are important predictors of Y.  Yet the highest proportion correct (classification accuracy) occurs when sex is ignored.  According to the improper score, the sex variable has negative information.  It is telling that a model that predicted Y=1 for every observation, i.e., one that completely ignored age and sex and only has the intercept in the model, would be 0.573 accurate, only slightly above the accuracy of using sex alone to predict Y.

The use of a discontinuous improper accuracy score such as proportion "classified" "correctly" has led to countless misleading findings in bioinformatics, machine learning, and data science.  In some extreme cases the machine learning expert failed to note that their claimed predictive accuracy was less than that achieved by ignoring the data, e.g., by just predicting Y=1 when the observed prevalence of Y=1 was 0.98 whereas their extensive data analysis yielded an accuracy of 0.97.  As discusssed here, fans of "classifiers" sometimes subsample from observations in the most frequent outcome category (here Y=1) to get an artificial 50/50 balance of Y=0 and Y=1 when developing their classifier.  Fans of such deficient notions of accuracy fail to realize that their classifier will not apply to a population when a much different prevalence of Y=1 than 0.5.

Sensitivity and specificity are one-sided or conditional versions of classification accuracy.  As such they are also discontinuous improper accuracy scores, and optimizing them will result in the wrong model.

Regression Modeling Strategies Chapter 10 goes into more problems with classification accuracy, and discusses many measures of the quality of probability estimates.  The text contains suggested measures to emphasize such as Brier score, pseudo R-squared (a simple function of the logarithmic scoring rule), c-index, and especially smooth nonparametric calibration plots to demonstrate absolute accuracy of estimated probabilities.