Road Map for Choosing Between Statistical Modeling and Machine Learning

Machine learning (ML) may be distinguished from statistical models (SM) using any of three considerations:
Uncertainty: SMs explicitly take uncertainty into account by specifying a probabilistic model for the data.
Structural: SMs typically start by assuming additivity of predictor effects when specifying the model.
Empirical: ML is more empirical including allowance for high-order interactions that are not pre-specified, whereas SMs have identified parameters of special interest.

There is a growing number of hybrid methods combining characteristics of traditional SMs and ML, especially in the Bayesian world. Both SMs and ML can handle high-dimensional situations.

It is often good to let the data speak. But you must be comfortable in assuming that the data are speaking rationally. Data can fool you.

Whether using SM or ML, work with a methodologist who knows what she is doing, and don't begin an analysis without ample subject matter input.

Data analysis methods may be described by their areas of applications, but for this article I’m using definitions that are strictly methods-oriented. A statistical model (SM) is a data model that incorporates probabilities for the data generating mechanism and has identified unknown parameters that are usually interpretable and of special interest, e.g., effects of predictor variables and distributional parameters about the outcome variable. The most commonly used SMs are regression models, which potentially allow for a separation of the effects of competing predictor variables. SMs include ordinary regression, Bayesian regression, semiparametric models, generalized additive models, longitudinal models, time-to-event models, penalized regression, and others. Penalized regression includes ridge regression, lasso, and elastic net. Contrary to what some machine learning (ML) researchers believe, SMs easily allow for complexity (nonlinearity and second-order interactions) and an unlimited number of candidate features (if penalized maximum likelihood estimation or Bayesian models with sharp skeptical priors are used). It is especially easy, using regression splines, to allow every continuous predictor to have a smooth nonlinear effect.

ML is taken to mean an algorithmic approach that does not use traditional identified statistical parameters, and for which a preconceived structure is not imposed on the relationships between predictors and outcomes. ML usually does not attempt to isolate the effect of any single variable. ML includes random forests, recursive partitioning (CART), bagging, boosting, support vector machines, neural networks, and deep learning. ML does not model the data generating process but rather attempts to learn from the dataset at hand. ML is more a part of computer science than it is part of statistics. Perhaps the simplest way to distinguish ML form SMs is that SMs (at least in the regression subset of SM) favor additivity of predictor effects while ML usually does not give additivity of effects any special emphasis.

ML and AI have had their greatest successes in high signal:noise situations, e.g., visual and sound recognition, language translation, and playing games with concrete rules. What distinguishes these is quick feedback while training, and availability of the answer. Things are different in the low signal:noise world of medical diagnosis and human outcomes. A great use of ML is in pattern recognition to mimic radiologists’ expert image interpretations. For estimating the probability of a positive biopsy given symptoms, signs, risk factors, and demographics, not so much.

There are many published comparisons of predictive performance of SM and ML. In many of the comparisons, only naive regression methods are used (e.g., everything is assumed to operate linearly), so the SM comparator is nothing but a straw man. And not surprisingly, ML wins. The reverse also happens, where the ML comparator algorithm uses poorly-chosen default parameters or the particular ML methods chosen for comparison are out of date. As a side note, when the SM method is just a straw man, the outcry from the statistical community is relatively muted compared with the outcry from ML advocates when the “latest and greatest” ML algorithm was not used in the comparison with SMs. ML seems to require more tweaking than SMs. But SMs often require a time-consuming data reduction step (unsupervised learning) when the number of candidate predictors is very large and penalization (lasso or otherwise) is not desired.

Note that there are ML algorithms that provide superior predictive discrimination but that pay insufficient attention to calibration (absolute accuracy).

Because SMs favor additivity as a default assumption, when additive effects dominate, SM requires far lower sample sizes (typically 20 events per candidate predictor) than ML, which typically requires 200 events per candidate predictor. Thus ML can sometimes create a demand for “big data” when small-moderate sized datasets will do. I sometimes dislike ML solutions for particular medical problems because of ML’s lack of assumptions. But SMs are not very good at reliably finding non-pre-specified interactions; SM typically requires interactions to be pre-specified. On the other hand, AHRQ-sponsored research I did on large medical outcomes datasets in the 1990s with the amazing University of Nevada Reno physician-statistician Phil Goodman, whom we lost at an all-too-early age, demonstrated that important non-additive effects are rare when predicting patient mortality. As a result, neural networks were no better than logistic regression in terms of predictive discrimination in these datasets.

There are many current users of ML algorithms who falsely believe that one can make reliable predictions from complex datasets with a small number of observations. Statisticians are pretty good at knowing the limitations caused by the effective sample size, and to stop short of trying to incorporate model complexity that is not supported by the information content of the sample.

Here are some rough guidelines that attempt to help researchers choose between the two approaches, for a prediction problem1.

A statistical model may be the better choice if

  • Uncertainty is inherent and the signal:noise ratio is not large—even with identical twins, one twin may get colon cancer and the other not; one should model tendencies instead of doing classification when there is randomness in the outcome
  • One doesn’t have perfect training data, e.g., cannot repeatedly test one subject and have outcomes assessed without error
  • One wants to isolate effects of a small number of variables
  • Uncertainty in an overall prediction or the effect of a predictor is sought
  • Additivity is the dominant way that predictors affect the outcome, or interactions are relatively small in number and can be pre-specified
  • The sample size isn’t huge
  • One wants to isolate (with a predominantly additive effect) the effects of “special” variables such as treatment or a risk factor
  • One wants the entire model to be interpretable

Machine learning may be the better choice if

  • The signal:noise ratio is large and the outcome being predicted doesn’t have a strong component of randomness; e.g., in visual pattern recognition an object must be an E or not an E
  • The learning algorithm can be trained on an unlimited number of exact replications (e.g., 1000 repetitions of each letter in the alphabet or of a certain word to be translated to German)
  • Overall prediction is the goal, without being able to succinctly describe the impact of any one variable (e.g., treatment)
  • One is not very interested in estimating uncertainty in forecasts or in effects of selected predictors
  • Non-additivity is expected to be strong and can’t be isolated to a few pre-specified variables (e.g., in visual pattern recognition the letter L must have both a dominating vertical component and a dominating horizontal component and these two must intersect at their endpoints)
  • The sample size is huge
  • One does not need to isolate the effect of a special variable such as treatment
  • One does not care that the model is a “black box”

Editorial Comment

Some readers have commented on twitter that I’ve created a false dichotomy of SMs vs. ML. There is some truth in this claim. The motivations for my approach to the presentation are

  • to clarify that regression models are not ML2
  • to sharpen the discussion by having a somewhat concrete definition of ML as a method without “specialness” of the parameters, that does not make many assumptions about the structure of predictors in relation to the outcome being predicted, and that does not explicitly incorporate uncertainty (e.g., probability distributions) into the analysis
  • to recognize that the bulk of machine learning being done today, especially in biomedical research, seems to be completely uninformed by statistical principles (much to its detriment IMHO), even to the point of many ML users not properly understanding predictive accuracy. It is impossible to have good predictions that address the problem at hand without a thorough understanding of measures of predictive accuracy when choosing the measure to optimize.

Some definitions of ML and discussions about the definitions may be found here, here, and here. I like the following definition from Tom Mitchell: The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.

The two fields may also be defined by how their practitioners spend their time. Someone engaged in ML will mainly spend her time choosing algorithms, writing code, specifying tuning parameters, waiting for the algorithm to run on a computer or cluster, and analyzing the accuracy of the resulting predictions. Someone engaged mainly in SMs will tend to spend time choosing a statistical model family, specifying the model, checking goodness of fit, analyzing accuracy of predictions, and interpreting estimated effects.

See this for more twitter discussions.

Further Reading

Footnotes


  1. Note that as described here, it is not appropriate to cast a prediction problem as a classification problem except in special circumstances that usually entail instant visual or sound pattern recognition requirements in a high signal:noise situation where the utility/cost/loss function cannot be specified. ML practitioners frequently misunderstand this, leading them to use improper accuracy scoring rules. ^
  2. There is an intersection of ML and regression in neural networks. See this article for more. ^

Related

comments powered by Disqus