*Classification combines prediction and decision making and usurps the decision maker in specifying costs of wrong decisions. The classification rule must be reformulated if costs/utilities change. Predictions are separate from decisions and can be used by any decision maker.*

The field of machine learning arose somewhat independently of the field of statistics. As a result, machine learning experts tend not to emphasize probabilistic thinking. Probabilistic thinking and understanding uncertainty and variation are hallmarks of statistics. By the way, one of the best books about probabilistic thinking is Nate Silver's *The Signal and The Noise: Why So Many Predictions Fail But Some Don't*. In the medical field, a classic paper is David Spiegelhalter's Probabilistic Prediction in Patient Management and Clinical Trials.

By not thinking probabilistically, machine learning advocates frequently utilize classifiers instead of using risk prediction models. The situation has gotten acute: many machine learning experts actually label logistic regression as a classification method (it is not). It is important to think about what classification really implies. Classification is in effect a decision. Optimum decisions require making full use of available data, developing predictions, and applying a loss/utility/cost function to make a decision that, for example, minimizes expected loss or maximizes expected utility. Different end users have different utility functions. In risk assessment this leads to their having different risk thresholds for action. Classification assumes that every user has the same utility function and that the utility function implied by the classification system is* that* utility function.

Classification is a *forced choice*. In marketing where the advertising budget is fixed, analysts generally know better than to try to classify a potential customer as someone to ignore or someone to spend resources on. They do this by modeling probabilities and creating a *lift curve*, whereby potential customers are sorted in decreasing order of estimated probability of purchasing a product. To get the "biggest bang for the buck", the marketer who can afford to advertise to n persons picks the n highest-probability customers as targets. This is rational, and classification is not needed here.

A frequent argument from data users, e.g., physicians, is that ultimately they need to make a binary decision, so binary classification is needed. This is simply not true. First of all, it is often the case that the best decision is "no decision; get more data" when the probability of disease is in the middle. In many other cases, the decision is revocable, e.g., the physician starts the patient on a drug at a lower dose and decides later whether to change the dose or the medication. In surgical therapy the decision to operate is irrevocable, but the choice of *when* to operate is up to the surgeon and the patient and depends on severity of disease and symptoms. At any rate, if binary classification is needed, it must be done **at the point of care when all utilities are known**, not in a data analysis.

When are forced choices appropriate? I think that one needs to consider whether the problem is mechanistic or stochastic/probabilistic. Machine learning advocates often want to apply methods made for the former to problems where biologic variation, sampling variability, and measurement errors exist. It may be best to apply classification techniques instead just to high signal:noise ratio situations such as those in which there there is a known gold standard and one can replicate the experiment and get almost the same result each time. An example is pattern recognition - visual, sound, chemical composition, etc. If one creates an optical character recognition algorithm, the algorithm can be trained by exposing it to any number of replicates of attempts to classify an image as the letters A, B, ... The user of such a classifier may not have time to consider whether any of the classifications were "close calls." And the signal:noise ratio is extremely high. In addition, there is a single "right" answer for each character.

When close calls are possible, probability estimates are called for. One beauty of probabilities is that they are their own error measures. If the probability of disease is 0.1 and the current decision is not to treat the patient, the probability of this being an error is by definition 0.1. A probability of 0.4 may lead the physician to run another lab test or do a biopsy. When the signal:noise ratio is small, classification is usually not a good goal; there one must model *tendencies*, i.e., probabilities.

The U.S. Weather Service has always phrased rain forecasts as probabilities. I do not want a classification of "it will rain today." There is a slight loss/disutility of carrying an umbrella, and I want to be the one to make the tradeoff.

Whether engaging in credit risk scoring, weather forecasting, climate forecasting, marketing, diagnosis a patient's disease, or estimating a patient's prognosis, I do not want to use a classification method. I want risk estimates with credible intervals or confidence intervals. My opinion is that machine learning classifiers are best used in mechanistic high signal:noise ratio situations, and that probability models should be used in most other situations.

This is related to a subtle point that has been lost on many analysts. Complex machine learning algorithms, which allow for complexities such as high-order interactions, require an enormous amount of data unless the signal:noise ratio is high, another reason for reserving some machine learning techniques for such situations. Regression models which capitalize on additivity assumptions (when they are true, and this is approximately true is much of the time) can yield accurate probability models without having massive datasets. And when the outcome variable being predicted has more than two levels, a single regression model fit can be used to obtain all kinds of interesting quantities, e.g., predicted mean, quantiles, exceedance probabilities, and instantaneous hazard rates.

A special problem with classifiers illustrates an important issue. Users of machine classifiers know that a highly imbalanced sample with regard to a binary outcome variable Y results in a strange classifier. For example, if the sample has 1000 diseased patients and 1,000,000 non-diseased patients, the best classifier may classify everyone as non-diseased; you will be correct 0.999 of the time. For this reason the odd practice of subsampling the controls is used in an attempt to balance the frequencies and get some variation that will lead to sensible looking classifiers (users of regression models would never exclude good data to get an answer). Then they have to, in some ill-defined way, construct the classifier to make up for biasing the sample. It is simply the case that a classifier trained to a 1/1000 prevalence situation will not be applicable to a population with a vastly different prevalence. The classifier would have to be re-trained on the new sample, and the patterns detected may change greatly. Logistic regression on the other hand elegantly handles this situation by either (1) having as predictors the variables that made the prevalence so low, or (2) recalibrating the intercept (only) for another dataset with much higher prevalence. Classifiers' extreme dependence on prevalence may be enough to make some researchers always use probability estimators instead. One could go so far as to say that classifiers should not be used at all when there is little variation in the outcome variable, and that only tendencies should be modeled.

One of the key elements in choosing a method is having a sensitive accuracy scoring rule with the correct statistical properties. Experts in machine classification seldom have the background to understand this enormously important issue, and choosing an improper accuracy score such as proportion classified correctly will result in a bogus model. This will be discussed in a future blog.

> Probabilistic thinking and understanding uncertainty and variation are hallmarks of statistics.

ReplyDeleteI certainly think it should be and I do think there is a subset of the statistics discipline that understands statistics as primarily about conjecturing, assessing, and adopting idealized representations of reality, predominantly using probability generating models for both parameters and data.

Not sure if its the majority - there is another prospective on statistics, as primarily being about discerning procedures with good properties that are uniform over a wide range of possible underlying realities and restricting use, especially in science, to just those procedures. Here the probability model is de-emphasized and its role can fade into background technicalities.

Also, starting with probability models and explicating their role in representing reality well enough so that we can act in ways that are not frustrated by reality, does seem hard for people. Perhaps more so with those going into machine learning and data science.

Hope you enjoy blogging.

Keith O'Rourke

Nice comments Keith - thanks. I didn't make this very clear, but probability has many roles including probability models for data and understanding individual calculated probabilities related to decision making and more. I was discussing more the latter.

DeleteThank you Professor Harrell

ReplyDeletefor this GREAT article.

(I had never thought about it

from this clear angle...).

btw:

reached your Blog article

via your (new) Twitter acct! :-)

@SF99

San Francisco

This is a thought provoking post. Thanks for writing it (and, more generally, thanks for creating this blog).

ReplyDeleteIt seems to me that you're defining "classification" too narrowly here, though. For example, you write:

To get the "biggest bang for the buck", the marketer who can afford to advertise to n persons picks the n highest-probability customers as targets. This is rational, and classification is not needed here.This seems like classification with marketer-specific rules to me. The lift curve describes the range of values that could, in principle, be used to classify customers as targets or non-targets, and each marketer is free to implement a rule as desired.

My own training is primarily in statistics and mathematical psychology (focusing mostly on signal detection theory and various related models of perception and [statistical] decision making), and I've only fairly recently started to dig into the machine learning literature. So maybe I have an overly broad definition of what counts as classification.

In any case, I'd be curious to hear more of your thoughts on this.

The lift curve does not use classification in any way. It uses the predicted probability of purchasing, or anything monotonically related to that probability. And the point where one stops advertising to customers will vary with the advertising budget.

ReplyDeleteHow is it functionally different from using a breakpoint other than 0.5 to convert probabilistic predictions to classifications? The marketer is saying, I want a classification model that classifies N people as 'market to' and all others as 'do not market to'. The break point for converting the probabilities to labels slides till they get what they want. This isn't terribly different (functionally) from adjusting a breakpoint to improve measures like Sensitivity/Specificity/F1.

DeleteIt's obviously different if the marketer adjusts how much they spend on advertising to each person based on the probability. The example you provided suggested a fixed cost of marketing to a person and so attempting to maximize revenue by targeting the top N most likely.

DeleteI agree that the lift curve alone is not a classifier, but it supports classification. Or, put another way, it functions as part of a classifier, wherein each marketer's classification rule (at any given time) is determined by their budget (at that time), which in turn determines the number of potential customers they can target.

DeleteIt may be just semantics but I don't see a lift curve as supporting classification. True you can solve for a cutpoint in predicted probabilities that yields the first n from a lift curve, but the lift curve can be based on miscalibrated probabilities, relative risks, relative odds, etc., and still work fine. But if you have a probability you have so much more. For example a marketer could change the form of advertisement when the probability of purchasing is lower but the customer is still worth pursuing. Classification just gets in the way of that.

DeleteThe roots of machine learning are in settings where one wishes to write a program to make automated decisions (such as character recognition, speech recognition, or computer vision). Attempts to write such programs by hand failed. Machine learning applied to large data sets has succeeded very well. In these settings, there is no human in the loop to look at probabilities or confidences, and there is no desire to make statistical inferences or test scientific hypotheses. In such settings, methods that are trained 'end-to-end' to perform the task have generally given better results than methods based on probability models. This is "Vapnik's Principle" that one should not solve a harder problem (i.e., probability estimation) as an intermediate step to solving an easier problem (i.e., classification). There is also an interesting analysis by Shie Mannor and his students showing that the linear Support Vector Machine is a robust classifier, which is a property that few probabilistic methods share.

ReplyDeleteBut of course as "machine learning experts" started looking at more subtle decision problems, they have reached the same conclusion: in many tasks it is important to estimate conditional probabilities. So today's deep neural networks are essentially multinomial logistic regressions (with very rich internal structure). And machine learning experts have been studying proper scoring rules to understand which loss functions give desired results. The ML Experts at Google and Microsoft are building causal models using propensity scores to make advertising decisions. Many of us employ Markov Decision Process models to understand optimal sequential decision making.

In short, your depiction of "machine learning experts" is a straw man that may be useful for your argument but is not representative of the good work in ML. Of course, anyone can call themselves an ML researcher (or a statistician) and apply tools naively. Given the hype around ML/Data Science, thousands of people are doing exactly this, unfortunately. --Tom Dietterich

Excellent points all. The "no human in the loop" type of ML classification in my view works best when the signal:noise ratio is high, and only works when one does not desire to use utilities or the utilities are unknowable but we have some vague belief that the classification is implicitly using a reasonable utility function. It also should be noted that many comparisons of performance by ML with probability estimators such as logistic regression have been hurt by the use of an improper accuracy scoring rule.

DeleteMany thanks for the interesting article. I am currently working on a data set obtained from a clinical trial in which the prevalence of disease (~ 50%) is by design is significantly higher than that observed in the real world ( ~ 15%), I am using logistic regression. Your article made me think whether some calibration is in order to apply this model to real world data? If yes then I would be grateful if you could make few suggestions.

ReplyDeleteThere is an approximate way to correct the intercept based on relative odds of disease in the training and the target population, but I've forgotten the reference. The most rigorous way to do this is to have real-world data and to fit a model with just an intercept and with an offset term: the log odds from your model from the oversampled-disease dataset. The new intercept estimate will be the best available frequentist estimate of the correction you need for the intercept to apply your original model to the real world. You can always give up on the idea of estimating absolute risk and just provide relative odds, once you select a reference point (e.g. subject with covariates all equal to the median or mean).

DeleteProbability machines!

ReplyDeletehttps://www.ncbi.nlm.nih.gov/pubmed/21915433

https://www.ncbi.nlm.nih.gov/pubmed/24581306

The methods for logistic regression are not well described in the paper, but I strongly suspect that they used logistic regression in a way that ignores every advance since logistic regression was invented by DR Cox in 1959. Some of the advances include regression splines, tensor splines, and penalized maximum likelihood estimation. The calibration curve they published for logistic regression is flat, making me suspect that they used a vanilla logistic regression with only linear terms when the data were generated to be highly nonlinear. That could have been fixed trivially. So perhaps they gave machine learning every advantage and logistic regression no advantage. If this is indeed the case, that paper is worse than useless.

DeleteP.S. I was referring to the first paper you listed. Haven't looked at the second yet.

DeleteI did a very quick read of the second paper for which you are a co-author. It doesn't seem to make the same mistake of underfitting logistic regression as the first paper made (if I'm reading it correctly). It possibly makes the opposite error because I didn't see appropriate penalization used in the logistic regression description. Logistic regression is often superior to machine learning for dealing with 2-way interactions, but you need to apply a penalty function. In my book Regression Modeling Strategies I show how to apply proper hierarchical penalties, e.g., least penalty on linear main effects, more penalty on nonlinear main effects, then on linear interactions, and most penalty on nonlinear interactions. In your case it would just involve putting a fairly heavy penalty (using effect AIC, etc.) on all the linear interaction terms. On a separate issue, the gold standards for comparing various models are the out-of-sample log likelihood (logarithmic probability scoring rule), the mean squared error of predicted logit, and mean absolute error of predicted logit and predicted probability. Precision, as you studied, is also important.

ReplyDeleteThanks very much for this thought provoking piece! In general, I agree with your point that, in many contexts, predicted probabilities, and their error, will have greater utility than classifications, but acknowledge that at the end there will always be a binary decision. With regards to the umbrella example, I think most people want a recommendation of whether they should bring an umbrella out or not, only some will want the probability of rain. Your preference will likely be dictated by your understanding of risk and uncertainty, and whether you are at the extremes of being worried about getting your hair wet or have a particularly cumbersome umbrella! In the biomedical sciences, I am not clear on what the preference of clinicians would be, but I would hazard a guess that it would context specific.

ReplyDeleteAlso, I am interested in your assertion that the additivity assumption is approximately true ‘much of the time’. Is there a mathematical proof for this? I have a hunch that this is correct, given Taylor-type expansions of most data-generating functions, but given that real-world data comes from unknowable constructs, I am not clear how this can be justified.

Thanks for the nice comments. I could be proven wrong but I think that the majority of people do not want to be told to bring an umbrella. Classification assumes in effect that everyone has the same utility function, which I know is not the case. My experience with additivity is that I had a grant with Phil Goodman (PI) to study neural networks vs. logistic regression in large medical outcome databases. We found no important interactions in any of the variables in any of the databases.

DeleteA good example of the overclassification thinking occurs in Larose and Larose's textbook, Data Mining and Predictive Analytics, 2nd edition, on page 422. Since all four combinations of two binary predictors made the same classification prediction ("won't churn", although as different probabilities), they recommend undersampling the data so that some of the probabilities are now greater than .50 and "churn" can be the classification outcome.

ReplyDeleteAs you note, few people in marketing would do this. The probability of losing a customer (churning) would not have to be more than 50% to trigger much concern.

Terrific and sickening example. Makes me wonder how many 'data scientists' understand data science.

ReplyDelete