In Machine Learning Predictions for Health Care the Confusion Matrix is a Matrix of Confusion

Drew Griffin Levy, PhD
GoodScience, Inc.

Machine Learning (ML) has already transformed e-commerce, web search, advertising, finance, intelligence, media, and more. ML is becoming ubiquitous and its centripetal gravity draws health care into the swirl. ML will potentially impact all aspects of health care: prevention, diagnostics, drug discovery, drug development, therapeutics, safety, health care delivery, population health management, administration, etc. The question is not whether ML will disrupt healthcare; this is inexorable. The fundamental question is how ML will ultimately help patients: and more immediately, how it will help clinicians provide better care to patients.

Recent demonstrations of ML applications in health care (e.g., Rajkomar, et al., Scalable and accurate deep learning with electronic health records. 2018) feature advances for interoperability, scalability, and integrating all available digital health information to “harmonize inputs and predict medical events through direct feature learning with high accuracy for tasks such as predicting in-hospital mortality (area under the receiver operator curve [AUROC] across sites 0.93–0.94), predicting 30-day unplanned readmission (AUROC 0.75–0.76), predicting prolonged length of stay (AUROC 0.85–0.86), and predicting all of a patient’s final discharge diagnoses (frequency-weighted AUROC 0.90).” This may be an inflection point for ML in health care, prefiguring ML as a routine tool for efficiently integrating and making sense of health care data at scale.

These AUROC measures of performance feel reassuring, and results like these will surely encourage additional work. However, the quality and character of ML is exposed in the nature of the performance metrics chosen. We must carefully choose the goals we ask these systems to optimize.

Evaluation of models for use in health care should scrupulously take the intended purpose of the model into account. The AUROC may be the right answer, but to a question subtly, yet fundamentally, different from prediction. For prediction, the AUROC is only an oblique answer to the correct question: how well does the data-algorithm system make accurate predictions of responses for future observations? This problem exposes a general confusion in ML of classification and prediction and several additional problems. These include categorization of inherently continuous variables in the interest of spurious prediction by classification, and misconstrued discrimination, among other issues. Calibration approaches appropriate for representing prediction accuracy are largely overlooked in ML. Rigorous calibration of prediction is important for model optimization, but also ultimately crucial for medical ethics. Finally, the amelioration and evolution of ML methodology is about more than just technical issues: it will require vigilance for our own human biases that makes us see only what we want to see, and keep us from thinking critically and acting consistently.

With AUROC’s such as those reported in Rajkomar, et al., we feel comforted that something is working as expected. These AUROC’s feel very satisfying. Such is power of cognitive comfort and the social force of convention that the receiver operating curve (ROC) has been reflexively used to evaluate model performance in almost all reports of statistical modeling and statistical learning. The ROC does indicate something about the capability of a tool for identifying a binary signal (such as for radar tuning for detecting incoming planes) and patterns in the errors of signal discrimination (noise). But the ROC and AUROC are predicated on sensitivity and specificity, and consequently do not provide correct direct information about the potential value of the prediction tool. The reasoning is so subtle as to be generally elusive. It is due to improper (inverse or perverse) conditioning: the error of transposed conditionals or affirming the consequent.

The Confusion Matrix

A confusion matrix is used to describe the performance of a classification model (a “classifier”) in binary data for which the true values are known as well. In its simplest and most typical presentation, it is a special contingency table with two dimensions used to evaluate the results of the test or algorithm. Each row of the matrix represents the ascribed or attributed class while each column represents the actual condition (the truth). The cells of the confusion matrix report the number of true positives and false positives, and false negatives and true negatives.

The ROC curve plots the true-positive rate (TPR) against the false-positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity (or probability of detection in machine learning). The false-positive rate is calculated as 1–specificity.

While sensitivity and specificity sound like the right thing and a good thing, there is an essential misdirection for prediction. The problem for prediction with focusing on sensitivity and specificity is that you are conditioning on the wrong thing: the true underlying condition; you are conditioning on the thing you actually want information about. In $Pr($true positive $\mid$ true condition positive$)$ and $Pr($true negative $\mid$ true condition negative$)$ these measures make fixed the aspect that should be free to vary to provide the information actually needed to assess something meaningful about the future performance of the algorithm. To understand how the algorithm will actually perform in new data, the measures required are $Pr($true positive $\mid$ ascribed class positive$)$ and $Pr($true negative $\mid$ ascribed class negative$)$, i.e., the other dimension of the confusion matrix. To measure something meaningful about future performance, the outcome of interest (or what will be found to be true) must not be fixed by conditioning.

While sensitivity and specificity are widely employed as accuracy measures and have intuitive appeal, it is well established that our intuitions can mislead us. Sensitivity and specificity are properties of the measurement process. Sensitivity- and specificity-based measures are not meaningful probabilities for prediction per se, unless we are specifically interested in our informed guesses when we already know the outcomes—the retrospective view; for example, the probability of the antecedent test result given present knowledge of disease or outcome status. Sensitivity and specificity would be applicable in case-control studies, for example. It is generally the reverse that is useful information in providing health care. For prediction and decision making we need to directly forecast the likelihood that the patient has the disease, given the test result (and other available information). How good the test is among patients with and without disease is ancillary.

Sensitivity and specificity only tell you something obliquely about prediction. They tell you something about the observed error proportions for specific tests or algorithms, but not about uncertainties for future observations or events and directly about the quality of the prognostication. Prediction requires conditional probabilities in which the frequency of the outcome or response variable of interest is random and depends on earlier events (e.g., test results or algorithm results). For decision making our uncertainty is generally and properly about prospective probabilities (likelihood of a future event of concern) given past and present conditions and events.

The information in the two axes of the confusion matrix is not symmetric. They are related, but not the same. This is why Bayes theorem is so valuable. The conditional probability, $Pr(A|B)$, that event A occurs given that event B has occurred already, is not equal to $Pr(B|A)$. Falsely equating the two probabilities frequently causes various errors of reasoning.

The confusion matrix and it’s subtle informational asymmetries is a source of confusion. This is nuanced but not trivial, inconsequential, or negligible.

The problem with how the ROC and AUROC are used is in confusing signal detection for prediction. The confusion comes from the fact that both signal detection and prediction involve reckoning uncertainty, but these are different kinds of uncertainty: measurement error estimation vs. stochastic estimation. This confusion is exacerbated when the concept of prediction is re-defined as “filling in missing information” (“Prediction Machines: The Simple Economics of Artificial Intelligence,” 2018). This liberal epistemology discounts the fundamental importance of time for information and creates ambiguity. Whether the arrow is in the bow or on the target matters for epistemology and information. But a proper accuracy scoring rule for prediction is a metric applied to probability forecasts. It is a metric that is optimized when the forecasted probabilities are identical to the true outcome probabilities.

The prevalence and stubborn persistence of the ROC and AUROC may be attributable to the complexity and nuance of the underlying statistical reasoning, and the assumed wisdom of existing practice. It much like the zombie inertia of the use of the null-hypothesis statistical testing paradigm, the p-value and Type I error in epidemiology and the social sciences (and elsewhere). Sensitivity, specificity, the ROC and the c-index have a nice technical “truthiness” quality about them that make them attractive and “sticky”. The relative merit of positive-predictive value (PPV) and NPV over sensitivity and specificity for clinical purposes involves similar issues and is emphasized in current medical education, but neglected in ML. This is also very much like the inability of case-control studies to measure absolute risk because of conditioning on case-status. And there are numerous psychological and social factors that also explain the ROC’s persistent attendance in the literature. For example, when confronted with a perplexing problem, question, or decision, we make life easier for ourselves by unwittingly answering a substitute, simpler question. Instead of estimating the probability of a certain complex outcome we subconsciously rely on an estimate of another, less complex or more accessible outcome. We never get around to answering the harder question (D. Kahneman, Thinking Fast and Slow). It is also very difficult to not adopt or capitulate to what has become a norm in scientific communication.

The problem of using the ROC as a performance measure for ML prediction systems is further complicated by specious categorization of inherently continuous variables in the interest of accommodating inefficient discrimination measures, and moreover by spurious prediction by classification.

A Matrix of Confusion

The measures of sensitivity and specificity in the confusion matrix are binary or dichotomous; and a binary measure—the bit—is the most elemental and simplest form of information. Humans have a very strong bias for avoiding complexity and strong tendencies for reducing dimensionality whenever possible. This tendency is so strong that we often seek to satisfy it even if unconsciously we throw away information to do so. Models are frequently developed and reported which dichotomize predicted values post hoc. For instance, information rich predicted probabilities from logistic regression and other models are frequently dichotomized at some threshold (e.g, arbitrarily at 0.5) to permit expression as categories (of 0’s and 1’s) just to supply the units that sensitivity and specificity measures and the ROC requires for an index of discrimination. This is coercing prediction into a classification paradigm and confusing fundamentally different objectives. Regardless of the optimization algorithm, the practice of using categorical accuracy scores for measuring predictive performance and to drive feature selection for models has led to countless misleading findings in bioinformatics, machine learning, and data science (Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules).

Categorization of inherently continuous independent or predictor variables (the right-hand side of the equation) is also highly prevalent and associated with a host of problems (Problems Caused by Categorizing Continuous Variables). This is unnecessary as it is easy, using regression splines, to allow every continuous predictor in modeling to have a smooth nonlinear effect. Categorization of continuous variables, whether dependent or independent variables, is associated with waste of information at best; but more generally lead to distortions of information and purposes. Categorization of continuous variables is entropic; and it only appears to help.

And vice versa: just as probabilistic prediction models are coerced into classification models, classification models are frequently misconstrued as prediction models. The confusion is that classification (a selection of alternative states) can be tantamount to a decision (a choice among alternative actions) using the data alone without incorporating subject specific utilities (see Classification vs. Prediction). The presumption comes from the fallacious view that ultimately end-users need to make a binary decision, so binary classification is needed. Optimum decisions require making full use of available data, developing expectations quantitatively expressed as individual probabilities on a continuous scale, and applying an independently derived loss/utility/cost function to make a decision that minimizes expected loss or maximizes expected utility. Different end users have different utility functions which leads to their having different risk thresholds for action. Classification assumes that every user has the same utility function—one only implicit in the classification system (though one wouldn’t know it from the literature). The author of a paper is not in possession of the real utilities for patients.

For all applications it is well to distinguish and clearly differentiate prediction and classification. Formally, for ML, classification is using labels you have for data in hand to correctly label new data. This is feature recognition and class or category attribution. Strictly understood, it is about identification, and not about stochastic outcomes. Classification is best used with non-stochastic mechanistic or deterministic processes that yield outcomes that occur frequently. Classification should be used when outcomes are inherently distinct and predictors are strong enough to provide, for all subjects, a probability closely approximating 1.0 for one of the outcomes. A classification does not account well for gray zones. Classification techniques are appropriate in situations in which there is a known gold standard and replicate observations with approximately the same result each time, for instance in pattern recognition (e.g., optical character recognition algorithms, etc.). In such situations the process generating the data are primarily non-stochastic, with high signal:noise ratios.

Classification is frequently not strictly understood (for the source of wisdom inspiring this thesis, see Road Map for Choosing Between Statistical Modeling and Machine Learning). Classification is inferior to probability modeling for driving the development of a predictive instrument. It is better to use the full information in the data to develop a probability model and to preserve gray zones. In ML, classification methods are frequently employed for ersatz prediction, or that conflate prediction and decision-making, which generate more confusion.


The AUROC (or its equivalent for the case of binary response variable, the c-index or c-statistic) is conventionally employed as a measure of the discrimination capacity of a model: the ability to correctly classify observations into categories of interest. Setting aside the question of the appropriateness of classification-focused measures (sensitivity, specificity and their summary in the ROC) of performance for prediction models, I speculate that the AUROC and the c-statistic do not really reflect what people generally think it does. And here again, nuances and behavioral economics (inconsistencies in perceptions, cognition, behavior and logic) are pertinent.

Discrimination literally indicates the ability to identify a meaningful difference between things and connotes the ability to put observations into groups correctly. As applied to a prediction model the area under the ROC curve or the c-statistic, however, is based on the ranks of the predicted probabilities and compares these ranks between observations in the classes of interest. The AUC is closely related to the Mann–Whitney U, which tests whether positives are ranked higher than negatives, and to the Wilcoxon rank-sum statistic. Because this is a rank based statistic, the area under the curve is the probability that a randomly chosen subject from one outcome group will have a higher score than a randomly chosen subject from the other outcome group—that’s all.

In health care, discrimination is frequently concerned with the ability of a test to correctly classify those with and without the disease. Consider the situation in which patients are already correctly classified into two groups by some gold standard of pathology. To realize the AUROC measure, you randomly pick one patient from the disease group and one from the non-disease group and perform the test on both. The patient with the more abnormal test result should be the one from the disease group. The area under the curve is the percentage of randomly drawn pairs for which the test correctly rank orders the test measures for the two patients in the random pair. It is something like accuracy, but not accuracy.

Also frequently important in health care is the ability of a model to prognosticate something like death. For the binary logistic prediction model the area under the curve is the probability that a random sample of the deceased will have a greater rank estimated probability of death than a randomly chosen survivor. This is only a corollary of what is desirable for evaluation of the performance of a prediction model: a measure of quantitative absolute agreement between observed and predicted mortality. The probability of correctly ranking a pair seems of secondary interest.

If you develop a model indicting that I am likely to develop a cancer, and you tell me that you assert this because the model has an AUROC of 0.9, you really have only told me something about the expected relative ranking of my predicted value; that on average people who do not go on to develop the cancer tend to have a lower predicted value. This seems like something less than strong inference. Whether the absolute risks are 0.19 vs. 0.17, or 0.9 vs. 0.2 does not enter into the information.

Various other rigorous interpretations of the AUC include, the average value of sensitivity for all possible values of specificity, and vice versa (Zhou, Obuchowski, McClish, 2011); or the probability that a randomly selected subject with the condition has a test result indicating greater suspicion than that of a randomly chosen subject without the condition (Hanley & McNeil, 1982).

I would bet that there is a substantial disconnect between all these rigorous formal interpretations of the AUROC and how the audience for reported evaluations of prediction models thinks about what it means. I suspect that we interpret (confabulate) the AUROC or c-statistic as something like a fancy Proportion Classified Correctly accuracy measure or an $R^2$. We read more into it than there is. But the area under the curve is the probability that a randomly chosen subject from one outcome group will have a higher score than a randomly chosen subject form the other outcome group, nothing more. I do not feel that tells us enough.

And there are various ways we may be misled by this measure (Cook, 2007; Lobo, 2007). As the ROC curve does not use the estimated probabilities themselves, only ranks, it may be insensitive to absolute differences in predicted probabilities. Hence, a well discriminating model can have poor calibration. And perfect calibration is possible with poor discrimination when the range of predicted probabilities is small (as with a homogeneous population case-mix), as discrimination is sensitive to the variance in the predictor variables. Over-fitted models can show both poor discrimination and calibration when validated in new patients. Inferential tests for comparing AUROC are problematic (Seshan, 2013), and other disadvantages with the AUROC are noted (Halligan, 2015). For various reasons, the AUROC and the c-index or c-statistic are problematic and of limited value for comparing among tests or models, though unfortunately, still widely used for such.

There are no single model performance measures that are at once simple, intuitive, correct, and foolproof. Again, the AUROC has a nice technical “truthiness”” quality; but this may be chimera. As an assessment of performance for a prediction model there may be less there than meets the eye; and we too often tend to see only what we want to see. Again, nuances in statistical thinking combined with the complexities of human psychology are conditions conducive to confusion.

I feel we use discrimination measures somewhat indiscriminately. Much like the general misinterpretation of p-values (hypothetical frequencies of data patterns under an arbitrary assumed statistical model interpreted as hypothesis probabilities). And like other forms of discrimination, because of circular reasoning the AUROC tends to confirm our biases or reinforce our prejudices about the merits of a model. For purposes of evaluating prediction models, discrimination may better be represented by the distributions of predicted probabilities rather than a facile single statistic that is a derivative of sensitivity and specificity. And the AUROC should not be the only criterion in assessment of model performance. Although the AUROC may be useful and sufficient under some circumstances (and even then, perhaps less well and often than is generally thought), the evaluation of prognostic models should not rely solely on the ROC curve, but should assess both discrimination and calibration—perhaps with a much greater emphasis on calibration.


A prediction generator is optimized when the forecasted probabilities are identical to the true outcome probabilities. Calibration is performed by directly comparing the model output with the corresponding measured values. The distance between the predicted outcomes and actual outcomes is central to quantify overall performance for prediction models. From a statistical perspective what we want for confidence in a model is a measure reflecting uniformly negligibly small absolute distances between predicted and observed outcomes.

A continuous accuracy scoring rule is a metric that makes full use of the entire range of predicted probabilities and does not have a large categorical jump because of an arbitrary threshold marking an infinitesimal change in a predicted probability.

For the purposes of comparing predicted and observed outcomes, instead of transforming predicted probabilities to the discrete scale with dichotomization at an arbitrary threshold such as 0.5 (as is done for the categorical sensitivity- and specificity-based accuracy measures), the observed outcomes are transformed to the probability scale. The discrete (0,1) observed outcomes are mapped with fidelity on the continuous probability scale using non-parametric smoothers. High-resolution calibration or calibration-in-the-small assesses the absolute forecast accuracy of predictions at granular levels. When the calibration curve is linear, this can be summarized by the calibration slope and intercept. A more general visualization approach uses a loess smoother or spline function to estimate and illustrate the calibration curve. With such visualizations, lack of fit for individuals in any part of the range becomes evident. Several continuous summary metrics for calibration are available such as the various descriptive statics for calibration error, the logarithmic proper scoring rule (average log-likelihood), or the Brier score, etc.

ML needs to be more circumspect about the methods employed in assessment of predictive performance and driving tuning and feature selection for models.

Ethics and Accuracy Validation

Artificial intelligence (AI) and ML are widely conceived as programs that learn from data to perform a task or make a decision automatically. Some aspects of AI/ML in health care may be different from AI/ML applications in other domains because of the nature of health care decision-making and the ultimate role of medical ethics. These differences are exposed in consideration of classification vs. prediction and how models are validated.

Much technical acumen is applied to developing systems such as those that can determine whether a potential customer will click a button and purchase something from a set of advertised offerings. Whether framed as a classification or a prediction, the negligible particular costs and risks involved in most commercial applications do not drive careful consideration of methodology in the same way that some health care activities might. As the power and appeal of ML in many fields leads ineluctably to technology transfer to health care, the ethics of technology usage involve examination of issues that are not always apparent or immediate.

From a medical ethics perspective, fundamental principles for consideration include regard for a trusting patient–physician relationship emphasizing beneficence (concern for the patient’s best interests, and the benefits the patient may derive from testing or a prognostication) and respect for autonomy (an appreciation that patients make choices about their medical care). Ethical issues of non-maleficence (using tests when the consequences of the test are uncertain) and justice also arise in providing health care. AI/ML will eventually have to address where and how decisions are made and the locus of responsibility in health care. The ethical principles of respect for autonomy, beneficence, and justice in health care should guide ML methods development in many cases. Concerns about classification vs. prediction for decision-making, and calibration for model evaluation will eventually impose themselves.

In addition to ML methods, any bias that exists in the health system may be also represented in the EHR data, and an EHR-ML algorithm may perpetuate that bias in future medical decisions. The ethics of autonomy and justice may be served when the data-model system informing care is transparent and the evidence and reasoning supporting a clinical decision are available for scrutiny by all stakeholders—especially the patient-provider partnership. Where data and models are inscrutable, potential ethical conflicts may emerge for patient autonomy and raise questions concerning the locus of responsibility in clinical decision making.

Model validation and accuracy are germane to how ML will ultimately help provide better care to patients. Proportion classified correctly, sensitivity, specificity, precision, and recall are all improper accuracy scoring rules for prediction and should not play a role when individual patient risk estimation is the real goal. The credibility of an ML tool for individual-level decision making will require assessment with calibration. For informing patients and medical decision making, a reassuring calibration should be a primary requirement. In emphasizing accurate calibration over discrimination the fundamental medical ethics of respecting patient individuality and autonomy are served, and moreover helps to optimize decision making under uncertainty.


Received wisdom or conventional and prevalent practice is not always a useful guide. ML and deep learning may well provide accurate predictions, but with the accuracy measures typically reported we still do not really know. The AUROC is a reasonable initial screen for exploratory propositions about data-algorithm systems. But without calibration expressed across the full range of outcome probabilities in the population of interest we will not know how reliable the probability-based predictions will ultimately be, no matter how it was generated (no matter how big the data or technical the algorithm). We should also be generally wary of inappropriate applications of classification, and of classification impersonating prediction, and other necessary distinctions.

Applications of Big Data and Data Science to health care will likely follow a path similar to that of other sectors (e.g., finance, marketing) and of other trends as they are adopted for commercial purposes. This process includes innovation to expose, create or capture value with high initial expectations; re-calibration as the gap between the original aspirations and the reality of discovered limitations is understood; consolidation and coalescing around new practices and standards; and, finally maturation in understanding how best to use this new resource; ultimately leading toward routine productive use in commercialization, operations and decision making. This is the natural history of innovation. And much time and fortune is lost in the interval between the peak of expectations and the plateau of productivity (see Gartner Hype Cycle; Technology Hype).

It is possible to ‘bend the curve’ of this cycle by leveraging knowledge and good strategy. There will be a “John Henry” moment between ML and conventional prediction modeling. The comparisons made to date are specious because the availability of high quality conventional prediction models employing modern applied methods (Harrell, 2015; Steyerberg, 2009) by experienced analysts is very, very limited (e.g., see Cui, 2009), and the use of AUROC in comparison unsuitable. Building an ML algorithm that includes rigorous evaluation against the best alternative is highly valuable for bending the curve, as it will sharpen thinking, forfend surprise and obviate criticism. Such comparisons will eventually be compelled by issues concerning medical ethics.

For AI and ML to ultimately help clinicians provide better care to patients the technical issues in the performance metrics chosen for ML evaluation will eventually prove to be critically important. To develop AI/ML that delivers better care for patients will require rigorous thinking about what are often complex and nuanced issues, and require deep understanding of health data and the various forms of predictive and evaluative models. Understanding of what is real “information” (as inputs to, and outputs from, algorithms)—its quality, its value and usefulness—will not come automatically or easily. The path may be fraught with abstruse or inconvenient truths. Vigilance for our own human biases that makes us see only what we want to see and keep us from thinking critically and acting consistently will help us navigate. The destination, though, is better care for patients.


comments powered by Disqus