Clinicians are people trained in the science and practice of medicine, and most of them are very good at it. They are also very good at many aspects of research. But they are generally not taught probability, and this can limit their research skills. Many excellent clinicians even let their limitations in understanding probability make them believe that their clinical decision making is worse than it actually is. I have taught many clinicians who say "I need a hard and fast rule so I know how to diagnosis or treat patients. I need a hard cutoff on blood pressure, HbA1c, etc. so that I know what to do, and the fact that I either treat or not treat the patient means that I don't want to consider a probability of disease but desire a simple classification rule." This makes the clinician try to influence the statistician to use inefficient, arbitrary methods such as categorization, stratification, and matching.

In reality, clinicians do not act that way when treating patients. They are smart enough to know that if a patient has cholesterol just over someone's arbitrary threshold they may not start statin therapy right away if the patient has no other risk factors (e.g., smoking) going against him. They know that sometimes you start a patient on a lower dose and see how she responds, or start one drug and try it for a while and then switch drugs if the efficacy is unacceptable or there is a significant side effect.

So I emphasize the need to understand probabilities when I'm teaching clinicians. A probability is a self-contained summary of the current information, except for the patient's risk aversion and other utilities. Clinicians need to be comfortable with a probability of 0.5 meaning "we don't know much" and not requesting a classification of disease/normal that does nothing but cover up the problem. A classification does not account for gray zones or patient and physician utility functions.

Even physicians who understand the meaning of a probability are often not understanding conditioning. Conditioning is all important, and conditioning on different things massively changes the meaning of the probabilities being computed. Every physician I've known has been taught probabilistic medical diagnosis by first learning about sensitivity (sens) and specificity (spec). These are probabilities that are in backwards time- and information flow order. How did this happen? Sensitivity, specificity, and receiver operating characteristic curves were developed for radar and radio research in the military. It was a important to receive radio signals from distant aircraft, and to detect an incoming aircraft on radar. The ability to detect something that is really there is definitely important. In the 1950s, virologists appropriated these concepts to measure the performance of viral cultures. Virus needs to be detected when it's present, and not detected when it's not. Sensitivity is the probability of detecting a condition when it is truly present, and specificity is the probability of not detecting it when it is truly absent. One can see how these probabilities would be useful outside of virology and bacteriology when the samples are retrospective, as in a case-control studies. But I believe that clinicians and researchers would be better off if backward probabilities were not taught or were mentioned only to illustrate how

**not**to think about a problem.

But the way medical students are educated, they assume that sens and spec are what you first consider in a prospective cohort of patients! This gives the professor the opportunity of teaching Bayes' rule and requires the use of a supposedly unconditional probability known as

*prevalence*which is actually not very well defined. The students plugs everything into Bayes' rule and fails to notice that several quantities cancel out. The result is the following: the proportion of patients with a positive test who have disease, and the proportion with a negative test who have disease. These are trivially calculated from the cohort data without knowing anything about sens, spec, and Bayes. This way of thinking harms the student's understanding for years to come and influences those who later engage in clinical and pharmaceutical research to believe that type I error and p-values are directly useful.

The situation in medical diagnosis gets worse when referral bias (also called workup bias) is present. When certain types of patients do not get a final diagnosis, sens and spec are biased. For example, younger women with a negative test may not get the painful procedure that yields the final diagnosis. There are formulas that must be used to correct sens and spec. But wait! When Bayes' rule is used to obtain the probability of disease we needed in the first place, these corrections completely cancel out when the usual correction methods are used! Using forward probabilities in the first place means that one just conditions on age, sex, and result of the initial diagnostic test and no special methods other than (sometimes) logistic regression are required.

There is an analogy to statistical testing. p-values and type I error are affected by sequential testing and a host of other factors, but forward-time probabilities (Bayesian posterior probabilities) are not. Posterior probabilities condition on what is known and does not have to imagine alternate paths to getting to what is known (as do sens and spec when workup bias exists). p-values and type I errors are backwards-information-flow measures, and clinical researchers and regulators come to believe that type I error is the error of interest. They also very frequently misinterpret p-values. The p-value is one minus spec, and power is sens. The posterior probability is exactly analogous to the probability of disease.

Sens and spec are so pervasive in medicine, bioinformatics, and biomarker research that we don't question how silly they would be in other contexts. Do we dichotomize a response variable so that we can compute the probability that a patient is on treatment B given a "positive" response? On the contrary we want to know the full continuous distribution of the response given the assigned treatment. Again this represents forward probabilities.

This merits a series of 4 or 5 lectures that tie all these ideas together and get out students past mere sensitivity, specificity, and the PPV. You've just nailed my foot the floor for a couple of days this summer, writing those lectures.

ReplyDeleteThe only thing I disagree with here is your implication that clinicians should not be taught about sensitivity and specificity. It really is important to know those concepts, along with prevalence and Bayes' theorem. The reason is that some tests are used in different populations and settings. The same test may be used to mass screen asymptomatic populations, and also as a confirmatory diagnostic test in people with symptoms suggestive of the disease in question. In the former setting the prior probability of disease is typically very low, and in the latter it is usually moderate or even high. The sensitivity and specificity are, under reasonably broad circumstances, the same in both populations as they are generally physico-chemical properties of the test itself. The clinician needs to understand that, and why, the same test result has very different implications in these two uses.

ReplyDeleteIn fact, I think that this is not sufficiently emphasized in medical training. Most of the clinicians I deal with, even those relatively recently trained, are shocked to learn how low the positive predictive value of most screening tests are. But once they have mastered that, they also need to understand why the very same test can be very informative when applied to a patient who is likely to have the disease in question.

The only thing I disagree with here is your implication that clinicians should not be taught about sensitivity and specificity. It really is important to know those concepts, along with prevalence and Bayes' theorem. The reason is that some tests are used in different populations and settings. The same test may be used to mass screen asymptomatic populations, and also as a confirmatory diagnostic test in people with symptoms suggestive of the disease in question. In the former setting the prior probability of disease is typically very low, and in the latter it is usually moderate or even high. The sensitivity and specificity are, under reasonably broad circumstances, the same in both populations as they are generally physico-chemical properties of the test itself. The clinician needs to understand that, and why, the same test result has very different implications in these two uses.

ReplyDeleteIn fact, I think that this is not sufficiently emphasized in medical training. Most of the clinicians I deal with, even those relatively recently trained, are shocked to learn how low the positive predictive value of most screening tests are. But once they have mastered that, they also need to understand why the very same test can be very informative when applied to a patient who is likely to have the disease in question.

I disagree. The confusion caused by sens and spec is massive and even more clearly they take valuable class and study time away from the real need: how to specify probability models and design studies that will yield the data need to estimate their parameters. The only place I can thing of for sens and spec is when you don't have data, or at least the needed data. Do we ever teach physicians how to compute the probability that a patient is male given his disease status? Why is a test different from the patient's sex?

DeleteDr. Harrell, diagnostic tests have different uses in clinical practice. They can confirm or discard a disease when there's suspicion of disease, discard a disease in an asymptomatic patient (screening), they also have prognostic capabilities, and they can help us decide which therapy is better.

ReplyDeleteNow, we cannot mix them up, these are different questions, although some tests can be both diagnostic and prognostic.

Sensitivity and specificity help us to choose which test to order according to the clinical context. For example, an EKG could not be used to discard a myocardial infarction and a D-dimer could not be used to confirm pulmonary thromboembolism.

Regarding prevalence. It works as a starting point for prior probabilities, which take into account specific patient characteristics. This and the likelihood of the test result gives us the posterior probabilities. Since rarely are we ever 100% sure of the diagnosis, based on some specified (subjective) thresholds we decide to treat, discard, or take another test.

So, I would have to disagree that sensitivity and specificity are not important in medical training.

Nothing you discussed motivates the need for sensitivity and specificity. What I need is the effect of each piece of information, e.g., the effect of age, the effect of having a test result of 17.35 on the log odds of disease. Give me a very specific example where sens and spec were needed and probability of disease or probability of outcome did not do the job.

DeleteIf you were not diagnosing a patient but were interpreting RCT results would you take time to compute the probability that a patient was on treatment B if she improved her blood pressure?

Thank you for your reply, Dr. Harrell.

DeleteMaybe we're talking past each other, and since this is an important topic, I'll try my best to understand your point of view.

To answer your question, in the context of a RCT it would not make any sense.

My question to you is, if sensitivity and specificity are not important, what would be the criteria a physician should have to choose a diagnostic test for a given clinical situation?

Regards.

Thanks for continuing this discussion because it will help me better understand the clinical issues. I should say that it's not they they are unimportant but that they are too indirect. Chapter 19 of BBR at http://biostat.mc.vanderbilt.edu/tmp/bbr.pdf gives my suggestions. Briefly I want a partial effect plot for each predictor, whether it's a test output or a regular baseline variable. For a test output this is the contribution to the log odds of disease holding all the baseline variables constant. As shown in that chapter once you know the regression fit (odds ratio for a binary test) you can show its impact on absolute risk of disease as a function of baseline risk in general (regardless of what gave the patient a high risk, for example). For the rare case where the test is binary, the odds ratio is the product of the likelihood ratio positive and the likelihood ratio negative.

DeleteDr. Harrell, I read the chapter of your book and I think I'm getting the idea.

DeletePlease correct me if I'm wrong:

What you propose is to condition on other clinical variables (which are theoretically a part of the clinical diagnostic process) to "predict" the post-test log odds of the disease. A good diagnostic test would add information to a pre-test model (you picture this as the absolute diagnostic yield).

And that would be the way of knowing which test to choose in a specific clinical setting, instead of sensitivity and specificity (and LRs, I guess).

This model relies on an analytical cohort design. Most of diagnostic studies I've read about are based on a cross-sectional design. I wouldn't know if this can change your proposal, but it doesn't look like it does.

My other concern is the translation into practice. I guess we would need the nomogram or clinical diagnostic rule (mobile app) for it to work. This can be cumbersome in some settings.

Regards.

PS: How can one interpret figure 19.6? The test is most useful with pretest probs < 80%? I'm having a hard time interpreting this figure.

I think you got what I was trying to convey Martin. Diagnostic models require a cohort study or a non-oversampled cross-sectional study. You can also use an oversampled study such as case-control if you can apply a correction to the model's intercept (no other corrections usually needed). Yes you need a nomogram or a prediction calculator to get the baseline risk and then to update it by the (usually continuous) test result. The technology is there now. I agree with your interpretation of the figure, until you know more about action thresholds the clinician/patient use.

Delete