Thursday, March 16, 2017

Subjective Ranking of Quality of Research by Subject Matter Area

While being engaged in biomedical research for a few decades and watching reproducibility of research as a whole, I've developed my own ranking of reliability/quality/usefulness of research across several subject matter areas.  This list is far from complete.  Let's start with a subjective list of what I perceive as the areas in which published research is least likely to be both true and useful.  The following list is ordered in ascending order of quality, with the most problematic area listed first. You'll notice that there is a vast number of areas not listed for which I have minimal experience.

Some excellent research is done in all subject areas.  This list is based on my perception of the proportion of publications in the indicated area that are rigorously scientific, reproducible, and useful.

Subject Areas With Least Reliable/Reproducible/Useful Research

  1. any area where there is no pre-specified statistical analysis plan and the analysis can change on the fly when initial results are disappointing
  2. behavioral psychology
  3. studies of corporations to find characteristics of "winners"; regression to the mean kicks in making predictions useless for changing your company
  4. animal experiments on fewer than 30 animals
  5. discovery genetics not making use of biology while doing large-scale variant/gene screening
  6. nutritional epidemiology
  7. electronic health record research reaching clinical conclusions without understanding confounding by indication and other limitations of data
  8. pre-post studies with no randomization
  9. non-nutritional epidemiology not having a fully pre-specified statistical analysis plan [few epidemiology papers use state-of-the-art statistical methods and have a sensitivity analysis related to unmeasured confounders]
  10. prediction studies based on dirty and inadequate data
  11. personalized medicine
  12. biomarkers
  13. observational treatment comparisons that do not qualify for the second list (below)
  14. small adaptive dose-finding cancer trials (3+3 etc.)

Subject Areas With Most Reliable/Reproducible/Useful Research

The most reliable and useful research areas are listed first.  All of the following are assumed to (1) have a prospective pre-specified statistical analysis plan and (2) purposeful prospective quality-controlled data acquisition (yes this applies to high-quality non-randomized observational research).
  1. randomized crossover studies
  2. multi-center randomized experiments
  3. single-center randomized experiments with non-overly-optimistic sample sizes
  4. adaptive randomized clinical trials with large sample sizes
  5. physics
  6. pharmaceutical industry research that is overseen by FDA
  7. cardiovascular research
  8. observational research [however only a very small minority of observational research projects have a prospective analysis plan and high enough data quality to qualify for this list]

Some Suggested Remedies

Peer review of research grants and manuscripts is done primarily by experts in the subject matter area under study.  Most journal editors and grant reviewers are not expert in biostatistics.  Every grant application and submitted manuscript should undergo rigorous methodologic peer review by methodologic experts such as biostatisticians and epidemiologists.  All data analyses should be driven by a prospective statistical analysis plan, and the entire self-contained data manipulation and analysis code should be submitted to journals so that potential reproducibility and adherence to the statistical analysis plan can be confirmed.  Readers should have access to the data in most cases and should be able to reproduce all study findings using the authors' code, plus run their own analyses on the authors' data to check robustness of findings.

Medical journals are reluctant to (1) publish critical letters to the editor and (2) retract papers.  This has to change.

In academia, too much credit is still given to the quantity of publications and not to their quality and reproducibility.  This too must change.  The pharmaceutical industry has FDA to validate their research.  The NIH does not serve this role for academia.

Rochelle Tractenberg, Chair of the American Statistical Association Committee on Professional Ethics and a biostatistician at Georgetown University said in a 2017-02-22 interview with The Australian that many questionable studies would not have been published had formal statistical reviews been done.  When she reviews a paper she starts with the premise that the statistical analysis was incorrectly executed.  She stated that "Bad statistics is bad science."

8 comments:

  1. I cannot imagine any statistician disagreeing with your Suggested Remedies. A point of clarification on your second list, I presume for the randomized trials on those lists, a prerequisite is that they be adequately blinded and have sufficient power for realistic effect sizes. I have seen randomized crossover and single center studies which were just too small to have sufficient power for any reasonable alternative.

    ReplyDelete
  2. Great points. I was taking those facets to be understood but should not have. Thanks for the comment. Regarding blinding (masking) I'm less worried.

    ReplyDelete
  3. The behavior of med journals regarding errors and statistical criticism is scandalous. But I disagree with the above comments about sample size and power. That a study is too small or is underpowered is by itself an invalid criticism. The fault lies instead with misinterpretation and misuse of its results. The results from a small or underpowered but otherwise well-done study can be valuable if focused on the confidence limits, not the P-value or "significance" - especially when considered in the larger research context. In particular, its results can be pooled with those from other studies to reveal more sharply what might be ambiguous from separate examinations of the studies.

    ReplyDelete
    Replies
    1. It's wonderful to have you involved Sander. I have pointed hundreds of people to your amazing 2000 paper which serves as a perfect example of how the statistical analysis strategy affects the reliability of results in nutritional epidemiology (and is applicable to may other fields). You are perfectly correct in your comments. It's really about misinterpretation of results and lack of emphasis on confidence intervals when a study is unbiased and well executed. I tell investigators when they really want to launch an under-sized study that the power and precision will be poor but the confidence intervals will tell most of the story and need to be presented first in the paper.

      Delete
    2. In both pharma and non-pharma clinical trials, there is often a gray area in regards to study design / sample sizes. The company or the investigator doesn't have the money to do a large Phase 3 trial but does have enough to run a non-trivial randomized trial. I always felt uncomfortable justifying a smaller sample size based on a large effect size but if I didn't do that, colleagues said that there was no chance that the study would be approved by IRB's. In pharma, there is also a huge desire to do hypothesis tests in the hope of using the study for future registration for a drug.

      Delete
    3. I've been in the same position. The IRBs need to know that apriori futile studies are not ethical, and statisticians need to be stronger. One incremental approach is to forbid the use of p-values and to only report confidence intervals for such studies, or Bayesian posterior probabilities (including the probability of similarity).

      Delete
  4. Can you expand on what you mean by "overly optimistic sample sizes" for single-center randomized studies? Do you mean situations where the sample size is small but the researchers are overly optimistic about what they'll be able to learn from the data?

    ReplyDelete
  5. Sure. Many single-center randomizied trials do not have huge budgets but were approved because the frequentist statistical power was >= 0.8. However the power was >= 0.8 because a more-than-clinically-important effect size was used in the power calculation, in order to have a sample size within budget. When you power a study to detect a miracle and all you get is a clinically meaningful effect you are left with nothing.

    ReplyDelete