This is a bit of a detour from the wine and food topics on which I usually write. But it is an important issue that concerns the credibility of science. The credibility of science and expertise in general are now coming under constant attack. Yet no advanced technological society can survive for long without a science that is both accurate and is believed to be accurate by the public that depends on it. Attacks on science threaten the very foundation of modern societies. Yet the skepticism about science is not coming only from opportunistic politicians or whack-job nut cases. There is good reason to be skeptical of our scientific institutions and how they function because of the replication crisis that is beginning to effect all scientific disciplines.
The “replication crisis” refers to the fact that an increasingly large number of studies is the sciences and the social sciences can’t be replicated. When independent researchers try to repeat a study they come up with vastly different results that contradict the results of the original study. As Aubrey Clayton notes in a very useful article in Nautilus:
An analysis of preclinical cancer studies found that only 11 percent of results replicated; of 21 experiments in social science published in the journals Science and Nature, only 13 (62 percent) survived replication; in economics, a study of 18 frequently cited results found 11 (61 percent) that replicated; and an estimate for preclinical pharmacology trials is that only 50 percent of the positive results are reproducible, a situation that, given the immense size of the pharma industry, has been estimated to cost labs something like $28 billion per year in the U.S. alone.
This is not good. Clayton’s article is important because he explains why there is a replication crisis—it has to do with a long simmering debate in statistical analysis. I’m no expert in statistics but I will try to give a clear summary of what this debate is about—the fate of civilization may depend on it. Clayton uses several examples; I will focus on one of them.
Suppose an otherwise healthy woman in her forties notices a suspicious lump in her breast and goes in for a mammogram. The report comes back that the lump is malignant. She wants to know the chance of the diagnosis being wrong. Her doctor answers that, as diagnostic tools go, these scans are very accurate. Such a scan would find nearly 100 percent of true cancers and would only misidentify a benign lump as cancer about 5 percent of the time. Therefore, the probability of this being a false positive is very low, about 1 in 20.
This is an approach to statistics that uses significance testing to determine the validity of a study.
Suppose we scan 1 million similar women, and we tell everyone who tests positive that they have cancer. Then, among those who actually have cancer, we will be correct every single time. And among those who don’t have it, we will be only be incorrect 5 percent of the time. So, overall our procedure will be incorrect less than 5 percent of the time.
The claim about validity is based on how often a patient would test positive if the condition were absent. In other words the likelihood that there is no correlation between the lump and malignancy is less than 5%, meaning the result is statistically significant.
But as Clayton notes there is a problem here. The study doesn’t take into account the background rate of cancer among women with a suspicious lump.
For the breast cancer example, the doctor would need to consider the overall incidence rate of cancer among similar women with similar symptoms, not including the result of the mammogram. Maybe a physician would say from experience that about 99 percent of the time a similar patient finds a lump it turns out to be benign. So the low prior chance of a malignant tumor would balance the low chance of getting a false positive scan result. Here we would weigh the numbers:
(0.05) * (0.99) vs. (1) * (0.01)
We’d find there was about an 83 percent chance the patient doesn’t have cancer.
According to this analysis the woman in the example was likely misdiagnosed. The replication crisis is happening because many studies are finding a relationship between phenomena that does not exist. Thus, researches fail to duplicate it.
The problem according Clayton is that we should be including in our calculations, the prior probability of a theory before making an observation to test the theory. This is called Bayesian probability theory—it has been around for decades but has not been widely accepted.
Why hasn’t it been widely accepted?
The main reason scientists have historically been resistant to using Bayesian inference instead is that they are afraid of being accused of subjectivity. The prior probabilities required for Bayes’ rule feel like an unseemly breach of scientific ethics. Where do these priors come from? How can we allow personal judgment to pollute our scientific inferences, instead of letting the data speak for itself?
For Baysians, we have to assign a probability before making the observations and then we allow the observations to influence the initial probability estimate. But that initial assignment is only going to be an educated guess—like the physician’s judgment above based on her experience that most lumps are benign. Might such an all encompassing fear of subjectivity be irrational?
This may have a happy ending. As Clayton notes:
Medical students are now routinely taught the diagnostic importance of base incidence rates. Bayes’ theorem helps them properly contextualize test results and avoid unnecessarily alarming patients who test positive for something rare.
Thankfully, science when done honestly is a self-correcting practice, and now that the replication crisis is well known scientists are responding by taking a harder look at statistical relationships that seem implausible. But an awful lot of bad science has already entered the mainstream giving aide and comfort to charlatans, quacks, and opportunistic politicians who would love to ignore inconvenient facts. I hope the self-correction is not too late.