
This question strikes at the heart of the tenuous link between statistics and causality in science
Writer: Dan Jacobson
Editor: Ceri Ngai
Artist: Zach Ng
How would you prove that smoking causes cancer?
You could say that heavy smokers are 25 times more likely to get lung cancer than non-smokers, but you’d be told that 30% of lung cancer patients are non-smokers. You could mention that lung cancer rates quadrupling in the early 20th century coincided with the explosion of the tobacco industry, but then be told that it also clashed with more pollution, road tarring, and two World Wars. You could run a randomised controlled trial, but asking thousands of healthy volunteers to smoke cigarettes for thirty years isn’t exactly ethically, or indeed logistically, viable.
Proving that smoking causes cancer not only requires navigating the lack of one-to-one relationship between the two, but also convincing both a $1 trillion industry, and one billion smokers worldwide. That tobacco causes lung cancer has since been proven, but confirming this took decades.
In her book p53: The Gene That Cracked The Cancer Code, Sue Armstrong attributes confirming this link to biochemists providing a mechanism by which it occurs. In the 1990’s, Gerd Pfeifer and Mikhail Denissenko demonstrated that BPDE, the transformed product of a substance found in tar called benzo(a)pyrene, caused alterations in the p53 gene which were only observed in smokers, demonstrating that smoking causes cancerous mutations.
However, in The Book of Why, the computer scientist and philosopher Judea Pearl suggests that this causal association had been demonstrated decades earlier, by the statistician Jerome Cornfield. Until the 1950s cynics of the smoking-cancer hypothesis, most notably the UCL-based professor Ronald Fisher, attributed the link to unmeasured ‘confounders’ – factors which separately influence both smoking and cancer. Incidentally, Fisher also smoked like a chimney.
Cornfield’s genius was to disprove Fisher’s hypothesis from a mathematical perspective. Consider for a moment, a hypothetical confounder, such as a ‘smoking gene’. For this gene to disprove the smoking-cancer hypothesis, it must be at least as strongly associated with both cancer and smoking individually as they are with each other. Therefore, due to the strength of correlation between smoking and cancer, the possibility of a confounder becomes highly unlikely.
The issue is that common statistical techniques assess correlation, not causation, making raw numbers insufficient for proving and analysing these relationships. This is intriguing, as modern statistics arose from fundamentally causal questions being asked by the UCL-affiliated statisticians Francis Galton and Karl Pearson, even if these ‘questions’ amounted solely to eugenics and scientific racism.
We know that correlation does not imply causation, but the implications of this adage extend beyond how we interpret results. Instead, it should inform the way questions are asked by researchers. Consequently, this should ensure that research priorities are grounded less in the final results, and more in the contexts which provide their significance.