THE CREDENTIALS OF SCIENTIFIC EVIDENCE: REBALANCING THE EPISTEMOLOGICAL SCALES
9.2.3 Probability, P-hunting and the null hypothesis
Abstract
This commentary investigates the epistemological flaws inherent in reliance on Fisherian frequentist statistics, particularly P-values, and the null hypothesis. It argues that the focus on refuting the null hypothesis — renamed here as the ‘No Observed Difference State’ (NODS) — is a logical error that facilitates confusion concerning the failure to observe a difference, with the absence of a difference. Eminent statisticians (including Paul Meehl and John Tukey) have contended that this approach has led science ‘up the garden path’ of misdirection, where, inter alia, “P-hunting” (or P-hacking) replaces genuine scientific analysis and reasoning. It illustrates how Fisher’s P-values, and Neyman-Pearson hypothesis testing, creates a disordered methodology that may provide exact answers, but to the wrong questions. This indicates that P-values have little place in biology or medical science and we need more quality thinking that prioritizes basic scientific knowledge and biological plausibility rather than mere statistical significance. It also discusses problems with the null hypothesis including how it is epistemologically inconsistent with empiricism.
Introduction
P-hunting: The illogical in pursuit of the indefensible [KG] Foxhunting: the unspeakable in pursuit of the uneatable Oscar Wilde
Just how far ‘down the Primrose path’1 Fisher has propelled the scientific community over the last century is hard to assess: as the eminent researcher Meehl stated years ago — this paper has been cited over 3,000/2 times [1]:
I suggest to you that Sir Ronald [Fisher] has befuddled us, mesmerized us, and led us down the primrose path. I believe that the almost exclusive reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology, I am not making some nit-picking statistician’s correction. I am saying that the whole business is so radically defective as to be scientifically almost pointless
The extent, breath, and seriousness of the P value misdirection seems not yet to be sufficiently appreciated, at least by medical researchers. P values are inextricably linked with the null hypothesis, although P values and null hypothesis testing are separate ideas. Neyman and Pearson (Egon, son of Carl, he used ‘Karl’ after his visit to Germany in the 1930s) developed the theory of hypothesis testing, and Fisher developed P values.
[P values] Meehl ‘The whole business is so radically defective as to be scientifically almost pointless’ [1]
The Null hypothesis
Calling the null hypothesis a ‘hypothesis’ is the first step in an aggrandizing misdirection, since it hardly justifies the epithet of ‘hypothesis’ — a hypothesis is a proposed explanation for something which acts as the foundation for further investigation. The null hypothesis is merely a statement of the negative: it would be better to call it the no observed difference state (NODS), if that sounds silly, then that is a good name for it3. It is a somewhat silly idea. Rejecting the null hypothesis when it is true is termed a type I error, accepting the null hypothesis when it is false, a type II error.
It is a misconceived term, partly because the word ‘null’ is not commonly used and has several different meanings — it thus has no essential understandability contribution to the term. Second, as stated above, it is not a hypothesis, just a statement of the negative. Like all double-negatives, it is inherently not easily conceptualised — partly because most people recognise it as a denial of empiricism and hence contrary to common sense.
If it is thought of as the ‘no observed difference state’ that illuminates its major logical problem, which is that not finding a difference is quite different to there not being a difference — just because one cannot see something does not mean it is not there. This is especially relevant to drug comparison trials because rating scales do not, indeed cannot, measure changes that are not anticipated or not understood, or that they are not sensitive to. It is like using an infrared heat- detecting camera, instead of a normal optical camera, to find hot objects on a dark night. If you use an optical camera you will not see it, but it is still there — it is a no observed difference state’. We are using methods of assessment where we do not even know if they are capable of seeing what is there: that is even more problematic than not knowing the difference between an optical camera and a heat sensitive camera, or, worse still, not knowing whether what you are looking for is hotter than the surrounding environment or not.
This illustrates the logical flaw with P values and rejecting the null hypothesis; the hidden presumption that is forgotten is that you are sure that your failure to observe really does indicate that nothing is there, and also that the P value is usefully valid and meaningful. These two conditions are rarely met in trials of psychotropic medications, or any other RCTs.
The null hypothesis may be methodologically consistent with the deductive approach to science but, as Rothman [2] (cited >6,000 times) puts it:
To entertain the universal null hypothesis is, in effect, to suspend belief in the real world and thereby to question the premises of empiricism
Sir Michael Rawlings explains [3]:
the null hypothesis is inappropriate when [inter alia] previously published studies have already shown benefit. Yet surveys over the past 10 years show that 73% of RCTs, published in major journals, persistently fail to make any systematic attempt to set their results in the context of previous investigations [see [4]].
He goes on to say:
The null hypothesis is even more awkward for trials seeking to show whether there is no difference (equivalence), no less benefit (non-inferiority), or not less than a prespecified difference (futility), between treatment groups. All require prior assumptions to be made about the extent to which the differences between treatments might be relevant or important.
One might also add that the null hypothesis can become even more problematic when one considers another major problem which is a deficiency, a lack of precision, failing to see a difference that does exist. This can be a function of the instruments used to seek that difference. The instrument used may be unable to detect a real difference that does exist (see below). In the specific example of trials of antidepressant drugs this is crucial because of the imprecision in the definition of symptoms and the minor degree of change that is being detected.
For example, if we compare imipramine with clomipramine using the standard rating instrument of the HDRS then we are unlikely to find a difference. However, if one uses a rating scale for OCD symptoms one would find a clear difference. Likewise, if we compare an SSRI with an NRI we might equally find no difference in a standard rating scale, but we would see a difference if we used a different measuring instrument.
It goes against pharmacological and scientific rationale — for a Bayesian — to imagine that SSRIs and NRIs are having the same effect on the system. In addition, there is the question of the extent to which major depressive disorder (as defined in DSM) is heterogenous, or is patho-physiologically comparable to melancholic depression: this means that the testing of most drugs for decades has been misleading and that the RCTs accepted by the EMA and the FDA for the registration of drugs has low validity — factors affecting this, and their implications, are discussed in detail in other commentaries.
This constitutes a specific example of how RCTs are hamstrung through being disconnected from basic science and pre-existing scientific evidence — a major factor diminishing their credentials as a technique for scientific investigation of causes and mechanisms. Some might describe that as a fatal flaw.
Thus, the ‘null hypothesis’ becomes nonsensical in many real-world scenarios — we do not know what we do not know, so how can we know how to look for it. With a metal detector? or an infrared camera?
The P value
The P value is not the probability that the null hypothesis is true, rather, it is the probability of obtaining the observed difference, if the null is true.
In 2016 The American statistical society published a criticism concerning P values which has now been cited over 8,000 times, it was presented by Wasserstein [5] and I quote its conclusion unabridged:
Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning
Regina Nuzzo published an article in Nature entitled “Scientific Method: Statistical Errors’ [6] which has also been cited more than 8,000 times as of the end of 2025.
It’s science’s dirtiest secret: The ‘scientific method’ of testing hypotheses by statistical analysis stands on a flimsy foundation
Altman [7], stated shortly before he died [and this paper has nearly 4,000 citations as of Jan 2026]:
Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power [P values] have been decried for decades, yet remain rampant.
Altman followed that with, in the tactful-speak required by medical journals:
Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists
He is saying that they are insufficiently informed, and insufficiently assiduous, to do it properly.
There is now a large body of literature, and statements by learned societies, discussing the widespread misuse of P values [7-10]. These papers have been cited thousands of times, yet the poor practice continues.
When these procedures have been misused for more than half a century already, a review [7] should not need to end with ‘We conclude with guidelines for improving statistical interpretation and reporting’ — could they not, should they not, have got it right by now? They have had 70 years of opportunity — to imagine that medical science is suddenly going to get it right after all that time is ‘a triumph of hope over experience’, as Samuel Johnson expressed it.
To imagine that, after 70 years, medical science is suddenly going to get P values right is ‘a triumph of hope over experience’
It is time for a change, indeed, well-past a time for change.
Another dimension of misuse is P-hacking, or P-hunting, an illegitimate scientific exercise whereby experimenters examine their data post hoc looking for variables that can be manipulated, and for different outcome measures that can be arbitrarily selected to show more favourable P-values [11-15]. Andrade discusses these and other similar questionable research practices in this concise article [12]. The common use of this practice illustrates two important points: first, the poor standard of refereeing for medical journals, which should prevent such papers ever being accepted for publication; second, how commonly P- values are misused and are of such minimal scientific value: what matters is quality thinking behind the experiment and thoughtful interpretation of the result, irrespective of arbitrary P-values. Thoughtful interpretation usually involves understanding of biological science, pharmacology, and Bayesian prior probabilities.
An eminent American statistician named Tukey made severe criticisms of Fisher and described his frequency-based ideas with these words [16]:
…emanating from the world of infancy, the childhood of experimental statistics, the childhood spent in the school of agronomy5 … almost invariably when closely inspected data are found to violate the standard assumptions required by frequentists. … far better [is] an approximate answer to the right question, which is often vague, than an exact answer to the wrong question6, which can always be made precise. … by and large, the great innovations in frequency-based statistics have not had correspondingly great effects on data analysis
That sums it up admirably: ‘far better [is] an approximate answer to the right question … rather than an exact answer to the wrong question.’7
It is apposite to revisit Fisher’s classic exposition of P values in the ‘tearoom’ experiment, the very same tearoom which was the source of the joke about the collective noun for a group of statisticians being ‘a quarrel’. This is where the classic ‘lady drinking tea’ experiment was conceived, as described in the review by Brereton [17]. The cogency of Tukey’s scathing criticism of Fisher’s approach is obviously and straightforwardly illustrated by understanding the chemistry, physiology, and pharmacology of mixing tea and milk. The effect of the lactose sugars in milk on the taste of tea works partly by modifying the perception of tannin, the concentration of which depends on the type and strength of the tea being brewed, the serving temperature (tea drinking afficionados would brew different sorts of tea at the different temperatures), and the qualities and temperature of the milk, etc8. Furthermore, the proteins in saliva neutralise tannin, therefore tasting a number of cups in a row will alter the perception of their taste, because the amount of protein in saliva decreases following persistent stimulation of salivary flow — that is why oranges are pleasant when you have a dry mouth at the halftime interval of a sporting match (citrus fruits are acidic, and acid is the most potent stimulator of salivary flow), whereas cold tea would taste disgusting. Thus, one can see how Fisher’s use of statistics in vacuo without understanding or consideration of physiology and pharmacology makes a complete nonsense of the experiment. It is a perfect example of what Tukey said, ‘an exact answer to the wrong question’. It also illustrates the fact that such experiments are scientifically useless because they do not enable one to answer the question of why and how (cause & mechanism): it merely answers a trivial ‘what/if’ question.
Lastly, note that even in the top medical journals most papers have no statistical review9, as Altman reminded us in 1998 [18]. And 20 years later the situation was still just as bad [19]. Even worse than that, most had no statistician in the planning stage of the trial in the first place, thereby impairing its ability to address the question in a way likely to provide a useful and valid answer. Furthermore, to add insult to injury, many journals severely limit the ability of experts to make post publication criticisms [20].
It is reasonable to conclude that P values have little place in biology or medical science and that they have been responsible for much serious misdirection over the last half century.
References
- Meehl, P.E., Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. 1992.
- Rothman, K.J., No adjustments are needed for multiple comparisons. Epidemiology, 1990: p. 43-46.
- Rawlins, M., De testimonio: on the evidence for decisions about the use of therapeutic interventions. Lancet, 2008. 372(9656): p. 2152-61.
- Clarke, M., S. Hopewell, and I. Chalmers, Reports of clinical trials should begin and end with up-to-date systematic reviews of other relevant evidence: a status report. J R Soc Med, 2007. 100(4): p. 187-90.
- Wasserstein, R.L. and N.A. Lazar, The ASA statement on p-values: context, process, and purpose. The American statistician, 2016. 70: p. 129-133.
- Nuzzo, R., Scientific method: statistical errors. Nature, 2014. 506(7487): p. 150-2.
- Greenland, S., et al., Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol, 2016. 31(4): p. 337-50.
- Ioannidis, J.P.A., The Proposal to Lower P Value Thresholds to .005. JAMA, 2018. 319(14): p. 1429-1430.
- Krzywinski, M. and N. Altman, Significance, P values and t-tests. Nat Methods, 2013. 10(11): p. 1041-2.
- Altman, D.G., The scandal of poor medical research. BMJ, 1994. 308(6924): p. 283-4.
- Szucs, D., A Tutorial on Hunting Statistical Significance by Chasing N. Front Psychol, 2016. 7: p. 1444.
- Andrade, C., HARKing, Cherry-Picking, P-Hacking, Fishing Expeditions, and Data Dredging and Mining as Questionable Research Practices. J Clin Psychiatry, 2021. 82(1).
- Head, M.L., et al., The extent and consequences of p-hacking in science. PLoS Biol, 2015. 13(3): p. e1002106.
- Stefan, A.M. and F.D. Schonbrodt, Big little lies: a compendium and simulation of p-hacking strategies. R Soc Open Sci, 2023. 10(2): p. 220346.
- Chavalarias, D., et al., Evolution of Reporting P Values in the Biomedical Literature, 1990-2015. JAMA, 2016. 315(11): p. 1141-8.
- McGrayne, S.B., The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code. 2011: Yale university press.
- Brereton, R.G., P-values and Ronald Fisher. Journal of Chemometrics, 2020. 34(9).
- Goodman, S.N., D.G. Altman, and S.L. George, Statistical reviewing policies of medical journals: caveat lector? J Gen Intern Med, 1998. 13(11): p. 753-6.
- Hardwicke, T.E. and S.N. Goodman, How often do leading biomedical journals use statistical experts to evaluate statistical methods? The results of a survey. PLoS One, 2020. 15(10): p. e0239598.
- Hardwicke, T.E., et al., Post-publication critique at top-ranked journals across scientific disciplines: a cross-sectional assessment of policies and practice. R Soc Open Sci, 2022. 9(8): p. 220139.
View and/or download a properly formatted PDF document below:
