Homogeneity and Simpson’s paradox

The Credentials of Scientific Evidence: Rebalancing the Epistemological Scales 

9.2.2 Homogeneity and Simpson’s paradox

Abstract

This commentary critiques the reliance on Randomized Controlled Trials (RCTs) in psychiatric research by examining the problems relating to by sample heterogeneity and Simpson’s Paradox. It propounds that the validity of RCTs, particularly those detecting small effect sizes (as in antidepressant trials), is contingent on achieving a high level of homogeneity for ‘diagnosis’ and across relevant variables. However, because the full extent of variables influencing conditions like depression (e.g., genetics, history, metabolism) is uncertain, adequate stratification is difficult. The paper demonstrates how this leads to Simpson’s Paradox, where trends observed in subgroups disappear or reverse when data is aggregated, potentially invalidating trial results. Thus, statistical methods alone are insufficient for elucidating treatment responses. It advocates for a shift toward Bayesian approaches and causal inference to better account for complex, unmeasured variables, and also variable between-patient response differences.

Introduction

The validity of any RCT depends on achieving homogeneity (diagnostic or patho-physiological), not only in respect of a valid and discreet disease entity, but also in respect of the relevant variables that might be stratified — this is crucially important for trials that show minor differences, i.e., all antidepressant drug trials. Blinding, stratification, and randomisation to attempt to ensure equal distribution of unknown variables is increasingly relevant when the differences in outcomes are small (cf. Hill) — it becomes decreasingly relevant when the outcome differences are large. Incidentally, so does the supposed placebo effect.

Homogeneity

Homogeneity of variables in depression treatment studies can rarely if ever be reliably achieved,becausewedonotknowwhattheyallare. Thisproblemdemonstrates the inevitable relevance of Simpsons paradox, which reduces the validity of, or even invalidates, many RCTs. When dealing with an imprecisely defined condition like depression it is not known to what extent variables might be relevant — these could include (inter alia) age, gender, ethnicity, response to previous treatment, number of previous episodes, family history, bipolarity, status of CYP450 enzymes etc. Stratifying to create a homogenous sample with respect to multiple variables requires the creation of an enormous sample (thousands) that are practically impossible for any real-world trial — one could never be sure that all relevant variables had been stratified. When the differences being measured between groups are small, as they usually are in psychiatric research, these are problems which can and do completely reverse the results of a trial. This is Simpson’s paradox, which must be considered and understood to appreciate how differences in apparent treatment effects may be reversed by adding a new variable.

When comparing drug treatments homogeneity of pathophysiology becomes crucial.

When drugs with different pharmacological profiles are being compared, in a condition where the underlying pathophysiology is uncertain, then serious misdirection and confusion is likely. This is especially relevant to trials like the notorious Star-D trial, where the sequenced treatment alternatives magnify that difficulty further — probably to the extent of making a nonsense of the results (see commentary, ‘Stepped trials: magnifying methodological muddles — the supernatant effect’). The difficulty with defining pathologically valid sub-types of depression compounds these difficulties because heterogeneity is an ever- present complicating factor; the other commentaries in this series discuss factors that are variably affected by heterogeneity.

Simpson’s paradox

Simpson’s paradox can be described as a result present when data is subdivided in groups, that disappears when the data is combined. For a detailed explanations about Simpson’s paradox see [1-4], and here is a good straightforward illustration of Simpson’s paradox.

A famous example of Simpsons paradox is explained in the above link and describes the controversy concerning the admission of females to an Ivy league university in the USA that, it is stated, feared being sued for not offering equal opportunity. The data appeared to show a bias against women because only 35% of female applicants were admitted to postgraduate courses, whereas 45% of male applicants were admitted, showing a clear bias against women — or does it? However, most departments claimed this was absurd because they had a higher acceptance rate of females than males. The explanation of how this paradox exists is a quintessential example of Simpson’s paradox and it is essential to understand it in order to appreciate how RCTs can be seriously misleading. I shall not here offer a summary of the argument in the above Link, because the information there is already precise and concise: I cannot succinctly improve on it.

In summary

Homogeneity, and the presence of unknown variables, and the effect of Simpson’s paradox on those factors, is a necessary requirement for understanding the pitfalls that RCTs are susceptible to. Causality and mechanisms are essential to understanding these issues, cf. Howick’s discussion of an extended consideration of Hill’s criteria [5], as well as discussion about Bayes and Pearl.

Drug comparison trials are performed in vacuo and are insufficiently anchored to underlying scientific knowledge and are especially prone to generating misdirection (cf. discussion about this by the philosopher Cartwright [6-8]). It is vital to think carefully about what questions you are asking, and why you are asking them (cf. Tukey). Such questions must be grounded in basic science and take heed of causes and mechanisms. Statistics alone will in never suffice (cf. Fisher). Thus, understanding the difference between a Bayesian/Pearl approach, contrasted with frequentist statistics and P-values, gets us to the heart of the epistemological problems in current concepts and approaches that are integral to the RCT enterprise.

References

  1. Rojanaworarit, C., Misleading Epidemiological and Statistical Evidence in the Presence of Simpson’s Paradox: An Illustrative Study Using Simulated Scenarios of Observational Study Designs. Journal of Medicine and Life, 2020. 13(1): p. 37-44.
  1. von Kugelgen, J., L. Gresele, and B. Scholkopf, Simpson’s Paradox in COVID-19 Case Fatality Rates: A Mediation Analysis of Age-Related Causal Effects. IEEE Transactions on Artificial Intelligence, 2021. 2(1): p. 18- 27.

  2. Pearl, J., Comment: Understanding Simpson’s Paradox. American Statistical Association, 2014. 68: p. 8-13.

  3. Fenton, N., M. Neil, and A. Constantinou, Simpson’s Paradox and the implications for medical trials. arXiv preprint arXiv:1912.01422, 2015.

  4. Howick, J., P. Glasziou, and J.K. Aronson, The evolution of evidence hierarchies: what can Bradford Hill’s âguidelines for causationâ contribute? Journal of the Royal Society of Medicine, 2009. 102(5): p. 186-194.

  5. Cartwright, N., Are RCTs the gold standard? BioSocieties, 2007. 2(1): p. 11-20.

  6. Cartwright, N., A philosopher’s view of the long road from RCTs to effectiveness. Lancet, 2011. 377(9775): p. 1400-1.

  7. Cartwright, N., What evidence should guidelines take note of? J Eval Clin Pract, 2018. 24(5): p. 1139-1144.