THE CREDENTIALS OF SCIENTIFIC EVIDENCE: REBALANCING THE EPISTEMOLOGICAL SCALES
9.2.4 Rating scales: Proxy measures, subjectiveness, and inter-rater reliability
Abstract
This commentary explores the scientific inadequacy of the rating scales widely used in psychiatric clinical trials, such as the Hamilton Rating Scale for Depression (HRSD). It argues that these tools are subjective ‘proxy’ measures that fail to reflect true illness pathology or long-term outcomes. It highlights the ‘crisis’ of inter-rater reliability, noting that rigorous testing of rater consistency has been largely abandoned, leading to measurement errors that exceed the effect sizes of the drugs being tested. Furthermore, it criticizes the reliance on unqualified raters employed by Contract Research Organizations (CROs). To remedy these flaws, the paper advocates for replacing subjective proxies with ‘hard’ objective outcomes — such as suicide rates and employment status. Using clomipramine as an example, the paper demonstrates how objective data can suggest superior efficacy that is missed by rating scales.
Introduction
Those performing clinical trials appear to have become inured to the fact that the assessment measures they use (HRSD, MADRS etc.) are not objective measures of change in illness pathology, or outcome. They are measures which are proxies (surrogates), and short-term proxies at that; rating scales are assumed to reflect long-term improvement and long-term treatment benefit in the illness: that assumption is not well substantiated. These short-term proxies are not accurate or reliable predictors, and using them (almost) exclusively, as psychiatry does, is fraught with uncertainty [1].
Examined through the lens of scientific methodology, rating scales barely qualify as competent science — poorly designed, badly used.
Inter-rater reliability
The rating scales used are subjective measures with sometimes low inter-rater reliability. Rigorous inter-rater reliability studies have largely been abandoned, in the large part because of their costly and time-consuming nature, Berendsen [2] found barely 5% of studies reported an inter-rater reliability (IRR) coefficient. As someone who took part in training with the ‘Present State Examination’ in the 1970s, I can attest to the fact that even psychiatrists trained in these techniques can have surprisingly divergent scores on rating scales. Many raters in CRO managed studies are not qualified psychiatrists and may be more divergent in their scores. Berendsen urged researchers to ‘conduct and report training procedures and reliability estimations’, but that is closing the stable-door after the horse has bolted.
The IRRs in most studies are sure to be in the same range1 as the 3-4 points difference in HDRS scores that are typically found [3-6].
A cursory inspection of the Hamilton rating scale will immediately illustrate that minor differences in the rating of items for sleep, appetite, and anxiety can produce a greater change in score (3 points on HRSD) than that necessary, as a degree of difference, to get a drug approved by the FDA (as an AD).
A recent analysis of the HDRS by Byrne [4] had a harsh conclusion:
Neither full nor abbreviated HRSD models are suitable for use in clinical trial settings and the HRSD’s status as the gold standard should be reconsidered
Sedative effects alone qualify a drug as an antidepressant: cf. [7] — quetiapine (low dose) and doxepin are good examples, both being selective H1 antagonists (producing increased appetite, improved sleep and reduced anxiety)2.
Long-term outcome: An antidote to proxy measures
Few studies look at the long-term outcome of measures, such as suicide rate, employment status, and other such practical ‘objective’ measures of real-world outcome. Long-term objective outcome measures are an important part of the evidence that allows us to be confident that different treatments are, or are not, efficacious. Hence, Hengartner [8] concluded:
… researchers should use hard (objective) real-world outcomes on which the sample was initially not selected for, such as, for instance, employment rates or receipt of disability benefits
Relevant to the above is an important but little discussed observation that has been lurking in the research literature for decades — that is the data strongly demonstrating that clomipramine has a lower number of deaths per million scripts issued (~10 per million vs ~40-50 for other TCAs), i.e. around five times lower than any of the other tricyclics [9-12]. CMIs LD50 (in rats) is similar to the other TCAs3, and in those who do take an overdose of CMI, it is equally toxic, i.e., the case fatality (mortality to self-poisonings ratio) is the about the same as other TCAs [13] — so how come there are fewer deaths per million scripts?
Conclusion
What is the most parsimonious interpretation of that data? A possible answer is that it is more effective for the treatment of depression4, and preventing suicide (like lithium [14, 15]). Death by suicide is a definitive objective measure of long- term outcome of depression treatment. Therefore, a substantially reduced death-rate compared to other ADs must be considered strong prima facie evidence of greater antidepressant effectiveness, a notion echoed by a more recent review [16].
Lithium and clomipramine may be the only two drugs that have been demonstrated to reduce the suicide rate in patients with depression
References
- Gotzsche, P.C., et al., Beware of surrogate outcome measures. International Journal of Technology Assessment in Health Care, 1996. 12(2): p. 238- 46.
- Berendsen, S., et al., An old but still burning problem: Inter-rater reliability in clinical trials with antidepressant medication. J Affect Disord, 2020. 276: p. 748-751.
- Morriss, R., et al., Inter-rater reliability of the Hamilton Depression Rating Scale as a diagnostic and outcome measure of depression in primary care. Journal of Affective Disorders, 2008. 111(2-3): p. 204-213.
- Byrne, D., et al., Evaluating the psychometric structure of the Hamilton Rating Scale for Depression pre- and post-treatment in antidepressant randomised trials: Secondary analysis of 6843 individual participants from 20 trials. Psychiatry Res, 2024. 339: p. 116057.
- Rohan, K.J., et al., A protocol for the Hamilton Rating Scale for Depression: Item scoring rules, Rater training, and outcome accuracy with data on its application in a clinical trial. J Affect Disord, 2016. 200: p. 111-8.
- Bagby, R.M., et al., The Hamilton Depression Rating Scale: has the gold standard become a lead weight? Am J Psychiatry, 2004. 161(12): p. 2163-77.
- Thase, M.E., et al., Efficacy of quetiapine monotherapy in bipolar I and II depression: a double-blind, placebo-controlled study (the BOLDER II study). J Clin Psychopharmacol, 2006. 26(6): p. 600-9.
- Hengartner, M.P., Is there a genuine placebo effect in acute depression treatments? A reassessment of regression to the mean and spontaneous remission. BMJ Evidence-Based Medicine, 2020. 25(2): p. 46-48.
- Farmer, R.D.T. and R.M. Pinder, Why do fatal overdose rates vary between antidepressants? Acta Psychiatrica Scandinavica, 1989. 80(S354): p. 25-35.
- Buckley, N., Fatal toxicity of serotoninergic and other antidepressant drugs: analysis of United Kingdom mortality data. British Medical Journal, 2002. 325: p. 1332-1333.
- Henry, J.A., C.A. Alexander, and E.K. Sener, Relative mortality from overdose of antidepressants. British Medical Journal, 1995. 310: p. 221-224.
- Buckley, N.A. and P.R. Mcmanus, Can The Fatal Toxicity Of Antidepressant Drugs Be Predicted With Pharmacological And Toxicological Data. Drug Safety, 1998. 18: p. 369-381.
- Hawton, K., et al., Toxicity of antidepressants: rates of suicide relative to prescribing and non-fatal overdose. British Journal of Psychiatry, 2010. 196(5): p. 354-358.
- Cipriani, A., et al., Lithium in the prevention of suicide in mood disorders: updated systematic review and meta-analysis. BMJ, 2013. 346: p. f3646.
- Fitzgerald, C., et al., Effectiveness of medical treatment for bipolar disorder regarding suicide, self-harm and psychiatric hospital admission: between- and within- individual study on Danish national data. Br J Psychiatry, 2022: p. 1-9.
- Taylor, D., S. Poulou, and I. Clark, The cardiovascular safety of tricyclic antidepressants in overdose and in clinical use. Therapeutic Advances in Psychopharmacology, 2024. 14: p. https://journals.sagepub.com/doi/epub/10.1177/204512532412432 97.
View and/or download a properly formatted PDF document below:
