RCTs, and the meta-analyses that they spawn, do not have the superior epistemic validity, or reliability, or extrapolability, that many suppose them to have, even if they are carried out to exacting and ideal designs. Many are significantly flawed, and ghost writing, deceit, and bias are inadequately accounted for. The prominence, even predominance, of meta-analysis in the landscape of clinical decision-making is such that there is a methodological monoculture. Other methodologies and clinical experience have been mistakenly relegated to obscurity. The resultant pronouncements assume a dictatorial attire, they become a blinkered and inflexible approach to therapeutics which stifles innovation. Innovation is the lifeblood of good science.
This ‘Lancet Cipriani 21 antidepressants meta-analysis’ paper published recently (Feb 2018) in the Lancet (1), is yet another ‘meta-analysis (MA)’, adding to the 200, or so, of already existing MA studies about AD drugs (2). Professor Ioannidis (one of the authors) recently published ‘The Mass Production of Redundant, Misleading, and Conflicted Systematic Reviews and Meta-analyses’. Does this one, perhaps the most thorough and prestigious thus-far, usefully advance the cause? It will become a much referred-to work, so it is important for it to be seen in a meaningful clinical treatment perspective.
However, I predict the caveats and limitations rightly expressed by the authors about the validity and generalisability of it will soon, as usual, be forgotten in the oversimplified implementation resulting from the insufficiently informed or critical use of ‘guidelines’.
We must remember that the trials considered do not include those with suicidal depression (perhaps the most important group to treat effectively), the young, the old, those with medical comorbidities, those on other medications, and that large group who fail to achieve remission (~70%) with the first-offered treatment (often an SSRI).
In effect, this means that such analyses are germane to only around 10% of ‘real-world’ patients who are considered for antidepressant drug treatment, and none of those seen in specialist practice.
This paper clearly entailed a great deal of work, even if there are uncertainties concerning its strength and usefulness. These uncertainties are highlighted in this commentary.
One can only express admiration for the industry and application of those concerned in researching and writing this paper: the following comments are not intended to be dismissive or disrespectful of these eminent researchers.
One of my recent commentaries is a detailed discussion of the many problems with guidelines, which emanate largely from over-reliance on meta-analysis and, therefore implicitly, on randomised controlled trials.
This is a complex issue (not amenable to a ‘postage-stamp’ précis) and those ‘coming to it’, with less knowledge and experience of science, will benefit from appreciating that there are extensive, serious, and relevant problems, in the practice and publishing of medical research, only some of which are mentioned here.
Some of these are expanded-on in my commentary ‘Guidelines: problems aplenty’ and in other commentaries in the menu heading above ‘Bias in science’. These cover the considerable complications caused by scientific fraud, ghost-writing, hiding and distorting of ‘raw’ [un-coded] patient data, and more.
That commentary, ‘Guidelines: problems aplenty’, extensively cited professor Ioannidis (one of the authors of this paper). I cannot immediately think of many other researchers whose work I admire and respect more, nor ones whose work I have cited more frequently. I wonder how he will feel about his contribution to this paper in years to come?
Hackneyed as this old computer programmer’s phrase may be, it is obligatory to start by repeating it “garbage in, garbage out”. In layman’s language, you cannot make a silk purse out of a sow’s ear, nor build a castle on sand.
“On two occasions I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
Charles Babbage, Passages from the Life of a Philosopher”
What exactly is being assessed? how is it assessed? who assessed it? can we see the original [un-coded] data? are they competent? honest? trustworthy? Sadly, the answers to those questions are often ‘no’, or, ‘we cannot be sure’.
Science requires scrupulous, honesty, objectivity, accuracy and reproducibility: otherwise, it is just ‘a castle built on sand’.
These and other questions are key issues to be dealt with before even the most tentative treatment recommendations can be considered based on such meta-analyses.
Fundamental is the question of whether all these drugs are meaningfully assigned the epithet of ‘antidepressant’, and then how well do they actually work, and on what symptoms, and finally their effect on long-term course and illness outcome. I will not even mention the failure to assess or discuss long-term adverse effects, or we will be here all day.
First, the average improvement in the Hamilton rating scale for depression (HAM-D) in these trials is small, a few points only out of fifty (like a 10% reduction in the ‘score’). Most patients come no-where near getting completely better. Yet, to hear the talk about these results, one could be excused for thinking most patients were getting ‘better’.
If one was paying for such treatment one would be demanding ones’ money back.
The HAM-D does not adequately assess the central symptoms of the illness, which are anergia and anhedonia. This key question of defining ‘biological’ depression is covered in various of my other commentaries, especially in relation to professor Parker’s work on ‘CORE’ symptoms (3-6).
The [Hamilton] rating-scale question is discussed in my above-mentioned commentary on guidelines. To compound the poor rating-scale problem, especially the significant influence of ‘sedative’ effects on scores (independent of any AD effect), there is a minimal assessment of functional capacity related to anergia [and anhedonia] — the assessment of social, work, and leisure activities, as a corroborating measure of lack of motivation and drive. These features are fundamental to understanding and assessing depression, and yet they are relatively poorly assessed and rated, or not assessed at all, by the rating scales on which these RCTs & meta-analyses are predicated (7-9).
Look at this online version of the HAM-D to see what I mean. Qs 4, 9,10,11 & 12 might all be improved by any anxiolytic/sedative — a one gradation change in each of those produces a 5-point improvement of your score, more than double that needed to get a drug approved by the FDA as an AD.
The question of inter-rater reliability is not even mentioned any more (9).
Cipriani et al. specifically state ‘we were not able to quantify some outcomes, such as global functioning’.
I do not intend to sound scathing, but not assessing anergia or functioning more fully, is like not asking a patient with anaemia how many stairs they can manage without stopping to regain their breath, or if they are back at work etc. Such omissions indicate poor clinical assessment skills.
And, the very use of the word antidepressant is a misrepresentation. You would not call a drug an antibiotic if it only slightly slowed down the growth of bacteria, without killing them. To label drugs as antidepressants when many of them merely produce a small change in symptoms, which are not even central to what we presume is the core of the illness, is an assumptive misconception and misrepresentation.
And, for those who favour using Bayesian reasoning to advance hypotheses, there is a simple, obvious, unaddressed, pharmacological contradiction that demands adequate explanation. The frequently unjustified epithet of ‘antidepressant’ glosses over the fact that various of these drugs work, or do not work, in different ways, and have different effects on neuro-transmission.
To suppose, prima facie, that ‘antidepressants’ working via different mechanisms and neurotransmitters are equally effective is implausible and without substantive justification.
Therefore, a methodology that appears to show that such is the case, is inevitably suspect. Note also, various of these are hardly used in clinical practice — e.g. moclobemide (despite having the most benign SE profile), trazodone, reboxetine, suggesting clinicians have given up on them. I certainly gave up on those three drugs pretty quickly, I regard moclobemide (and mirtazapine) as an ‘active placebo’.
Incidentally, what happened to moclobemide? It is absent from the Cipriani MS.
Questions a Bayesian would wish to address are various, prominent among them being: why is there an [apparent] significant difference between different SSRIs when they are all ‘known’ to be equally effective in that role. That applies especially to citalopram and escitalopram (its isomer), and desvenlafaxine & venlafaxine; and bupropion (a pro-drug) has hydroxy-bupropion as the dominant active molecule. Yet when hydroxy-bupropion was tested, it failed, and was successfully buried (it was called Radafaxine, but you will not have even heard of it). Why is the anti-histamine, mirtazapine (aka 6-aza-mianserin), [apparently] so good, but not its pharmacologically identical brother (10)), mianserin? The mirtazapine vs. mianserin mystery is yet another forgotten issue concerning deceitful data. What about the huge sibutramine AD trial? Only ever reported as a (very brief) conference abstract. Sibutramine was the first ‘dual-action’ AD drug (again ‘buried’). Then there is the initial tranche of trials on duloxetine which all failed, leading to it being ‘shelved’ (it is hard to see from the data presented if the authors ‘found’ those older studies), but it was resurrected later.
These anomalies require explanations and cast a shadow over the validity of MA results. A parsimonious initial supposition must be that some/many of them do not work as true ADs at all [except via non-depression-specific mechanisms, such as sedation/anxiolysis].
And, to discuss extrapolating the putative optimal AD agent for the general depressed population, based on the results of such meta-analyses, as it appears this paper, and the discussion around it, suppose to do, despite the caveats given (11), is ‘heroic’ and seems to ignore other methodologies, and decades of clinical experience, and thereby to assume or insinuate that they are of lesser value.
Evidently, an emphatic reminder that there is no epistemological justification for assigning qualitatively different validity to those two domains of evidence is required.
Indeed, an exaggerated view of the epistemic virtues of RCTs currently dominates thinking (12, 13) — Sir Austin Bradford Hill himself made a point of endorsing Claude Bernard’s view that there is ‘no qualitative epistemic difference between experiment [RCTs] and observation’ [clinical experience].
Hill also opined: ‘you need neither randomisation, nor statistics, to analyse the results, unless the treatment effect is very small (14)’.
And, for most patients, it is small.
And, the issue of mis-coding data (and hiding the ‘raw’ data — something Healy, Gotzsche, and others have written about (15, 16): and, fraud and deceit, are not even mentioned or addressed (see also my commentary about the forgotten story of deceitful data presented on mirtazapine). There is no doubt that doctors are naive about these matters and tend to avoid calling-out scientific misconduct even if they do recognise it.
The reality that fraud and deceit issues do seriously affect all meta-analyses receives insufficient consideration, for many different reasons.
Indeed, I wrote to one of the authors (Leucht) of this study, about his previous meta-analysis of antipsychotic drugs (17) which failed to mention the seminal paper by Houston et al (18) detailing the blatant deceit involved in the work published about risperidone, and Leucht was specifically discussing the question of bias (and cf. mirtazapine). Incidentally, Leucht stated in reply to me ‘they did not know about Huston’: evidently there are other things they not know about.
It is concerning to predict that it is inevitable that prestigious publications such as this will tend to dictate the kinds of treatment than ordinary doctors will use. This [excessive reliance on guidelines] has already produced a blinkered and stultified approach to antidepressant treatment, which virtually excludes, inter alia, the use of drugs like tranylcypromine and clomipramine. I know few experienced psycho-pharmacologists who do not agree that clomipramine is superior to amitriptyline.
In my career as a psycho-pharmacologist I have seen hundreds of patients who have alternated between amitriptyline and clomipramine. I cannot remember many who responded better on amitriptyline, but I would estimate that 19/20 would have stated unequivocally that they were better on clomipramine, having had only a partial response to amitriptyline (the difference being sufficiently great that any increased burden of side-effects was usually accepted with little compliant). That methodology, A-B-A trial, represents a powerful methodology for comparing drugs, especially when subjects are already known to suffer from a ‘biological’ depression; by virtue of previous response to an established antidepressant (or ECT): they represent a ‘better’ and ‘purer’ (more homogeneous) sample.
And, that brings us to another glossed-over but vital question. It is inevitable that the samples in these trials, frequently done in outpatients, are comprised of significant proportion of people who are never going to respond to ADs because they do not have a ‘biological’ depression. It is a frequently forgotten fact that randomised controlled trials rest, for their methodological validity, on the presupposition that the sample in question is homogenous. When that is palpably not the case the power of the RCT is much reduced, some would say invalidated, even more so when the treatment effect is small (cf. Hill).
And, these trials of 8 weeks duration are assessed with only subjective interim surrogate outcome measures (rating-scales).
Interim surrogate outcome measures have various major limitations, especially for predicting long-term treatment of chronic illnesses (cf. lithium, not a potent AD, but reduces suicide more than ‘ADs’ (19)).
Furthermore, concerning interim surrogate outcome measures, cf. the story of anti-hypertensives drugs, where early reliance on short-term BP reduction failed to show various nuances of outcome, such as the superior benefit of ACE-inhibitors on preserving kidney function (20, 21).
And, tranylcypromine is not even mentioned (nor moclobemide, which some might consider to be an ‘active placebo’). There are of course few trials in the modern era that used it. Why is this? At least partly because the RCTs & meta-analyses that feed guidelines do not mention it, so nobody uses it, and nobody does trials with it. The classic example of my point above about how guidelines foster a circular, self-fulfilling, restricted and blinkered approach to the subject.
As the Australian professor, Gordon Parker has said, there are major limitations to ‘level 1’ evidence derived from RCTs, which are no longer producing meaningful clinical results’ (22).
No longer? Did they ever?
Remember, the recognition of the undoubted powerful antidepressant effect of amitriptyline, clomipramine, and tranylcypromine owes nothing to double-blind trials. Nor did the discovery of penicillin, or the effectiveness of a host of other general medical drugs. Remember what Sir Austin Bradford Hill said about statistics being unnecessary if the effect was obvious.
Meta-analysis amplifies all these limitations into serious problems and mis-directions, made worse still because the prevailing ethos in the profession strongly discourages ‘non-guideline’ treatment.
The initial reaction to this paper from ‘the profession’ has been laudatory and uncritical. The quality and trustworthiness of the data, and the techniques and methodology, simply do not bear the degree of interpretation and extrapolation that they are being subjected to.
Attempting to make fine distinctions between drugs, when none of them are much better than placebo, is absurd and unscientific; this is even more so because ‘depression’ is not a homogeneous entity.
The very fact that such intricate manoeuvres are required to show these minor benefits is itself proof that their effect is minimal – it really is that simple.
If they had a good effect, and produced remission (wellness) in a majority of patients, we would not be occupying our time with this fruitless nit-picking and bickering.
Worse still, these ‘results’ (which are neither new nor definitive) will be accorded a degree of reliability and authority they in no way deserve. The lazy and unthinking will justify their practice by stating they ‘followed the guidelines’. The policy-makers and providers will decline to make nortriptyline, clomipramine, tranylcypromine etc. available.
Patients will suffer as a result.
One has to wonder if the authors of this study will remain comfortable with those consequences. They have assumed an onerous burden of ‘prophecy’ and history may not judge them kindly.
My parting suggestion is that some researchers get organised and produce a trial demonstrating that tranylcypromine treats psychotic depression, even when ECT has failed. The irony is, of course, that such a trial is unnecessary because the effect is so decisive that randomisation, blinding, assessment with rating scales, statistical analysis etc. are all unnecessary. Remember Sir Austin Bradford Hill?
Even good science does not change many people’s minds, and so I offer this poignant story about a professor of psychiatry with psychotic depression was made-well, cured, restored to full functioning, by tranylcypromine.
Key points from Cipriani paper
The Cipriani analysis was based on 522 double-blind studies (116 477 patients) involving 21 antidepressants (maximum 8 weeks duration).
The 21 antidepressants
agomelatine, bupropion, citalopram, desvenlafaxine, duloxetine, escitalopram, fluoxetine, fluvoxamine, levomilnacipran, milnacipran, mirtazapine, paroxetine, reboxetine, sertraline, venlafaxine, vilazodone, and vortioxetine, amitriptyline, clomipramine, trazodone, nefazodone [but not, oddly, moclobemide, imipramine, or nortriptyline].
‘Our assessment overall found few differences between antidepressants when all data were considered’.
All antidepressants were [a tiny bit] more effective than placebo, with ORs ranging between 2·13 (95% credible interval [CrI] 1·89–2·41) for amitriptyline and 1·37 (1·16–1·63) for reboxetine.’
‘In our analyses, funding by industry was not associated with substantial differences in terms of response or dropout rates. However, non-industry funded trials were few and many trials did not report or disclose any funding.’
‘We did not cover important clinical issues that might inform treatment decision making in routine clinical practice (eg, specific adverse events, withdrawal symptoms, or combination with non-pharmacological treatments). Additionally, because of the paucity of information reported in the original studies, we were not able to quantify some outcomes, such as global functioning. It should also be noted that some of the adverse effects of antidepressants occur over a prolonged period, meaning that positive results need to be taken with great caution, because the trials in this network meta-analysis were of short duration.’
‘Given the modest effect sizes, non-response to antidepressants will occur. Our information unfortunately cannot guide next-step choices after failure of such a first step (ie, they do not apply to treatment resistant depression), for which well performed trials are scarce.’
‘Notwithstanding these limitations, the findings from this network meta-analysis represent the most comprehensive currently available evidence base to guide the initial choice about pharmacological treatment for acute major depressive disorder in adults. All statements comparing the merits of one antidepressant with another must be tempered by the potential limitations of the methodology,32 the complexity of specific patient populations, and the uncertainties that might result from choice of dose or treatment setting. We hope that these results will assist in shared decision making between patients, carers, and their clinicians.’
‘In total, 87 052 participants were randomly assigned to an active drug and 29 425 were randomly assigned to placebo. The mean age was 44 years (SD 9) for both men and women; 38 404 (62·3%) of 61 681 of the sample population were women. The median duration of the acute treatment was 8 weeks (IQR 6–8). 243 (47%) of 522 studies randomly assigned participants to three or more groups, and 304 (58%) of 522 were placebo controlled trials. 391 (83%) of 472 were multi-centre studies and 335 (77%) of 437 studies recruited outpatients only. 252 (48%) of 522 trials recruited patients from North America, 37 (7%) from Asia, and 140 (27%) from Europe (59 [11%] trials were cross-continental and the remaining 34 [7%] were either from other regions or did not specify). The great majority of patients had moderate-to-severe major depressive disorder, with a mean reported baseline severity score on the Hamilton Depression Rating Scale 17-item of 25·7 (SD 3·97) among 464 (89%) of 522 studies.’
‘The current report summarises evidence of differences between antidepressants when prescribed as an initial treatment. Given the modest effect sizes, non-response to antidepressants will occur. Our information unfortunately cannot guide next-step choices after failure of such a first step (ie, they do not apply to treatmentresistant depression), …’
I confidently predict that the cautions and caveats in the section I have placed in bold above, which are correct and appropriate, will be lost in the implementation. This is what frequently happens with diagnostic guidelines like the DSM, and with treatment guidelines: thus, they become diktats, rather than advice to the implemented with caution, judgement, and consideration of individual circumstances.
Formula 1 Anti-Depressant starting grids: 2009 vs 2018
[Or the ‘Cipriani stakes’ for fillies?]
NB This is rather esoteric humour
Front row of the grid
2009 mirtazapine, escitalopram, sertraline, venlafaxine
2018 mirtazapine, escitalopram, sertraline, paroxetine, agomelatine
So, my bête noire, venlafaxine, has suffered ‘dual-traction’ issues in the rear differential and slipped off the front row of the grid! Paroxetine makes a comeback after several poor seasons (although poor hydraulic pressure in the central ram remains a concern over the full race distance). Several teams have appeals for using illegal slicks in qualifying before the stewards. Team cipramil mysteriously failed to qualify (there are rumours about sandbagging and patents). Team Moclobemide appear to have lost all fans and all sponsors after the revelation that the engine supplier grossly exaggerated the power output. Team Clomipramine just never got noticed in F1 circles, after starting on the wrong test-circuit, where they just went round-and-round for days and couldn’t stop themselves, despite scrubbing the tyres down to the canvass. When they did turn up they had no fan base, and were overlooked, despite having the most powerful engine. Team ‘Tranyl-Tyrrell’, with their 6-wheel car, were disqualified on technical grounds long ago for needing special fuel, and using too high turbo-boost pressures — but they are still consistently winning on the alternative ‘non-F1’ circuit, due, some say, to their inherently superior design and performance.
1. Cipriani, A, Furukawa, TA, and Salanti, G, Comparative efficacy and acceptability of 21 antidepressant drugs for the acute treatment of adults with major depressive disorder: a systematic review and network meta-analysis. Lancet, 2018: p. http://www.thelancet.com/journals/lancet/article/PIIS0140-6736(17)32802-7/fulltext.
2. Ioannidis, JP, The Mass Production of Redundant, Misleading, and Conflicted Systematic Reviews and Meta-analyses. Milbank Q., 2016. 94(3): p. 485-514.
3. Parker, G, Defining melancholia: the primacy of psychomotor disturbance. Acta Psychiatr Scand Suppl, 2007(433): p. 21-30.
4. Parker, G and McCraw, S, The properties and utility of the CORE measure of melancholia. J Affect Disord, 2017. 207: p. 128-135.
5. Parker, G, Roy, K, Hadzi-Pavlovic, D, Mitchell, P, et al., Subtyping depression by clinical features: the Australasian database. Acta Psychiatr. Scand., 2000. 101(1): p. 21-8.
6. Soria, V, Vives, M, Martinez-Amoros, E, Galvez, V, et al., The CORE system for sub-typing melancholic depression: Adaptation and psychometric properties of the Spanish version. Psychiatry Res., 2016. 239: p. 179-83.
7. Trajković, G, Starčević, V, Latas, M, Leštarević, M, et al., Reliability of the Hamilton Rating Scale for Depression: A meta-analysis over a period of 49 years. Psychiatry Res., 2011. 189: p. 1-9.
8. Tabuse, H, Kalali, A, Azuma, H, Ozaki, N, et al., The new GRID Hamilton Rating Scale for Depression demonstrates excellent inter-rater reliability for inexperienced and experienced raters before and after training. Psychiatry Res., 2007. 153(1): p. 61-7.
9. Bagby, RM, Ryder, AG, Schuller, DR, and Marshall, MB, The Hamilton Depression Rating Scale: has the gold standard become a lead weight? Am J Psychiatry, 2004. 161(12): p. 2163-77.
10. Gillman, PK, A systematic review of the serotonergic effects of mirtazapine: implications for its dual action status. Hum Psychopharmacol, 2006. 21(2): p. 117-25.
11. Parikh, S and Kennedy, S, More data, more answers: picking the optimal antidepressant. Lancet, 2018: p. http://www.thelancet.com/journals/lancet/article/PIIS0140-6736(18)30421-5/fulltext.
12. Healy, D, Trussed in evidence? Ambiguities at the interface between clinical evidence and clinical practice. Transcultural psychiatry, 2009. 46(1): p. 16-37.
13. Worrall, J, Causality in medicine: getting back to the Hill top. Prev. Med., 2011. 53(4-5): p. 235-8.
14. Hill, AB, The Environment and Disease: Association or Causation? Proc. R. Soc. Med., 1965. 58: p. 295-300.
15. Le Noury, J, Nardo, JM, Healy, D, Jureidini, J, et al., Restoring Study 329: efficacy and harms of paroxetine and imipramine in treatment of major depression in adolescence. BMJ, 2015. 351: p. h4320.
16. Schroll, JB, Penninga, EI, and Gotzsche, PC, Assessment of Adverse Events in Protocols, Clinical Study Reports, and Published Papers of Trials of Orlistat: A Document Analysis. PLoS Med, 2016. 13(8): p. e1002101.
17. Leucht, S, Tardy, M, Komossa, K, Heres, S, et al., Antipsychotic drugs versus placebo for relapse prevention in schizophrenia: a systematic review and meta-analysis. Lancet, 2012. 379: p. 2063 – 2071.
18. Huston, P and Moher, D, Redundancy, disaggregation, and the integrity of medical research. Lancet, 1996. 347(9007): p. 1024-6.
19. Young, AH, Review: lithium reduces the risk of suicide compared with placebo in people with depression and bipolar disorder. Evid Based Ment Health, 2013. 16(4): p. 112.
20. Williams, B, Recent hypertension trials: implications and controversies. J. Am. Coll. Cardiol., 2005. 45(6): p. 813-27.
21. Wu, HY, Huang, JW, Lin, HJ, Liao, WC, et al., Comparative effectiveness of renin-angiotensin system blockers and other antihypertensive drugs in patients with diabetes: systematic review and bayesian network meta-analysis. BMJ, 2013. 347: p. f6008.
22. Parker, G, Evaluating treatments for the mood disorders: time for the evidence to get real. Aust NZ J Psychiatry, 2004. 38(6): p. 408-14.
Consider Donating to PsychoTropical
PsychoTropical is funded solely through generous donations, which has enabled extensive development and improvement of all associated activities. Many people who follow the advice on the website will save enormously on doctors, treatment costs, hospitalization, etc. which in some cases will amount to many thousands of dollars, even tens of thousands — never mind all the reduction in suffering and the resultant destruction of family, work, social, and leisure capability. A donation of $100, or $500, is little compared to those savings. Some less-advantaged people feel that the little they can give is so small it won’t make a difference – but five dollars monthly helps: so, do not think that a little donation is not useful.
– Dr Ken Gillman