Which depression measure is best?

      7 Comments on Which depression measure is best?

A new paper published today in Lancet Psychiatry, led by Christopher Veal, reports findings from a systemic review of 450 clinical trials for unipolar and bipolar depression. Our results can be seen as trying to answer one of the oldest questions in the field of depression measurement: which of the over 200 measures is the best?

I want to thank the principal investigator Dr. Astrid Chevance for the fantastic work, and all the researchers and folks with lived experiences who helped with the study. Below, I briefly provide the rationale for the study, and then summarize the results.

1. Background

Depression affects millions globally and is one of the leading causes of disability. The efficacy of interventions like psycho- and pharmacotherapies is typically assessed through randomized controlled trials (RCTs) via so-called outcome measures: information we want to know to determine if patients improved. As we have highlighted in a recent review paper on theoretical and statistical challenges in depression measurement published in Nature Reviews Psychology (PDF), depression is measured in all sorts of different ways, which raises all sorts of questions. For example, do all these different measures for depression (at least 280!) actually measure the same construct? And do the differences between depression measures stand in the way of cumulative science? After all, it’s difficult to compare the results of studies if they used different measurements.

In today’s paper, Chris tackles a different question related to research waste, specifically: Do we measure the things we are actually interested in? For example, the famous MADRS depression scale was developed in the 1970s by following a very small group of 64 patients during a single clinical trial in which patients received an antidepressant. They gave patients a longer list of items they thought relevant for depression, but patients only improved on 10 of those, so the scale was reduced to 10 items. This leaves a lot of questions open — perhaps some of the dropped items are highly relevant to depression but the treatment was just bad? Or perhaps some of the patients did not improve on the ‘insomnia’ item because there was construction going on in their neighborhood?

2. The PROCEED study

Luckily, Astrid and the gang have previously published the PROCEED study which helps us in determining which outcomes are relevant to stakeholders; I have summarized PROCEED in a previous blog post.

In PROCEED, we asked what relevant stakeholders think we should measure in clinical trials for depression, and received over 8183 open-ended answers from 1912 patients, 464 informal caregivers, and 627 health-care professionals from 52 countries. This is different from the typical approach — existing scales like HRSD, BDI, CES-D etc were put together by (typically old, white, male) researchers or clinicians without involvement of stakeholders. May they have missed important aspects?

Get notifications for new blog posts on eiko-fried.com:

Join 1,097 other subscribers

Here a summary of the results:

Core Outcomes


The study identified 80 relevant domains, including clinical symptoms and functioning dimensions. Going back to the numerous depression measures used in the literature, the PROCEED outcomes raise the question which of those capture these domains comprehensively. To address this gap, our new paper aims to identify all outcome measures used in recent RCTs for depression, and to evaluate the extent to which they cover the  80 PROCEED domains.

3. Study setup

The research was conducted in 4 steps.

First, we identified 450 RCTs for adult unipolar and bipolar depression conducted between 2018 and 2022.

Second, we counted a total of 388 outcome measures across these trials. For depression specifically, we identified 296 different measures. Holy cow. We focused on efficacy-related, clinical, and functional outcomes. We did not consider biological outcomes (e.g., blood samples, because they are surrogates) or harm-related outcomes (e.g., adverse events, because they are already systematically collected).

Third, the outcome measures were categorized into two parts: objective measures (e.g., duration of hospitalization or death) on the one hand, and subjective measures (e.g., clinical symptoms or satisfaction) on the other. Subjective outcomes were further divided into

  • patient-reported outcome measures (PROMs)
  • clinician-reported outcome measures (ClinROMs)
  • observer-reported outcome measures (ObsROMs)
  • and performance outcome measures (PerfOMs)

Fourth, depression measures are often scales. For example, the MADRS I described above has 10 items, and they can differ a lot from each other. So to check whether all the identified measures overlap with the 80 outcome domains identified by the PROCEED study that matter most to patients, we investigated measures at the item level. Numerous raters (absolute heroes!) compared every single item of each scale with every outcome domain. For example, a neuropsychologist compared the PerfOMs, folks with lived experiences matched the PROMs, clinicians matched the ClinROMs, etc.

4. Study results

4.1 Descriptives

A detailed description of the type of RCTs identified can be found in the paper. Of the 388 identified outcomes, there were

  • 259 (67%) PROMs
  • 63 (16%) PerfOMs
  • 45 (12%) ClinROMs
  • 1 (<1%) ObsROM
  • and 8 (2%) objective outcomes.

RCTs require one primary outcome, which is the one that is usually reported in the abstract and relied upon for interpretation. In the RCTs we analyzed, 176 (39%) of the primary outcomes were PROMs, and 276 (61%) ClinROMs. The most commonly used scales were as follows:

We also investigated results in only depression vs only bipolar trials, as well as drug vs psychotherapy trials, reported in more detail in the paper.

4.2 Outcome matching

We now compared the most commonly used outcome measured identified and mapped them on the 80 PROCEED domains; I’ll show the results separately for PROMS and ClinROMs.

For patient-rated measures, the PROMs, results are as follows. Each circle (different colors) represents one specific outcome measure, and each line represents one of the 80 domains identified in PROCEED. That means that the colored light blue circle, BDI-II, covers ummmm-let-me-count I’m going to say a bit less than half of the 80 domains. The GAD-7 (red circle), on the other hand, only taps into 5 PROCEED domains.

In summary, the 10 most commonly used PROMs focused on depression, sleep quality, quality of life, and disability or impairment. Each of these PROMS overlapped with between five (6%) and 31 (39%) of the 80 domains. If you were to give a patient all of these scales at the same time, they would still fail to cover 24 (30%) of the 80 domains identified in PROCEED—one in three! Some of these relate to symptoms (e.g., helplessness, memory loss, or dissociation), others to functioning (e.g., capacity to get out of bed or communicating feelings).

For clinician-rated outcomes, the ClinROMs, six were multi-item measures assessing depression, suicidality, anxiety, and functioning, and three were Clinical Global Impression scales, assessing  severity of depression, patient improvement, and bipolar severity. In total, 18 (23%) of PROCEED domains were not covered by these nine ClinROMs; as above, some were symptoms (e.g., emotional blunting, incurability, or feeling alone), others related to functioning (e.g., social isolation or ability to cope with a life event).

If we now rank all outcome measures, the list is as follows:

  • HAMD contains 47 (59%) of the 80 domains, 42 / 64 symptom domains, and 5 / 16 functioning domains
  • MADRS contains 42 (53%) of the 80 domains, 41 / 64 symptom domains, and 1 / 16 functioning domains
  • BDI-II contains 31 (39%) of the 80 domains, 28 / 64 symptom domains, and 3 / 16 functioning domains
  • PHQ-9 contains 22 (28%) of the 80 domains, 21 / 64 symptom domains, and 1 / 16 functioning domains.

This is, of course, not the final answer about which depression measure is best. It’s probably not even a good answer, because it ignores numerous aspects of validity. It is also not a question we set out to answer when working on the paper.

But it is a answer, and probably an answer based on more evidence than most previous answers in the literature.

5. Conclusions

Reducing heterogeneity of outcome measures in the literature whilst focusing on measuring outcomes that matter to patients is not an easy task. We offer two ways forward.

First, developing a core outcome set (COS) for depression, a “minimum set of outcomes agreed on by all relevant stakeholders to be measured in all trials of a given condition”. PROCEED and this study here provide a great first step on the road toward a COS. We discuss problems of initiatives trying to side-step the development of a COS, such as recent efforts imposing very specific measures such as the PHQ-9 as a required standard.

Second, interim solutions include using measures together that maximize PROCEED domain coverage, such as the complementary use of the HAMD, BDI-II, and Functional Assessment Short Test that would result in 59 (74%) of the 80 patient-relevant domains being assessed.

I’ll paste the full study abstract below. Huge thanks to Chris, Astrid, and the rest of the team for the massive amount of work. The full paper is online in Lancet Psychiatry.

“Research waste occurs when randomised controlled trial (RCT) outcomes are heterogeneous or overlook domains that matter to patients (eg, relating to symptoms or functions). In this systematic review, we reviewed the outcome measures used in 450 RCTs of adult unipolar and bipolar depression registered between 2018 and 2022 and identified 388 different measures. 40% of the RCTs used the same measure (Hamilton Depression Rating Scale [HAMD]). Patients and clinicians matched each item within the 25 most frequently used measures with 80 previously identified domains of depression that matter to patients. Seven (9%) domains were not covered by the 25 most frequently used outcome measures (eg, mental pain and irritability). The HAMD covered a maximum of 47 (59%) of the 80 domains that matter to patients. An interim solution to facilitate evidence synthesis before a core outcome set is developed would be to use the most common measures and choose complementary scales to optimise domain coverage.”

7 thoughts on “Which depression measure is best?

  1. Paul Naarding

    Could you please explain how the (17 item.) Hamilton can relate to more than 50 of the symptom domains of proceed?

    Furthermore, why should you include the dysfunctioning of an illness in the symptom list? Dysfunctioning is at the core of any relevant (psychiatric/medical) disorder. I will not argue that functioning should be the target of therapy, but most improvement in function is probably reached with effective treatment of the disorder itself.

    1. Eiko Post author

      Thanks Paul, great questions!

      1) Appendix 3 of the paper contains a very detailed list of the 80 domains (https://www.sciencedirect.com/science/article/pii/S2215036623004388?dgcid=author#sec1). These outcomes make it possible that a single item captures multiple items — of course not in great depth and comprehensibly, but with sufficient overlap that the raters decided the symptoms capture the relevant domains! Hope this makes sense.

      2) In PROCEED, we asked people what they think we should measure to measure if people get better. We did not ask specifically about “symptoms”. Dysfunction as an answer here makes a lot of sense, of course. Let me know if I misunderstood your question here.

      1. Paul Naarding

        Thanks for answering Eiko, I really like the articles. I would like to draw your attention to a more clinical article recently written by my old-age psychiatrist companion Richard Oude Voshaar: https://pure.rug.nl/ws/portalfiles/portal/870583749/afad239.pdf
        It is a bit more specific on the clinical problem of late-life-depression.

        On the first issue that I raised, I thank you for the clarification, but am somewhat disappointed that it didn’t lead to less but more domains. Could it be that there to a greater or lesser extend is overlap between these domains and that you could come to a certain reduction?

        On my second issue, my argument is that any disease of disorder (psychiatry) has the main feature that symptoms are that serious that they will give rise to dysfunctioning in work, private or leasure activities. So, this is really a definition fact, and thus I would not be interested in dysfunctioning as part of the disorder itself.
        As I stated in my former comment, I will take dysfunctioning very seriously, as a matter of fact it is usually the major outcome measure that is really important.

        1. Eiko Post author

          Thanks for clarifying, and the recommendation paper, goes onto my reading pile.

          (1) The PROCEED paper explains the qualitative coding in detail, and the 80 domains came out of a ton of very careful work. The result of the qualitative analysis is that it would be good to consider these 80 domains separately.

          (2) I agree with the definition, so it is great to see that lived experience experts, carers and experts also considered dysfunction a crucial aspect to measure in clinical trials. This ties right into the fantastic 2009 paper by McKnight & Kashdan that inspired one of my first PhD papers, arguing we ought to measure functioning/impairment instead of the sole focus on symptom measures in many clinical trials: https://pubmed.ncbi.nlm.nih.gov/19269076/

          So I think we all agree on this!

  2. Gerrit Burkhardt

    Dear Eiko,

    Thank you for this really relevant work! As always, a pleasure to read. To make matters worse for the MADRS: the 64 patients seem to have been enrolled in multiple “clinical trials” of Mianserin, Maprotilin, Amitriptylin, and Clomipramin. Unfortunately, the original publication does not reference these trials, and it’s not clear how many patients were treated with each drug (see https://www.researchgate.net/publication/224773098_A_New_Depression_Scale_Designed_to_be_Sensitive_to_Change).

    Best wishes,
    Gerrit Burkhardt (LMU Munich)

    1. Eiko Post author

      Well, at least it means that these 10 items improved across a number of treatments best, on average. Thanks, hadn’t been aware they came from different trials taking different medication.

  3. Pingback: Which depression measure is best? – A New Vision for Mental Health

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.