Let me summarize the findings first.
We examined 2 crucial psychometric assumptions that are part of nearly all contemporary depression research. We find strong evidence that these assumptions do not seem to hold in general, which impacts on the validity of depression research as a whole.
What are these psychometric assumptions? In depression research, various symptoms are routinely assessed via rating scales and added to construct sum-scores. These scores are used as a proxy for depression severity in cross-sectional research, and differences in sum-scores over time are taken to reflect changes in an underlying depression construct. For example, a sum-score of symptoms that supposedly reflects “depression severity” is often correlated to stress or gender or biomarkers to find out what the relationship between these variables and depression is; this is only valid if a sum-score of symptoms is actually a reasonable proxy for depression severity. In longitudinal research, if a sum-score decreases from 20 points to 15 points in a population, we conclude that depression improved somewhat. This is only valid if the 20 points and the 15 points reflect the same thing (if the 20 points would reflect intelligence and the 15 points neuroticism, the difference of 5 points over time would be meaningless).
To allow for such interpretations, rating scales must (a) measure a single construct, and (b) measure that construct in the same way across time. These requirements are referred to as unidimensionality and measurement invariance. In the study, we investigated these 2 requirements in 2 large prospective studies (combined n = 3,509) in which overall depression levels decrease, examining 4 common depression rating scales (1 self-report, 3 clinician-report) with different time intervals between assessments (between 6 weeks and 2 years).
A consistent pattern of results emerged. For all instruments, neither unidimensionality nor measurement invariance appeared remotely tenable. At least 3 factors were required to describe each scale (this means that the sum-score does not reflect 1 underlying construct, but at least 3 and sometimes up to 6), and the factor structure changed over time. Typically, the structure became less multifactorial as depression severity decreased (without however reaching unidimensionality). The decrease in the sum-scores was accompanied by an increase in the variances of the sum-scores, and increases in internal consistency.
You can see the results in the graph below. The four sections represent four different rating scales, the lines represent the first (red) and second (green) measurement point of the longitudinal datasets, PA (blue) means parallel analysis which tells us how many factors a scale has at a given timepoint, and the x-axis represents the number of factors that have to be extracted. If our red and green data lines are above the blue PA line, it means that we should extract a factor. You can read up on all the fancy ESEM modeling in the paper itself, but the gist is that in order to be unidimensional, a scale must only have 1 factor; and as you can see, all scales require the extraction of at least 3 factors. In order to be measurement invariance, the factor solution has to be stable across time; it’s highly evident from the graphs that this is not the case, because the lines for the 2 measurement points per scale should roughly overlap if this were the case (do me the favor and click on the image, I can’t embed vector graphics here). For a scale to be unidimensional and measurement invariant, the red and green line should be very similar, and they should be above the blue line for the first factor and then drop below the blue line for the second etc factors.
These findings challenge the common interpretation of sum-scores and their changes as reflecting 1 underlying construct. In other words, summing up symptoms to a total score, and correlating this total score statistically with other variables such as risk factors or biomarkers, is very questionable if the score itself is not unidimensional and does not reflect 1 underlying construct (depression). Obviously, if you have worked with depression symptoms before you know that they are very different from each other, and the idea that they are interchangeable indicators of 1 condition (depression) is very problematic. But now we have empirical evidence that this is the case — which is consistent with many other papers that have shown similar results. The special thing about this paper is that we examined these properties across a whole range of scales, and tested the robustness of the results by varying many dimensions (datasets, clinician- vs patient-rated, and timeframes).
The violations of common measurement requirements are sufficiently severe to suggest alternative interpretations of depression sum-scores as formative instead of reflective measures. A reflective sum-score is 1 that indicates an underlying disorder, the same way having a number of measles symptoms tells us that you have measles: the symptoms inform us about a problem because the problem caused the symptoms. A formative sum-score, on the other hand, is nothing but an index: a sum of problems. These problems are not meant to reflect or indicate an underlying problem. Still, we can learn something from such a sum-score: the more problems people have, the worse they are probably doing in their lives.
» Fried, E. I., van Borkulo, C. D., Epskamp, S., Schoevers, R. A., Tuerlinckx, F., & Borsboom, D. (2016). Measuring Depression over Time … or not? Lack of Unidimensionality and Longitudinal Measurement Invariance in Four Common Rating Scales of Depression. Psychological Assessment. Advance Online Publication. (PDF)