Measurement Schmeasurement

      1 Comment on Measurement Schmeasurement

TL;DR This post summarizes our recent APS symposium entitled “Measurement Schmeasurement” (feat. Jessica Flake, Mijke Rhemtulla, Andre Wang, and Scott Lilienfeld); provides a brief history of how it came to be, and the important role social media plays in modern psychology; and serves as a shameless plug for an invited measurement workshop Jessica and I will be giving at SIPS 20181. Also: Measurement Matters.

A brief history of Measurement Schmeasurement

The idea came up on October 9th 2017, when Jennifer Tackett brought up dodgy measurement on Twitter:

Jessica Flake replied:

And soon thereafter, a new idea was born:

So Jessica and I2 crowd-sourced a symposium on problematic measurement practices, with calls to participate on Twitter and PsychMap. The response was overwhelming, with many more contributions than we could fit into one symposium. Thanks again for everybody who would have wanted to contribute, and sorry that APS only has 4 spots for a symposium. After we decided to team up with Mijke Rhetulla and Andre Wang for the submission, we slapped the title Measurement Schmeasurement on it, and submitted it to APS for the conference in May 2018. While waiting for APS to reject the symposium — measurement is an incredibly dry and boring topic, after all — we wrote a piece for the APS Observer entitled Measurement Matters, and put together a reading list of important measurement papers and book chapters.

And then — to my surprise — APS accepted our symposium.

Measurement Schmeasurement — it’s happening!

Cut to May 24th 2018, Hilton, San Francisco. I make my way to the room in which Measurement Schmeasurement is supposed to take place. My thoughts wander … I notice that I feel somewhat guilty for having abandoned my students back home I was supposed to teach about the importance of methodology in clinical psychology, but know that they’re in good hands: Kim de Jong has been taking over teaching.

I shuffle onwards, to the room I had scouted the day before: As expected, it’s one of the smallest rooms in the building, at the very end of a dark corridor … surely noone will show up. I’m in my bathrobe, haven’t prepared a presentation. Why would I.

I’m shocked when I enter the room. There are people here. Plural. Several people.

In panic, I cause some technical issues by sabotaging the VGA cable, and try to convince people to leave. “Folks, this is a symposium on measurement. You’re probably in the wrong room”.

But it fills up. People stand outside the room now.

And others who couldn’t attend the conference tell us on Twitter that they would have liked to attend.

It dawns upon me: I will actually have to present something today. To my relief, Jessica starts with her presentation, and I furiously start preparing my slides … you have 10 minutes, Eiko. Make it count.

1. The Fundamental Role of Construct Validation in Original and Replicated Research

Jessica Flake starts with a summary of her 2017 paper, in which she analyzed measurement practices in 35 articles published in JPSP (the Journal of Personality and Social Psychology), with 700 reported measures (slides). Read the paper, a 10 minute presentation cannot do all analyses Flake et al. conducted justice.

In about half of the 700 measures, authors provided no validation information whatsoever (i.e. no references or information regarding reliability), and when authors presented evidence, in 61% they only reported Cronbach’s Alpha. Jessica connects that to the replication crisis in psychology, making the argument we also made in the Observer: Current discussions about the replicability crisis and questionable research practices focus on statistics (e.g. p-hacking), but proper measurement is more foundational, and largely ignored.

The second paper Jessica talks about is work in progress: An analysis of measurement practices in the 2015 “Estimating the reproducibility of psychological science” paper. It looks like about 20% of the replications face measurement problems such as on-the-fly scales, scales with low reliability, lack of translated versions, lack of measurement invariance, or breaking scales apart differently than the original studies.

But what does it mean if a study does not replicate in case the measurement is not the same? And conversely, what does it mean if a study does replicate, if scale reliability is low, or measurement invariance doesn’t hold? Jessica concludes that a focus on measurement practices should play a key role in our efforts to improve cumulative psychological science:

2. Common Measurement Problems in Psychology: The Example of Major Depression

I am up second, have salvaged some semi-coherent slides. I zoom in on one specific example construct with hugely problematic measurement properties: Major Depression (slides3).

I start out with urgency: Depression is among the most common and debilitating mental disorders in the world, and among the most measures constructs in all sciences: Three (!) rating scales of depression are among the top 100 papers across all sciences, ever. I follow this up with six points.

First, there is a fundamental lack of agreement on what depression is and how to best measure it. This is obvious when considering the fact that there are 280+ different depression scales, many of which are still in use.

Second, scales differ considerably in content, and are often at best moderately correlated (figure from this paper).

Third, given this, and given that researchers mainly use only one scale in scientific research on depression (i.e. given that scales are implicitly treated as interchangeable), this implies considerable issues for replicability and generalizability of depression research. That is, a relationship you find between the Beck Depression Inventory and an outcome might not replicate if you had used the Hamilton scale. Similar to Jessica, I point out that proper measurement is a prerequisite for drawing valid inferences from studies.

Fourth, I highlight work by Ken Kendler and Mark Zimmerman who showed that symptoms in depression scales and diagnostic criteria such as the DSM are largely based on history and path dependence, not psychometric evidence, which shouldn’t inspire confidence in depression measures. Beck and Hamilton and Zung and Radloff and many others just put symptoms they thought were important into their respective scales, and today’s DSM-5 symptoms are largely based on a list Cassidy et al. put together in 1957, an empirical study of 100 manic-depressive patients.

Fifth, given how common scales were ‘developed’ in the 60s and 70s, it is not surprising that they do not meet basic psychometric criteria such as temporal measurement invariance or unidimensionality (see here for a comprehensive paper on these two topics).

I have to skip the last topic due to time constraints, but want to mention it here briefly: To meet the DSM criteria for depression, one needs to have at least 1 of the 2 core symptoms, and 5 out of 9 symptoms total. In large epidemiological data, this means that most people will have missing values on symptoms 3 to 9 because they are not queried. When analyzing such data, these missing values are usually replaced with zeros. This practice is common, and introduces considerable artificial dependencies in the data4. Measurement matters.

3. Unintended consequences of latent variable modeling

As third speaker, Mijke Rhemtulla provides a beautiful example where taking into account measurement error can move researchers further away from the true model (slides).

The idea of latent variables is often that they deal with measurement error:

This has become somewhat of a mantra, and Mijke shows that ‘dealing with measurement error’ can lead to worse recovery of a true model compared to using, for instance, just a mean of several items: It depends on what the true model is, a reflective factor model does not automatically solve all issues. On the contrary, sometimes it creates them.

To show this, Mijke simulated data from the true mediation model below, and fit different models to the data.

I’m not going to go into the details here (see her slides), but in this case, the model that recovers the X -> M and M -> Y paths with the smallest deviation from the true parameter is actually the simplest model that used the mean of the items x1 through x4, where coefficients a and b are estimated as a=.57 and b=.57 (the true coefficients are a=.6 and b=.6). In contrast, latent variable models deviate more strongly from the true model, and some of them have terrible fit:

Mijke concludes with highlighting the crucial part that theory plays in psychometrics:

I’m very much looking forward to reading her paper on this, which is currently in preparation. Mijke’s conclusion also reminded me of one of my favorite blog posts, written by Denny Borsboom, on what he calls “theoretical amnesia” in psychology.

4. Connecting Unreliable Measurement to Statistical Power in Structural Equation Modeling (SEM)

Andre Wang gives the final talk in our symposium (slides), and starts out with the rationale that many measures in psychology have low reliability, which can cause problems for power. Common wisdom is that SEM can take out measurement error, and hence increase the reliability of measures. But what is the relation between measurement error and power in SEM? After all, it is well known that SEM is more accurate, but less precise:

Andre presents a small part of a simulation study he is currently writing up together with his collaborators for a paper. In the study, they simulated data from a model with 2 factors X and Y, each with several observed indicators, and there was a non-zero path from X to Y in the model as well. The question is now what the power is to retrieve this non-zero parameters, depending on sample size, the size of the parameter, the number of indicators, and the reliability of the indicators:

You can find simulation results in the slides; in summary, more indicators per factor lead to more power to recover the coefficient, as do larger samples. But even in these situations, small parameters are not adequately recovered if items are unreliable.

Andre concludes:

Looking forward

I’m very happy with how things went, and want to thank all contributors for making this happen. I also want to thank Scott Lilienfeld, who was our fantastic discussant. Scott highlighted measurement issues that he has seen in his career, and focused on talking about several measurement problems that keep popping up in his role as editor in chief at Clinical Psychological Science.

Next up is the Society for the Improvement of Psychological Science (SIPS) in Grand Rapids June 2018, where Jessica and I will give a workshop on “Questionable measurement practices threaten cumulative psychological science — here’s how to avoid them”5.

Here our abstract:

Researcher degrees of freedom and questionable research practices have contributed to a replicability crisis in psychology. New open science standards such as pre-registrations or registered reports improve cumulative psychological science by helping us to define our research questions and how we will answer them a priori. This workshop focuses on a topic that has received limited attention so far: questionable measurement practices (QMPs). First, we will discuss examples of QMPs in the psychological literature, and describe how they connect to larger issues of transparency, replicability, and reproducibility. Second, we will outline specific points in the research pipeline such as measurement selection and scoring where QMPs commonly arise. Third, we will show that planning and transparency can prevent QMPs and improve the validity of your work. Fourth, we will outline techniques for promoting measurement transparency that participants can apply to their research, including a step-by-step tutorial for pre-registering studies and writing registered reports. The workshop will be interactive in that participants will discuss, reflect, and get feedback about their own constructs and measures. Our goal is that participants leave the workshop with a clear plan for improving their own measurement practices.

Hope to see y’all there. And now excuse me, I need to continue writing my grant on depression measurement. This is the first time in years I feel such grant might have a chance to succeed. HYPE!

  1. COI: Absolutely none, since the workshop is not only free of charge, but will also be video-recorded to be available to a broad audience!
  2. Or as we say: Jesseiko Frake
  3. Powerpoint, because otherwise the gifs don’t run
  4. See here for an example where zero imputation led from correlations of ~0.3 among depression symptoms to correlations of > 0.9
  5. Kind of sounds like a buzzfeed title: “18 reasons why cheese is good for you”

One thought on “Measurement Schmeasurement

  1. MMorales

    Thank you. As a researcher from a developing country with interest in measurement and assessment, particularly in the clinical setting, this post, as well as ypur tweets, are very much appreciated. Congratulations on the successful symposium!


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.