A checklist to vet psychedelic science

      1 Comment on A checklist to vet psychedelic science

As result of a commentary we wrote, Jama Psychiatry recently published a correction of a serious error in a 2023 paper on treating bipolar depression with psilocybin (together with Ioana Cristea and Florian Naudet). This correction included correcting the title of the paper itself.

In this blog, I’ll very briefly discuss the core issue in this paper that led to a correction, but then, more importantly, use this paper as an opportunity to introduce our checklist we developed for a review on psychedelic science. We’ll go through the paper step by step, and apply the checklist to see how well the paper performs.

1. Aaronson et al. 2023: psilocybin for bipolar disorder

The paper by Aaronson et al. 2023, entitled “Single-Dose Synthetic Psilocybin With Psychotherapy for Treatment-Resistant Bipolar Type II Major Depressive Episodes A Nonrandomized Controlled Trial“, followed 19 patients for 12 weeks. All patients had treatment-resistant type 2 bipolar depression. Patients received a single dose of 25mg psilocybin, and had a total of 7 sessions with therapists: 3 before and 3 after dosing, and then the 8 hour dosing session itself. Participants showed a dramatic pre-post response with an effect size of Cohen’s d=4.08.

My bachelor students learn the important difference between an open label study (no control group) and a controlled trial (i.e., with a control group). Everybody in this research area is aware of this crucial difference: this necessarily includes all authors of this paper, the reviewers, and the handling editor.

None of these 15+ folks appear to have been concerned about the fact that the title of the paper included the term ‘controlled trial’ when the paper was a study without control group. I am thankful that the journal published a correction that correctly describes this as an error.

But is worth noting that this is yet another obviously problematic paper in a long list of such papers.

  • Just a few weeks ago, I had a paper corrected on using psychedelics to improve wellbeing in cancer patients: the first author did not report financial conflicts of interest, but I found out that he is the CEO of a company selling psychedelic therapy for cancer patient.
  • A few days back, we submitted a commentary to The BMJ about a new meta-analysis using psilocybin for depression treatments where the authors clearly made statistical errors leading to an inflation of treatment estimates to truly ridiculous levels (e.g., effect sizes may have been inflated by up to 500%).
  • This list goes on and on; see this recent video on various serious integrity distortions.

What is happening in this literature? I am also a little flabbergasted by the authors’ response to our commentary pointing out this error. Instead of a clear “that is our fault and should really not have happened”, they write:

“This could be problematic, as there was no comparator group or other control for non-specific effects. However, the nature of the study design was clearly described in the article and the accompanying editorial, which stated, ‘Aaronson and colleagues present an important step forward in this single-arm open-label study of psilocybin in patients with bipolar II depression’.”

The fact that another publication adequately describes the design of your paper when you didn’t is not the defense you think it is. The authors write further:

“We elected to use the phrase ‘controlled trial’ to reflect the stringency of participant selection and the prohibitions on other psychotropic drugs before and after psilocybin dosing.”

That is not what ‘controlled trial’ means, and the authors must know this. The term ‘controlled trial’ means that in of the treatment arms (plural) serves as the control condition.

2. A checklist for vetting psychedelic studies

I had honestly planned to stop writing blogs about individual papers after our recent review paper and my new video. But this paper by Aaronson et al. 2023 seems like a great opportunity to  showcase our new checklist for reviewers, policy makers, journalists, funders and others who consume papers in this area.

So lets go through it, and rate each category using a traffic light system: 🟢 🟡 🔴 (and ⚪ means “does not apply”).

🟡 2.1 Valid inferences

“Is there sufficient transparency around data collection and statistical analyses, and are the conclusions supported by the evidence? Is there evidence that the treatment and not other factors (e.g. breaking the blind) explain the difference between the intervention and control group? Have independent reviewers with the relevant statistical expertise been assigned to review the manuscript? Are the reviews publicly available?”

  • This is an open label study, so there can’t really be strong evidence that the treatment itself (rather than e.g. placebo effects or expectancy effects) are the causal driver of improvement.
  • There is no information on independent reviewers with statistical expertise, and the reviews are not publicly available. (The systematic review of psilocybin as depression treatment I described above that received a quick expression of concern by The BMJ published the reviews, and the statistical issues we see in the paper were not raised by the reviewers and editorial board, raising questions about whether the paper was sufficiently reviewed).
  • Unfortunately, statistical code to reproduce the analyses (should always be shared) is not available, meaning we cannot verify what actual analyses were carried out exactly. Data are not available, but that is common for clinical trials (with such a small sample size, patient anonymity is at stake when sharing data, so not sharing here is the right way to ago).
  • To check for obvious problems, we briefly navigate to the clinical trial registry of this trial which can be found in the paper itself: study NCT04433845. The study has different versions of registrations (‘Record History’) which we can easily compare. Unlike some other studies in this research area,  there are no obvious red flags in terms of changing outcome measures or changing the duration of treatment after data collection. The changes across versions are marked in red/green, and you see no changes for the primary outcome measures, which is great. This primary outcome is also what authors report in the publication (MADRS, 3 weeks). Check.

  • Authors did not register any secondary outcomes in the clinical trials registry, but they do report secondary outcomes in the paper (e.g. MADRS at week 12). This is unusual and quite problematic, especially given that the first draft of this trial registry is already 4 years old. However, the authors do talk about exploratory goals (e.g. analyzing some extra questionnaires) in the ‘trial protocol’, which is (at present) inconsistent with the clinical trial registry.

Overall, this gets a yellow: not great, but also not absolutely horrible (see here for more serious cases).

🟡 2.2 Conflicts of interests

“Are potential COIs reported transparently in the paper? What is the nature of these COIs, and, in the presence of severe COIs, are there sufficient safeguards in place so that the findings can be considered trustworthy (e.g. preregistration of primary outcomes and statistical analyses)? Are all included measures fully disclosed and reported?”

The study has considerable conflicts of interests, which appear to be transparently reported. The authors followed clinical trials properly when it comes to the primary outcome, it appears. Hard to tell if the findings are trustworthy in the absence of data or analytic code. The secondary outcomes reported in the paper are not registered in the clinical trial protocol.

Overall, yellow.

🔴 2.3 Safety & adverse events

“Is it easy to find all relevant information regarding adverse events in the study? Is there an independent arbiter to decide whether an adverse event is related to the treatment? Is the psychotherapy component of the study standardized and fully described? Were trained therapist used to carry out treatments?”

The authors write in their paper: “There were no significant adverse events related to the psilocybin dosing. The most common adverse event was headache in 4 of 15 patients on the day of dosing, with symptoms resolving within 24 hours”. I would like to see more information on this, given evidence that adverse events are systematically under-reported in this literature. But I could not find anything, either in the supplementary materials nor in the clinical trials registry that does. Note also that the registry, about half a year after publication, still does not contain information on adverse results. Nearly no information and lack of transparency: red.

🟡 2.4 Control group

“Is a control group included to address common validity threats such as placebo effects, expectancy effects, and regression to the mean? If no control group is included, are interpretations sufficiently careful?”

There is no control group. Interpretations aren’t as careful as I would like them to be (because little can be learned from a study like this, in my view), but it’s not terrible either—the authors write in the abstract that “findings suggest efficacy and safety of psilocybin in bipolar II depression and support further study of psychedelics in this population”, which is fine overall.

Lets say yellow (a green open label study would be one that carefully grapples with the many obvious problems of open label designs for learning something about efficacy of treatments, which this study does not).

🔴 2.5 Sample size

“Is a power or sensitivity analysis provided, and does it include a justification of the minimum effect size of interest? Is the study sufficiently powered to detect a difference between intervention and control group (rather than powered against no effect at all)?”

There is no power analysis and no control group. Generally, the sample size of 15 patients is too small for generalizability from the sample to the populations of interest; in the relevant part of my video on this topic, I compare clinical trials to polls (and you would not trust a poll about anything if it relies on 15 people) to explain what a sufficient sample size would look like, and why small samples are problematic for two reasons.

🟡 2.6 Selection bias

“Does the studied sample differ from the population of interest? Is a statement about constraints on generalizability included? Is demographic (e.g. gender, age, socioeconomic background) and clinical (e.g. severity, comorbidities) information provided?”

Of 70 participants approached, only 19 met inclusion criteria, already showing how unrepresentative the selected sample is. Specifically, the authors excluded everybody with any comorbidities, introducing a selection problem. This is common and understandable, but it does mean that the results are not representative for folks with bipolar II depression.

Yellow because there are studies that exclude participants for many more reasons than just comorbidities; but in practice this is always red in a sample of 15 participants: mental health problems are so different across people that 15 people is not sufficient ot learn about the population of interest (people with bipolar II depression).

🟡 2.7 Study duration

“Do scientists follow the patients for a sufficient time frame to justify the conclusion that successful treatment took place, that is, that people have returned to a normal level of symptom load, wellbeing, and functioning?”

3 weeks primary outcome is short but ok in this literature (8 weeks would be better), but 12 weeks secondary outcome is not terrible. Yellow.

🔴 2.8 Breaking the blind

“Have efforts been made to minimize the risk of unblinding (e.g. by using active placebos)? Was masking efficacy (i.e. if blinding succeeded) assessed and reported?”

Always red in open label studies: people know their group assignment, which is a threat to internal validity. I explain this in detail in my video.

🔴 2.9 Placebo effects

“Have efforts been made to minimize the risk of unblinding (e.g. by using active placebos)? Was masking efficacy (i.e. if blinding succeeded) assessed and reported?”

Always red in open label studies: we cannot distinguish the response due to the treatment from the placebo response.

⚪ 2.10 Mechanisms of action

“Are inferences regarding potential mechanisms of action supported by evidence? Are the data and materials available in a repository for replication and secondary analyses?”

Data, code and materials are not available for secondary analyses. Again, it makes sense not to share data in cases of n=15 clients for reasons of anonymity, but code should always be made available to know exactly what analyses authors conducted. No information on mechanisms of action, which was also not the goal here, so we’ll record it as “does not apply”.

3. Conclusion

That leaves us, for the Aaronson et al. 2023 paper, with:

Note that vetting papers via a checklist should preferably not done by one person like I did in this blog post — I may well have not caught an issue, or misunderstood something, so take this with a grain of salt.

This issue notwithstanding, I do not feel that the paper necessarily needed to be published in a tier 1 journal, given the very limited inferences that can be drawn from the provided evidence. As we write in the commentary:

“We were surprised to see the journal publish a study with a design that does not allow the authors to answer their research questions about efficacy and safety. To understand whether psilocybin explains part of the change in Montgomery-Åsberg Depression Rating scale score observed from before to after treatment (Cohen d = 4.08), a control group is crucial. As is, the design can account neither for nonspecific (including placebo) effects nor regression to the mean. Lack of blinding increases expectancy effects in patients and can bias researchers, interviewers, and data analysts who know the desired results. Moreover, all participants received psychotherapy, and some additional pharmacotherapy, likely contributing to the changes observed from before to after treatment.”

One thought on “A checklist to vet psychedelic science

  1. Renske Blom

    Dear Eiko,
    Also, we as clinicians/psychiatrists use different questionnaires the symptoms of a Bipolar II disorder.
    For example the life chart and other specific questionnaires https://www.kenniscentrumbipolairestoornissen.nl/voor-professionals/behandeling/meetinstrumenten/

    The MADRS only measures depressive symptoms. If a patient is hypomanic(..), this will not be tracked by the primary outcome of this study: the MADRS.

    All the best,
    Renske Blom (psychiatrist)


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.