A few days ago, Richard Morey started a discussion on Twitter arguing that small samples are not inherently problematic. In the interesting discussion that ensued, I kept thinking about clinical psychology and clinical trials, where I believe that small samples are problematic. To explain my position, let’s look at situations where small samples are fine, sampling variability, the accuracy of parameter estimates, and the way many clinical studies are interpreted. I hope this clarifies why — given the way science is currently interpreted — small samples can lead to a lot of trouble.
Small samples are fine because …
Let’s start with trying to present the statement that small samples are not inherently problematic in the best possible light, in a way that Richard and others would hopefully agree with1. The main point is that sample size alone does not make a good or bad study, for many reasons.
First, the problem is usually inference, but that is an issue of interpretation, not sample size. Nothing wrong with n=20 if researchers report margins of errors.
Second, if there is no publication bias, and studies in a field are all based on small N, 10 studies with n=20 will have the same result as 1 study with n=200. Interestingly, this highlights the important role of publication bias especially when typical samples are small.
Third, sample size does not equal power, and there are within-subject designs that have many hundred repeated measures (e.g., functional neuroimaging), offering tremendous power. We should not generally assume every n=20 study to be “bad”, and there are more important details to look at. I don’t conduct such experimental studies, but I would probably also get upset if reviewers just shot down my paper “because you only have 20 participants”.
Fourth, the 20 people I draw from a population are part of this population — there is nothing wrong with them. They are just as part of the population as any other subset.
And finally, a point that wasn’t raised in the recent discussion but that regularly comes up in discussions with clinical researchers: it can be very difficult to collect data in clinical populations. n=20 might not be great, but it’s all we sometimes get.
Small N is problematic in case of …
Small n is fine if you don’t do anything with the results. And on the long run, aggregating enough studies, these n=20 will contribute to the literature. But there is sampling variability, and there is accuracy of parameter estimates. And I would argue that below a certain threshold, if uncertainty is too large to be meaningful, we should not fund research in the first place. In which case, studies in small samples are problematic.
Sampling variability means that when we draw our sample from the population, there will be variability in the parameter estimates based on the number of people we draw. The more people we draw, the more likely our sample estimates will match the population estimates (given a number of assumptions). And the fact that we only study a sample we draw from the population, and not the whole population, is the main reason we use statistics in the first place. If you have the whole population, and observe that men are taller than women, it does not matter what the height difference is for it to be “significant”. This is why it is a bit funny if you see papers that observe the whole population (e.g. 6 million Danes) and still report like confidence intervals—we know the true parameter estimate, there is no uncertainty.
As an example, let’s look at the current polls for the upcoming German elections, where things look pretty interesting: while CDU/CSU (Christian Democrats) is leading strongly in front of SPD (Social Democrats), the remaining 4 parties are very close together. And 1% here can matter a lot, because it may result in a coalition that reaches the magical 51% of votes required, and also because parties need to reach the 5% “hurdle” to get into parliament (4.9% means that a party does not even get a single seat).
First, let’s start with the true model.
Now we draw n=50, and very much underestimate the votes for CDU/CSU compared to the true model. In fact, we could have a leftish government with SPD + Die Linke!
The second time we draw 50 people, our predictions would kick out Die Linke because it doesn’t reach the 5% hurdle.
And in the last poll, we would wrongly kick out both AfD and Die Linke.
Sampling variability matters, and it impacts on the precision of parameter estimates. And if someone wanted money from me to do a n=50 poll for the German election, I would conclude that the money is wasted. There is no point because the estimates are not sufficiently informative3. A small sample is inherently undesirable in case we want to interpret the result of a given study, and not just put the results in a sheet so that it can be used in 10 years for a meta-analysis. Arguably, this is usually the case when conducting research.
Constraints on Generality
After starting to write this blog, I saw that Daniel Simons, Yuichi Shoda, and Stephen Lindsay published a paper in Perspectives entitled “Constraints on Generality (COG): A Proposed Addition to All Empirical Papers”. The abstract states:
“Psychological scientists draw inferences about populations based on samples—of people, situations, and stimuli—from those populations. Yet, few papers identify their target populations, and even fewer justify how or why the tested samples are representative of broader populations. A cumulative science depends on accurately characterizing the generality of findings, but current publishing standards do not require authors to constrain their inferences, leaving readers to assume the broadest possible generalizations. We propose that the discussion section of all primary research articles specify Constraints on Generality (i.e., a “COG” statement) that identify and justify target populations for the reported findings. Explicitly defining the target populations will help other researchers to sample from the same populations when conducting a direct replication, and it could encourage follow-up studies that test the boundary conditions of the original finding. Universal adoption of COG statements would change publishing incentives to favor a more cumulative science.”
This seems very much related to the point we’re discussing here: a COG statement for a clinical trial with 20 patients would likely read very differently compared to a trial with 2000 patients. Sample size matters greatly for these topics (as do many other points, obviously).
Before we move on, let’s get some assumptions out of the way. In the discussion on Twitter, the example of a group difference of 10 standard deviations was mentioned. In this case when the signal is extremely strong, small samples can lead to incredibly strong effects, and considerable certainty of parameter estimates (e.g., 10 men have XY, 10 women have XX). But this is not a very plausible scenario for psychological data, and this blog refers to situations we are usually faced with: small to moderate effect sizes with considerable uncertainty for parameter estimates in n=20.
A second point was that small and large studies differ in other aspects than sample size (i.e. what good is a large dataset if there are many other problems with it). On the one hand, we should hold all other aspects equal when talking about sample size. On the other, this is an important point to discuss: if there needs to be a tradeoff between quality and quantity of data collection, where should we strike the balance? I would definitely agree with Richard and others here that we shouldn’t sacrifice everything just to enrol yet another 20 participants.
A third point was that it might be cheaper to conduct 10 n=20 studies than 1 n=200 study. I don’t think this is reasonable for the large majority of studies.
What is informative?
Let’s get back to the elephant in the room: how informative should a study be to be considered worth conducting and funding? I would argue that there is a dimension between 0 and 1, and that the threshold for “minimal informativeness” will always remain somewhat arbitrary (and I can only lose by proposing an arbitrary threshold here). On the long run, one could argue, all scientific studies in some way add information, and if only to show that there is pronounced publication bias.
But studies come at a cost. In my field, there is considerable burden for participants of clinical trials (participation alone). There are potential side-effects of novel drugs, and participating in a clinical trial also mean thats you might not get the optimal well-established therapy for your respective disorder (by either being prescribed the experimental drug or placebo). Now, obviously, we need clinical trials. But I would argue that beneath a certain sample size, such studies are simply not sufficiently informative to warrant the burden for patients.
Let’s look at a three examples from my field, all published in tier1 high-impact journals 2015 – 2017.
- Authors enrolled 29 people split into 2 groups (depressed patients vs controls), and tested whether warming them up (hyperthermia) was an efficacious depression treatment. The blinding failed (93.8% in the experimental believed they were being warmed up, 71.4% in the control group), and only one of four dependent variables showed a significant group difference. The authors report this questionnaire in the manuscript, and bury the three questionnaires that do not show group differences in the supplementary materials. Conclusion: “Whole-body hyperthermia holds promise as a safe, rapid-acting, antidepressant modality with a prolonged therapeutic benefit” (details).
- A second study gave n=14 depressed patients Ketamine; there was no placebo group. 7 of 14 people improved 240 minutes (!) after treatment, 2 of 14 patients 3 months later. The paper title is “Rapid and Sustained Reductions in Current Suicidal Ideation”, and the abstract states: “repeated doses of open-label ketamine rapidly and robustly decreased suicidal ideation in pharmacologically treated outpatients with treatment-resistant depression with stable suicidal thoughts” (details).
- Third, a team enrolled 24 patients, and treated 18 of them with a novel antidepressant, and 6 of them with placebo. They used 4 outcome measures to assess treatment efficacy, and find that patients do better than placebo on 2 of these. They tested the biological pathway on which the drug was developed to work on (the whole reason the drug was developed, following up animal trials), and find the biological pathway remains unaffected by the drug. The authors conclude that the novel drug shows “shows potential as a treatment” (details).
Clearly, there are many problems here besides sample size 4. But even if these studies had been carried out perfectly, an n=20 trial would remain uninformative because we’d have no idea to which degree these 20 patients are representative of the population of patients we try to make inferences about. And even if we find that a treatment outperforms placebo in n=20, we couldn’t trust the result to replicate in another sample of 20 participants.
The above studies were carried out to answer specific scientific questions. The question was not “is drug 1 more helpful for Susan, Mark, and Andy than placebo is for Bob, Claire and Florence”. The question was: “is drug 1 efficacious above placebo response in treating depression”. And this is a highly relevant question to answer: patients with treatment-resistant depression have very high suicide rates, and finding better treatments is not a merely academic exercise — it literally impacts the life of many people suffering from severe mental health problems. For that reason, it would be interesting to have a longer discussion here about the importance of single studies across different fields of psychology (e.g. social vs clinical), and to find out what expectations we have and should have for single studies. Certainly, no single study can answer a deep scientific question, shoulder of giants kind of thing … but there are good research practices, and these usually boil down to leading to results are are more trustworthy and likely to replicate. For any single study, sample size matters.
And maybe in other fields of psychology, debates are more … academic in nature? I honestly don’t know too much about social psychology, but where I come from, there was a recent large study arguing that there are significant group differences in brain morphology between depressed and normal people (based on very little evidence, as we show in this letter). This resulted in the development of a drug that aims to increase certain parts of the brain of depressed patients (I wish I was kidding), and this drug has now been given to depressed patients. Single studies in this field have a huge impact, because single studies are interpreted as strong evidence.
Given that this is how clinical science works today — given context, patient burden, and inference — small samples in this field are problematic. And I believe that this does not contrast with Richard’s position, who made the point that there is nothing inherently wrong with small samples per se. It really boils down to context, inference, and what purpose a given study has. Given that no studies are carried out without context, inferences, or purpose, discussions that do not take these factors into account might not be too relevant. Which leads back to the fact that we ought to talk about how these factors might play different roles in different fields of psychology.
- Following Rapoport’s Rules that Rogier Kievit pointed out to me years ago
- Actually 9 times … questionable research practices and fishing! But it was to demonstrate a point, and you can find my reproducible code here
- A term we will return to later on
- I review more questionable clinical trial papers here