What is a good theory, and what is a bad one? In this blog, I’ll introduce theories, models, phenomena, data, and how they relate to each other. I’ll explain what Paul Meehl, the hypothetico-deductive framework, and the open science reform have in common, and why proposed solutions to problems in our field have largely ignored the importance of constructing or revising theories. I’ll discuss issues with latent and weak theories, and conclude by describing two important features of formal theories, and how these can be constructed.
The blog is the result of working on these topics for four years, and covers a range of papers we’ve published (1, 2, 3, 4), but also a lot of other recent work. It covers the things I’ve learned in the last years from reading some foundational work, and features some new ideas I’ve had in the last months. I’ve put together an introductory reading list on theory formation, which contains my favourite papers, book chapters, and blogs on the topic, some of which I will introduce below. A special shoutout to the paper by Jonas Haslbeck and colleagues (2021), which was one of the most insightful pieces I have read in years.
1. Where we are coming from
Psychology is, and has been, a hyper-empirical discipline. Or, as Denny Borsboom put it: hyper-ultra-mega empirical. We are quite good at testing things, but we aren’t very good at theorizing1. As Daniel Nettle described this just a few days ago, there is no shame in that.
“You shouldn’t have to claim to have a fully identified theory (especially if you haven’t actually got one, which I think is true of much psychological research, including plenty that is well worth doing). Although the hypothetico-deductive model is an important part of science, there is a great deal of valuable science that precedes it. Biology had hundreds of years of taxonomy and natural history before it had formal phylogenetic models; geology had detailed maps of rocks before it accepted plate tectonics; people had been studying the motion of planets long before Newtonian mechanics, and so on.”
In our work, we’ve framed this “natural history” work as efforts to establish phenomena: robust, recurring features of the world that require explaining (i.e. explananda;2). Such explanations happen in the form of theories that explain them (i.e. explanantia3)—which is, in a nutshell, why we need theories.
We don’t have many great explanations in psychology. As Robert Cummins (2000) put it: “We are overwhelmed with things to explain, and somewhat underwhelmed by things to explain them with”. But psychologists want to be grown-up scientists, and somewhere, someone misunderstood grown-up science as theory-testing, i.e. confirmatory. And so we ended up in a situation where many researchers were conducting what really was exploratory work, based on intuition, and sometimes based on a bit of proto-theorizing, but they were writing up their papers as confirmatory.
One consequence of this pretending was the replication crisis. Psychologists really wanted to support their “theories” with data, but employed a bunch of questionable research practices to do so. A consequence to this replication crisis, in return, was the open science reform movement. And in the last decade, we have become better at identifying and preventing questionable research practices such as p-hacking and HARKing (hypothesizing after results are known) through tools like preregistration, registered reports, and sharing of data and code. Victory!
But identified challenges and proposed solutions have focused on increasing the reliability and replicability of psychological findings by improving methodological and statistical practices. Camera zooms on Cummins’ face, who is shaking his head ever so slightly: That helps a lot with the explananda part of psychology, but not at all with the explanantia part. We still need to explain these findings.
Curiously, this is sort of a repetition of history. In Nettle’s quote above appears the hypothetico-deductive framework, i.e. the commonly embraced idea in psychology that we 1) derive hypotheses from theories, 2) deduce observational consequences from these hypotheses, and 3) test these consequences to verify the hypotheses (and thereby the theory) using the logic of confirmation. Psychologists have been great at 3), ok at 2), and bad at 1), given that proper theories are largely absent. Interestingly, our dedication to this framework based on testing has in part reinforced the lack of theory formation in psychology, because it provides little in the way of helping psychologists to construct and update theories after they are exposed to empirical data. As Borsboom and colleagues put it in their new paper on theory construction: the framework “discourages the use of systematic methods in order to generate, develop, and fully appraise theories”. The focus has led to the consequence that psychological theories “tend neither to be refuted nor corroborated, but instead merely fade away as people lose interest” (Meehl, 1978). This situation hasn’t really changed, nearly half a century after Meehl wrote this sentence.
In our recent paper led by Donald Robinaugh, we highlight the paradoxical role that Meehl played in the history of psychology. He was concerned about the role of null-hypothesis significance testing; given that variables in psychology tend to be correlated (what he called the crud factor), bigger datasets make it easier for you to reject the null-hypothesis and ‘confirm your theory’. Meehl wrote eloquently and viciously about weak theories in “soft psychology”, but, like the open science reform and the hypothetico-deductive framework, focused on testing of theories or effects. For instance, he championed what he called risky tests that put theories at grave risk of refutation. I tend to use the example of traffic to explain this.
“Suppose Theory 1 predicts traffic jams on a highway for each hour in upcoming week, and Theory 2 predicts that there will be at least one traffic jam in the next week. Assuming both theories are false, Theory 1 is more likely rejected, Theory 2 more likely corroborated with more data. Psychological theories are similar to Theory 2, predicting a significant correlation between two variables or a significant group difference, rather than predicting the magnitude of the effect.”
But Meehl, the open science reform, and the hypothetico-deductive framework failed to provide, and we continue to lack, a concrete and well-established set of tools for theory construction, as we discuss in detail in our Meehl paper. How to fix this?
2. Theories and models
I haven’t found a short overview piece on these topics, but for a bit of a longer intro, the book “Scientific Models in Philosophy of Science” by Daniela Bailer-Jones is a good starting point.
Theories are bodies of knowledge that are broad in scope and aim to explain phenomena. As I said before, phenomena are robust features of the world. Phenomena can be obtained by observation and common sense (e.g. humans have capacity for language), or by other means, such as fitting statistical models to data (e.g. there is a statistical relation between smoking at time 1 and lung cancer at time 2). Data may contain the phenomenon we are interested in, but also noise (measurement error, experimenter bias, transcription errors). That is why theories explain phenomena, not data; James Woodward has a great paper on this topic.
Good theories provide better abductive explanations about a phenomenon under investigation than competing theories. Abductive inference, or inference to the best explanation, can be understood as a process that aims to provide good (i.e., simple and plausible) explanations for phenomena, in the sense that if a theory were true, the phenomenon would look the way it looks in the world. We will return to this later.
Models, on the other hand, are instantiations of theories, narrower in scope and often more concrete, commonly applied to a particular aspect of a given theory, providing a more local description or understanding of a phenomenon. Evolution and gravitation are theories, but their application to mate selection and the motion of the planets is best done via models. From this perspective, models serve as intermediaries between theories and the real world.
In contrast to theories, models illustrate with precision the mechanisms that might govern the processes that lead to a phenomenon by decomposing processes into relevant parts, properties of these parts, relations between the parts, and temporal dynamics of their change. I really liked Paul Smaldino’s metaphor of geographical maps: models ignore much of reality to be useful, in the same way a map of Rome ignores much of reality to help us navigate Rome. The tricky part is to leave out the right kinds of things to enable models to help us with explanation or prediction.
Models come in different shapes. Psychologists immediately think of statistical models, but there are many other models, such as verbal or mathematical models (e.g., different descriptions of how neurons interact in the brain), diagrams (e.g., the hydrological cycle that moves water on Earth), or mechanical models (e.g., a physical model of the DNA double helix). I think of these kinds of models as phenomena models because they usually aim to represent robust phenomena in the world. The second type of model is the statistical model, which represents data, not phenomena; this is why other disciplines call such models data models. As Woodward taught us, this is an important distinction because data ≠ phenomenon.
Theories and models have in common that they help us understand, explain, predict, and control phenomena. There is a big debate about what of these is more important, which played out in the commentaries to my Psychological Inquiry paper (see here for a videolecture about the paper); I summarize these positions in some detail in my rejoinder to the commentaries (theories & models are for explaining vs they are for predicting vs they ought to be useful).
3. Relations between theory, model, data, and the world
What are theories and models about? They are about phenomena, which arise due to target system. These target systems consist of components and relations among these components, and give rise to the phenomena we observe4.
Robinaugh has been leading an interdisciplinary collaboration on developing a formal model for panic disorder, so let’s use this as an example. In this case, the phenomenon we may want to understand (explain, predict, control) is panic disorder. The target system therefore contains all the things in the world that are important for the emergence of panic disorder; given the complexity of mental health problems, we can think of this as a biopsychosocial target system with many components and relations among those. We could speculate that one of these components might be avoidance behavior, for instance: as we know from Clinical Psychology, avoiding fearful stimuli can lead to vicious cycles where the fear of the avoided stimuli grows, and avoidance grows, and fear grows, and so forth. On the side of the model that aims to represent the target system, we now include the variables we think are important for representing the target system well, and include all relations among components. This is why you can think of a model as an assumption about the world. The figure below showcases this scenario,
In the figure below, adapted from Figure 6 of the paper by Haslbeck et al. (all credit to them for the figure), showcases this scenario:
The specific relations and components and not relevant here, and too complicated to explain in detail (see the preprint for details). Important for now is the idea that a model has a structure, and that this structure aims to represent the target system, which in turn consists of components and relations.
3.1 The power of representation
Often, building a mathematical model is really, really difficult. This is probably why there was no formal model of panic disorder before Robinaugh started this challenging, interdisciplinary initiative in 2016. In such cases, we can use other kinds of models to represent a target system.
In a recent talk, Smaldino used the example of the U.S. Army Corps of Engineers Bay Model: in 1950, they tried to understand (predict, control) how to best dam the San Francisco bay area. To do so, they built a model of the bay area. Like, an actual model, 1:1000 horizontal scale, 1:100 vertical scale, 1:100 scale in time (regarding the tides), with around 15 people to operate the model.
Here’s the beauty: as representations of the target system, theories and models allow us to engage in surrogative reasoning, using the theory to make predictions about the target system. This can also be called an inference ticket, something you can learn about your target system without actually having to do that in real life. We do this all the time in simulation studies, i.e. learning about situations that are not actually realized. For example, we know quite a bit about bridges and earthquakes, and can test what would happen to a specific bridge under a specific earthquake scenario in theory (e.g., via a computational model), allowing the construction of better bridges. This is because we understand theoretical concepts such a gravitation. You can immediately see the powerful implications this would have for e.g. treating panic disorder.
Borsboom has a blog post in which he speculates how psychologists would go about building such bridges, following standard methodological procedures. It would be a long and arduous exercise in fitting regressions, removing outliers, and predicting variance, without gaining much understanding of what is actually happening.
For surrogative reasoning to work properly, however it really helps when there is a close mapping between the structure of the theory or model on the one hand, and the target system on the other that is represented. In the case of the U.S. Army Corps of Engineers Bay Model, this was a pretty catastrophic failure, and the model ended up being an abysmal representation of the target system.
Because models represent, you can also think of experiments as models. Smaldino used the example of the famous Marshmallow experiment in a recent talk: we aren’t really interested to learn anything about Marshmallows, or how long children wait to eat Marshmallows. We are interested to use surrogative reasoning to learn something about the underlying process (i.e. the target system) that we believe is represented well via the experiment.
3.2 Good theories and models are true
Here is a good moment to acknowledge that some theories and models have historically worked out really well although they turned out to be false. This is a bit of a thorny issue, and I’ll only tackle this briefly. I think of good models as useful but imperfect abstractions which are all false in the sense that they are incomplete. Think about the map of Rome again: different maps of Rome are different instantiations of the same underlying target system (Rome), which can be differentially useful in different contexts: e.g. finding the closest café for a good espresso vs organizing a 2-hour community walk with the elderly that minimizes topological obstacles. Both are “false” in terms of adequately representing Rome, but they represent the required target system appropriately for the given purpose. In that, different models for the same purpose may differ from each other in the degree to which they help us explain, predict, and control phenomena. And if you enable all map overlays on GoogleMaps and try to navigate Rome for any purpose, you’ll likely be unable to do anything, because there is so much useless information on your screen.
Rutherford’s model of the atom — electrons orbit a very small and heavy nucleus — was false: simulating data from it shows that the known universe would collapse in a split second. But it got many things right, such as separating electrons from a dense core, and was instrumental in bringing about crucial changes to atomic models in particle physics with a higher degree of verisimilitude (i.e., truthlikeness). These newer theories and models do a better job at explanation, prediction, and control.
For that reason, some physicists will go as far and consider the entire discipline of physics as a model. For a more detailed debate on utility vs truthiness of models, see section “Theories Should be Useful” here. My summary:
“I believe that we can learn most about our target systems when our theories and models represent them well, that is, with some level of abstraction. When we understand models as intermediaries between theories and the real world, the idea that all models are wrong because they are incomplete is a feature, not a bug; a model’s inability to perfectly reflect reality does not stand in the way of providing actionable insights, as shown by the Rutherford model or Newton’s theory of gravitation.”
4. Bringing data to bear on theories
In psychology, researchers often try to answer questions that they have, usually based on intuitions or proto-theories. To do so, they link theories to data by using statistical models that impose assumptions on the data. This is really important to realize: choosing a specific model in R or SPSS will impose assumptions on your data. For example, if you choose
linear regression, you will impose the assumptions that two variables are related in a linear fashion. More complicated models, such a the reflective latent variable model, impose a lot more assumptions on the data.
Again, data is phenomenon plus noise, and a statistical model can have the purpose to link the theory to the real world which is – so we hope – reflected in the data. We’re not in the (fully) exploratory world now, we are in the world where psychologists have ideas they want to test. And in this world, choosing an appropriate statistical model that imposes assumptions on the data consistent with the theory is instrumental in bringing data to bear on the theory. The only full sentence in italics in this blog post, and for a reason: I really didn’t grasp this properly, until quite recently, although it seems so obvious in hindsight. This is because if the mapping is unclear, it is in turn very difficult to know what we can learn from the results of the statistical model fit on data for your theory. I’ll call this the inference gap. And now you will hopefully forgive your stats teachers who really wanted you to understand the assumptions statistical models impose on data.
There are many different ways in which it can remain unclear how you can bring your data to bear on your theory. If I was teaching a class on this, I’d ask students to come up with lists at this point. There are many possibilities, e.g. you may not tell us about the statistical model you are using in sufficient detail (so we cannot understand what assumptions it imposes on the data). Or you may use a linear model when your theory is non-linear (Figure 4), giving us a hard time to understand how we can bring your results to bear on your theory. A cheat some folks employ to avoid the inference gap is to pretend the statistical model is the theoretical model, but that doesn’t usually work out well.
4.1 Statistical equivalence
One of the most obvious issues here is when statistical models are interpreted as theoretical models. I list some examples in my Psychological Inquiry paper on theory building and testing, such as the inferential shortcut of going from “I estimate a latent variable model” to “I identified a psychological construct”; but a typical structural equation model such as a factor model in cross-sectional data can tell us little about the data generating mechanism, i.e. the target system, no matter how good the fit, and a well-fitting factor model cannot be taken as evidence or proof that a psychological construct exists as a psychological construct. This issue quickly gets technical, so I will give you a simple example (there are many more) why this is the case: statistical equivalence.
Suppose we are interested in the robust phenomenon that intelligence tests are positively correlated, and have cross-sectional data of test-scores. We can now fit a factor model to such data (right side), and get excellent fit for a one-factor solution. This has led some researchers to conclude that the “g-factor” exists as a psychological construct. But I can also fit a network model to the data, with identical fit, that does not feature a latent variable at all (left side). Does that mean that we can firmly conclude that the correlation of intelligence tests does not come from one common cause, but from the fact that mathematical, verbal, geometrical and other forms of intelligence cause each other over time, as proposed by the mutualism theory? Absolutely not, due to (among many other issues) statistical equivalence. To get the graphs below, I simulated 10.000 observations from the causal system on the left side, and simulated 10.000 observations from the factor model on the right side. For both datasets, fitting the alternative statistical model to the data leads to models with equivalent fit.
This is an obvious issue: mistaking the statistical model for the theoretical model. But there is a more general point in that although theories remain often somewhat unclear, results of statistical models are taken to corroborate theories.
4.2 The inference gap
The first problem-case I want to talk about is the most benign scenario, where the lack of mapping between the (1) theory we try to test and the (2) statistical model is unclear, i.e. there remains an inference gap. This, in turn, leaves the open question how results obtained in the data can inform us about our theory: they don’t lend themselves to either refute or corroborate the theory because the mapping between statistical model and theory is unclear.
The second problem-case features what I call a latent theory, i.e. a theory that isn’t properly explicated, but is clearly there. You see this regularly in the literature, both in published papers and common practices. This is often followed concluding that a “theory” is supported by the results when it is unclear to the reader why this would be the case:
4.3 Weak theories
Finally, we have theories that are spelled out with so much imprecision, lack any description of auxiliary assumptions, and can easily be adjusted post-hoc by changing assumptions that had not been spelled out.
Samuel Gershman’s piece “How to never be wrong” (2019) nicely describes this approach as a slippery slope toward unfalsifiability. For example, data produced by the famous Michelson–Morley experiment in 1887 was inconsistent with the prevailing theory that light is propagated through the ether, which led FitzGerald and Lorentz to adapt properties of the ether post-hoc in a way that exactly fit the new data. This is little different than claims in psychology that unspecified “hidden moderators” explain non-replications of original findings.
I call such theories weak theories, and they really shouldn’t be called theories in the first place; intuitions or proto-theories are likely better terms. In psychology, such theories are usually narrative and imprecise descriptions vulnerable to hidden assumptions and unknowns. They provide neither precise explanations nor predictions, and it can be difficult to find out if data support weak theories or not. As Meehl showed us, weak theories plus null-hypothesis significance testing is a lovely combination to obtain statistically significant results that reject a null hypothesis, but what the alternative hypothesis is – what is actually supported by the data – remains often unclear. It remains equally unclear if the results actually corroborate the theory or not.
Meehl (1990) used a nice metaphor to describe the value of theories and models: they get “money in the bank […] by predicting facts that, absent the theory, would be antecedently improbable”. Moving away from prediction to also incorporate explanation, let’s update this and say that they get money by explaining facts that, absent theory, would be hard to understand. For instance, Mendeleev used the periodic table in 1871 to correctly predict the existence and specific properties of three unknown elements that were discovered within the next 15 years: gallium, scandium, and germanium. Weak theories will stay poor theories in both aspects.
5. Formal models
There isn’t always a clear separation between theories and models, and there is value in both formal theories and formal models. So from here on I’ll be talking about formal models, but feel free to extend this to formal theories.
Before we start: there is nothing inherently wrong with intuitions, proto-theories, and exploratory research, as Daniel Nettle described eloquently in his blog post. It’s just important that we don’t pretend we do anything else than proto-theorizing when we are proto-theorizing. Now that we got this out of the way, onwards to verbal ambiguity and mental simulations.
To escape vagaries of language and imprecise descriptions, we started in 2016 with the effort to construct a formal model of panic disorder, led by Robinaugh. This project helped me realize personally how problematic verbal ambiguities are: I had never thought there might be so many different ways two variables can relate to each other until we needed to write their relation out as an equation, and that small changes to parameterization can dramatically impact what data the theory predicts. I really like the idea of formalizing models (even if they are just based on intuitions or proto-theories), because it will force you to think about aspects of the model that you had not previously considered. Others have referred to this as introducing important choice points in model development, or as properly scrutinizing the systems we are talking about. A common counter-argument here is that ideas can be expressed precisely in a narrative way, and that the problem of verbal ambiguity can be prevented this way. While this isn’t usually done, I am happy to agree that this is, in principle, possible.
But that leaves is with a possibly even bigger issue, and it took me a while to get my head around this. Haslbeck and Robinaugh explain this well in their respective works: without a formal model, it is often unclear what data you would expect, given your theory. A formal or computational model allows to simulate data under a theory, and enables the comparison of such theory-implied data with actual data; verbal theories lack this feature.
We recently demonstrated this using the example of a very simple system: a vicious cycle between two variables A & B, with an additional negative feedback loop on A to make sure A slowly decreases over time. Not only do small changes in the way the causal effects between A and B are implemented in the computational model (e.g., as linear or sigmoidal) dramatically change the theory-implied data — the data resulting from the model is unknowable without simulating data. And how can psychologists corroborate or update theories when they do not even know what sort of data their theory would produce? Formal models can be exceptionally helpful here and provide tools for thinking clearly, evaluating explanations, informing theory development, informing measurement, and facilitating collaboration and integration; see this paper for more information, from which the Figure is also taken:
And I want to highlight that verbal theories by Clark and others were necessary for us to formalize this; so this should not be seen as an alternative to, but a continuation of prior (non-formal) work.
6. Moving forward
As I wrote in my Psychological Inquiry piece:
“While the last decade was focused on improving our statistical practices, the next decade of psychological science should be one of improving our theoretical practices. I hope I have demonstrated that this will be necessary, but it is only fair to point out that this will be difficult as well: psychology is concerned with complex phenomena that are notoriously difficult to measure and understand.”
And while some areas in psychology are quite good at theory building and testing, such as cognitive and mathematical psychology, many other disciplines are very far away from that. The first time I encountered the term theoretical psychology was in 2014, in a talk that Borsboom gave at the International Convention of Psychological Science. He pointed out that, in contrast to theoretical biology, theoretical physics, or theoretical economics, there is no dedicated field in psychology concerned with theory formation. Looking back at my own studies, I learned how to estimate fancy models in several statistical programs, but the topics theory construction, mapping between theoretical and statistical models, and challenges of inference (induction, deduction, or abduction) did not come up once in my curriculum. In fact, I had never heard about the term “formal model” before my post-doctoral fellowship. This is, in part, why I put the reading list on theory formation together: maybe others are in a similar position where they want to learn more, but didn’t get any proper training.
A very concrete step is to read recent work on theory construction, e.g. 1, 2, 3, 4, and 5. Don and I taught a workshop on formalizing theories in 2019 at SIPS (all materials openly available), and are planning to teach an updated version of this workshop this year at the SPSP Summer Forum in Psychology, July 8-10 2021).
Second, we should consider removing some educational content to make space for proper training in philosophy of science, theoretical psychology, and statistical modeling6. Third, we should offer psychologists the possibility to become experts specifically in theoretical psychology, along with relevant mathematical and philosophical training. Theoretical biologists, theoretical physicists, and theoretical economists are usually not also experts in measurement and experimental research. And finally, science is a team sport, and psychologists need not have all expertise required to conduct strong interdisciplinary work. Work together; reach out to colleagues from other departments; go to interdisciplinary conferences. Speaking a common language across disciplines can help
facilitate interdisciplinary research; formal models provide one such a language.
Disclaimer: For some parts of this blog post, I adapted some sections from my recent papers referenced here. You can think of this as transparent self-plagiarizing: the adapted sections aren’t cited in “…” because I adapted them to the blog. The goal was to mix and intertwine appropriate sections of different papers to tell one coherent story, rather than regurgitating the papers separately. If you want to cite the content, please cite the relevant paper, not the blog post.
- Yes yes yes, there is proper theorizing in psychology, and there are toolkits. I list them further down the blog post. I’m talking about run-of-the-mill psychology, not cognitive or mathematical psychology that have done quite well on the theorizing front
- singular “explanandum”, in case others get confused about this as much as I do
- singular “explanans”
- This is quite closely related to the idea of homeostatic property clusters — the idea that certain properties tend to go together in nature due to causal processes. For example, beavers (or biological species generally) are homeostatic property clusters: clusters of properties (size, number of limbs, amount of hair, degree of cuteness, eats wood, builds dams) that go together due to really complicated causal processes. Depression is probably also such a property cluster, but that’s for another blog post. I have written about this in more detail here.
- And I actually bought a hard-copy, as a thank-you for the inspiration.
- “But Eiko, what should we stop teaching?” — We could start with stopping to teach psych myths that have long been debunked, but are still taught as part of common curricula.