8 min read

Dagatha Christie and the Conundrum of Causality - Redux (7)

We find Dagatha sitting back in her armchair, taking stock of the whirlwind tour of causality that we’ve been on over this series of blog posts…

First, we gently explored what exactly causality is and how to define it - that is, in terms of interventions or counterfactuals (e.g., “what would happen if we did A instead of B?”). Methods of representing causality using Directed Acyclic Graphs - DAGs - were then introduced, which are a powerful tool for making our causal assumptions clear and informing down-stream analyses (e.g., identifying confounders, mediators or colliders, or assumptions about missing data or measurement error).

Second, we took a minor detour to explore how we can create worlds via simulating data. This is a super-useful tool for understanding the consequences of different causal assumptions. For instance, we can model different causal scenarios (based on DAGs), and explore how different analysis choices impact our conclusions (not adjusting for unmeasured confounding, adjusting for a mediator, conditioning on a collider, impact of missing data, different measurement error types, etc.).

Third, we discussed confounding. Confounding is a causal phenomenon, with a confounder defined as a variable which causes both the exposure and the outcome - these can easily be diagnosed via DAGs as a variable which has arrows from it to both the exposure and outcome. To remove bias due to confounding, we simply need to adjust for all potential confounders (although this is easier said than done, as sometimes it’s not clear whether variables are confounders or mediators, plus the ever-present spectre of unmeasured confounding).

Fourth, we took a dive into the world of mediators - these are variables which are caused by the exposure, which in turn cause the outcome. Unlike confounders which bias the exposure-outcome association, mediators are part of the mechanism by which the exposure causes the outcome; we therefore generally would not want to control for mediators, as this blocks some (or all!) of the pathway by which the exposure affects the outcome, resulting in biased estimates.

Fifth, we explored the murky underworld of colliders and collider bias. Collider bias occurs when one conditions on a common consequence of (i.e., a variable caused by) the exposure and the outcome (or factors related to the exposure and outcome). If one conditions on a collider, this can cause a spurious association between the exposure and outcome, even if in reality there is no causal relationship - correlation without causation! As always, DAGs can be used to help identify and avoid colliders, although this can often be difficult, especially as collider bias can often occur surreptitiously, due to missing data and/or non-random selection into a study.

Sixth, and finally, we introduced measurement error, how this can be represented in DAG-form, and all the kinds of havoc it can wreck on our inferences. Measurement error can be especially pernicious, as different forms of measurement error have different impacts on analyses - from relatively benign bias towards the null, to bias in all different directions depending on the structure of the measurement error.

Life lessons

So what should we take from this whirlwind causal adventure? There are a few key areas where I have found this approach especially useful, which I hope others will too:

  • Taking causality seriously: While causal inferences from observational (i.e., non-experimental) research is certainly tricky and rests on numerous, often untestable, assumptions, I have found that a causal perspective is indispensable when planning research questions and analyses. I have read - and myself been part of1 - numerous papers where research questions were not clearly defined and a ‘causal salad’ approach to analyses taken, where a range of covariates have been thrown in the model with scant regard for the causal structure of the data. Perhaps the inferences from these papers are sound and would not substantially alter if a more cautious causal approach was taken, but colour me skeptical2. Either way, we can do better, and taking causality seriously is a good start. This involves thinking through whether variables are confounders, mediators or colliders - ideally with some justification - plus consideration of other sources of bias (i.e., selection and measurement bias), and planning analyses accordingly.

  • Clearly stating the research question: These benefits apply not just to work specifically aiming to infer causality, but also for descriptive and predictive work as well. Clearly stating the research aim will help readers understand exactly what the paper is trying to show, and how the results should be interpreted. For instance, in predictive research the aim is to see which combination of variables best predicts the outcome; causality is irrelevant here, so the model estimates should not be given a causal interpretation. In descriptive work, meanwhile, the aim is simply describe the data and explore whether certain variables are associated with one another; these associations should not be given a causal interpretation either, at least without future work probing these relationships in more depth to try and establish causality. Finally, for causal work, this causal DAG-based approach requires a clearly defined estimand - the causal effect you are trying to measure - and avoids over-interpreting other coefficients in the model, which are highly-unlikely to also be causal estimates (cf. the Table 2 fallacy). These causal considerations have definitely made me a more careful researcher; I believe they can do the same for others, too.

  • Use simulations as intuition pumps: Trying to understand all the arcane rules of causal inference - which covariates are needed to remove confounding bias, the impact of collider bias, when missing data or measurement error will or will not result in bias, etc. - can be difficult and rather overwhelming. This is where I have found simulating data to be especially useful, as by simulating the data I am able to control all the parameters and understand how different assumptions impact inferences. This is impossible when working with observational data, as we have no idea what the true data-generating mechanism is. I find simple toy simulations extremely enlightening, and really helps understand how confounding, mediation, collider bias, missing data and measurement error can affect results. I hope others find these simple simulations helpful as well.

  • Make your assumptions clear: DAGs force you to lay all your cards on the table and show exactly the assumptions you are making when trying to estimate a causal effect. This is great, and means you have to think carefully about your assumptions. However, by being more explicit, you are opening yourself up to more disagreement (it’s easier to see the flaws in something if it’s explained clearly, compared to something vague and hidden). This is a good thing, though, as it means that others can question or vary these assumptions to see whether this alters conclusions. This is part of open science, and should be encouraged along with other open-science practices such as sharing data and code, and pre-registering analyses.

  • Acknowledging limitations: By making your causal assumptions clear, a DAG-based approach also helps when discussing the limitations of a study. For instance, perhaps unmeasured confounding, missing data/selection bias and/or measurement error could impact your study’s conclusions. We should be open about these threats, as it provides a more balanced and less-hyped picture of our research, and can help point to avenues for future research to do better. Simply avoiding all discussion of limitations and pretending all studies are perfect might help you get into a prestigious journal, but it doesn’t help science. So, please, be humble.3

Summary

I hope that this series of blogs have been somewhat useful in introducing some key core concepts in causality. This has very much been a (hopefully relatively gentle) introduction to this vast, complex and growing topic. While this ‘causal revolution’ predominantly started in Epidemiology, it is great to see that recognition of the importance of a causal perspective is making its way into our fields, such as psychology, sociology and evolution/ecology.

There are, of course, many more advanced topics which have not been covered here, including approaches to assess unmeasured confounding, formal causal mediation analysis, a massive literature on selection bias and imputation methods (e.g., here and here), methods for correcting measurement bias, longitudinal data analysis (e.g., here and here), general modelling and statistical inference (e.g., here and here), plus so much more4.

Hopefully this has been a useful primer to start your journey in causality.

Welcome, and come on in, the water’s lovely.5


  1. More frequently in the past, I would hope!↩︎

  2. Is research which takes this explicit causal approach more likely to replicate? I don’t know of any studies testing this idea, but it would be an interesting research project to try and find out!↩︎

  3. I once heard advice for publishing along the lines of “Don’t mention limitations in your paper, as that’s for the reviewers to notice.” Needless to say, this is terrible, terrible advice. Don’t listen to it. Please.↩︎

  4. For a general introduction to many of these more advanced topics from an epidemiological perspective, Modern Epidemiology is a great resource.↩︎

  5. While this is the last blog in this ‘Dagatha Christie’ series introducing core concepts in causality, I am planning other blog posts on these and related topics - On missing data/multiple imputation/inverse probability weighting, structured life course models, and other exciting things. Stay tuned!↩︎