To correlate, or not to correlate, that is the question.

“Correlation doesn’t imply causation”. This is a mantra that anyone who attended a statistics class or even watched a YouTube video about statistics has already heard. However, when we analyze data and see those beautiful lines growing together or those colorful dots clustering, we instantly get goosebumps and create a reason for that correlation. When doing this, we are falling in love with spurious correlation (not always, though). But, what does it mean practically and how can we avoid it?

Correlation and causation

We say that when data are correlated there is an association between variables: one is likely to change when the other changes. This might make us believe that changing one variable causes the other to change. However, the difference here is subtle but totally modifies the perception. Indeed, we consider a causality relationship when one variable changes and brings the other one to change too. This is causation. We can also say that causation implies correlation but not the opposite. 

As an example, now that we are heading towards the hot season, we could probably notice that ice cream sales and cold beverages sales are increasing. Does this mean that higher ice cream consumption cause also cold beverage sales to increase? No! As you might have guessed, the real cause is a third variable which is the temperature growth.

So, spurious correlation is a relationship between two variables that appears to be causal but it is not.

Then, why are we sometimes assuming that when watching correlated data vary, one variable is causing the other to change? What we need to know is that our brain is lazy and prefers to trick us rather than doing the hard work. This means that, when analyzing a large amount of data and information, our brain prefers to use shortcuts that simplify the sense of the information providing us, sometimes, with the wrong answer. This mental shortcut is called a cause-relation cognitive bias, and this bias goes wrong whenever we believe that correlation implies causation.

Tyler Vigen developed some great and really funny examples of spurious correlation. You can find them here. As an example, did you know that the decrease in the number of pirates is causing global warming? (just joking, of course, don’t take this seriously. Verify the sources before. Like we should do with any news we read.)

 

Source: BuzzFeedNews

 

In conclusion, how can we avoid this trap?

Let’s start from the most common errors (and charts) that we can encounter in our daily work and see how to minimize the risk of spurious correlation:

1.Avoid to compare dissimilar variables

Similar curves may appear on Y axis scales even if it measures different values. When the values look to be connected but aren't, this becomes dangerous. Better to show them separately.

Source: HBR

Source: HBR

2.Beware of skewed scales

Even if we are measuring data belonging to the same category (and Y axes are measuring the same thing) we can infer a wrong correlation since altered scales might cause the lines to change.

Source: HBR

Source: HBR

3.Don’t create a narrative (if - then)

We are eager to find a correlation/causation but sometimes we compare unrelated data that seems to make sense together (one changes and the other one varies too). We are trying to create a narrative but it could also be a simple coincidence.

Source: HBR

Source: HBR

So, in the end, I hope this article could help you in re-evaluating data and information going around you. And next time you fall in love with spurious correlation at least you will know how to break the spell.

Previous
Previous

Is direct-to-consumer (DTC) for everyone?

Next
Next

What do Pirandello, math and a river have in common? - Part 3