On the interpretation of a regression model

May 18, 2018

(This article was originally published at Statistics – Freakonometrics, and syndicated at StatsBlogs.)

Yesterday, NaytaData (aka @NaytaData ) posted a nice graph on reddit, with bicycle traffic and mean air temperature, in Helsinki, Finland, per day,

I found that graph interesting, so I did ask for the data (NaytaData kindly sent them to me tonight).

ggplot(df, aes(meanTemp, cyclists)) +
  geom_point() +
  geom_smooth(span = 0.3)

But as mentioned by someone on twitter, the interpretation is somehow trivial : people get out on their bike when the weather is nice. The hotter, the more cyclists on the road. Which is interpreted here in a causal way…

But actually, we can also visualize the data as follows, as suggested by Antoine Chambert-Loir

 ggplot(df, aes(cyclists, meanTemp)) +
  geom_point() +
  geom_smooth(span = 0.3)

The interpretation would be, somehow, that the more cyclists on the road, the hotter it is. Why not consider this causal interpretation here ? Like cyclists go so fast, or sweat so much, that they increase temperature…

Of course, it is the standard (recurrent) discussion “correlation is not causality”, but in regression models, we like to tell a story, to pretend that we have some sort of a causal story. But we do not prove it. Here, we know that the first one is more credible than the second one, but how do we know that ? To go further, how can we use machine learning techniques to prove causal relationships ? How could a machine choose between the first and the second story ?



Please comment on the article here: Statistics – Freakonometrics

Tags: , , , ,