Correlation vs causation

A few days ago headlines such as “Liberals swear more on twitter than right-wingers, says study” swept the news. I was skeptical from the beginning, wondering whether the authors of the study controlled for age. Turns out, they didn’t.

Sylwester, K., & Purveyer, M. (2015). Twitter language use reflects psychological differences between Democrats and Republicans. is a correlation study, not a causation study, as its title would imply. These psychological differences are likely to reflect a whole other set of variables.

The authors of the study do not use any control variables. They regress various words used by Twittter account owners on a dummy variable indicating whether the person is a Democrat. They also simply count the number of times specific words were used by Republicans or Democrats. So it’s not just age that they do not control for; it’s geography, too.

I don’t think many people would argue with the fact that younger people tend to be more liberal. Younger people could also be less reluctant to use swear words. Similarly, Republicans tend to be more religious, so it makes sense that they use words like “God” and “psalm” in their tweets more than Democrats. However, the fact that they use more religious words stems more from their geography, rather than their political affiliation. It is the former that tends to cause the other, more often than the other way around. A very simple example: living in the south is likely to mean you were brought up in a Republican family, being a Republican doesn’t make you move south.

Now, I know that they probably couldn’t get data on age and geography (and on other characteristics of the user) because of privacy settings. Plus, I’d imagine that Twitter doesn’t ask you for your exact age when you sign up, and even if it did, there is nothing stopping you from lying.

Since this is the case, is there a point in publishing a study like this? I’d like to see whether the trends still hold when age, geography, level of education, and a whole set of other variables are controlled for.

