Book review: Everybody Lies by Seth Stephens-Davidowitz

May 15, 2017

(This article was originally published at Big Data, Plainly Spoken (aka Numbers Rule Your World), and syndicated at StatsBlogs.)

Everybodylies_coverSeth Stephens-Davidowitz has written a fascinating book calling for social scientists to use data collected by Google or Facebook in their research. This is a controversial issue, and if it weren’t so, it wouldn’t warrant writing a full-length book about it. Google does not release publicly its search data, but provides some pre-processed and aggregated statistics through services such as Google Trends and Adwords. Researchers who use this data do not have control over its collection or processing. (Seth previously worked at Google, and has written some columns for New York Times.) Facebook does not publish its data either, although social media users have a lower expectation of privacy.

Such big datasets come with a set of knotty problems, which I have previously summarized as OCCAM. In addition to having little control over its origin, the researcher’s purpose typically diverges from that of the data collector. The data is found or observed and not usually experimental. It is often treated as “complete” or essentially complete by the researcher, which is an assumption, not a fact. In stages of merging in other data, the researcher introduces inevitable errors. Here is my previous post about OCCAM datasets (link). It is not surprising that classically trained scientists have reservations about such datasets, especially if they are interested in causal mechanisms. Nevertheless, I agree with Seth that we can make progress on solving these problems if we start taking them seriously.

In writing the book, Seth carried out a number of mini-studies using mostly Google Trends data. Here are 8 things I learned from reading Everybody Lies:

  1. Some people use search engines as confessionals. They type complete sentences like “I am sad.” or open-ended questions like “Is my daughter ugly?”
  2. People assume machines (like the Google search engine) will keep their secrets. For sensitive topics, Google may generate more honest data than surveys. There are many questions asked to Google that I’m sure people won’t pose to a librarian.
  3. Google searches for “Obama” is frequently paired with “kkk” and the “n” word. The prevalence of racist searches does not exhibit a North-South divide – it’s East-West.
  4. As President of Harvard, Larry Summers spent quite a bit of time brainstorming with Economics PhD students on how to beat the stock market using new data. (And they came up empty-handed, or so they say.)
  5. Anthony Weiner got rejected from Stuyvesant High School (famous NYC public school), missing the cutoff score in the admissions test by one point
  6. Some economists found that going to Stuyvesant conferred no meaningful benefit to one’s career – at least, this is the case for those who attain a score close to the cutoff in the admissions test.
  7. There are 6,000 searches on Google a year for “how to kill your girlfriend” while there are 400 murders of girlfriends.
  8. “Big data” does not provide any insights that surveys can’t at the aggregate level so people slice and dice the data to examine “micro” segments, which means they are analyzing a huge collection of small data sets

The research mentioned in Seth’s book come out of the Economics discipline primarily, and can be considered in the tradition of Freakonomics (2005). There are examples of natural experiments, as popularized by Steven Leavitt. Seth brings the coverage up to the current trends, describing regression discontinuity, field experiments, and other techniques favored by econometricians at the moment. For those interested in what happens after Steven Leavitt, this is a good place to start.

Please comment on the article here: Big Data, Plainly Spoken (aka Numbers Rule Your World)

Tags: , , , , , , , , , , ,