The backlash against Big Data has started

February 21, 2013

(This article was originally published at Numbers Rule Your World, and syndicated at StatsBlogs.)

It is inevitable that all the hype around "Big Data" leads to a backlash. As someone who's been working in "data science" before the term existed, I am happy to see widespread validation of the field but also concerned about over-promise and under-deliver. Several recent articles went overboard in criticizing data science -- while their points are sometimes valid, the tone of these pieces misses the mark. I'll discuss one of these articles in this post, and some others in the next few days.


Andrew Gelman has a beef with David Brooks over his New York Times column called "What Data Can't Do". (link) I will get to Brooks's critique soon--my overall feeling is, he created a bunch of sound bites, and could have benefited from interviewing people like Andrew and myself, who are skeptical of Big Data claims but not maniacally dismissive.

The biggest issue with Brooks's column is the incessant use of the flawed man versus machine dichotomy. He warns: "It's foolish to swap the amazing machine in your skull for the crude machine on your desk." The machine he has in his mind is the science-fictional, self-sufficient, intelligent computer, as opposed to the algorithmic, dumb-and-dumber computer as it exists today and for the last many decades. A more appropriate analogy of today's computer (and of the foreseeable future) is a machine that the human brain creates to automate mechanical, repetitious tasks at scale. This machine cannot function without human piloting so it's man versus man-plus-machine, not man versus machine.

I use such an analogy in Chapter 2 of Numbers Rule Your World, to compare and contrast the credit-scoring algorithmic paradigm with the manual underwriting paradigm of the past. The point is that there is more similarity than difference between the automated and the manual methods; the automated methods are faster, better able to handle multiple threads, and unfazed by individual bias.


A major blind spot is ignoring the work of Kahneman and Tversky, and other behavioral psychologists, who have shown convincingly that the human brain is subject to all kinds of biases, and uses heuristics that lead to incorrect judgements.

A large body of work, for instance, points to the "priming" effect. Someone may walk into the supermarket and buy detergents just because he or she heard a story about cheating on the radio. Of course, people would deny such influences but experiments prove the effects exist. There's also the experiment that shows that subjects who are asked to hold a pencil in their mouth to activate "grin" muscles feel happier than those made to activate "growl" muscles.

It is comedic when Brooks tells us that "people are really good at telling stoires that weave together multiple causes and multiple contexts... data... cannot match the explanatory suppleness of even a mediocre novel". I mean, does he care if the "stories" and "novels" lead to correct decisions? Or is he just in it for entertainment?


While I agree with some of Brooks's diagnosis of the problems with data-driven analyses, it is often the case that the alternative of not using data or using the brain as he calls it does not create a demonstrably better outcome.

Under "Big Data has trouble with big problems," he complained: "We've had huge debates over the best economic stimulus, with mountains of data, and as far as I know not a single major player in this debate has been persuaded by data to switch sides."

Macroeconomics has always been a field that suffers from lack of data (in fact, a sample path of one). I don't know what "mountains of data" he's talking about, especially since the things we're doing like quantitative easing has not been done ever before. Nor do I understand why the proof of utility is side-switching. He may be right that the economists have not switched sides but this says more about the people--mind you, who have embarrassing records when it comes to managing the economy--than about the data. Given that Brooks doesn't think the economic decisions were based on data, then who should we blame the many economic failures on? The human brain?

"Data creates bigger haystack... and the needle we are looking for is still buried deep inside." This is definitely true. But why is data analysis the problem here? What's the alternative he has in mind? Most data analysts sooner or later realize that "data science" is as much art as science. We don't have to pick one or the other; we can have the best of both worlds.


Brooks made a really great point at the end of the piece, which I will paraphrase: any useful data is cooked. "The end result looks disinterested, but, in reality, there are value choices all the way through, from construction to interpretation." Instead of thinking about this as cause for concern, we should celebrate these "value choices" because they make the data more useful.

This brings me back to Gelman's reaction in which he differentiates between good analysis and bad analysis. Except for the simplest problems, any good analysis uses cooked data but an analysis using cooked data could be good or bad.


Please comment on the article here: Numbers Rule Your World

Tags: , , , , , , , , , ,