Communicating, coding and intuition for data scientists

August 2, 2012

(This article was originally published at Numbers Rule Your World, and syndicated at StatsBlogs.)

There is a stimulating conversation going on between Cathy O'Neil (mathbabe) and CMU Prof. Cosma Shalizi about whether "data science" is different from "statistics". Cathy started by posting some comments about "how to hire data scientists" (link). Cosma responded with white is the new black (link): a "modern" statistics undergraduate training would prepare one well for such jobs. Cathy disagreed on several fronts, favoring PhD training (to be able to cobble together methodology on the fly and defend it) and dealing with people.

Cosma has some more thoughts (link), agreeing with Cathy on most points but unconvinced by her repeated argument that one should just hire some "smart" people and they will figure it out. He pointed to a bunch of wrong results in network science coming from physicists who are generally considered smart people.

Cathy has another follow-up (link, cross-posted to Naked Capitalism where I first picked up this thread). She doubled down on her position, arguing that statistics graduates do not have the necessary communications skills. She then railed against "poseurs" in the data science community, people who just know how to press a button and run some black-box algorithms.


Agreeing with Both

Since I have hired a few people in business statistics and seen how they fared, I have strong opinions on these topics. I think Cathy's post on what skills are the most necessary is a must-read, so is her point about asking the right questions. I agree with Cosma that a statistics degree should be very desirable to employers (who understand what they want from data science). To summarize Cathy's points, we look for creative problem-solvers.

People who follow my blogs know I have long stressed communications skills, which include Powerpoint-type presentations, in-person meetings, translations (connecting engineers and business people), and negotiations (balancing business and technical objectives). Emphatically, I did not mention dashboards, dynamic and interactive graphics, 3D charts, piles of spreadsheets, or volumes of statistical output. Some people may not want to interface with the business side; that's fine but effective communication is still important when speaking to one's manager. It's a chance to demonstrate that one understands the statistics is serving business objectives.

Disagreeing with Both

Cathy and Cosma both feel that knowing specific programming languages is not essential. To quote Cathy, "you shouldn’t obsess over something small like whether they already know SQL." To put it politely, I reject this statement. To apply to a data science job without learning the five key SQL statements is a fool's errand. Simply put, I'd never hire such a person. To come to an interview and draw a blank trying to explain "left join" is a sign of (a) not smart enough or (b) not wanting the job enough or (c) not having recently done any data processing, or some combination of the above. If the job candidate is a fresh college grad, I'd be sympathetic. If he/she has been in the industry, you won't be called back. (One not-disclosed detail in the Cosma-Cathy dialogue is what level of hire they are talking about.)

Why do I insist that all (experienced) hires demonstrate a minimum competence in programming skills? It's not because I think smart people can't pick up SQL. The data science job is so much more than coding -- you need to learn the data structure, what the data mean, the business, the people, the processes, the systems, etc. You really don't want to spend your first few months sitting at your desk learning new programming languages.

Both Cathy and Cosma also agree that basic statistical concepts are easily taught or acquired. Many studies have disproven this point, starting with the Kahneman-Tversky work. A recent example cited by Felix Salmon (and Andrew Gelman) showed that economists can't interpret a simple linear regression properly. Loads of shady research get published in peer-reviewed journals, in many fields that demonstrate little to no understanding of basic statistics. One of my favorite examples is a paper in transportation research that I came across when writing my book in which a t-test was used to show that an entire dataset is "statistically significant".

What really sets one apart in data science/statistics is intuition. Given the large data sets with gazillion dimensions, there are gazillion ways to look at the data. How does the analyst figure out what to look at, and efficiently come to useful conclusions? When does the analyst discover that the data contain a chunk of user_ids equal to zero: at the start of the project, while he/she digs through the results of the first or second analyses, half-way through the project, or never?



In the textbook, this last question is a completely solved problem. Just follow the flowchart. Do your data cleanliness checks. In the real world, things are not that simple. It may take hours, even days, to conduct a thorough check. There are millions of variables and you can't check them all. You may have developed some familiarity with the data, which leads you to skip certain checkpoints. The zero user-ids may have only recently appeared due to a mistake by someone upstream. They may not have affected your output in a noticeable way if the methodology you used is robust to outliers. One of the cues is typically a much larger output data set than you expected when you join the user-ids to some other data -- but I'd be impressed if you always do a count of every immediate dataset you have ever produced. If you sense something wrong with the analysis output, could you come up with hypotheses as to why it went wrong, have one of those hypotheses check out, say exposing the zero-id issue? Being able to "sense something wrong" is easier said than done, when you are staring at pages of calculations.

A lot of the intuition come with experience. But experience is not sufficient. That's why earlier I mentioned the importance of learning the data, understanding how it was collected. That isn't sufficient either. It's a whole lot of things, many intangible qualities, together that produce the intuition. This requirement is the toughest on fresh graduates, whether you are an undergrad or a PhD.





Please comment on the article here: Numbers Rule Your World

Tags: , ,