Specialization and Communication in Data Science

September 11, 2017

(This article was originally published at Simply Statistics, and syndicated at StatsBlogs.)

I have been interested for a while now in how data scientists can better communicate data analysis activities to each other and to people outside the field. I believe that our current methods are inadequate because they have mostly been borrowed from other areas (notably, computer science). Many of those tools are useful, but they were not developed to communicate data analysis concepts specifically and often fall short. I talked about this problem in my Dean’s Lecture earlier this year and how the field of data science could benefit from developing its own theories, to simplify communication as other fields have done.

One thing that I have noticed is that in other fields, the development of those fields can be viewed in part as a trend towards increasing specialization. With people in a field who increasingly specialize in a certain sub-specialty, there is a parallel need for the specialists to communicate and coordinate with each other in order to produce a whole product. Over time, the separation of a field into an collection of specialists drives the development of communication tools that can serve as clearinghouses of mutually agreed upon information. Without adequate tools, the communications overhead involved with adding more people to a project will become too great and the entire enterprise can collapse. This phenomenon is famously described in Fred Brooks’s The Mythical Man-Month as it relates to software engineering projects.

I thought it might be useful to talk about some of these other areas and how they have overcome increased specialization and the separattion of duties with communications tools. Tracing the history of other fields is instructive as it potentially provides a basis on which we can discuss data analysis. Listeners of my podcast with Hilary Parker know that we regularly have a segment that we refer to as “analogy corner” and this is the Simply Statistics version of that.

Specialization in Other Areas

The first example comes from filmmaking and the development of the screenplay. The Script Lab describes The History of the Screenplay and how filmmaking worked before the development of scripts:

When contemplating the history of screenwriting, one cannot divorce the theories of screenwriting from the evolution of film production. The earliest films were often solo projects, from conception to completion. Referred to as the “cameraman system” this was the most primitive of filmmaking. Soon, directors became central to the process, but most movies were filmed with only a vague idea of what the director wanted to shoot. Often crews were kept waiting while the director planned what to film next.

Films were one-person projects and were developed in a more or less linear fashion. It was an inefficient system—most films today are produced in a highly nonlinear fashion to accommodate actors’ schedules and various production processes.

Today, the screenplay serves as a critical communications hub around which many filmmaking departments (costume, makeup, hair, props, sets) can organize their activity. Imagine if representatives of each of these departments had to individually consult with the screenwriter or director about every detail of their work. It would be a nightmare of exponentially growing complexity. With a written document, such as as screenplay, that everyone can agree on as the authority of “what is happening in the film”, people can get their jobs done without the need of constant back and forth communication.

The second analogy comes from finance. In finance there was a similar development of specialization as it relates to limited liability. Here, the “specialization” refers to the separation of the owners of a company from its managers. As a result, there must be a way for managers of the company to communicate to the investors what exactly is going on with the operations of the company. Thus, the development of financial statements, accounting rules, and various publicly available documents that let investors analyze the health of a company. Graham and Dodd’s seminal Security Analysis is essentially a plea to investors to evaluate companies based on the publicly available data, rather than on common myths and legends about what makes for a good or safe investment. Today, with the separation of owners from managers and the creation of standardized communication formats between the two (e.g. S-1s, 10-Ks, 10-Qs, etc.), we have the basis of the global capital markets system.

The last analogy comes from western classical music, where there is often a division between the composer of the music and the performer. In more complex symphonic music, you might say there are three roles: the composer, the performer, and the interpreter/conductor. However, in early classical music, such a division didn’t exist and composers typically performed their own music, often by themselves. In this setup, there’s not much need to write things down, as the music can be stored in and performed from the composer’s head. This concept was well-captured in the movie Amadeus where Mozart describes his opera The Magic Flute as being “up here in my noodle” (the rest is just scribbling and bibbling).

Of course, opera might be the ultimate example from classical music where some sort of communication tool is needed to coordinate between the musicians, singers, and set designers. Thus, for most classical music, we have the score, which specifies what every instrument and signer is doing at any given time. There is a standardized notation that allows others unfamiliar with the composer to quickly grasp what is going on and to assemble the time and resources needed to perform the work.

What About Data Analysis?

In data science today, or really in science, much of what goes on follows the “vertically integrated” model, where the same person, asks the question, collects the data, and analyzes the data. The need for communication methods doesn’t really arise until that work needs to be disseminated to others (including yourself in the future). In large collaborations where communication over analyses needs to be done from the start, my experience has been that even in the best case scenarios, the methodology is ad hoc and difficult to re-create in another project involving different people.

Most would agree that the software code that actually does the analysis is an important component of communicate what is being done. However, not everyone needs or wants all of the details provided by the code. Perhaps one concept we could steal from music is the distinction between the score and the parts. In a symphony, the conductor needs the full score because they need to know what everyone is doing at all times. But the first violinist only reads the first violin part—they don’t need to read the entire score in order to play a vital part in creating the finished product.

Developing appropriate communications tools for data science is critical to scaling data analysis so that more people can be involved and to reproducibility/replicability so that more people can understand what is going on in an analysis. Until then, I think we will continue to jam tools from other fields into the data science process, and that is fine. These tools are useful, but I think ultimately are not a perfect fit.

Please comment on the article here: Simply Statistics