Beware of Excel Histograms

November 12, 2012

(This article was originally published at Learn and Teach Statistics and Operations Research, and syndicated at StatsBlogs.)

Excel histograms are a disgrace. Microsoft should be embarrassed to have them associated with their ubiquitous and generally wonderful spreadsheet, Excel. I have previously posted on how useful and versatile Excel is for enabling people to bypass the number crunching, and get to the ideas behind statistics. This is mostly true. But the histogram add-in should come with a health and safety warning.

To start with, the default look for the histogram is outrageously poor. I have some data on times a person takes to solve a Rogo puzzle. (Collected as part of our research on what factors affect solution time.) I put the data in the spreadsheet, and use the data analysis toolpak to create a histogram using the default settings. Voila!

Histogram produced using default settings in Excel

I’ll stretch it out a bit so you can see it in all its glory:

Histogram using Excel defaults, stretched out.

And here is what can be produced from this with a fair degree of manipulation:

Excel Histogram whipped into shape

This is not just a question of cosmetics. The way the horizontal axis is labelled makes it very difficult to read correctly an Excel-produced histogram unless the adaptation shown above is used. And sometimes, an Excel-produced histogram is just plain incorrect.

Bad histogram in NCEA sample exam

What prompted this tirade is a question from the sample external examination question in NCEA level 2 “Apply probability methods in solving problems”. This is an exam that over 15,000 students are likely to take. Fortunately this is only an examplar and not the real thing. It includes a badly labelled histogram, which I am almost certain was made in Excel.

Histogram taken from an exam exemplar

The introduction says: “Ali has a farm in Southland. She records the weights of 32 lambs born on her farm. The results are shown on the histogram above.”

The first question asks: “What proportion of the lambs weighs less than 1.25 kg?”

Go on – work it out.

I can’t answer this for certain. The labelling of the horizontal axis renders this question unanswerable. I suspect the desired answer is one out of 32. To get this answer I assume that the lambs are weighed to a precision of one decimal place, and the numbers under the graph are the inclusive upper bounds for the area above. I make this assumption because I know that is what Excel does. That is two too many assumptions for students in an exam. This is too many assumptions for any graph. Graphs exist in order to communicate, not confuse.

A histogram always has “bins” which cover a range of values. If you went to school last century and learnt to draw them by hand, you would put the boundary number between the bins on the tick mark on the graph that was the boundary between the bins. Intuitive!

A Google search on the word histogram shows most of the histograms with the tick marks at the boundaries, and quite a few using the Excel work around shown above. That is because the only way you can get the number and the tick mark to line up, is to move the tick mark to the centre. An Excel column chart is designed to be a value graph for nominal data, and it is being pressed into service in an unnatural way.

Another example

This is a simple mockup to illustrate

Example A

The question is, how many people scored 3 or less in the test?

It isn’t clear. Did one person score between 0 and 3, and then three between 3 and 6? The data is actually:  0, 0.5, 1, 2, 5, 5.5, 7, 8, 9,10 and the answer is that four people scored three or less. The following histogram shows this.

Example B (same data as Example A)

All it takes is some relabelling and the meaning is clear.

Teaching implications

We thought long and hard about the teaching of histograms within a Business Statistics course. We concluded that any student who is likely to need to produce a histogram in the future, is likely to (ought to?) have a better statistics package than Excel to use. Teaching them this bizarre work-around in Excel is a waste of student time (We decided this after we made students do this in a course.) It is more important for students to be able to interpret histograms correctly, and be aware of the pitfalls of badly labelled histograms. Consequently we taught students to interpret and then critique histograms, rather than construct them themselves, and assessed the same way.

If you ever use a histogram yourself, make sure you do not fall into the pit shown above!

And for those of you who persist in teaching histograms in Excel (or need help yourself in knowing how to do it – hence avoiding said pit), here is a pdf handout.


Good luck.

The best outcome would be that Microsoft get their very poor data analysis add-in fixed up, and the world would be a better place. Any chance of that?

Please comment on the article here: Learn and Teach Statistics and Operations Research

Tags: , , , , , ,