What makes a good data scientist?

December 31, 2012
Apparently, New Year's Eve is not a popular day to come to the office as it seems I'm the only one here. No matter, it just means I can blast Mahler 3 (Bernstein, NY Phil, 1980s recording) louder than I …

2012 in review

December 31, 2012
The WordPress.com stats helper monkeys prepared a 2012 annual report for this blog. Here's an excerpt: 19,000 people fit into the new Barclays Center to see Jay-Z perform. This blog was viewed about 150,000 times in 2012. If it were a concert at the Barclays Center, it would take about 8 sold-out performances for that

Top Posts of 2012

December 31, 2012
This has been a great year for my blog. I've seen tremendous growth in my subscribers. I look forward to engaging with and learning from my followers in 2013 and I plan to offer valuable content in return. If you're interested in following along, you can quickly subscribe via RSS or e-mail. I use Google

Four Values Can Still Be Worth A Chart

December 31, 2012
A while ago, Kaiser Fung criticized a chart for its uselessness because it only showed four numbers. The chart appeared on the smart web comic Abstruse Goose (which, as of this writing, is down for a site reorganization). First, I think Mr. Fung was trolled by his reader, and fell for it hook, line, and sinker. The point of this chart is not to communicate a lot of data or…

Tips for R Package Creation

December 31, 2012
I'm being tortured by the mistakes of my past self. I think I've made most every mistake possible in creating a package and I want to go back in time and tell year ago me all I know now. But …

Misusage of the new shiny package: A nerdy drink tracker for your next party

December 31, 2012
Currently a lot of people are talking about the new shiny package. So I got curious and built an own, more or less useful app: A drink trackerThis app can be used to track how much someone drank and therefore it is very useful for every party, especial...

An established probability theory for hair comparison? “is not — and never was”

December 30, 2012
Hypothesis H: “person S is the source of this hair sample,” if indicated by a DNA match, has passed a more severe test than if it were indicated merely by a visual analysis under a microscopic. There is a much smaller probability of an erroneous hair match using DNA testing than using the method of visual analysis [...]

December 30, 2012
An interesting new app called 100plus, which looks like it uses public data to help determine how little decisions (walking more, one more glass of wine, etc.) lead to more or less health. Here's a post describing it on the …

Searching for Structure underlying Customer Satisfaction Ratings: Item Response Theory through the Back Door

December 30, 2012
Variations on a Theme of Negative Skew and Positive ManifoldNo one familiar with research on customer satisfaction expects to find uncorrelated ratings or symmetric distributions centered toward the middle of the rating scale.  There are forces at...

Fixed effects, followed by Bayes shrinkage?

December 30, 2012
Stuart Buck writes: I have a question about fixed effects vs. random effects. Amongst economists who study teacher value-added, it has become common to see people saying that they estimated teacher fixed effects (via least squares dummy variables, so that there is a parameter for each teacher), but that they then applied empirical Bayes shrinkage [...]

Innovation and the New York City subway

December 30, 2012
Innovation and NYC subway don't usually come together but something changed in the past year or so. One of the greatest life-changers has been the installation of countdown clocks in many of the stations, telling riders how long till the next (several) trains arrive. Now there is a smartphone app for this. (link) *** Readers of Chapter 1 of Numbers Rule Your World learn the concept behind these countdown clocks…

Update to Graphing Non-Proportional Hazards in R

December 30, 2012
Update 31 July 2013: I've moved all of the functionality described in this post into an R package called simPH. Have a look. It is much easier to use. This is a quick update for a previous post on Graphing Non-Proportional Hazards in R. In the previ...

An R wish list for 2013

December 30, 2012
First go and read An R wish list for 2012. None of the wishes came through in 2012. Fix the R website? No, it is the same this year. In fact, it is the same as in 2005. Easy to find help? Sorry, next year. Consistency and sane defaults? Coming soon to a theater near […]

Speed skating 10 km

December 29, 2012
It is winter which makes it time for one of Netherlands beloved sports: speed skating. Speed skating is done over various distances, but for me, the most beautiful is the 10 km. The top men do this in about 13 minutes. In this post I try to u...

Big Data at Berkeley

December 29, 2012
Mike Jordan asked me to post the following information: The new Simons Institute for the Theory of Computing at Berkeley is an exciting new initiative which will begin organizing semester-long programs starting in 2013. One of the first programs, set for Fall 2013, will be on the “Theoretical Foundations of Big Data Analysis”. The organizers [...]

Sexism in science (as elsewhere)

December 29, 2012
Solomon Hsiang sends along this from Corinne Moss-Racusin, John Dovidio, Victoria Brescoll, Mark Graham, and Jo Handelsman: Despite efforts to recruit and retain more women, a stark gender disparity persists within academic science. . . . In a randomized double-blind study . . . science faculty from research-intensive universities rated the application materials of a [...]

Clustering with selected Principal Components

December 29, 2012
In the Visualizing Principal Components post, I looked at the Principal Components of the companies in the Dow Jones Industrial Average index over 2012. Today, I want to show how we can use Principal Components to create Clusters (i.e. form groups of similar companies based on their distance from each other) Let's start by loading

Europe has an Open Data Portal, too

December 28, 2012
The European Commission opened its Open Data Portal some days ago. Powered by CKAN. . Most of the 5811 datasets (97%) are statistical ones

New book by Stef van Buuren on missing-data imputation looks really good!

December 28, 2012
Ben points us to a new book, Flexible Imputation of Missing Data. It’s excellent and I highly recommend it. Definitely worth the \$89.95. Van Buuren’s book is great even if you don’t end up using the algorithm described in the book (I actually like their approach but I do think there are some limitations with [...]

Most Findings Are False

December 27, 2012
$Most Findings Are False$

Most Findings Are False Many of you may know this paper by John Ioannidis called “Why Most Published Research Findings Are False.” Some people seem to think that the paper proves that there is something wrong with significance testing. This is not the correct conclusion to draw, as I’ll explain. I will also mention a [...]

3 msc kvetches on the blog bagel circuit

December 27, 2012
In the past week, I’ve kvetched over at 3 of the blogs on my blog bagel: I.  On an error in Mark Chang’s treatment of my Birnbaum disproof on  Xi’an’s Og. II. On Normal Deviant’s post offering “New Names For Statistical Methods” III. On a statistics chapter in Nate Silver’s book, discussed over at Gelman’s blog. [...]

The Möbius strip, or, marketing that is impervious to criticism

December 27, 2012
Johnny Carson had this great trick where, after a joke bombed, he'd do such a good double-take that he'd end up getting a huge laugh. This gimmick could never have worked as his sole shtick—at some point, Johnny had to tell some good jokes—but it was a reliable way to limit the downside. For the