In this third post on Measles data I want to have a look at some high incidence occasions. As described before, the data is from Project Tycho, which contains data from all weekly notifiable disease reports for the United States dating back to 18...

My article on whether we can trust airfare prediction models is published today at FiveThirtyEight, the new data journalism venture launched by Nate Silver after he moved to ESPN. This topic was originally conceived as a chapter of Numbersense (link) but I dropped it. As I have noted in my review of Nate Silver's book, he has a keen interest in evaluating predictions, and not surprisingly, he encouraged me to…

From 2006: Naseem Taleb's publisher sent me a copy of "Fooled by randomness: the hidden role of chance in life and the markets" to review. It's an important topic, and the book is written in a charming style—I'll try to respond in kind, with some miscellaneous comments. On the cover of the book is a […]

Four years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010 explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15 (see video clip that was added below).Remember junk shots, top kill, blowout preventers? [1] The EPA has […]

Here the monotonicity of the EM algorithm is established. $$ f_{o}(Y_{o}|\theta)=f_{o,m}(Y_{o},Y_{m}|\theta)/f_{m|o}(Y_{m}|Y_{o},\theta)$$ $$ \log L_{o}(\theta)=\log L_{o,m}(\theta)-\log f_{m|o}(Y_{m}|Y_{o},\theta) \label{eq:loglikelihood} $$ where \( L_{o}(\theta)\) is the likelihood under the observed data and \(L_{o,m}(\theta)\) is the likelihood under the complete data. Taking the expectation of the second line with respect to the conditional distribution of \(Y_{m}\) given \(Y_{o}\) and […]

The “sampling from an infinite population” metaphor beloved by statisticians of all types is a disaster for reproducible science. To explain why I’ll show what sampling from a finite population has going for it that’s not there ...

Someone who doesn't want his name shared (for the perhaps reasonable reason that he'll "one day not be confused, and would rather my confusion not live on online forever") writes: I'm exploring HLMs and stan, using your book with Jennifer Hill as my field guide to this new territory. I think I have a generally […]

The joint paper, written with Gery Geenens and Davy Paindaveine, entitled “Probit transformation for nonparametric kernel estimation of the copula density” is now online on http://arxiv.org/abs/1404.4414 “Copula modelling has become ubiquitous in modern statistics. Here, the problem of nonparametrically estimating a copula density is addressed. Arguably the most popular nonparametric density estimator, the kernel estimator is not suitable for the unit-square-supported copula densities, mainly because it is heavily affected by boundary bias issues. In addition, most…

We use R to take a very brief look at the distribution of e-book sales on Amazon.com. Recently Hugh Howey shared some eBook sales data spidered from Amazon.com: The 50k Report. The data is largely a single scrape of statistics about various anonymized books. Howey's analysis tries to break sales down by declared category and […]

As I often manipulate time series from different sources, I rarely come across the same date format twice. Having to reformat the dates every time is a real waste of time because I never remember the syntax of the as.Date function. I put below a few examples that turn strings into standard R date format. […]

Someone writes: Suppose I have two groups of people, A and B, which differ on some characteristic of interest to me; and for each person I measure a single real-valued quantity X. I have a theory that group A has a higher mean value of X than group B. I test this theory by using […]

J’animerai une formation lundi 28 de 14:00 à 16:00 au local N-6320 de l’UQAM sur le thème introduction aux arbres de classification. Cette formation est organisée dans le cadre des séminaires en méthodes d’analyses quantitatives et qualitatives qui se tiennent régulièrement depuis un peu plus d’un mois. animé par le collectif pour le développement et les applications en mesure et évaluation (Cdame). Les slides sont disponibles en pdf (il y a quelques animations,…

The New York Times recently published an article on education titled "Parental Involvement Is Overrated". Most research in this area supports the opposite view, but the authors claim that "evidence from our research suggests otherwise". Before you stop helping your children … Continue reading →

Nelson Villoria writes: I find the multilevel approach very useful for a problem I am dealing with, and I was wondering whether you could point me to some references about poolability tests for multilevel models. I am working with time series of cross sectional data and I want to test whether the data supports cross […]

How is it possible that it has taken a podcast called Data Stories 35 episodes to get to the topic of data storytelling? Alberto Cairo and I helped get the topic straightened out, and I think we even convinced Moritz that stories are not the enemy of exploration. It was a fun episode to record, and it touches on many interesting topics.

Ethan Siegel wrote a post entitled The Math of the Fastest Human Alive five years ago, using regressions. An alternative is too use extreme value models (I wrote a post a long time ago on the maximum length of a tennis match using extreme value theory a few years ago). In 2009, John Einmahl and Sander Smeets wrote a great article entitled ultimate 100m world records through extreme-value theory. The article is…

Here I derive a simple formula for probability distributions general enough for Statistical Mechanics and Classical Statistics in which the roles, meanings, and interpretations between the Information Entropy and Boltzmann’s Entropy are as clear ...