Testing the validity of the assumption, that the errors in a regression model are normally distributed, is a standard pastime in econometrics. We use this assumption when we construct standard confidence intervals for, or test hypotheses about, t...

The Stan Model of the Week showcases research using Stan to push the limits of applied statistics. If you have a model that you would like to submit for a future post then send us an email. Our inaugural post comes from Nathan Sanders, a graduate student finishing up his thesis on astrophysics at Harvard.

Ooooooh, I never ever thought I'd have a legitimate excuse to tell this story, and now I do! The story took place many years ago, but first I have to tell you what made me think of it: Rasmus Bååth posted the following comment last month: On airplane tickets a Swedish "å" is written as

Kaggle competitions are potentially pretty cool. Kaggle supplies in-sample data ("training data"), and you build a model and forecast out-of-sample data that they withhold ("test data"). The winner gets a significant prize, often $100,000.00 or mo...

I have previously written about the scope of local and global variables in the SAS/IML language. You might wonder whether SAS/IML modules can also have local scope. The answer is no. All SAS/IML modules are known globally and can be called by any other modules. Some object-oriented programming languages support […]

My article on whether we can trust airfare prediction models is published today at FiveThirtyEight, the new data journalism venture launched by Nate Silver after he moved to ESPN. This topic was originally conceived as a chapter of Numbersense (link) but I dropped it. As I have noted in my review of Nate Silver's book, he has a keen interest in evaluating predictions, and not surprisingly, he encouraged me to…

From 2006: Naseem Taleb's publisher sent me a copy of "Fooled by randomness: the hidden role of chance in life and the markets" to review. It's an important topic, and the book is written in a charming style—I'll try to respond in kind, with some miscellaneous comments. On the cover of the book is a

Four years ago, many of us were glued to the “spill cam” showing, in real time, the gushing oil from the April 20, 2010 explosion sinking the Deepwater Horizon oil rig in the Gulf of Mexico, killing 11, and spewing oil until July 15 (see video clip that was added below).Remember junk shots, top kill, blowout preventers? [1] The EPA has […]

Here the monotonicity of the EM algorithm is established. $$ f_{o}(Y_{o}|\theta)=f_{o,m}(Y_{o},Y_{m}|\theta)/f_{m|o}(Y_{m}|Y_{o},\theta)$$ $$ \log L_{o}(\theta)=\log L_{o,m}(\theta)-\log f_{m|o}(Y_{m}|Y_{o},\theta) \label{eq:loglikelihood} $$ where \( L_{o}(\theta)\) is the likelihood under the observed data and \(L_{o,m}(\theta)\) is the likelihood under the complete data. Taking the expectation of the second line with respect to the conditional distribution of \(Y_{m}\) given \(Y_{o}\) and

The “sampling from an infinite population” metaphor beloved by statisticians of all types is a disaster for reproducible science. To explain why I’ll show what sampling from a finite population has going for it that’s not there ...

Someone who doesn't want his name shared (for the perhaps reasonable reason that he'll "one day not be confused, and would rather my confusion not live on online forever") writes: I'm exploring HLMs and stan, using your book with Jennifer Hill as my field guide to this new territory. I think I have a generally

The joint paper, written with Gery Geenens and Davy Paindaveine, entitled “Probit transformation for nonparametric kernel estimation of the copula density” is now online on http://arxiv.org/abs/1404.4414 “Copula modelling has become...

We use R to take a very brief look at the distribution of e-book sales on Amazon.com. Recently Hugh Howey shared some eBook sales data spidered from Amazon.com: The 50k Report. The data is largely a single scrape of statistics about various anonymized books. Howey's analysis tries to break sales down by declared category and

As I often manipulate time series from different sources, I rarely come across the same date format twice. Having to reformat the dates every time is a real waste of time because I never remember the syntax of the as.Date function. I put below a few examples that turn strings into standard R date format. […]

Someone writes: Suppose I have two groups of people, A and B, which differ on some characteristic of interest to me; and for each person I measure a single real-valued quantity X. I have a theory that group A has a higher mean value of X than group B. I test this theory by using