## A blessing of dimensionality often observed in high-dimensional data sets

April 9, 2015
By

Tidy data sets have one observation per row and one variable per column.  Using this definition, big data sets can be either: Wide - a wide data set has a large number of measurements per observation, but fewer observations. This type of data set is typical in neuroimaging, genomics, and other biomedical applications. Tall - a

## What can be in an R data.frame column?

April 9, 2015
By

As an R programmer have you every wondered what can be in a data.frame column? The documentation is a bit vague, help(data.frame) returns some comforting text including: Value A data frame, a matrix-like structure whose columns may be of differing type...

April 9, 2015
By

This video on how to make it in academia was produced over 10 years ago by Steven Goodman for the ENAR Junior Researchers Workshop. Now the whole world can benefit from its wisdom. The movie features current and former JHU Biostatistics faculty, including Francesca Dominici, Giovanni Parmigiani, Scott Zeger, and Tom Louis. You don't want

## Why not statistics

April 9, 2015
By

Jordan Ellenberg’s parents were both statisticians. In his interview with Strongly Connected Components Jordan explains why he went into mathematics rather than statistics. I tried. I tried to learn some statistics actually when I was younger and it’s a beautiful subject. But at the time I think I found the shakiness of the philosophical underpinnings […]

## My favorite Neyman passage: on confidence intervals

April 9, 2015
By

I've been doing a lot of reading on confidence interval theory. Some of the reading is more interesting than others. There is one passage from Neyman's (1952) book "Lectures and Conferences on Mathematical Statistics and Probability" (available here) t...

## New research in tuberculosis mapping and control

April 9, 2015
By

Mapping and control. Or, as we would say, descriptive and causal inference. Jon Zelner informs os about two ongoing research projects: 1. TB Hotspot Mapping: Over the summer, I [Zelner] put together a really simple R package to do non-parametric disease mapping using the distance-based mapping approach developed by Caroline Jeffery and Al Ozonoff at […] The post New research in tuberculosis mapping and control appeared first on Statistical Modeling,…

## Health economic combat

April 9, 2015
By

A couple of weeks ago we decided to create a more formal website for our research group within the department of Statistical Science at UCL. The group includes the PhD students involved in health economic-related topics (basically all under my sup...

## Scala for Machine Learning [book review]

April 9, 2015
By

Nicolas, Patrick R. (2014) Scala for Machine Learning, Packt Publishing: Birmingham, UK. Full disclosure: I received a free electronic version of this book from the publisher for the purposes of review. There is clearly a market for a good book about using Scala for statistical computing, machine learning and data science. So when the publisher … Continue reading Scala for Machine Learning [book review]

## Scala for Machine Learning [book review]

April 9, 2015
By

Nicolas, Patrick R. (2014) Scala for Machine Learning, Packt Publishing: Birmingham, UK. Full disclosure: I received a free electronic version of this book from the publisher for the purposes of review. There is clearly a market for a good book about using Scala for statistical computing, machine learning and data science. So when the publisher … Continue reading Scala for Machine Learning [book review]

## Classification with Categorical Variables (the fuzzy side)

April 9, 2015
By
$\frac{1}{n}\sum_{i=1}^n \widehat{Y}_i=\frac{1}{n}\sum_{i=1}^n Y_i$

The Gaussian and the (log) Poisson regressions share a very interesting property, i.e. the average predicted value is the empirical mean of our sample. > mean(predict(lm(dist~speed,data=cars))) [1] 42.98 > mean(cars\$dist) [1] 42.98 One can prove that it is also the prediction for the average individual in our sample > predict(lm(dist~speed,data=cars), + newdata=data.frame(speed=mean(cars\$speed))) 42.98 The geometric interpretation is that the regression line passes through the centroid, > plot(cars) > abline(lm(dist~speed,data=cars),col="red") > abline(h=mean(cars\$dist),col="blue")…

## Heads I win, tails you lose? Meehl and many Popperians get this wrong (about severe tests)!

April 9, 2015
By

[T]he impressive thing about the 1919 tests of Einstein ‘s theory of gravity] is the risk involved in a prediction of this kind. If observation shows that the predicted effect is definitely absent, then the theory is simply refuted. The theory is incompatible with certain possible results of observation—in fact with results which everybody before […]

## Paperpile makes me more productive

April 9, 2015
By

One of the first things I tell my new research students is to use a reference management system to help them keep track of the papers they read, and to assist in creating bib files for their bibliography. Most of them use Mendeley, one or two use Zotero. Both do a good job and both are […]

## New video course: Campaign Response Testing

April 8, 2015
By

I am proud to announce a new Win-Vector LLC statistics video course: Campaign Response Testing John Mount, Win-Vector LLC This course works through the very specific statistics problem of trying to estimate the unknown true response rates one or more p...

## How can teachers of (large) online classes use text data from online learners?

April 8, 2015
By

Dustin Tingley sends along a recent paper (coauthored with Justin Reich, Jetson Leder-Luis, Margaret Roberts, and Brandon Stewart), which begins: Dealing with the vast quantities of text that students generate in a Massive Open Online Course (MOOC) is a daunting challenge. Computational tools are needed to help instructional teams uncover themes and patterns as MOOC […] The post How can teachers of (large) online classes use text data from online…

## Compute the rank of a matrix in SAS

April 8, 2015
By

A common question from statistical programmers is how to compute the rank of a matrix in SAS. Recall that the rank of a matrix is defined as the number of linearly independent columns in the matrix. (Equivalently, the number of linearly independent rows.) This article describes how to compute the […]

## an email exchange about integral representations

April 7, 2015
By

I had an interesting email exchange [or rather exchange of emails] with a (German) reader of Introducing Monte Carlo Methods with R in the past days, as he had difficulties with the validation of the accept-reject algorithm via the integral in that it took me several iterations [as shown in the above] to realise the […]

## Comparison of Bayesian predictive methods for model selection

April 7, 2015
By

This post is by Aki We mention the problem of bias induced by model selection in A survey of Bayesian predictive methods for model assessment, selection and comparison, in Understanding predictive information criteria for Bayesian models, and in BDA3 Chapter 7, but we haven’t had a good answer how to avoid that problem (except by […] The post Comparison of Bayesian predictive methods for model selection appeared first on Statistical…

## Outside pissing in

April 7, 2015
By

Coral Davenport writes in the New York Times: Mr. Tribe, 73, has been retained to represent Peabody Energy, the nation’s largest coal company, in its legal quest to block an Environmental Protection Agency regulation that would cut carbon dioxide emissions from the nation’s coal-fired power plants . . . Mr. Tribe likened the climate change […] The post Outside pissing in appeared first on Statistical Modeling, Causal Inference, and Social…

April 7, 2015
By

Recently, I received an email from Ozan, who wrote:"I’ve a simple but not explicitly answered question within the text books on stationary series. I’m estimating a model with separate single equations (I don’t take into account the interactions a...

## The end of the oil glut

April 7, 2015
By

In my last post, I talked about how America had depressed oil prices by increasing its supply. Recall this graph which shows that the supply glut is primarily caused by increased American supply (the top pink line is America): Since low prices are mainly caused by American oversupply, a decrease in American supply will have […]

## And . . . our featured 2015 seminar speaker is . . . Thomas HOBBES!!!!!

April 7, 2015
By

Just in case you’ve forgotten where this all came from: This came in the departmental email awhile ago: CALL FOR APPLICATIONS: LATOUR SEMINAR — DUE DATE AUGUST 11 (extended) The Brown Institute for Media Innovation, Alliance (Columbia University, École Polytechnique, Sciences Po, and Panthéon-Sorbonne University), The Center for Science and Society, and The Faculty of […] The post And . . . our featured 2015 seminar speaker is . .…

## Planned redundancy

April 7, 2015
By

The following Wall Street Journal caught my eye the other day: (Link to article) Looking closely, I realize that the four charts are identical, except for the call-outs. This is a kind of small-multiples in which the same data reside...

## Unsolicitors

April 7, 2015
By

This is probably just me being a bit grumpy, but I guess this happens to many people. I have just received an email (and it's not the first time) from a random scientific journal (this time it's a medical journal) inviting me to publish my research.Exc...