## Simulate lognormal data with specified mean and variance

June 4, 2014
By

In my book Simulating Data with SAS, I specify how to generate lognormal data with a shape and scale parameter. The method is simple: you use the RAND function to generate X ~ N(μ, σ), then compute Y = exp(X). The random variable Y is lognormally distributed with parameters μ […]

## Machine Learning and Applied Statistics Lesson of the Day – How to Construct Receiver Operating Characteristic Curves

$Machine Learning and Applied Statistics Lesson of the Day – How to Construct Receiver Operating Characteristic Curves$

A receiver operating characteristic (ROC) curve is a 2-dimensional plot of the (the true positive rate) versus (1 minus the true negative rate) of a binary classifier while varying its discrimination threshold.  In statistics and machine learning, a basic and popular tool for binary classification is logistic regression, and an ROC curve is a useful way to assess the predictive accuracy […]

## Did you buy laundry detergent on their most recent trip to the store? Also comments on scientific publication and yet another suggestion to do a study that allows within-person comparisons

June 3, 2014
By

Please answer the above question before reading on . . . I’m curious after reading Leif Nelson’s report that, based on research with Minah Jung, approximately 42% of the people they surveyed said they bought laundry detergent on their most recent trip to the store. I’m stunned that the number is so high. 42%??? That’s […] The post Did you buy laundry detergent on their most recent trip to the…

## The pleasure of walking

June 3, 2014
By

The proverb goes: walk before you run. My latest contribution to Harvard Business Review (link) makes the point that many websites can improve their user experience by focusing on simple personalization measures, like showing me my shirt size. Recommendation engines based on machine-learning algorithms still have ways to go. I ran across a number of obstacles in my recent travel, which again highlights the value of getting the basics down.…

## Post-Piketty Lessons

June 3, 2014
By

The latest crisis in data analysis comes to us (once again) from the field of Economics. Thomas Piketty, a French economist recently published a book titled Capital in the 21st Century that has been a best-seller. I have not read … Continue reading →

## Video Tutorial – Useful Relationships Between Any Pair of h(t), f(t) and S(t)

$Video Tutorial – Useful Relationships Between Any Pair of h(t), f(t) and S(t)$

I first started my video tutorial series on survival analysis by defining the hazard function.  I then explained how this definition leads to the elegant relationship of . In my new video, I derive 6 useful mathematical relationships that exist between any 2 of the 3 quantities in the above equation.  Each relationship allows one quantity […]

## Skimming statistics papers for the ideas (instead of the complete procedures)

June 2, 2014
By

Been reading a lot of Gelman, Carlin, Stern, Dunson, Vehtari, Rubin “Bayesian Data Analysis” 3rd edition lately. Overall in the Bayesian framework some ideas (such as regularization, and imputation) are way easier to justify (though calculating some seemingly basic quantities becomes tedious). A big advantage (and weakness) of this formulation is statistics has a much […] Related posts: Checking claims in published statistics papers Data Science, Machine Learning, and Statistics:…

## How does Practical Data Science with R stand out?

June 2, 2014
By

There are a lot of good books on statistics, machine learning, analytics, and R. So it is valid to ask: how does Practical Data Science with R stand out? Why should a data scientist or an aspiring data scientist buy it? We admit, it isn’t the only book we own. Some relevant books from the […] Related posts: A bit of the agenda of Practical Data Science with R Data…

## Swallowing the Bitter Pill: England, the Premier League and the World Cup

June 2, 2014
By

Discussions abound about England’s chances at the 2014 edition of the World Cup. For a country which has produced elite football players such as Gary Neville, John Terry and Paul Scholes (and yes, David Beckham), there isn’t a lot of optimism ...

## Collaborative lesson development with GitHub

June 2, 2014
By

If you're doing any kind of scientific computing and not using version control, you're doing it wrong. The git version control system and GitHub, a web-based service for hosting and collaborating on git-controlled projects, have both become wildly popu...

## Why we hate stepwise regression

June 2, 2014
By

Haynes Goddard writes: I have been slowly working my way through the grad program in stats here, and the latest course was a biostats course on categorical and survival analysis. I noticed in the semi-parametric and parametric material (Wang and Lee is the text) that they use stepwise regression a lot. I learned in econometrics […] The post Why we hate stepwise regression appeared first on Statistical Modeling, Causal Inference,…

## On deck this week

June 2, 2014
By

Mon: Why we hate stepwise regression Tues: Did you buy laundry detergent on their most recent trip to the store? Also comments on scientific publication and yet another suggestion to do a study that allows within-person comparisons Wed: All the Assumptions That Are My Life Thurs: Identifying pathways for managing multiple disturbances to limit plant […] The post On deck this week appeared first on Statistical Modeling, Causal Inference, and…

## Missing data, mysterious order, reverse causation wipes out a simple theory

June 2, 2014
By

New York Times columnist Floyd Norris published a set of charts purportedly to show that the housing market in the U.S. is on the mend. Not so quick Floyd. His theory - originating from an economist at Hanley Wood, a...

## Specify formats when you write vectors to a data set

June 2, 2014
By

Sometimes you have data in SAS/IML vectors that you need to write to a SAS data set. By default, no formats are associated with the variables that you create from SAS/IML vectors. However, some variables (notably dates, times, and datetimes) should have formats associated with the data values. You can […]

## Aerial Views

June 2, 2014
By

Depict reality with photograps has a long tradition: On Wikimedia Commons the Swiss National Library published a series of old and …Continue reading →

## Autocorrelation in project Tycho’s measles data

June 1, 2014
By

Project Tycho includes data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested.I have looked at Ptoject Tycho's measles data before, general look, incidence,...

## Jessica Tracy and Alec Beall (authors of the fertile-women-wear-pink study) comment on our Garden of Forking Paths paper, and I comment on their comments

May 31, 2014
By

Jessica Tracy and Alec Beall, authors of that paper that claimed that women at peak fertility were more likely to wear red or pink shirts (see further discussion here and here), and then a later paper that claimed that this happens in some weather but not others, just informed me that they have posted a […] The post Jessica Tracy and Alec Beall (authors of the fertile-women-wear-pink study) comment on…

## Loading IP Test Data Into Postgres

May 31, 2014
By

Recently, I was trolling around the internet looking for some IP address data to play with.  Fortunately, I stumbled across MaxMind's Geolite Database, which is available for free.    All I have to do is include this notice:This product ...

## More on Piketty — Oh God No, Please, No…

May 31, 2014
By

Piketty, Piketty, Piketty! How did the Piketty phenomenon happen? Surely Piketty must be one of the all-time great economists. Maybe even as great as Marx.Yes, parts of the emerging backlash against Piketty's Capital resonate with me. Guido M...

## Trimming the Fat from glm() Models in R

May 30, 2014
By

One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm models are not so concise; we noticed this to our dismay when we tried to […] Related posts: Generalized linear models for predicting rates Bad Bayes: an example of why…

## I posted this as a comment on a sociology blog

May 30, 2014
By

I discussed two problems: 1. An artificial scarcity applied to journal publication, a scarcity which I believe is being enforced based on a monetary principle of not wanting to reduce the value of publication. The problem is that journals don’t just spread information and improve communication, they also represent chits for hiring and promotion. I’d […] The post I posted this as a comment on a sociology blog appeared first…

## Step-by-Step Guide to Setting Up an R-Hadoop System

May 30, 2014
By

by Yanchang Zhao RDataMining.com Following my first R-Hadoop system setup guide written in Sept 2013, I have further tested setting up a Hadoop system for running R code, as well as using HBase. I have tested it both on a … Continue reading →

## Mmm, statistical significance . . . Evilicious!

May 30, 2014
By

Just in case you didn’t check Retraction Watch yet today, Carolyn Johnson reports: The committee painstakingly reconstructed the process of data analysis and determined that Hauser had changed values, causing the result to be statistically significant, an important criterion showing that findings are probably not due to chance. As the man said: His resignation is […] The post Mmm, statistical significance . . . Evilicious! appeared first on Statistical Modeling,…