Posts Tagged ‘ data ’

A quick introduction to Apache Spark for statisticians

February 8, 2017
By
A quick introduction to Apache Spark for statisticians

Introduction Apache Spark is a Scala library for analysing "big data". It can be used for analysing huge (internet-scale) datasets distributed across large clusters of machines. The analysis can be anything from the computation of simple descriptive statistics associated with the datasets, through to rather sophisticated machine learning pipelines involving data pre-processing, transformation, nonlinear model … Continue reading A quick introduction to Apache Spark for statisticians

Read more »

Deep thinking about your data

February 3, 2017
By
Deep thinking about your data

In the on-going series of posts about the IMDB dataset, from Kaggle, I have so far looked at several of the scraped variables, including the number of faces on movie posters (1, 2), plot keywords (3), and movie rating by title year (4). In this post, I tackle the variables resulting from a data merge between IMDB and Facebook. These columns have names like "Director Facebook Likes", "Actor 1 Facebook…

Read more »

February talks, and exploratory data analysis using visuals

January 30, 2017
By
February talks, and exploratory data analysis using visuals

News: In February, I am bringing my dataviz lecture to various cities: Atlanta (Feb 7), Austin (Feb 15), and Copenhagen (Feb 28). Click on the links for free registration. I hope to meet some of you there. *** On the...

Read more »

Pre-processing data is not just about correcting errors

January 30, 2017
By
Pre-processing data is not just about correcting errors

Exploration of IMDB rating data, by Kaiser Fung, founder of Principal Analytics Prep

Read more »

Apparently Hollywood does not recycle action-movie plots. The data said so, so it must be right

January 25, 2017
By
Apparently Hollywood does not recycle action-movie plots. The data said so, so it must be right

Today I continue to explore the movie dataset, found on Kaggle. To catch up with previous work, see the blog posts 1 and 2. One of the students came up with an interesting problem. Among the genre of action movies, are there particular plot elements that are correlated with box office? This problem is solvable because the dataset contains a variable called "plot keywords" lifted from IMDB. Plot keywords are…

Read more »

Numbersense and government accountability in the new political reality

January 24, 2017
By

You've heard me say often, numbersense is the most important quality for good data analysts; little did I know that numbersense would become the new requirement for healthy American democracy. From the first day in office, the new President is at war with numbers (over attendance figures at his inauguration). But I believe that getting to the bottom of data-driven claims is a bi-partisan issue: while it is obvious that…

Read more »

Good models + Bad data = Bad analysis

January 18, 2017
By
Good models + Bad data = Bad analysis

Example showing how to diagnose bad data in data science models

Read more »

Chopped legs, and abridged analyses

December 27, 2016
By
Chopped legs, and abridged analyses

Reader Glenn T. was not impressed by the graphical talent on display in the following column chart (and others) in a Monkey Cage post in the Washington Post: Not starting column charts at zero is like having one's legs chopped...

Read more »

ASA President meets OCCAM data

December 27, 2016
By

Just leaving this quote from ASA President Jessica Utts here (Source: Amstat News Dec 2016): A few days ago, I was in Vietnam and took a four-hour bus ride from Ha Long Bay to Hanoi. When I arrived, my fitness tracker had given me credit for taking 9,124 steps and climbing 81 flights of stairs during those four hours, even though I only left my seat once during a short…

Read more »

Books on Scala for statistical computing and data science

December 22, 2016
By
Books on Scala for statistical computing and data science

Introduction People regularly ask me about books and other resources for getting started with Scala for statistical computing and data science. This post will focus on books, but it’s worth briefly noting that there are a number of other resources available, on-line and otherwise, that are also worth considering. I particularly like the Coursera course … Continue reading Books on Scala for statistical computing and data science

Read more »


Subscribe

Email:

  Subscribe