If you've ever started a book and not finished it, it may comfort you to know that you are not alone. It's hard to get accurate estimates of the percentage books that are discontinued, but the rise of e-reading (and… Continue reading →

Kaiser Fung shares this graph from Ritchie King: Kaiser writes: What they did right: - Did not put the data on a map - Ordered the countries by the most recent data point rather than alphabetically - Scale labels are found only on outer edge of the chart area, rather than one set per panel […] The post Small multiples of lineplots > maps (ok, not always, but yes in…

Editor's note: This is a guest post by Alyssa Frazee, a graduate student in the Biostatistics department at Johns Hopkins and a participant in the recent rOpenSci hackathon. Last week, I took a break from my normal PhD student schedule … Continue reading →

For those who weren't able to attend my recent talks, a few have surfaced online. *** JMP put up the video of the webcast from last Friday with Alberto Cairo, a data visualization expert and author of The Functional Art. You can access it from here. This event is part of their Analytically Speaking series with recent guests such as David Hand and Michael Schrage. I also appear on this…

Yesterday I blogged about the Hilbert matrix. The (i,j)th element of the Hilbert matrix has the value 1 / (i+j-1), which is the reciprocal of an integer. However, the printed Hilbert matrix did not look exactly like the formula because the elements print as finite-precision decimals. For example, the last […]

I recently introduced the use of linear basis function models for supervised learning problems that involve non-linear relationships between the predictors and the target. A common type of basis function for such models is the Gaussian basis function. This type of model uses the kernel of the normal (or Gaussian) probability density function (PDF) as […]

Consider again an experiment that seeks to determine the causal relationships between factors and the response, where . Ideally, the sample size is large enough for a full factorial design to be used. However, if the sample size is small and the number of possible treatments is large, then a fractional factorial design can be used instead. Such a […]

Sometime today, I got the idea to try to do automatic speech recognition. Speech recognition, even though it is widely used (and is on our phones), still seems kind of sci-fi-ish to me. The thought of running it on your own computer is still pretty exciting. I looked for open source libraries, and was pleasantly surprised to find Sphinx, a CMU project. It has python bindings, and even lets you…

Many view the propensity theory of probabilities as something incompatible with Bayesian probabilities. Nothing could be further from the truth; it represents an elementary special case of that definition. To see this I’ll apply those Bayesian pr...

IPython notebooks have become a defacto standard for presenting Python-based analyses and talks, as evidenced by recent Pycon and PyData events. As anyone who has used them knows, they are great for “reproducible research”, presentations, and sharing via the nbviewer. There are extensions connecting IPython to R, Octave, Matlab, Mathematica, SQL, among others. However, the […]

There’s a lot of free advice out there. I offer some of it myself! As I’ve written before (see this post from 2008 reacting to this advice from Dan Goldstein for business school students, and this post from 2010 reacting to some general advice from Nassim Taleb), what we see is typically presented as advice […] The post Advice: positive-sum, zero-sum, or negative-sum appeared first on Statistical Modeling, Causal Inference,…

There is now some serious soul-searching in the mainstream media about their (previously) breath-taking coverage of the Big Data revolution. I am collecting some useful links here for those interested in learning more. Here's my Harvard Business Review article in which I discussed the Science paper disclosing that Google Flu Trends, that key exhibit of the Big Data lobby, has systematically over-estimated flu activity for 100 out of the last…

-+*There’s a theorem in statistics that says You could read this aloud as “the mean of the mean is the mean.” More explicitly, it says that the expected value of the average of some number of samples from some distribution is equal to the expected value of the distribution itself. The shorter reading is confusing […]

The Hilbert matrix is the most famous ill-conditioned matrix in numerical linear algebra. It is often used in matrix computations to illustrate problems that arise when you compute with ill-conditioned matrices. The Hilbert matrix is symmetric and positive definite, properties that are often associated with "nice" and "tame" matrices. The […]

We were reading “Out, Damned Spot: Can the ‘Macbeth effect’ be replicated?” (Earp,B., Everett,J., Madva,E., and Hamlin,J. 2014, in Basic and Applied Social Psychology 36: 91-8) in an informal gathering of our 6334 seminar yesterday afternoon at Thebes. Some of the graduate students are interested in so-called “experimental” philosophy, and I asked for an example that used statistics […]

Hier, sur Twitter, @JF_Godbout partageait un joli graphique relatif aux élections québécoises, avec les nombres de votes obtenus (ici en pourcentage des votes totaux) et le pourcentage de sièges que cela donne, Il faut dire qu’hier, c’...

District Data Labs is a new endeavor by members of the local data community (myself included) to increase educational outreach about data-related topics through workshops and other media to the local data community. We want District Data Labs to be an efficient learning resource for people who want to enhance and expand their analytical and […]

Recently, I was asked:Why do you not recommend Access to use? Just curious. Read on page xi of your intro in Data Analysis Using SQL and Excel. Just beginning a class in SQL and bought your text. Thanks, MortThis is a very fair question and o...