Category: Opinion

New Getting Started with vtreat Documentation

Win Vector LLC‘s Dr. Nina Zumel has just released some new vtreat documentation. vtreat is a an all-in one step data preparation system that helps defend your machine learning algorithms from: Missing values Large cardinality categorical variables Novel levels from categorical variables I hoped she could get the Python vtreat documentation up to parity with … Continue reading New Getting Started with vtreat Documentation

Why R?

I was working with our copy editor on Appendix A of Practical Data Science with R, 2nd Edition; Zumel, Mount; Manning 2019, and ran into this little point (unfortunately) buried in the back of the book. In our opinion the R ecosystem is the fastest path to substantial data science, statistical, and machine learning accomplishment. … Continue reading Why R?

It is Time for CRAN to Ban Package Ads

NPM (a popular Javascript package repository) just banned package advertisements. I feel the CRAN repository should do the same. Not all R-users are fully aware of package advertisements. But they clutter up work, interfere with reproducibility, and frankly are just wrong. For example, here is the advertisement code from ggplot2: .onAttach <- function(…) { withr::with_preserve_seed({ … Continue reading It is Time for CRAN to Ban Package Ads

Introducing data_algebra

This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases. In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and rqdatatable). Introduction Parts of the … Continue reading Introducing data_algebra

vtreat up on PyPi

I am excited to announce vtreat is now available for Python on PyPi, in addition for R on CRAN. vtreat is: A data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. vtreat prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. … Continue reading vtreat up on PyPi

Florence Nightingale, Data Scientist

Florence Nightingale, Data Scientist.
In 1858 Florence Nightingale published her now famous “rose diagram” breaking down causes of mortality.

By w:Florence Nightingale (1820–1910). – http://www.royal.gov.uk/output/Page3943.asp [dea…

Lord Kelvin, Data Scientist

In 1876 A. Légé & Co., 20 Cross Street, Hatton Gardens, London completed the first “tide calculating machine” for William Thomson (later Lord Kelvin) (ref). Thomson’s (Lord Kelvin) First Tide Predicting Machine, 1876 The results were plotted on the paper cylinders, and one literally “turned the crank” to perform the calculations. The tide calculating machine … Continue reading Lord Kelvin, Data Scientist

PyCharm Video Review

My basic video review of the PyCharm integrated development environment for Python with Anaconda and Jupyter/iPython integration. I like the IDE extensions enough to pay for them early in my evaluation. Highly recommended for data science projects, at…

A Comment on Data Science Integrated Development Environments

A point that differs from our experience struck us in the recent note: A development environment specifically tailored to the data science sector on the level of RStudio, for example, does not (yet) exist. “What’s the Best Statistical Software? A Comparison of R, Python, SAS, SPSS and STATA” Amit Ghosh Actually, Python has a large … Continue reading A Comment on Data Science Integrated Development Environments

R Books Discount!

We, the community of Manning R and data science authors, have talked Manning into offering a catalog-wide 40% discount on all books. Please take a look at some great deals on some great technical books here: http://mng.bz/adRj !

My Favorite data.table Feature

My favorite R data.table feature is the “by” grouping notation when combined with the := notation.
Let’s take a look at this powerful notation.

First, let’s build an example data.frame.

d <- wrapr::build_frame(
"gr…

Not Always C++’s Fault

From the recent developer.r-project.org “Staged Install” article: Incidentally, there were just two distinct (very long) lists of methods in the warnings across all installed packages in my run, but repeated for many packages. It turned out that they were lists of exported methods from dplyr and rlang packages. These two packages take very long to … Continue reading Not Always C++’s Fault