Category: Python

Predicted distribution of Mersenne primes

Mersenne primes are prime numbers of the form 2p – 1. It turns out that if 2p – 1 is a prime, so is p; the requirement that p is prime is a theorem, not part of the definition. So far 51 Mersenne primes have discovered [1]. Maybe that’s all there are, but it is […]

String interpolation in Python and R

One of the things I liked about Perl was string interpolation. If you use a variable name in a string, the variable will expand to its value. For example, if you a variable $x which equals 42, then the string “The answer is $x.” will expand to “The answer is 42.” Perl requires variables to […]

New Getting Started with vtreat Documentation

Win Vector LLC‘s Dr. Nina Zumel has just released some new vtreat documentation. vtreat is a an all-in one step data preparation system that helps defend your machine learning algorithms from: Missing values Large cardinality categorical variables Novel levels from categorical variables I hoped she could get the Python vtreat documentation up to parity with … Continue reading New Getting Started with vtreat Documentation

Regular expressions and special characters

Special characters make text processing more complicated because you have to pay close attention to context. If you’re looking at Python code containing a regular expression, you have to think about what you see, what Python sees, and what the regular expression engine sees. A character may be special to Python but not to regular […]

Regular expressions and special characters

Special characters make text processing more complicated because you have to pay close attention to context. If you’re looking at Python code containing a regular expression, you have to think about what you see, what Python sees, and what the regular expression engine sees. A character may be special to Python but not to regular […]

Introducing data_algebra

This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases. In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and rqdatatable). Introduction Parts of the … Continue reading Introducing data_algebra

What is vtreat?

vtreat is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner. vtreat takes an input DataFrame that has a specified column called “the outcome variable” (or “y”) that is the quantity to be predicted (and must not have missing values). Other input columns are possible … Continue reading What is vtreat?

vtreat up on PyPi

I am excited to announce vtreat is now available for Python on PyPi, in addition for R on CRAN. vtreat is: A data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. vtreat prepares variables so that data has fewer exceptional cases, making it easier to safely use models in production. … Continue reading vtreat up on PyPi

Lord Kelvin, Data Scientist

In 1876 A. Légé & Co., 20 Cross Street, Hatton Gardens, London completed the first “tide calculating machine” for William Thomson (later Lord Kelvin) (ref). Thomson’s (Lord Kelvin) First Tide Predicting Machine, 1876 The results were plotted on the paper cylinders, and one literally “turned the crank” to perform the calculations. The tide calculating machine … Continue reading Lord Kelvin, Data Scientist

Angles in the spiral of Theodorus

The previous post looked at how to plot the spiral of Theodorus shown below. We stopped the construction where we did because the next triangle to be added would overlap the first triangle, which would clutter the image. But we could certainly have kept going. If we do keep going, then the set of hypotenuse […]

PyCharm Video Review

My basic video review of the PyCharm integrated development environment for Python with Anaconda and Jupyter/iPython integration. I like the IDE extensions enough to pay for them early in my evaluation. Highly recommended for data science projects, at…

A Comment on Data Science Integrated Development Environments

A point that differs from our experience struck us in the recent note: A development environment specifically tailored to the data science sector on the level of RStudio, for example, does not (yet) exist. “What’s the Best Statistical Software? A Comparison of R, Python, SAS, SPSS and STATA” Amit Ghosh Actually, Python has a large … Continue reading A Comment on Data Science Integrated Development Environments

Distribution of quadratic residues

Let p be an odd prime number. If the equation x² = n mod p has a solution then n is a square mod p, or in classical terminology, n is a quadratic residue mod p. Half of the numbers between 0 and p are quadratic residues and half are not. The residues are distributed […]

Free Video Lecture: Vectors for Programmers and Data Scientists

We have just released two new free video lectures on vectors from a programmer’s point of view. I am experimenting with what ideas do programmers find interesting about vectors, what concepts do they consider safe starting points, and how to condense and present the material. Please check the lectures out. Vectors for Programmers and Data … Continue reading Free Video Lecture: Vectors for Programmers and Data Scientists