Special characters make text processing more complicated because you have to pay close attention to context. If you’re looking at Python code containing a regular expression, you have to think about what you see, what Python sees, and what the regular expression engine sees. A character may be special to Python but not to regular […]

# Author: John

## Munging CSV files with standard Unix tools

This post briefly discusses working with CSV (comma separated value) files using command line tools that are usually available on any Unix-like system. This will raise two objections: why CSV and why dusty old tools? Why CSV? In theory, and occasionally in practice, CSV can be a mess. But CSV is the de facto standard […]

## Three-digit zip codes and data privacy

Birth date, sex, and five-digit zip code are enough information to uniquely identify a large majority of Americans. See more on this here. So if you want to deidentify a data set, the HIPAA Safe Harbor provision says you should chop off the last to digits of a zip code. And even though three-digit zip […]

## Working with wide text files at the command line

Suppose you have a data file with obnoxiously long lines and you’d like to preview it from the command line. For example, the other day I downloaded some data from the American Community Survey and wanted to see what the files contained. I ran something like head data.csv to look at the first few lines […]

## Estimating vocabulary size with Heaps’ law

Heaps’ law says that the number of unique words in a text of n words is approximated by V(n) = K nβ where K is a positive constant and β is between 0 and 1. According to the Wikipedia article on Heaps’ law, K is often between 10 and 100 and β is often between 0.4 […]

## Estimating vocabulary size with Heaps’ law

Heaps’ law says that the number of unique words in a text of n words is approximated by V(n) = K nβ where K is a positive constant and β is between 0 and 1. According to the Wikipedia article on Heaps’ law, K is often between 10 and 100 and β is often between 0.4 […]

## Mickey Mouse, Batman, and conformal mapping

A conformal map between two regions in the plane preserves angles [1]. If two curves meet at a given angle in the domain, their images will meet at the same angle in the range. Two subsets of the plane are conformally equivalent if there is a conformal map between them. The Riemann mapping theorem says […]

## Star-crossed lovers

A story in The New Yorker quotes the following explanation from Arthur Eddington regarding the speed of light. Suppose that you are in love with a lady on Neptune and that she returns the sentiment. It will be some consolation for the melancholy separation if you can say to yourself at some—possibly prearranged—moment, “She is […]

## Contributing to open source projects

David Heinemeier Hansson presents a very gracious view of open source software in his keynote address at RailsConf 2019. And by gracious, I mean gracious in the theological sense. He says at one point “If I were a Christian …” implying that he is not, but his philosophy of software echos the Christian idea of […]

## Stone-Weierstrass on a disk

A couple weeks ago I wrote about a sort of paradox, that Weierstrass’ approximation theorem could seem to contradict Morera’s theorem. Weierstrass says that the uniform limit of polynomials can be an arbitrary continuous function, and so may have sharp creases. But Morera’s theorem says that the uniform limit of polynomials is analytic and thus […]

## Distribution of zip code population

There are three schools of thought regarding power laws: the naive, the enthusiasts, and the skeptics. Of course there are more than three schools of thought, but there are three I want to talk about. The naive haven’t heard of power laws or don’t know much about them. They probably tend to expect things to […]

## Landau kernel

The previous post was about the trick Lebesgue used to construct a sequence of polynomials converging to |x| on the interval [-1, 1]. This was the main step in his proof of the Weierstrass approximation theorem. Before that, I wrote a post on Bernstein’s proof that used his eponymous polynomials to prove Weierstrass’ theorem. This […]

## Lebesgue’s proof of Weierstrass’ theorem

A couple weeks ago I wrote about the Weierstrass approximation theorem, the theorem that says every continuous function on a closed finite interval can be approximated as closely as you like by a polynomial. The post mentioned above uses a proof by Bernstein. And in that post I used the absolute value function as an […]

## Proving that a choice was made in good faith

How can you prove that a choice was made in good faith? For example, if your company selects a cohort of people for random drug testing, how can you convince those who were chosen that they weren’t chosen deliberately? This is something I’ve helped companies with. It may be impossible to prove that a choice […]

## Detecting a short period in an RNG

The last couple posts have been looking at the Cliff random number generator. Yesterday I posted about testing the generator with the DIEHARDER test suite, the successor to George Marsaglia’s DIEHARD battery of RNG tests. This morning I discovered something about the Cliff RNG which led to discovering something about DIEHARDER.The latter is more important: […]

## Testing Cliff RNG with DIEHARDER

My previous post introduced the Cliff random number generator. The post showed how to find starting seeds where the generator will start out by producing approximately equal numbers. Despite this flaw, the generator works well by some criteria. I produced a file of 100 million 32-bit integers by multiplying the output values [1], which were […]

## Fixed points of the Cliff random number generator

I ran across the Cliff random number generator yesterday. Given a starting value x0 in the open interval (0, 1), the generator proceeds by xn+1 = | 100 log(xn) mod 1 | for n > 0. The article linked to above says that this generator passes a test of randomness based on generating points on […]

## Ease of learning vs relearning

Much more is written about how easy or hard some technology is to learn than about how hard it is to relearn. Maybe this is because people are more eager to write about something while the excitement or frustration of their first encounter is fresh. The ease of relearning a technology is under-rated. As you’re […]

## Uniform approximation paradox

What I’m going to present here is not exactly a paradox, but I couldn’t think of a better way to describe it in the space of a title. I’ll discuss two theorems about uniform convergence that seem to contradict each other, then show by an example why there’s no contradiction. Weierstrass approximation theorem One of […]

## Nearly parallel is nearly transitive

Let X, Y, and Z be three unit vectors. If X is nearly parallel to Y, and Y is nearly parallel to Z, then X is nearly parallel to Z. Here’s a proof. Think of X, Y, and Z as points on a unit sphere. Then saying that X and Y are nearly parallel means […]