Category: Computing

Why would anyone do that?

There are tools that I’ve used occasionally for many years that I’ve just started to appreciate lately. “Oh, that’s why they did that.” When you see something that looks poorly designed, don’t just exclaim “Why would anyone do that?!” but ask sincerely “Why would someone do that?” There’s probably a good reason, or at least […]

Progress on the Collatz conjecture

The Collatz conjecture is for computer science what until recently Fermat’s last theorem was for mathematics: a famous unsolved problem that is very simple to state. The Collatz conjecture, also known as the 3n+1 problem, asks whether the following function terminates for all positive integer arguments n. def collatz(n): if n == 1: return 1 […]

How UTF-8 works

UTF-8 is a clever way of encoding Unicode text. I’ve mentioned it a couple times lately, but I haven’t blogged about UTF-8 per se. Here goes. The problem UTF-8 solves US keyboards can often produce 101 symbols, which suggests 101 symbols would be enough for most English text. Seven bits would be enough to encode […]

Excel, R, and Unicode

I received some data as an Excel file recently. I cleaned things up a bit, exported the data to a CSV file, and read it into R. Then something strange happened.
Say the CSV file looked like this:

I read the file into R with…

Quiet mode

When you start a programming language like Python or R from the command line, you get a lot of initial text that you probably don’t read. For example, you might see something like this when you start Python. Python 2.7.6 (default, Nov 23 2017, 15:49:48) [GCC 4.8.4] on linux2 Type “help”, “copyright”, “credits” or “license” […]

More bc weirdness

As I mentioned in a footnote to my previous post, I just discovered that variable names in the bc programming language cannot contain capital letters. I think I understand why: Capital letters are reserved for hexadecimal constants, though in a weird sort of way. At first variable names in bc could only be one letter […]

Prefix code examples

In many offices, you can dial a three digit number to reach someone else in the office. In such offices, you usually have to dial 9 to to reach an outside number. There’s no ambiguity because no one can have an extension that begins with 9. After you’ve entered three digits, the phone system knows […]

Number of possible Unicode characters

How many? The previous post showed how the number of Unicode characters has grown over time. You’ll notice there was a big jump between versions 3.0 and 3.1. That will be important later on. Unicode started out relative small then became much more ambitious. Are they going to run out of room? How many possible […]

Growth of Unicode over time

My previous post quoted Randall Munroe saying Unicode “started out just trying to unify a couple different character sets” and grew much more ambitious. The first version of Unicode, published in 1991, had 7,191 characters. Now the latest version has 137,994 characters and so is about 19 times bigger. Here’s a plot of the number […]

Munging CSV files with standard Unix tools

This post briefly discusses working with CSV (comma separated value) files using command line tools that are usually available on any Unix-like system. This will raise two objections: why CSV and why dusty old tools? Why CSV? In theory, and occasionally in practice, CSV can be a mess. But CSV is the de facto standard […]

Working with wide text files at the command line

Suppose you have a data file with obnoxiously long lines and you’d like to preview it from the command line. For example, the other day I downloaded some data from the American Community Survey and wanted to see what the files contained. I ran something like head data.csv to look at the first few lines […]

Ease of learning vs relearning

Much more is written about how easy or hard some technology is to learn than about how hard it is to relearn. Maybe this is because people are more eager to write about something while the excitement or frustration of their first encounter is fresh. The ease of relearning a technology is under-rated. As you’re […]

Accelerating an alternating series

The most direct way of computing the sum of an alternating series, simply computing the partial sums in the terms get small enough, may not be the most efficient. Euler figured this out in the 18th century. For our demo we’ll evaluate the Struve function defined by the series Note that the the terms in […]

Journalistic stunt with Emacs

Emacs has been called a text editor with ambitions of being an operating system, and some people semi-seriously refer to it as their operating system. Emacs does not want to be an operating system per se, but it is certainly ambitious. It can be a shell, a web browser, an email client, a calculator, a […]

Notes on computing hash functions

A secure hash function maps a file to a string of bits in a way that is hard to reverse. Ideally such a function has three properties: pre-image resistance collision resistance second pre-image resistance Pre-image resistance means that starting from the hash value, it is very difficult to infer what led to that output; it […]

Software to factor integers

In my previous post, I showed how changing one bit of a semiprime (i.e. the product of two primes) creates an integer that can be factored much faster. I started writing that post using Python with SymPy, but moved to Mathematica because factoring took too long. SymPy vs Mathematica When I’m working in Python, SymPy […]

Why are regular expressions difficult?

Regular expressions are challenging, but not for the reasons commonly given. Non-reasons Here are some reasons given for the difficulty of regular expressions that I don’t agree with. Cryptic syntax I think complaints about cryptic syntax miss the mark. Some people say that Greek is hard to learn because it uses a different alphabet. If […]

Protecting privacy while keeping detailed date information

A common attempt to protect privacy is to truncate dates to just the year. For example, the Safe Harbor provision of the HIPAA Privacy Rule says to remove “all elements of dates (except year) for dates that are directly related to an individual …” This restriction exists because dates of service can be used to […]

SQRL: Secure Quick Reliable Login

Steve Gibson’s Security Now is one of the podcasts I regularly listen to, and so I’ve been hearing him talk about his SQRL for a while. This week he finally released SQRL: Secure Quick Reliable Login. You can read more about SQRL in the white paper posted on the GRC web site. Here’s a tease […]

Random sampling from a file

I recently learned about the Linux command line utility shuf from browsing The Art of Command Line. This could be useful for random sampling. Given just a file name, shuf randomly permutes the lines of the file. With the option -n you can specify how many lines to return. So it’s doing sampling without replacement. […]