Category: Unicode

How UTF-8 works

UTF-8 is a clever way of encoding Unicode text. I’ve mentioned it a couple times lately, but I haven’t blogged about UTF-8 per se. Here goes. The problem UTF-8 solves US keyboards can often produce 101 symbols, which suggests 101 symbols would be enough for most English text. Seven bits would be enough to encode […]

Excel, R, and Unicode

I received some data as an Excel file recently. I cleaned things up a bit, exported the data to a CSV file, and read it into R. Then something strange happened.
Say the CSV file looked like this:
foo,bar
1,2
3,4

I read the file into R with…

Number of possible Unicode characters

How many? The previous post showed how the number of Unicode characters has grown over time. You’ll notice there was a big jump between versions 3.0 and 3.1. That will be important later on. Unicode started out relative small then became much more ambitious. Are they going to run out of room? How many possible […]

Growth of Unicode over time

My previous post quoted Randall Munroe saying Unicode “started out just trying to unify a couple different character sets” and grew much more ambitious. The first version of Unicode, published in 1991, had 7,191 characters. Now the latest version has 137,994 characters and so is about 19 times bigger. Here’s a plot of the number […]

The hopeless task of the Unicode Consortium

Randall Monroe, author of xkcd, discussing Unicode on the Triangulation podcast: I am endlessly delighted by the hopeless task that the Unicode Consortium has created for themselves. … They started out just trying to unify a couple different character sets. And before they quite realized what was happening, they were grappling with decisions at the […]

Projecting Unicode to ASCII

Sometimes you need to downgrade Unicode text to more restricted ASCII text. For example, while working on my previous post, I was surprised that there didn’t appear to be an asteroid named after Poincaré. There is one, but it was listed as Poincare in my list of asteroid names. Python module I used the Python module unidecode […]

Trademark symbol, LaTeX, and Unicode

Earlier this year I was a coauthor on a paper about the Cap Score™ test for male fertility from Androvia Life Sciences [1]. I just noticed today that when I added the publication to my CV, it caused some garbled text to appear in the PDF. Here is the corresponding LaTeX source code. Fixing the […]