Why are regular expressions difficult?

Regular expressions are challenging, but not for the reasons commonly given.

Non-reasons

Here are some reasons given for the difficulty of regular expressions that I don’t agree with.

Cryptic syntax

I think complaints about cryptic syntax miss the mark. Some people say that Greek is hard to learn because it uses a different alphabet. If that were the only difficulty, you could easily learn Greek in a couple days. No, Greek is difficult for English speakers to learn because it is a very different language than English. The differences go much deeper than the alphabet, and in fact that alphabets are not entirely different.

The basic symbol choices for regular expressions — . to match any character, ? to denote that something is optional, etc. — were arbitrary, but any choice would be. As it is, the chosen symbols are sometimes mnemonic, or at least consistent with notation conventions in other areas.

Density

Regular expressions are dense. This makes them hard to read, but not in proportion to the information they carry. Certainly 100 characters of regular expression syntax is harder to read than 100 consecutive characters of ordinary prose or 100 characters of C code. But if a typical pattern described by 100 characters of regular expression were expanded into English prose or C code, the result would be hard to read as well, not because it is dense but because it is verbose.

Crafting expressions

The heart of using regular expressions is looking at a pattern and crafting a regular expression to match that pattern. I don’t think this is difficult, especially when you have some error tolerance. Very often in applications of regular expressions it’s OK to have a few false positives, as long as a human scans the output. For example, see this post on looking for ICD-10 codes.

I suspect that many people who think that writing regular expressions is difficult actually find some peripheral issue difficult, not the core activity of describing patterns per se.

Reasons

Now for what I believe are reasons why regular expressions are.

Overloaded syntax and context

Regular expressions use a small set of symbols, and so some of these symbols to double duty. For example, symbols take on different meanings inside and outside of character classes. (See point #4 here.) Extensions to the basic syntax are worse. People wanting to add new features to regular expressions ‐ look-behind, comments, named matches, etc. — had to come up with syntax that wouldn’t conflict with earlier usage, which meant strange combinations of symbols that would have previously been illegal.

Dialects

If you use regular expressions multiple programming languages, you’ll run into numerous slight variations. Can I write \d for a digit, or do I need to write [0-9]? If I want to group a subexpression, do I need to put a backslash in front of the parentheses? Can I write non-capturing groups?

These variations are difficult to remember, not because they’re completely different, but because they’re so similar. It reminds me of my French teacher saying “Does literature have a double t in English and one t in French, or the other way around? I can never remember.”

Use

It’s difficult to remember the variations on expression syntax in various programming languages, but I find it even more difficult to remember how to use the expressions. If you want to replace all instances of some regular expression with a string, the way to do that could be completely different in two languages, even if the languages use the exact same dialect of regular expressions.

Resources

Here are notes on regular expressions I’ve written over the years, largely for my own reference.