Unicode Regex Pitfalls

I sometimes receive the good advice to make my regular expressions more concise with the /i modifier or using backslash character classes like \d, \s or the like. But I avoid them on purpose. Why?

When it comes to programming languages, many developers are pretty polyglot nowadays and I am not an exception. I switch between C, JavaScript, Go, Java, Perl, Python, Ruby, and others on a regular basis. And that is easy because all of these languages are actually pretty similar. That is also true for the most part of their regular expression implementations. All of them are based on or are at least heavily inspired by Perl regular expressions.

However, the way they handle non-US-ASCII characters, significantly differs. This is not only owed to the fact that the way Unicode is handled in these languages differs but also the result of some not always obviously different design decisions.

What is Case-Insensitive?

In the good old days of US-ASCII the term “case-insensitive” was immediately clear. The concept stood for environments where “case”, “CASE”, and even “cASe” stood for the same thing. But when the limitations of US-ASCII were overcome, we were faced with comparing “café” to “CAFE” (or even “CAFÉ”), or cyrillic strings like “ягода” and “ЯГОДА”.

You should also keep in mind that this even has security implications like the IDN homograph attack shows. In brief, this attack is based on the phenomenon that many characters share the same graphical representation in most font faces. For example the latin capital letter A is almost always indistinguishable from a Greek uppercase alpha, or from a cyrillic uppercase A although their binary representation is completely different.

The /i Modifier

Compiling a regular expression with the modifier /i has the effect of making the regex engine ignore case while matching. How to exactly enable the modifier varies depending on the language:

# Perl: 
$string =~ /foobar/i;
// JavaScript:
string.match(/foobar/i);
// Java:
Pattern.compile("foobar", CASE_INSENSITIVE);

It is obvious that “Q” and “q” are equivalent when compared case-insensitively. But what about “Ä” and “ä”?

For a regular expression engine that is not unicode-aware, the Unicode character “Ü” is equvalent to two(!) bytes with the values 0xc3 and 0x9c, or in Windows-1252 to the sequence “Ãœ”. A UTF-8 encoded lowercase “ü” corresponds to the 2-byte sequence “ü” and it is no wonder that even a case-insensitive comparison triggers false in such cases.

So, how do different regular expression engines treat this case? We will stick to Perl, JavaScript, and Java since they are pretty representative.

Perl

Most languages use “Perl Compatible Regular Expressions”, and so Perl deserves to be looked at first.

my $uuml = "Ü";
my $re = "ü";
if ($uuml =~ /^$re$/i) {
    print "match\n";
} else {
    print "no match\n";
}

This will print out “no match” since Perl will do a byte-wise match here because it considers both the string and the regular expression a sequence of bytes (as opposed to multi-byte sequences).

But you can mark the input string and the regular expression as UTF-8 character sequences. One way to do this:

use Encode;

my $uuml = "Ü";
my $re = "ü";
Encode::_utf8_on($uuml);
Encode::_utf8_on($re);
if ($uuml =~ /^$re$/i) {
    print "match\n";
} else {
    print "no match\n";
}

Perl will now report a match because it considers both the input string and the regular expression source as a sequence of (UTF-8 encoded) characters. Yes, sad but true, this is a pretty bogus concept.

JavaScript

Let’s look at JavaScript now.

var uuml = "Ü";
if (uuml.match(/^ü$/i)) {
    console.log("match");
} else {
    console.log("no match");
}

For JavaScript this is a match (because of the /i modifier). Obviously, the JavaScript regular expression engine is always unicode-aware, at least in this case.

Java

Java has the most developer-friendly implementation:

import java.util.regex.*;

public class PlayGround{
    public static void main(String args[]) {
        int flags = Pattern.CASE_INSENSITIVE; 
        boolean match = Pattern.compile("^ü$", flags)
                               .matcher("Ü")
                               .matches();
        System.out.println(match);
        
        flags = Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE;
        match = Pattern.compile("^ü$", flags)
                       .matcher("Ü")
                       .matches();
        System.out.println(match);
    }
}

This is how it should be. You can explicitely enable unicode-aware case folding with the flag UNICODE_CASE. Why is this good (a synonym for Do The Right Thing™)? ASCII case-folding is pretty cheap. Case folding for Unicode is complex and potentially expensive. Look at the Unicode Case Folding Mapping if you cannot believe it.

Borderline Cases for /i

What is the lowercase version of “K”? If you think that the answer is obvious, you have either looked at a hexdump of the source code of this page or you have nonchalantly ignored the fact that the Latin uppercase letter K, the Greek uppercase letter Kappa, the Cyrillic uppercase letter Ka, and(!) the Kelvin sign are indistinguishible in most font faces. Test: How many different characters can you see in “KΚКK”? There are four distinct characters!

That is more than an interesting side-note because the Kelvin sign keeps a little surprise for you. Look again at the Unicode Case Folding Properties. Go to this line:

212A; C; 006B; # KELVIN SIGN

Read this as Unicode character “KELVIN SIGN” \u212a in the common (C) case folding mapping has the lowercase counterpart \u006b which happens to be the US-ASCII Latin lowercase k. That means that the very common mappings between [A-Z] and [a-z] are not bi-unique, there is no one-to-one correspondence between them!

That has an important impact on parsing source code with regular expressions. A variable is often defined by the pattern /^[_a-zA-Z][_a-zA-Z0-9]+$/ or by the equivalent /^[_a-z][_a-z0-9]+$/i. But are they really equivalent?

The above linked Unicode case folding map states in the introductory comment:

If all characters are mapped according to the full mapping below, then case differences (according to UnicodeData.txt and SpecialCasing.txt) are eliminated.

So, let’s check whether the Kelvin sign matches the regular expression /[a-z]/i. We begin once more with Perl.

print "match\n" if "\x{212a}" =~ /^[a-z]$/i;

The string “\x{212a}” is the Kelvin sign, and - wow! - it is obviously inside the character class /[a-z]/i. And this is the correct behavior!

Now JavaScript:

console.log("\u212a".match(/^[a-z]$/i));

In JavaScript it does not match. Is this a bug? Based on common sense it probably is a bug, especially if you look at this piece of code:

console.log("\u212a".toLowerCase());

This spits out a lowercase “k”, true to the Unicode standard.

At the end of the day, the JavaScript pattern /[a-z]/i does what most people expect. But in fact it compensates a surprising feature of Unicode with a surprising behavior of its regular expression implementation.

But you should avoid all ambiguities and just write /[a-zA-Z]/, when you mean it (although this seems to be slightly slower in Firefox).

Note: in case of the Kelvin sign, the /u modifier introduced in ECMAScript 2015 changes the behavior. See https://mathiasbynens.be/notes/es6-unicode-regex for details.

And What Is a Digit?

Similar considerations apply to the usage of shortcuts for character classes like \d for digits and\s for space, or similar constructs.

Once again, Perl is pretty strict:

print "match" if  "\x{6f3}\x{6f2}" =~ /\d/;

“\x{6f3}\x{6f2}” is the same as “۴۲”, which is 42 written with Arabic digits.

Likewise, \s matches ideographic space:

print "match" if  "\x{3000}" =~ /\s/;

JavaScript’s behavior is, umh, surprising. Arabic digits are not digits:

console.log("\u06f3\u06f2".match(/\d/));

This yields false. But \s matches ideographic space:

console.log("\u3000".match(/\d/));

Checking out the exact behavior of the often used \b and \w and all of the uppercase variants and how the behavior is changed by the /u modifier introduced in ES2015 is left as an exercise to the reader.

It should also be mentioned that Ruby comes with just another variant. For example /\d/ is equivalent to /[0-9]/ but /[[:digit:]]/ matches digits in all scripts. The same applies to \s and [:space:] and so on.

Conclusion

Using /i, \d, \s, \b, /w can be a subtle source of bugs and makes the exact behavior of your code hard to understand. Avoid them if possible! The rock stars in your team will understand either version. Those that have better ideas for their spare time than reading specs will thank you.


blog comments powered by Disqus