The Gory Details of URL Validation

I currently work on a web application that allows users to supply a link to their homepage. Such links have to be validated client-side with JavaScript (actually Typescript/Angular) and with Perl on the server-side. But what should be accepted as a valid homepage link? And what is the right approach to analyze the provided URLs?

Table Of Contents

You will often see recommendations for regular expression, for example, this one from an answer at stackoverflow.com:

function validURL(str) {
  var pattern = new RegExp('^(https?:\\/\\/)?'+ // protocol
    '((([a-z\\d]([a-z\\d-]*[a-z\\d])*)\\.)+[a-z]{2,}|'+ // domain name
    '((\\d{1,3}\\.){3}\\d{1,3}))'+ // OR ip (v4) address
    '(\\:\\d+)?(\\/[-a-z\\d%_.~+]*)*'+ // port and path
    '(\\?[;&a-z\\d%_.~+=-]*)?'+ // query string
    '(\\#[-a-z\\d_]*)?$','i'); // fragment locator
  return !!pattern.test(str);
}

Looks great, but ... life ain't all beer and liverwurst. A closer look soon shows several issues. In fact, if you have to validate URLs, a regular expression is almost never sufficient to do the job correctly. You can use regular expressions for heuristically extracting URLs from longer strings, but not really for validating URLs.

Mathias Bynens has even compiled a list of URL validation regexes. But beware that the test list on the page has at least two false negatives.

There Is No General-Purpose URL Validator

Yes, there is a formal standard for URLs but that standard does not help you all that much. Most of the time you will have reasons to reject formally correct URLs (for example http://localhost) or to accept or reinterpret incorrect URLs (for example http://www.cantanea.com for www.cantanea.com).

But there are countless edge cases that have to be decided based on the actual requirements. Even if you decided to use a ready-made, necessarily opinionated solution, you should at least be aware of these borderline cases because they can introduce subtle bugs or even security issues.

Use a URL Parser

No matter what programming language you are using, somebody will already have written a URL parser for it. For JavaScript, this parser is part of the language core as the URL interface:

url = new URL(input)

Note that this throws an exception if you pass an invalid URL as input.

For Perl, use URI:

use URI;

$url = URI->new($input);

Using a well-established package for the initial parsing not only saves you from overlooking important edge cases but you will also be able to continue processing with a normalized, canonical form of the URL:

url = new URL(input)
// Validate ...
return url.href;

And in Perl:

$url = URI->new($input);
# Validate ...
return $url->canonical;

Now you can start validating the individual components of the URL against your requirements. The individual components are:

https://Myself:s3cr3t@My-Company.com:8080/path/to/Search?Q=huh&L=en#Results
\___/   \____/ \____/ \____________/ \__/\_____________/ \________/ \_____/
  |       |      |           |        |         |            |        |
scheme  user  password     host      port     path         query   fragment

Both the Javascript URL() and the Perl URI class create an object from the URL. In Javascript you can read and assign the properties directly, in Perl you use getters/setters:

Javascript URL Perl URI Value (Javascript version)
protocol scheme() https:
username userinfo() Myself
password userinfo() s3cr3t
hostname host() my-company.com
port port() 8080
host n/a my-company.com:8080
origin n/a https://my-company.com:8080
pathname path() /path/to/Search
search query() ?Q=huh&L=en
searchParams query_form() { Q: "en", L: "en" }
hash fragment() #Results

The Perl URI methods, in general, do not return the separators. For example the scheme() method returns just "https" but not "https:" like the JavaScript version.

The Scheme

A lot of times, you will only be interested in https:// or http:// URLs. Even if you have a wider notion, you should almost always explicitly whitelist allowed schemes. Remember, you cannot click on git:// links in a document opened in your browser and you cannot send mail to an x-letter-to:user@something.

Path, Query, and Fragment

Pretty much anything is allowed here, and there is little reason to reject something.

Port

Ports are integers in the range of 1 to 65536.

The JavaScript URL parser throws an exception when a port greater than 65536 was specified. It does, however, accept a port number of zero. The port zero is a so-called wildcard port, meaning that the system should use a suitable port number. In our context, that does not make sense, such ports are rejected:

if (url.port === 0) throw new Error('Port 0 is not allowed.');

The Perl URI package allows all non-negative integers. You have to implement that check yourself:

die "port out of range" if ($url->port < 1 || $url->port > 65535);

Contrary to the JavaScript URL interface, the Perl URI package does not strip off leading zeroes from port numbers (http://localhost:0001234/ is a funny way of writing http://localhost:1234). So you must be careful that you do not misinterpret port numbers with leading zeroes as octal numbers in the comparison:

my $port = $url->port;
$url->port($port) if $port =~ s/^0+//;
die "port out of range" if ($port < 1 || $port > 65535);

Username and Password

It is often overlooked that a URL can contain credentials (username and/or password). Whether such URLs should be rejected or not is a matter of policy. The check in JavaScript is done like this:

if (url.username !== '' || url.password !== '') {
    throw new Error('Credentials are not allowed.');
}

The Perl version:

die "userinfo\n" if $url->userinfo;

By the way, this is a feature often used in URL obfuscation. For example, the URL http://facebook.com@3232235521/ will not bring you to Facebook but rather to the web interface of your router at http://192.168.1.1/. The reason is that the number 3232235521 is one of many representations of the numerical IP address 192.168.0.1 and the string "facebook.com" is not the hostname part but the username part of the URL http://facebook.com@3232235521/. Modern browsers will therefore warn you about this before opening such URLs.

Hostname

A good starting point for familiarizing with hostname standards is the Top Level Domain Name Specification.

In a nutshell, a fully-qualified domain name (read "hostname") is made up of at least two "labels" separated by dots. For example, the hostname "www.example.com" is made up of the labels "www", "example", and "com". Each one must consist of alphabetic characters (a-z), ASCII digits (0-9) or the hyphen (- aka dash) only. The first character must be an alphabetic character.

Case does not matter.

The Root Label

The root domain of the internet has no name. Its corresponding label is the empty string. In practice, that means that if a hostname ends with a dot, it is already fully qualified, i.e. no search expansion applies.

If it does not end with a dot, a configurable list of search domains is appended to the name and also tried to be resolved. This list is configured in the file /etc/resolv.conf:

$ cat /etc/resolv.conf 
domain cantanea.com
search cantanea.com
nameserver 127.0.0.1
nameserver 8.8.8.8
$ host smtp
smtp.cantanea.com has address 212.72.196.90
$ host smtp.cantanea.com.
smtp.cantanea.com has address 212.72.196.90
$ host smtp.
Host smtp. not found: 3(NXDOMAIN)

An empty label is only allowed for the root label. Consequently, a valid hostname may never contain two or more subsequent dots.

Violating Hostname Standards

If a hostname, for example web_server.company.com, violates hostname standards, that does not prevent it from working. You can put an entry for it in your /etc/hosts and your browser will try to connect to http://web_server.company.com without problems. On the other hand, your name server would probably reject such an entry in a zone file, and you will not be able to officially register domain names that violate the standard.

However, from a UX or security point of view it does not matter all that much which hostnames are valid but rather which hostnames actually work. It depends on your requirements, how strict your particular checks should be.

Detailed Hostname Validation

Validating the hostname part of the URL is by far the most complicated part. It gets simpler by splitting the hostname into its individual labels first:

var labels = url.hostname.split('.');
if (labels[labels.length - 1] === '')
    labels.pop();

If the hostname ends with a dot, we strip it off. Whether that is the correct decision, depends on your requirements. Actually, it would be more robust to enforce a trailing dot but that would make URLs look pretty odd to the average user.

The Perl version is a little different because the implementation of split() is mildly-spoken surprising for cases where the separator appears at the beginning or end of the string. You better strip off an empty root label before you split the hostname into labels:

$host =~ s/\.$//; # Strip off an empty root label.
my @labels = split /\./, $host;

Empty Labels/Consecutive Dots

Are two consecutive dots like in www..example.com allowed in a hostname? No. They would constitute an empty label and that is only allowed for the root label.

So we have the next check:

if (labels.filter(label => label === '').length)
    throw new Error('consecutive dots are not allowed');

Note that if you don't strip off a possible empty root label above, you must allow an empty label if it occurs at the last position.

Because of the way split() works in Perl if the string starts or ends with the separator, it is simpler to check for empty labels with a pattern match (or more efficiently with index()) on the original hostname:

die "consecutive dots are not allowed\n" if $url->host =~ /\.\./;

Numerical IPv4 Addresses

Numerical IP addresses require different checks than symbolic hostnames. Therefore, the validation continues with checking whether the hostname portion of the URL is a numerical IPv4 or IPv6 address.

A naive regex for IPv4 addresses in quad-dotted notation is this:

new Regex(/^(([0-9]{1,3})\.){3}([0-9]{1,3})$/);

Unfortunately, that also matches 256.257.258.999 which is not a valid IP address. Remember, only numbers in the range 0-255 are allowed.

Okay, I know, you have read Mastering Regular Expressions and know a fix. But there are other gotchas ...

The fact that the quad-dotted notation is often called dot-decimal notation, should not fool you into thinking that numerical IP addresses have to be decimal. 127.0.0.1 can also be written as 0177.0.0.1, 0x7f.0.0.1, or even as 0000177.0x0.0000.0x1, they are all the same! Keep that in mind, when you want to blacklist IPs.

But the term "quad-dotted notation" is also misleading in another respect: A numerical IPv4 address does not necessarily have to be expressed as a group of four integers but can be anything from one to four integers. Take the IP address 120.144.171.205. In quad-dotted hex notation it is 0x78.0x90.0xab.0xcd. As it turns out, you can express the same address as 0x78.0x90.0xabcd, or 0x78.0x90abcd, and even as 0x7890abcd. In quad-dotted decimal notation, this would be 120.144.43981, 120.9481165, or just 2022747085. They are all the same IP!

The following table summarizes how the single, double, triple, and quadruple dot notation works with IPv4 addresses.

Groups Pattern Bits
4 a.b.c.d 8.8.8.8
3 a.b.c 8.8.16
2 a.b 8.24
1 a 32

An integer overflow in all of these notations is an error and results in an invalid IP address, except for one case: Mac OS X (and maybe other BSD Unix flavors) seems to accept arbitrary large numbers for the 32 bit single notation:

$ ping -c 1 89754893758943759873429
PING 89754893758943759873429 (143.202.181.149): 56 data bytes
64 bytes from 143.202.181.149: icmp_seq=0 ttl=44 time=240.768 ms

--- 89754893758943759873429 ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 240.768/240.768/240.768/0.000 ms

On a GNU/Linux system (using glibc as the system library), the same IP address results in an error. Likewise, with all JavaScript engines that I have tested, passing integers with more than 32 bits to the constructor of URL cause an exception to be thrown.

On the other hand, 0xff.0x1000000 is nowhere a numerical IP address, because 0x1000000 is 25 bits wide, and the a.b variant of IP addresses only allows 24 bits for b. It is rather interpreted as an (invalid) symbolic hostname; invalid because labels must never begin with a digit. And the rationale behind that rule should be clear by now: If a label begins with a digit, it is part of a numerical IPv4 address and can easily be distinguished from a label of a symbolic hostname.

Unfortunately, Perl's URI module does not normalize numerical IPv4 addresses into their canonical, decimal, quadruple form. I personally consider that a bug that should be fixed in URI but for the time being, you have to do the normalization yourself. In order to reduce the amount of source code in this post, I omit the Perl version of the validation from now on. If you are interested, check a fully-commented implementation in Perl at https://github.com/gflohr/Lingua-Poly/blob/master/apis/users/lib/Lingua/Poly/API/Users/Validator/Homepage.pm.

In JavaScript, things are simpler because the URL class has already taken care of it and the property hostname is always in its canonical, decimal form. And because the full address is already split into labels, the actual check for a numerical IPv4 address is relatively simple:

var isIP = false;
if (labels.length === 4) {
    var octetRe = new RegExp('^(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$');
    if (labels.filter(label => !!label.match(octetRe)).length === 4) {
        isIP = true;
        // IPv4-specific checks follow ...
    }
}

The regular expression only matches numbers in the range 0-255. Alternatively, you could just check that the numbers are decimal (/(0|[1-9][0-9]+)/) and are all less than 256.

The additional checks to perform on numerical IP addresses completely depend on your particular requirements or policy. You may want to reject them altogether. In my case, I want to only reject such addresses that cannot be used as a publicly available website address. That means that I want to block all private IP ranges (192.168.x.x, 172.16-31.x.x, 10.x.x.x) as well as link-local addresses (169.254.x.x) and the new carrier-grade NAT deployment addresses (100.64-127.x.x) and, of course, addresses bound to the loopback device (127.x.x.x).

So the complete check looks like this:

var isIP = false;
if (labels.length === 4) {
    var octetRe = new RegExp('^(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$');
    if (labels.filter(label => !!label.match(octetRe)).length === 4) {
        isIP = true;
        var octets = labels.map(octet => parseInt(octet, 10));

        // IPv4 addresses with special purpose?
        if (// Loopback.
            octets[0] === 127
            // Private IP ranges.
            || octets[0] === 10
            || (octets[0] === 172 && octets[1] >= 16 && octets[1] <= 31)
            || (octets[0] === 192 && octets[1] === 168)
            // Carrier-grade NAT deployment.
            || (octets[0] === 100 && octets[1] >= 64 && octets[1] <= 127)
            // Link-local addresses.
            || (octets[0] === 169 && octets[1] === 254)) {
            throw new Error('special purpose IPv4 addresses are not allowed');
        }
    }
}

But, hey, isn't 1.0.0.127.in-addr.arpa the same as 127.0.0.1? For DNS lookup yes, but these addresses cannot be used in networking because "hostnames" in the zone .in-addr.arpa resolve to names, not IP addresses. So there is nothing to worry about for now.

IPv6 Addresses

IPv6 addresses can also be used in URLs. But the address has to be wrapped into square brackets so that the colons do not conflict with the colons used in the scheme and to separate the port. The IPv6 equivalent of http://127.0.0.1 is hence http://[::1] for the IPv6 address ::1.

IPv6 addresses are made up of eight groups of four hexadecimal digits, each group separated by a colon. Unnecessary leading zeroes can be omitted, and at most one sequence of subsequent zero groups (for example :0:0:0) can be compressed into two colons (::).

Like IPv4, the IPv6 address space contains large areas reserved for special purposes like private networks or link-local addresses. In order to detect them, it makes sense to first normalize IPv6 addresses into their longest, canonical form, and then match. By the way, you could benefit from the same idea for IPv4 addresses, if you normalize them into hexadecimal or octal form.

So-called IPv4 mapped addresses may cause additional problems. For example, the IPv4 address 172.16.17.18 can be expressed as 0000:0000:0000:0000:0000:FFFF:172.16.17.18 resp. ::FFFF:172.16.17.18 in its compressed form. If you want to support them, you have to do the same checks as described for IPv4 addresses above, because otherwise people could disguise unwanted IPv4 addresses as IPv6. We simplify things here by completely rejecting all IPv4 mapped addresses. That is simple for the decimal form because our regular expression does not even match in the first place. In their hexadecimal form, these addresses still have to be explicitly rejected.

For the same reason, IPv4 translated addresses and the IPv4/IPv6 translation address space (6to4) are also discarded.

The complete check looks like this:

if (!!url.hostname.match(/^\[([0-9a-fA-F:]+)\]$/)
        && !!url.hostname.match(/:/)) {
    // Uncompress the IPv6 address.
    let groups = url.hostname.substr(1, url.hostname.length - 2).split(':');
    if (groups.length < 8) {
        for (let i = 0; i < groups.length; ++i) {
            if (groups[i] === '') {
                groups[i] = '0';
                let missing = 7 - groups.length;
                for (let j = 0; j <= missing; ++j) {
                    groups.splice(i, 0, '0');
                }
                break;
            }
        }
    }

    // Check it.
    if (groups.filter(group => group.match(/^[0-9a-f]+$/)).length === 8) {
        const igroups = groups.map(group => parseInt(group, 16));
        const max = igroups.reduce((a, b) => Math.max(a, b));

        if (max <= 0xffff) {
            isIP = true;
            const norm = groups.map(group => group.padStart(4, '0')).join(':');
            if (max === 0 // the unspecified address
                // Loopback.
                || '0000:0000:0000:0000:0000:00000:0000:0001' === norm
                // IPv4 mapped addresses.
                || !!norm.match(/^0000:0000:0000:0000:0000:ffff/)
                // IPv4 translated addresses.
                || !!norm.match(/^0000:0000:0000:0000:ffff:0000/)
                // IPv4/IPv6 address translation.
                || !!norm.match(/^0064:ff9b:0000:0000:0000:0000/)
                // IPv4 compatible.
                || !!norm.match(/^0000:0000:0000:0000:0000:0000/)
                // Discard prefix.
                || !!norm.match(/^0100/)
                // Teredo tunneling, ORCHIDv2, documentation, 6to4.
                || !!norm.match(/^200[12]/)
                // Private networks.
                || !!norm.match(/^f[cd]/)
                // Link-local
                || !!norm.match(/^fe[89ab]/)
                // Multicast.
                || !!norm.match(/^ff/)
            ) {
                throw new Error('special purpose IPv6 address');
            }
        }
    }
}

if (isIP) return;

Important! Please take this code with a grain of salt. I'm not an IPv6 expert and haven't seriously tested the code. Please double-check it if proper IPv6 handling is mission-critical for you.

The return statement at the end is necessary because the rest of the checks are only valid for symbolic hostnames.

Fully-Qualified Domain Names

For my use-case, I want to allow only publicly available URLs. That implies that they must be fully qualified. In other words, they must have at least two labels.

if (labels.length < 2)
    throw new Error('only fully-qualified hostnames are allowed');

Sadly, this is not enough. Certain top-level domains are further divided into sub-namespaces. A well-known example is the domain .uk for the United Kingdom. It is, for instance, not possible to register the domains .co.uk, .ac.uk, .gov.uk, among others. And .uk is not the only top-level domain like this. There are also .in, .au, .br, and more that have similar policies.

For the time being, I do not perform such tld-specific checks, because the resulting code would be a maintenance nightmare, and failure to detect such invalid URLs does not have an impact on security. This is true for my use case, but your mileage may vary.

Top-Level Domain Checks

There are some further constraints for top-level domains.

RFC2606 declares the top-level domains .example, .test, .localhost, and .invalid, as well as the second-level domains .example.com, .example.net, and .example.org as reserved.

But reserved does not mean that they are invalid. For example, the website http://example.com exists, and the URL is, of course, fully valid. I still decided to reject all of these domains, because it does not make sense to specify them as a user's homepage in the context of my particular application. But, again, your mileage may vary.

RFC6762 and RFC7686 further declare .local and .onion as special purpose. While we are at it, we also reject .home and .corp because they are not IANA-registered but sometimes recommended for private use.

The domains .in-addr.arpa and ip6.arpa which are used for reverse DNS have already been mentioned. In fact, the entire top-level domain .arpa is used for technical, obsolete or esoteric stuff, and so we disallow it altogether. The same applies to .int.

Traditionally, top-level domains could only contain alphabetic characters (i.e. no hyphen, no decimal digits). This has now been relaxed by https://tools.ietf.org/id/draft-liman-tld-names-01.html to allow top-level domains with Unicode characters (see below). But the ASCII hyphen or ASCII decimal digits are only allowed if the top-level domain name starts with the IDN/punicode marker "xn--". See the specs for details!

That leads to the following code in JavaScript:

var tld = labels[labels.length - 1];
if ('xn--' !== tld.substr(0, 4) && !!tld.match(/[-0-9]/)) {
    throw new Error('Only alphabetic characters allowed in TLD.');
}

if ([
    'example',
    'test',
    'localhost',
    'invalid',
    'local',
    'onion',
    'home',
    'corp',
    'arpa',
    'int'].includes(tld)
    || ('example' === labels[labels.length - 2]
        && ['com', 'net', 'org'].includes(tld))) {
    throw new Error('Reserved top-level domains are not allowed.');
}

It is sometimes said that top-level domains must be at least two characters long. I do not know about such a standard but feel free to implement such a restriction in your own code. At the time of this writing, all officially registered top-level domains are at least two characters long.

But do not limit the length of top-level domains to three characters. This rule is obsolete for quite some time already! The top-level domains .info, .museum, and .travelerinsurance are all officially registered!

But there is one more restriction! Each label must start with an alphabetic character and may not end with a hyphen. Because of IDN issues (see below) we reformulate that as: A label must not start with a hyphen or digit and must not end with a hyphen.

for (let i = 0; i < labels.length; ++i) {
    var label = labels[i];
        if (!!label.match(/^[-0-9]/) || !!label.match(/-$/)) {
            throw new Error('malformed hostname');
        }
    }
}

Writing that in a more elegant fashion with Array.filter() is left as an exercise to the reader.

Unicode

Unicode is present in domain names for quite some time now. But Unicode support is actually a client-side feature, not a feature of the domain name system. If a label contains Unicode characters, it is converted by the client (for example the web browser) into a string that starts with "xn--" and is then converted into US ASCII by the so-called Punycode algorithm. So, technically spoken, hostnames are still strictly US ASCII, but in most cases, for client-side checks, you have to accept Unicode in hostnames.

In fact, hostnames cannot contain arbitrary Unicode characters. For example, the already mentioned document https://tools.ietf.org/id/draft-liman-tld-names-01.html states that top-level domain names can still only contain (Unicode!) letters.

And then there exist specific rules for all top-level domains that support Unicode. Most registrars only allow Unicode characters that are frequently used in the context of the domain.

You can easily see that enforcing all these restrictions will quickly result in an enormous amount of work and maintenance effort. Since it is not mission-critical in my case, I only check that the hostname does not contain forbidden US-ASCII characters, but allow all Unicode characters without any further checking. Decide yourself, whether you have other requirements.

if (!!url.hostname.match(/[\x00-\x2c\x2f\x3a-\x60\x7b-\x7f]/)) {
    throw new Error('forbidden character in hostname');
}

The pattern match checks that only alphabetic characters (a-z), decimal digits (0-9), the hyphen (-), and the dot (.) or characters outside the range of US-ASCII are contained in the hostname.

The . is also allowed as part of a hostname but it cannot be part of a hostname label because it is the label separator. But it is more efficient to execute the regular expression just once.

You may also note that uppercase letters A-Z are not included in the character range. But the hostname is already in canonical form and therefore completely lowercase. This is, by the way, also true for URLs with an IPv6 address as the hostname portion.

Arguably, you may also check that the hostname is valid UTF-8. Google Chrome converts invalid UTF-8 to an IDN and tries to open it, Firefox does not. So we better leave complaints to the browser and do not perform any check for invalid multi-byte sequences in the hostname. But, again, you may have different requirements.

Unicode Normalization

We have already learned that hostnames are case-insensitive and browsers embrace that by automatically converting the hostname portion of a URL to lowercase so that http://LOCALHOST/ is equivalent to http://localhost/.

Unfortunately, this is just the tip of the iceberg. The browser does a lot more normalization than just lowercasing. These hostnames are all equivalent as the hostname part of a URL:

  • LOCALHOST
  • 𝓡𝓸𝓬π“ͺ𝓡𝓱𝓸𝓼𝓽
  • ο½Œο½ο½ƒο½ο½Œο½ˆο½ο½“ο½”
  • β“›β“žβ“’β“β“›β“—β“žβ“’β“£
  • 𝕝𝕠𝕔𝕒𝕝𝕙𝕠𝕀π•₯

And so are these:

  • 127.0.0.1
  • ①⑑⑦.β“ͺ.β“ͺ.β‘ 
  • οΌ‘οΌ’οΌ—.0.0.οΌ‘
  • πŸ™πŸšπŸŸ.𝟘.𝟘.πŸ™

Try the Unicode Text Converter for more examples, although the more fancy conversions will fortunately not work in browsers.

Technically, the browser is converting hostnames using a Unicode Normalization Form. In JavaScript, this is equivalent to the following:

normalized = '𝓡𝓸𝓬π“ͺ𝓡𝓱𝓸𝓼𝓽'.normalize('NFKC);

But if you follow the recommendation to use the URL interface for parsing URLs this happens implicitly and the hostname part has already gone through Unicode normalization.

In Perl you would do the following:

use Unicode::Normalize;
$normalized = Unicode::Normalize::NFKC('𝓡𝓸𝓬π“ͺ𝓡𝓱𝓸𝓼𝓽');

I am not 100 % sure whether the browser really uses NFKC or just NFKD. If you know better, please leave a comment.

Unfortunately, the Perl URI package does not perform this type of normalization but interprets these non-canonical forms as International domain names (IDN). Try this example:

use v5.10;

use URI;
my $uri = URI->new('http://𝓡𝓸𝓬π“ͺ𝓡𝓱𝓸𝓼𝓽/');
say $uri->host;
say $uri->canonical;

The output is:

xn--taaaaaaaaa5gbbbbbbbb2vkb9h7ck7do0i3a51ldaddddddd
http://xn--taaaaaaaaa5gbbbbbbbb2vkb9h7ck7do0i3a51ldaddddddd/

It shows that the problem is already present in the parser. When you invoke the host() method, the conversion has already taken place. You can only work around it by checking if the hostname begins with xn--. If it does, you have to convert it back to its Unicode form, normalize it as described above, and then feed the result back into URI->new().

Maximum Length

According to RFC952 host software MUST handle host names of up to 63 characters length and SHOULD handle host names of up to 255 characters in length. That means that there is no real hard limit for the maximum length of a hostname.

At the time that RFC952 was written, one character in a hostname was one byte. Today, in the presence of Unicode domain names, it is unclear whether the limit applies to the length in bytes, to the length in characters (for practical reasons we can assume UTF-8 multi-byte sequences) or to the length of the hostname encoded in Punycode.

Short of better findings, I decided to not impose any limit on the length of a valid hostname and rely on the web servers configuration for the maximum request size to defend the application against DOS attacks.

Security and Privacy Considerations

If you expose user-supplied URLs, some security aspects need to be considered.

Schemes Should Be Whitelisted

The above code only allows http and https URLs. For good reason:

<a href="javascript:for (;;) alert('Ouch!!!')">Woodstock's homepage</a>

The notorious javascript: URL scheme is just one particularly annoying example. But since URL schemes may have arbitrary semantics, you should only allow well-known schemes that make sense for your application.

Reject Private IP Networks and Link-Local Addresses

Allowing http://127.0.0.1:8080 as a user homepage in a public forum is not just a little glitch but a security problem.

Some software (for example the popular VLC media player) can be controlled with an HTTP interface. It is therefore not a good idea to click on forum user Barney's homepage link, when it is http://127.0.0.1:8080/share?file=/etc/passwd&rec=me@my.com.

But that is just the pretty shiny tip of the iceberg. Today most households have many devices and gadgets that expose an HTTP interface, for examples routers, TVs, set-top boxes, gaming consoles, and in the Internet of Things there are even more attractive targets like door locks, telephones, or even cars. The HTTP APIs of these devices do not necessarily comply with state-of-the-art security standards because the developers may think that these devices "only" operate in private networks that are never accessible from the outside.

If content from user-supplied URLs is loaded automatically by the browser, for example, when used in the src attribute of an img tag, things get even worse, because other users do not even have to actively click on a link, leave alone that they are aware of the browser sending the request.

Reject Special Purpose Domain Names

It should be clear that allowing localhost is just as dangerous as allowing 127.0.0.1. In fact, if a network device is accessible with a special purpose domain name, it is almost guaranteed that your web application should not cause a browser to send unsolicited requests to such devices.

This is best understood with the .local domain (see RFC6762). Many network devices self-configure themselves with hostnames in the .local domain and sadly these names are very easy to guess. I am writing this post on a MacBook Pro, and guess which device is accessible as MacBook-Pro-3.local? Try the command hostname on your computer, and you may be surprised. And if you are the proud owner of the Acme Corp. InkRocket 42 printer, you should not be suprised when its web interface may be accessible as Acme-InkRocket-42.local.

Problems With URL Probing

It may be tempting to skip the entire parsing and validating business by simply sending a GET request to the URL supplied and see if it works.

Doing that in the client (web browser) is a little bit tricky but not such a bad idea after all. Because of the same-origin policy, you cannot use XmlHttp requests. You must insert an img tag in the DOM, and check whether the image loaded. Still, doing this is a good protection against typos.

But if the validation is business-critical you must also perform it on the server-side. Sadly, probing URLs on the server-side is a recipe for serious trouble, more precisely it is potentially vulnerable to resource enumeration. If an attacker sets the image URL of their avatar to http://192.168.0.1:8080/favicon.ico and the server accepts it after successfully probing the URL, the attacker knows that you run a web server at that address behind your firewall.

Rejecting private network IPs kind of remedies the problem. But if your web application has the external name www.funny-catvideos.com, the attacker can just try more ports on that hostname (which may be accessible from the server) or try other names inside that domain. If they are lucky, they may end up gaining remote access to the server management console that is only accessible from the server.

Source Code

You can find the typescript source code of the URL validator described above at this link. Porting it to conventional Javascript should be trivial: Just remove the class wrapper, and replace all occurrences of let and const with var.

The (hopefully) equivalent server-side version in Perl can be found at this link.

Please, do not blindly copy the code into your project. It reflects the policy for my particular project. Other projects will probably need modifications.

Important: The Perl version of the validation does not (yet) perform Unicode normalization on hostnames!

Corrections or improvements are always welcome. Just file a pull request or leave a comment below.

Leave a comment
This website uses cookies and similar technologies to provide certain features, enhance the user experience and deliver content that is relevant to your interests. Depending on their purpose, analysis and marketing cookies may be used in addition to technically necessary cookies. By clicking on "Agree and continue", you declare your consent to the use of the aforementioned cookies. Here you can make detailed settings or revoke your consent (in part if necessary) with effect for the future. For further information, please refer to our Privacy Policy.