Simple Content Negotiation For Nginx

Content negotiation is a key concept for multi language web sites. For Nginx it is only available as a patch. But negotiating the language is a rather trivial task for most sites. Instead of patching the web server, a couple of lines of Perl code will also do the job.

In “Multilingual Web Sites With Jekyll” I have described how to set up a multilingual static site with Jekyll. This post describes the web server side of the configuration. The solution outlined here is - by the way - not specific to Jekyll but works for every static site served by nginx.

How Does Content Negotiation Work?

If you know what content negotiation is and how it works, you can skip the following explanations and jump directly to the solution of the problem.

Sometimes a web server provides a resource in multiple variants. For example a text file could be available as HTML, PDF, or an OpenOffice document, an image could be available as a PNG and a GIF, and a large text file could be compressed with gzip or compress.

Almost all browsera send Accept headers with every request. These headers inform the server about the preferences of the user and the capabilities of the browser. The following headers are defined:

Accept
content type
Accept-Charset
character set
Accept-Encoding
encoding (almost always synonym with compression)
Accept-Language
language

This post covers only language negotiation. The other headers from the Accept family are treated in a similar manner. You can read more about this at https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html.

The user’s language preferences are encoded with the following syntax in the header Accept-Language:

Accept-Language: de-DE; de, q=0.9; fr; q=0.7; en; q=0.3

The individual language identifiers are separated by commas. Each identifier consists of 1 to 8 US-ASCII characters for the primary language tag, optionally followed by an arbitrary number of sub tags, each of them consisting of a hyphen and 1 to 8 US-ASCII characters. The primary tag normally identifies the language, and the first sub tag identifies the country or region. Language identifiers with more than one sub tag are rarely used. Alternatively, a wildcard (*) can be specified with the meaning “any language”. Current browsers do not seem to use the wildcard feature.

The language identifier can optionally be followed by a quality value that is separated by a semi-colon. The quality value is a floating point number in the range of 0 to 1. The default quality value is 1, a quality value of 0 means “inacceptable”.

We can now “translate” the above example:

Accept-Language: de-DE; de, q=0.9; fr; q=0.7; en; q=0.3

The user prefers documents in German for Germany (quality value 1), followed by German for any country or region (quality value 0.9). If German is unavailable, French is desired (quality value 0.7), and otherwise English (quality value 0.3).

Sixty-Four-Dollar-Question: If the server has only documents in German for Austria (de-AT) and English (en) available, which one should it deliver? The English one! The identifier de is not the equivalent to de-*!

Structure Of a Multilingual Site

The de-facto reference implementation for content negotiation is the Apache module mod_negotiation. Let’s assume that the different language variants of a particular resource are located in one directory inside the web server document root. Then, the following naming convention is expected:

$ ls htdocs/path/to/directory
index.html.bg    index.html.de-CH index.html.en    index.html.it
index.html.de    index.html.de-DE index.html.es    index.html.ru
index.html.de-AT index.html.de-IT index.html.fr

Should a browser request an existing file, exactly that file gets delivered. Content negotiation only jumps in, when the file does not exist. If the browser requests for example index.html.de-AT that file gets delivered. If index.html gets requested the server has to select a suitable resource because index.html does not exist.

In order to do so, the header Accept-Language gets analyzed as described above, and the server will select the file that comes closest to the user preferences. That technique can be combined with preferences for character set and encoding. Thus, the server may respond to a request for the resource index.html with the contents of the file index.utf-8.html.de-DE.gz.

The User’s View Of Content Negotiation

Visitors of a multilingual site expect that the server (initially) honors the user’s language preferences but will not arbitrarily change the language within a session.

If a user has set her preferred language in the browser to French, and points her browser to the documentation of mod_negotiation she will expect the French version. Should she change her mind, and decide to read this technical documentation in English, she will expect to be only presented the English versions of all documents, and not the French ones actually corresponding to her language preferences.

That is simple and does not require cookies. All you have to do is to always explicitly link to the resource variant in the current language. In other words, links should not point to index.html but to index.html.en.

Many sites do not use page-based content negotiation (index.html.de, index.html.fr, etc.) but organize resources in language-specific directories. For example, all German documents would be located under http://www.example.com/de/. That has a little disadvantage: The link http://www.example.com/news/000123/index.html will be content-negotiated by Apache. That means it will be distributed in a language-agnostic manner. The analogous link http://www.example.com/de/news/000123/index.html in a directory based structure will always point to the German version. Organizing resources in language-specific directories is hence less flexible because content negotiation no longer works for each landing page.

Speaking URLs

Many sites today prefer SEO-friendly speaking URLs, containing as little as possible content-less URL components. The address of this post for example is http://www.guido-flohr.net/simple-content-negotiation-for-nginx/ and not http://www.guido-flohr.net/blog/2016/02/28/. I personally think speaking URLs are overrated but they are state-of-the-art.

Using speaking URLs implies that only one single address has to support content negotation and that is the landing page, normally /. All other resources have to use distinct URLs on a static site. They have to be unique cross-language. That simplifies our original problem. We have to teach nginx only for the location / to honor the header Accept-Language.

Nginx Configuration

The web server has to invoke the Perl handler:

1
2
3
4
5
6
    location = / {
        proxy_set_header X-Forwarded-Host $http_host;
        proxy_set_header X-Forwarded-Port 4001;

        proxy_pass http://localhost:4002;
    }

The root page of the site is the one and only landing page that has to support content negotiation because from there on the selected language gets preserved until explicitly changed by the language switch. Therefore, the handler gets invoked in line 1 only for the location /. Important! Forgetting the equals sign will activate the handler for all URLs starting with a slash, and that are all resources on the server. You will end up in a redirect loop.

Lines 2 and 3 instruct nginx to pass the hostname and port of the actual server in custom request headers to the handler.

Line 5 contains the address of the handler proxy. Being a proxy means that we have to start another web server listening on port 4002 on the IPv4 loopback interface.

Perl Handler

Although the handler is written in Perl you can easily translate that into any other programming language. The important point is that the language of choice provides a simple and light-weight web server.

When speaking of nginx handlers, many people automatically think of WSGI for Python, PSGI/Plack for Perl or Rack for Ruby. However, installing PSGI/Plack with all of its dependencies usually leads to an installation orgy and that has to be repeated for every Perl update.

The original problem being ridiculously simple, I chose a solution that only has two dependencies, libintl-perl and HTTP-Server-Simple. The library “libintl-perl” is available for almost every platform as a precompiled package (for example as libintl-perl, p5-libintl-perl, perl-libintl-perl or the like). The module HTTP::Server::Simple often has to be installed manually:

$ sudo cpan install HTTP::Server::Simple

The handler should be saved somewhere as lingua.pl. Do not forget to set the x-bit!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#! /usr/bin/env perl

use strict;

my %supported = ( 
    en => '/en/',
    de => '/de/',
);

my $server = ContentNegotiator->new(4002);
$server->host('127.0.0.1');
$server->run;

package ContentNegotiator;

use base qw(HTTP::Server::Simple::CGI);
use Locale::Util qw(parse_http_accept_language);

sub handle_request {
        my ($self, $cgi) = @_;

        my @linguas = parse_http_accept_language $ENV{HTTP_ACCEPT_LANGUAGE};
        my $lingua = 'en';
        my $server = "$ENV{HTTP_X_FORWARDED_HOST}:$ENV{HTTP_X_FORWARDED_PORT}";
        foreach my $l (@linguas) {
            if (exists $supported{$l}) {
                $lingua = $l;
                last;
            }
        }

print <<EOF;
HTTP/1.0 303 See Other
Location: http://$server$supported{$lingua}
Content-Length: 0

EOF
}

1;

Lines 5 to 8 contain the configuration. All available languages are mapped to URLs. I only support German and English with the start pages /de/ and /en/ respectively.

The server class gets instantiated in line 10. Line 11 instructs the proxy to listen on the IPv4 loopback address that we configured in the nginx configuration, and in line 11 the server gets started with the method run(). Alternatively, you can start the server as a background daemon with background(). But most init systems will get along better with scripts that run in the foreground.

The definition of the server class starts with line 14. It is a sub class of HTTP::Server::Simple::CGI (line 16). And we import the function parse_http_accept_language() from Locale::Util (contained in libintl-perl) in line 17.

The user’s language preferences from the header Accept-Language are passed to the handler in the environment variable HTTP_ACCEPT_LANGUAGE, for example something like “de-DE; de, q=0.9; fr; q=0.7; en; q=0.3”. The function parse_http_accept_language() parses this string and in line 22 returns a list of languages sorted by language preference. That list gets stored in the variable @linguas.

In line 23 we initialize the selected language in the variable $lingua. Our fallback language is “en” like English. In line 25 we iterate over the list of preferred languages passed by the browser. In line 28 we stop for the first hit, that is, when one of the preferred languages exist in %supported.

Note: This is an imperfect solution that has at least two bugs: If a language had been marked as inacceptable with a quality value of 0, it may still be selected by our code. Furthermore, wildcards for tags are also not supported. I think that the code will still work for all practical purposes because none of the major browsers support setting quality values of 0 or wildcards in language tags.

Without any hit, $lingua still contains our fallback language “en” from line 23. Starting with line 32 we finally print a redirect to the start page for the negotiated language. We use HTTP protocol version 1.0 because we know that nginx will upgrade our reply to the correct version for the request sent by the browser. Line 35 is very important! It specifies a content length of 0 bytes. Without this header, nginx will keep on reading from the handler until it times out after a couple of seconds. That slows down things unnecessarily.

Let’s test the handler! Start a shell and type “perl lingua.pl”. In a second shell start a telnet session:

$ telnet 127.0.0.1 4002
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET / HTTP/1.1
Host: localhost 

HTTP/1.0 303 See Other
Location: http://:/en/
Content-Length: 0

Connection closed by foreign host.

The location Location: http://:/en/ does not look good. The reason is that the custom headers X-Forwarded-Host and X-Forwarded-Host specified in the nginx configuration that our handler relies on are missing. But we can easily emulate that:

$ telnet 127.0.0.1 4002
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
GET / HTTP/1.1
Host: localhost
X-Forwarded-Host: www.example.com
X-Forwarded-Port: 4242

HTTP/1.0 303 See Other
Location: http://www.example.com:4242/en/
Content-Length: 0

Connection closed by foreign host.

HTTP::Server::Simple translates all HTTP request headers following the scheme X-Forwarded-Host => HTTP_X_FORWARDED-HOST into environment variables, a pretty standard technique.

The final test is shot at the real web server:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ telnet www.guido-flohr.net 80
Trying 62.75.204.82...
Connected to www.guido-flohr.net.
Escape character is '^]'.
GET / HTTP/1.1
Host: www.guido-flohr.net

HTTP/1.1 303 See Other
Server: nginx/1.6.3
Date: Sun, 28 Feb 2016 21:01:51 GMT
Content-Length: 0
Connection: keep-alive
Location: http://www.guido-flohr.net:80/en/

Hint: You can return to the prompt with CTRL-D or CTRL-] followed by “close” at the telnet prompt “telnet> “.

As you can see, nginx upgrades the protocol to HTTP/1.1 (line 8) and also adds the headers Server, Date, and Connection automatically.

Starting The Perl Handler On Boot

At this point nginx lets us pay the price for not using Apache. We somehow have to ensure that the Perl handler gets started automatically before nginx. I will describe three options:

Invocation Via Cron

A quick and dirty solution that does its job surprisingly well. We just edit the crontab for the web server user or root:

* * * * * nohup /path/to/lingua.pl >/dev/null 2>&1 &

There is no need to protect it against parallel execution with a PID file because our handler is always listening on the same port and the same interface. Should an instance already run, the handler will terminate immediately with an error message.

Invocation Via Init Script

That is usually very simple. You just copy a similar start script inside /etc/init.d and modify it to your needs. On BSD systems search in /etc/rc* for boilerplate code.

A start script /etc/init.d/nginx-lingua for Gentoo-Linux should serve as an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/sbin/runscript

depend() {
    need net
    before nginx
}

start() {
    ebegin "Starting nginx language negotiation."
    start-stop-daemon --start \
        --exec /var/lib/nginx/lingua.pl \
        --user nginx:nginx \
        --background --make-pidfile --pidfile /var/run/nginx-lingua.pid
    eend $?
}

stop() {
    ebegin "Stopping nginx language negotiation."
    start-stop-daemon --stop \
        --exec /var/lib/nginx/lingua.pl \
        --pidfile /var/run/nginx-lingua.pid
    eend $?
}

Line 4 defines a dependency on the service net (the network). In line 5 we specify that the handler should be started before nginx, such that nginx can use the proxy immediately.

The function start() starting with line 8 contains nothing special. We start the service with start-stop-daemon, specifying the path to our script with the option --exec. With the option --user we specify an unprivileged user so that the handler does not run with root privileges.

The options in line 13 instruct start-stop-daemon to put our script into the background and store the PID in the file /var/run/nginx-lingua.pid.

Would it not be wiser, to put the handler into the background inside the Perl code, and write the PID file ourselves? The answer is no because start-stop-daemon does that very reliably, contrary to most daemon implementations in Perl or other scripting languages.

The function stop() beginning with line 17 is now self-explanatory. It corresponds to the start function but a couple of start options can be omitted.

Finally, we have to start the service and add it to the default runlevel:

$ sudo /etc/init.d/nginx-lingua
$ sudo rc-update add nginx-lingua default

Systemd

Systemd is a currently very popular alternative to classical init systems. Systemd replaced the flexible approach of init scripts with so-called “service files”. The directory /etc/systemd/system seems to be a good starting point on the search for the correct location of custom services.

Create a file /etc/systemd/system/nginx-lingua.service:

1
2
3
4
5
6
7
8
9
10
[Unit]
Description=HTTP Language Negotiation For Nginx

[Service]
Type=simple
ExecStart=/usr/share/nginx/lingua.pl
User=nginx

[Install]
WantedBy=multi-user.target

The handler has to be started and should be added to the default runlevel:

$ sudo systemctl enable nginx-lingua
$ sudo systemctl start nginx-lingua

Important! Systemd adds the extender “.service” automatically as you know it from MS-DOS based systems. You must not specify the full file name nginx-lingua.service in the above commands!

Conclusion

The solution presented here serves my particular requirements and should be easily adoptable for other preferences. Instead of relying on HTTP::Server::Simple you can easily rewrite the handler for Plack or Starman. And if you do not want to install libintl-perl, you can copy the function parse_http_accept_language() from the source code of Locale::Util. If you prefer to write in Ruby or Python or whatever, you also should not be confronted with any major problems.

There is another subtle difference to content negotiation with Apache’s mod_negotiation. While mod_negotiation delivers the requested resource itself, we just make a redirect. Both techniques have their advantages, and the handler presented here can be easily rewritten to deliver the resource itself.


blog comments powered by Disqus