Template Toolkit and Unicode

The mere mention of Unicode in the context of Template Toolkit causes great wailing and chattering of teeth with many a Perl hacker. You will encounter double encoded UTF-8 or arbitrarily thrown in question marks in a seemingly random fashion. How can this be avoided?

Perl's UTF8 Flag

The source of all evil is Perl's so-called utf8[sic!] flag. Starting with Perl 5.6, Perl scalars --- let's call them strings for our purposes --- have this secret property. Whether the flag is set or not is completely at the discretion of the author of the code that produced the strings.

This flag is mostly useless but Perl will corrupt your data, when you concatenate strings that do have the flag set with strings that have not. Non-US-ASCII characters will either be replaced with a question mark or will get doubly encoded. For example Western accented characters may be replaced with two character strings that start with an uppercase A with a tilde (for example Café instead of Café), cyrillic strings will turn into lots of Ðs and Ñs, often followed by unprintable characters. It depends on the script of the input data.

If the flag is set on data, Perl will also corrupt it when writing it to files unless you take extra measures.

The problem is that Perl assumes that data not flagged is encoded in an 8 bit character set and will convert data back and forth between a guessed 8 bit character set and UTF-8. This is always the wrong decision for any sane application but cannot be prevented.

Template Toolkit and the UTF8 Flag

As all template engines, Template Toolkit's main job is to concatenate strings from multiple sources into one output string. If all these strings are UTF-8 encoded, this should be a pretty easy task but because of Perl's idiosyncrasies described above, there are numerous pitfalls and caveats. You must either ensure that all data involved has the UTF-8 flag switched off or on. Mixing it will result in errors.

The easiest way to achieve correct output is: Do nothing! If you can, then just leave the flag alone, do not use the ENCODING parameter, when instantiating the template processor, and do not use the so-called output disciplines, when printing the rendered output.

But this is not always possible, especially, when your application interacts with third party software. There are three main points of failure:

  1. Reading of Templates
  2. Variable Interpolation
  3. Output

We will illustrate that with a simple example. Create a file render.pl:

[% FILTER $Highlight language-perl line-numbers %] use strict;

use Template; use YAML qw(Load);

my $yaml = <<EOF;

month: Décembre year: 2018 prices: coffee: 3,50 tea: 2,40 beer: 2,80 EOF my \(vars = Load \)yaml; $vars->{currency} = '€';

my $template = <Café de la gare

Menu pour [% month %] [% year %]

[% INCLUDE menu.tt %] EOF

Template->new->process($template, $vars); [% END %]

The script reads the template variables $vars from YAML but extends it with a currency before it passes it to the template processor for rendering. Before you can do that, you must create the included template file menu.tt in the same directory:

<ul>
  <li>Café:  </li>
  <li>Thé:  </li>
  <li>Bière:  </li>
</ul>

Make sure that Template and YAML are installed, and then run the script. It should print:

$ perl render.pl
<h1>Café de la gare</h1>
<p>Menu pour Décembre 2018</p>
<ul>
  <li>Café: 3,50 €</li>
  <li>Thé: 2,40 €</li>
  <li>Bière: 2,80 €</li>
</ul>

Everyghing correct. But now you read in the manual page of YAML that you should switch to YAML::XS because it is faster and more robust. Okay, you install YAML::XS and change line 4 to read:

use YAML::XS qw(Load);

You run the script again, and:

$ perl render.pl
<h1>Café de la gare</h1>
<p>Menu pour D?cembre 2018</p>
<ul>
  <li>Café: 3,50 €</li>
  <li>Thé: 2,40 €</li>
  <li>Bière: 2,80 €</li>
</ul>

Ouch. The line with the month has suddenly changed. The accented e is messed up: D?cembre.

What has happened? The reason for the change in behavior is that YAML::XS sets the UTF8 flag recursively on the output data. And there is no way to prevent that.

Still, all data is in UTF-8, but parts of it with the flag set, others do not have it. And the result is that Perl repairs the data by corrupting it.

Strategy 1: Get Rid Off the UTF-8 Flag

One way to fix this is to recursively clear the UTF-8 flag on the data returned by YAML::XS::Load:

...
my $vars = Load $yaml;

use Encode;
Encode::_utf8_off($vars->{month});

$vars->{currency} = '€';
...

Well, that's not really recursively, but only month is not US-ASCII and so it is enough to turn the flag off for that property. This looks like a hack but, when you read on, you can see that this is probably the simplest and therefore best approach.

For a bullet-proof implementation you should really recursively switch the UTF-8 flag off. This is a function that does that for you:

use Data::Walk qw*(walk);
use Scalar::Util qw(reftype);

sub clear_utf8_flag {
    my ($data) = @_;

    my $wanted = sub {
        if (ref $_) {
            my $obj = $_;
            if ('HASH' eq reftype $obj) {
                foreach my $key (keys %$obj) {
                    if (Encode::is_utf8($key)) {
                        my $value = delete $obj->{$key};
                        Encode::_utf8_off($key);
                        $obj->{$key} = $value;
                    }

                    my $value = $obj->{$key};
                    if (defined $value && !ref $value
                        && Encode::is_utf8($value)) {
                        Encode::_utf8_off($obj->{$key});
                    }
                }
            } elsif ('ARRAY' eq reftype $obj) {
                foreach my $item (@$obj) {
                    if (defined $item && !ref $item
                        && Encode::is_utf8($item)) {
                        Encode::_utf8_off($item);
                    }
                }
            }
        }
    };

    walk $wanted, $data;

    return $data;
}

Unfortunately, for complicated, deeply nested data structures and a lot of documents, this can become slow and turn out as a bottleneck.

Strategy 2: Groking With the UTF-8 Flag

You can also go the painful way of using Perl's Unicode features.

Reading Templates: The ENCODING Parameter

The variables coming from YAML::XS all have the UTF-8 flag on and mixing that with other data causes the merged data to be corrupted. We now have to ensure that all data has the UTF-8 flag on.

We begin with the in-memory template and change the end of render.pl:

[% FILTER \(Highlight "language-perl" %] my \)template = <Café de la gare

Menu pour [% month %] [% year %]

[% INCLUDE menu.tt %] EOF use Encode; Encode::_utf8_on($template);

Template->new({ENCODING => 'UTF-8'})->process($template, $vars); [% END %]

What has changed? The utf8 flag on the template string is turned on, and we additionally pass the option ENCODING => 'UTF-8' to the Template Toolkit. Run the script:

$ perl render.pl
<h1>Caf? de la gare</h1>
<p>Menu pour D?cembre 2018</p>
<ul>
  <li>Caf?: 3,50 €</li>
  <li>Th?: 2,40 €</li>
  <li>Bi?re: 2,80 €</li>
</ul>

Worse than before. Now almost everything is corrupted although - keep that in mind - everything is perfectly valid UTF-8. But Perl now corrupts the data, while writing the file to STDOUT. You can avoid that by inserting one line anywhere before Template Toolkit gets invoked:

binmode STDOUT, ':utf8';
Template->new({ENCODING => 'UTF-8'})->process(\$template, $vars);

The call to binmode() does not what the name of the function suggests but tells Perl that it should accept textual, utf-8 encoded data when writing to standard output and that it should leave the data alone and not do any weird charset conversions.

Let's look at the result:

$ perl render.pl
<h1>Café de la gare</h1>
<p>Menu pour Décembre 2018</p>
<ul>
  <li>Café: 3,50 €</li>
  <li>Thé: 2,40 €</li>
  <li>Bière: 2,80 €</li>
</ul>

Arrgh, now the Euro signs are messed up. The reason is that $vars->{currency} does not have the flag turned on. We now use another variant setting the flag. Change the line where it is set like this:

$vars->{currency} = Encode::decode('UTF-8', '€');

This does exactly the same as:

$vars->{currency} = '€';
Encode::_utf8_on($vars->{currency});

Run render.pl again, and, hooray! the output should be correct by now.

Debugging Template Toolkit Unicode Problems

As mentioned above, there are two strategies for using Template Toolkit with UTF-8 data and templates:

  1. Do nothing! If you have charset issues, parts of your data have the utf8 flag turned on. Find the culprit, turn it off, and the problem is gone.

  2. Invoke Template->new with the option ENCODING => 'UTF-8' . Make sure that the utf8 flag is switched on for all of your data and templates. When rendering the template, make sure that you call binmode STDOUT, ':utf8' (and replace STDOUT with the file handle you are using for output).

In general, the second strategy causes a lot more work without any benefit.

When debugging charset issues, you should find the first place where a problem occurs in the output. Perl corrupts strings during concatenation or interpolation. So, try to find out, where the corrupted piece originated from and fix the data according to your strategy by turn the flag off or on.

If you go with the second strategy, check that the file handle that you print to is ready for UTF-8 by calling binmode HANDLE, ':utf8'.

It should also be noted that Template Toolkit checks input templates for BOMs (byte order marks) and may set the utf8 flag for template files have a byte order mark. Since that behavior is undocumented and byte order marks are an obsolete concept anyways, you should probably remove the BOM from your template and continue debugging.


blog comments powered by Disqus