Extending Xgettext With Locale-XGettext

The program xgettext from GNU Gettext was originally written for extracting translatable messages from C source files only into PO files. It has since then be extended to support 26 more languages (at the time of this writing). But what to do, when this is not enough?

Motivation For Locale::XGettext

Adding a new language to xgettext is — albeit not trivial — straightforward. Still, it is very often not appropriate to permanently extend xgettext for various reasons:

  • The support code for the new language has to be written in C, not necessarily the language of choice for writing a parser.
  • Somebody has to maintain the code.
  • The language support should be very generic, when it is included in GNU gettext, whereas a parser for a particular project or use case can be a lot more relaxed. It does not always have to be general-purpose to be functional.
  • The support for the new language is only available after the next release of GNU gettext. It takes even longer until vendors pick up the new gettext version.
  • You may not even extract messages from files at all but instead from a database, a web site, a content-management-system, etc.
  • Your use case may be too specific. For example, extracting translatable strings from HTML makes perfect sense. But is there a one-and-only way to do this?

Many, many string extractors for creating .po (actually rather .pot files) have been born in such situations. They very often spring into life as simple one-shot solutions with hard-coded defaults and assumptions. Then it turns out that you actually have to make things configurable, you add more and more bells and whistles, copy everything over into the next project, fix a bug, forget to backport the fix, you’ve been through that, haven’t you?

A while ago, I was in such a situation. I wrote a Template-Plugin-Gettext, a gettext plug-in for the Template Toolkit, a popular and powerful template system Perl. Almost simultaneously, I had to develop a website translation framework for imperia CMS. I soon got tired of copying the boilerplate code back and forth between the two projects and decided to write a more generic solution. The result is Locale-XGettext.

Installation

Before you can write an extractor, you first have to install the library itself. See https://github.com/gflohr/Locale-XGettext#installation for various ways to do this.

Basic Usage

The following example is written in Perl because Locale::XGettext itself is written in Perl. But you can also write extractors in C, Java, Python or Ruby (and actually many more languages with a litte bit of effort). See the instructions for writing extractors in other languages on github for more information.

The minimal implementation for an xgettext-sample is as simple as this:

1
2
3
4
5
6
7
8
9
10
11
#! /usr/bin/env perl

use strict;

Locale::XGettext::Sample->newFromArgv(\@ARGV)->run->output;

package Locale::XGettext::Sample;

use strict;

use base 'Locale::XGettext';

Line 5 is actually the entire wrapper. It creates an instance of our class Locale::XGettext::Sample with the constructor newFromArgv(), and passes all command-line arguments to it. Then there is a chained method call to run() which runs the extractor and then output() which outputs the PO file. The stub implementation for Locale::XGettext::Sample follows starting at line 7. Actually, it implements nothing but just inherits everything from the base class.

Enough of explanations, make the file executable, and off we go:

$ chmod +x xgettext-sample
$ ./xgettext-sample
./xgettext-sample: no input file given
Try './xgettext-sample --help' for more information!
$ ./xgettext-sample --help
Usage: ./xgettext-sample [OPTION] [INPUTFILE]...

Extract translatable strings from given input files.

Mandatory arguments to long options are mandatory for short options too.
Similarly for optional arguments.

Input file location:
  INPUTFILE ...               input files
  -f, --files-from=FILE       get list of input files from FILE
  -D, --directory=DIRECTORY   add DIRECTORY to list for input files search
etc.

Does that look familiar to you? Compare it with the output from the original xgettext:

$ xgettext
xgettext: no input file given
Try 'xgettext --help' for more information.
$ xgettext --help
Usage: xgettext [OPTION] [INPUTFILE]...

Extract translatable strings from given input files.

Mandatory arguments to long options are mandatory for short options too.
Similarly for optional arguments.

Input file location:
  INPUTFILE ...               input files
  -f, --files-from=FILE       get list of input files from FILE
  -D, --directory=DIRECTORY   add DIRECTORY to list for input files search
...

Your ten-liner has almost the same impressive command-line interface that xgettext has. And best of all, the options actually work. Well, mostly:

$ ./xgettext-sample does-not-exist.txt
Error resolving file name 'does-not-exist.txt': No such file or directory!

Specifying a non-existing input file for reading will trigger an error. Short of another input file, we therefore specify the script name itself:

$ ./xgettext-sample xgettext-sample 
Can't locate object method "readFile" via package "Locale::XGettext::Sample" at /opt/local/lib/perl5/site_perl/5.24/Locale/XGettext.pm line 186.

Perl complains that the method readFile() was not implemented. So add this stub implementation to end of xgettext-sample:

sub readFile {
    my ($self, $filename) = @_;

    # FIXME! Parse $filename and extract translatable strings!
}

This is how a method implementation looks in Perl. The method readFile() expects two arguments. The first is the instance, conventionally called $self (you can name it $this or whatever else you want if you prefer), the second one is the name of the file to parse. Locale::XGettext invokes this method for every input file specified.

Run the script again, and the error should vanish. On the other hand, nothing happens. Why? Because no strings have been found, and Locale::XGettext — just like xgettext — will not create an output file without any strings unless being told:

$ ./xgettext-sample --force-po xgettext-sample

Now a valid PO file messages.po has been created in the current directory but it only contains the PO header and no strings.

Let’s change that now. Instead of really parsing a file, we will for our example just return random strings. Change the method readFile() to read like the following.

sub readFile {
    my ($self, $filename) = @_;

    $self->addEntry('msgid', 'Hello, world!');
    $self->addEntry(msgid => 'Goodbye, solitude!');
    $self->addEntry({msgid => 'A hash reference'});
}

Lines 4-6 are new and show three different ways to add PO entries to the output. They are all equivalent. Pick the one that you feel best with.

Run the script again, this time without --force-po:

$ ./xgettext-sample xgettext-sample

Still nothing happens? No! No news good news. A file messages.po was created:

$ tail messages.po

msgid "Hello, world!"
msgstr ""

msgid "Goodbye, solitude!"
msgstr ""

msgid "A hash reference"
msgstr ""

For many purposes you are already done and can begin implementing the real parser. The PO file created is not very fancy but it works.

You can now play around with the command-line options. You can use -D resp. --directory for searching additional directories for source files. Option -f resp. --files-from can be used to read the list of input files from a plain text files (and contrary to GNU gettext, you can even specify this option multiple times). Study the output of --help for more ideas.

Adding References, Flags, Etc.

The PO entries in the output file are currently completely naked. Normally they contain reference to the source files, and often flags like “c-format”, “no-c-format”, and so on. Let’s see how, this is done:

1
2
3
4
5
6
7
8
9
10
11
12
sub readFile {
    my ($self, $filename) = @_;

    my $lineno = 1;
    $self->addEntry(msgid => 'Hello, world!',
                    reference => "$filename:$lineno",
                    flags => 'greet-format');
    ++$lineno;
    $self->addEntry(msgid => 'Goodbye, solitude!',
                    reference =" $filename:$lineno",
                    flags => 'no-greet-format,goodbye-format');
}

Now we set additional properties for the entries, reference for the location of the message in the source file, and flags for a comma-separated list of flags to apply.

Run and check again:

$ ./xgettext-sample xgettext-sample
$ tail messages.po
#: xgettext-sample:1
#, greet-format
msgid "Hello, world!"
msgstr ""

#: xgettext-sample:2
#, no-greet-format, goodbye-format
msgid "Goodbye, solitude!"
msgstr ""

That now looks more familiar, doesn’t it?

By the way, you should avoid setting flags directly from the parser but rather let Locale::XGettext add them automatically, when needed. See further below for more information!

Support For Keywords

For most languages, the keywords that mark translatable strings are configurable. For example, the C implementation has the default keywords “gettext, ngettext, …”. You can even specify which arguments contain the translatable string, which one contains a possible plural form, and which one a message context. Locale::XGettext supports that as well:

$ ./xgettext-sample --keyword=npgettext:1c,2,3 xgettext-sample

That translates to: Additionally to the default keywords for the language in question, also accept the keyword “npgettext()”, and extract the message context from the first argument (the trailing “c” stands for “context”), the singular form is argument 2, and the plural form is argument 3.

Locale::XGettext understands these keyword definitions but your parser has to honor them, too. We would need a real parser in order to be able to show that. We therefore just show how you can retrieve all the necessary information:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
sub readFile {
    my ($self, $filename) = @_;

    # Assume that we have found the invocation of a function
    # "npgettext()" in the code.
    my $func_name = 'npgettext';

    # Does not really help...
    my $option = $self->option('keyword');

    # But this does:
    my $keywords = $self->keywords;
    if (exists $keywords->{$func_name}) {
        my $singular = $keywords->{$func_name}->singular;
        my $plural = $keywords->{$func_name}->plural;
        my $context = $keywords->{$func_name}->context;
      
        # And now use this information to find the right arguments.
    }
}

In line 9 we get the value for the command-line option “keyword”. But that doesn’t really help. You would have to parse all the option strings yourself and merge them with the possible default keyword definitions for your language.

Instead you call the method keywords() (line 12) and get all valid keyword definitions by Locale::XGettext in a hash. Locale::XGettext has also merged the keyword definitions specified on the command-line with the default definitions for your language (see below).

Line 13 shows how to test wether a certain keyword is defined. Otherwise, you would not bother looking at the arguments. If it is defined, you get the position of the singular, plural or context argument. The position of the singular argument is always an integer greater than zero. The position of the plural argument or the context argument can be 0 if the keyword in question does not have a plural or context argument.

Likewise, the method comment() returns a possible automatic commenbt for that specific keyword or undef (or NULL, nil, None …). But Locale::XGettext adds these automatic comments automatically and you normally have to bother about it.

Default Keywords

If your language has default keywords that should always be used, you should implement the method defaultKeywords():

sub defaultKeywords {
  return [
    'gettext',
    'ngettext:1,2',
    'npgettext:1c,2,3',
    'dgettext:2',
    'dngettext:2,3',
    'dnpgettext:2c,3,4'
  ];
}

This should now be self-explanatory. Note (line 3) that you do not have to specify the argument position of the singular form if a function has just one single argument.

Automatic Comments

PO files can contain comments anywhere in the file. They use the common syntax “# Comment text …”. Automatic comments are automatically inserted by tools above certain entries. They always start with “#.” at the beginning of a line.

Translator Comments

In a localized Perl file you may find something like this:

1
2
3
4
if ($dayno == 6) {
    # TRANSLATORS: This is the abbreviation for "Sunday"!
    $day = gettext("Sun");
}

Obviously, the programmer followed a very common convention for localized sources: The string “Sun” without context looks like the name of the star currently shining on your head. A so-called translator comment prevents a misunderstanding here and explains that this is for the day of the week.

But Locale::XGettext has to know whether a comment preceded the occurrence of the string. This is how you do it:

1
2
3
4
5
6
sub readFile {
    my ($self, $filename) = @_;

    $self->addEntry(msgid => "Sun",
                    automatic => " TRANSLATORS: The week day!");
}

Pass whatever comment (or multi-line comment) you find immediately preceding the keyword but don’t bother parsing the comment yourself! Just strip off the comment marker for your language. Locale::XGettext will do the right thing for you.

The comment marker “TRANSLATORS:” is just a convention. You have to specify it explicitely on the command-line:

$ ./xgettext-sample --add-comment="TRANSLATORS:" xgettext-sample
$ cat messages.po
...
#. TRANSLATORS: The week day!
msgid "Sun"
msgstr ""

Again: Do not parse comments yourself! If you specify --add-comments='' (with an empty argument), all comments preceding a keyword should be added to the PO file. Additionally, some special comments that contain the string “xgettext:” are also treated in a specific way. You don’t want to implement all that logic yourself!

Keyword-Specific Comments

It is also possible to specify that translations marked with a certain keyword should receive an automatic comment in the PO file:

./xgettext-sample --keyword=greet:1,'"Hello, translator!"'

Note that the double-quotes have to be protected from shell expansion!

To make this magic happen Locale::XGettext has to know the keyword that a particular string was marked with. That is very easy:

1
2
3
4
5
6
7
8
sub readFile {
    my ($self, $filename) = @_;

    $self->addEntry(msgid => "Hello, world!",
                    keyword => "greet");
    $self->addEntry(msgid => "Goodbye, solitude!",
                    keyword => "wave_goodbye");
}

Run the extractor again with the keyword option from above. Now look into messages.po:

1
2
3
4
5
6
#. Hello, translator!
msgid "Hello, world!"
msgstr ""

msgid "Goodbye, solitude!"
msgstr ""

The first string was marked with the keyword “greet” (at least we pretend that) and gets the automatic comment specified on the command-line. The second string used a different keyword and the resulting PO entry does not have a comment.

Keyword-Specific Flags

Similar to the way that automatic comments can be added for certain keywords, the same can happen for flags:

$ ./xgettext-sample --keyword=greet:1,'"Hello, translator!"' \
    --flag=greet:1:no-perl-brace-format xgettext-sample
$ cat messages.po
...
#. Hello, translator!
#, no-perl-brace-format
msgid "Hello, world!"
msgstr ""

msgid "Goodbye, solitude!"
msgstr ""

The first string now additionally receives the flag “no-perl-brace-format”.

There is a difference to GNU gettext to note here. The default flags for C in GNU gettext contain the following definitions:

    ...
    "gettext:1:pass-c-format",
    ...
    "printf:1:c-format",
    ...

This is best understood with an example in C:

1
printf(gettext("Filename '%s', line number %d\n"), filename, lineno);

The first (and only) argument to printf() is always a C format string. This is expressed as “printf:1:c-format”. The argument to gettext() on the other hand is just an ordinary string. But iff the return value of gettext() is used as the first argument to printf(), then that quality of the printf() argument is passed back to gettext().

The reason for that is that C format strings are potentially dangerous. A program can easily crash if the format string does not fit the arguments following. Imagine, the French translator would have translated the above string to “ligne %d en ficher ‘%s’\n”. However, the first argument for printf() is still the file name and the second argument the line number and that will almost inevitably result in a segmentation fault.

The above flag definitions from GNU gettext effectively prevent that. They enforce that the translated format string matches the invocation in the source code. That is good.

On the other hand, that scenario is mostly specific to C and it considerably complicates the implementation of string extractors. Locale::XGettext does not directly support checking of nested function calls. For the same reason it simply ignores a prefix “pass-“ specified for a flag. It would be more consistent to ignore such flag definitions altogether, but just ignoring it fits better for cargo-culted flag definitions.

Feel free to implement multi-level argument checking in your parser! But Locale::XGettext cannot assist you with that.

Extracting Strings From Other Data Sources

Another common scenario is that translatable strings do not come from source files but rather from a database or similar systems. Locale::XGettext supports that:

1
2
3
4
5
6
7
8
9
sub extractFromNonFiles {
    my ($self, $filename) = @_;

    my $dbname = $self->option('database_name');
    
    # Query the database and use addEntry() to feed strings into
    # Locale::XGettext.
    # ...
}

The method extractFromNoneFiles() gets called after all input files have been processed. By default, it does nothing. But you can override it and add more strings from other sources.

Wait! What is the option “database_name” in line 4 of the code example above? There is no such option!

That depends on you:

Modelling the Command-Line Interface

The default command-line interface does not always fit for custom extractors. Locale::XGettext allows you to put a square peg into a round hole.

Language-Specific Options

You can extend the command-line interface by adding language-specific options:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
sub languageSpecificOptions {
    return
        "database-name=s",
        "database_name",
        "    --database-name",
        "specify the name of the database",

        "database-user=s",
        "database_user",
        "    --database-user",
        "specify the user to connect with to the database",

        "database-pass=s",
        "database_pass",
        "    --database-pass",
        "specify the database password";
}

Check the usage information of your extractor:

$ ./xgettext-sample --help
...
Language specific options:
  -kWORD, --keyword=WORD      look for WORD as an additional keyword
  -k, --keyword               do not to use default keywords
      --flag=WORD:ARG:FLAG    additional flag for strings inside the argument
                              number ARG of keyword WORD
      --database-name         specify the name of the database
      --database-user         specify the user to connect with to the database
      --database-pass         specify the database password
...

Three new options have been added to the command-line interface. Each group consists of four strings:

  1. The option definition, for example "database-name=s" for an option "--database-name" that takes a mandatory string argument. See the documenation of Getopt::LongOptions for details.
  2. The "name" of the option. This is the argument that you pass to the method "option()" if you want to retrieve the option value.
  3. The left part of the help output for that option.
  4. The right part of the help output for that option.

If you are not happy with the automatic help output generated by Locale::XGettext for your new options, you can also override the method printLanguageSpecificOptions() and roll your own version:

1
2
3
4
5
sub printLanguageSpecificOptions {
    my ($self) = @_;

    print "  --foo=bar    you know what to do\n";
}

Describe Your Expected Input

Easy:

1
2
3
sub fileInformation {
    return "Input files must be valid foomatic files.";
}

Test it:

$ ./xgettext-sample --help
Usage: ./xgettext-sample [OPTION] [INPUTFILE]...

Extract translatable strings from given input files.

Input files must be valid foomatic files.
...

Describe the Extractor Capabilities

Not all command-line options make sense for all extractors. By overriding several methods you can switch certain features on and off:

1
2
3
4
5
6
7
8
9
sub needInputFiles {
    return;  # Nothing is false.

    # This is also false:
    #return 0;
    #return '';
    # Everything else is true:
    #return 42;
}

If the method needInputFiles() returns a Perl falsy value, Locale::XGettext no longer expects input files. It changes the usage information accordingly and no longer invokes readFile() (but only extractFromNoneFiles()).

The following “boolean” capability methods are currently supported:

  • needInputFiles(): See above! Default is true.
  • canKeywords(): Does the extractor support keywords? Should -k resp. --keywords work? Default is true.
  • canFlags(): Does the extractor support automatic flags? Should --flag work? Default is true.
  • canExtractAll(): Can the extractor extract all strings and not just those that were marked for translation? Switches on and off the option -a, --extract-all.

The option --extract-all does not work out of the box. If you want to support it, get the option value, and decide which strings to return for your input.

Plans For the Future

Interfacing Locale::XGettext from other languages, especially from C and Java could be simplified by allowing JSON strings instead of nested data structures. That will hopefully be implemented soon for one of the next versions.

Another problem stems from the fact that Locale::XGettext currently depends on the Perl library Locale::PO for reading and writing PO files. Locale::PO’s API has a number of quirks, and it does not allow to change the exact format of the PO files produced. Without this feature, Locale::XGettext cannot implement a number of output detail options supported by GNU gettext’s xgettext.

State Of the API

Locale::XGettext is relatively well tested and it has actually already been successfully used in a number of projects. And all currently planned future features can be implemented in a backwards-compatible manner. Still, keep in mind that the version number at the time of this writing is 0.1 and you know yourself what that means.

Summary

Locale::XGettext is a supplement to GNU xgettext. It allows writing custom string extractors compatbile with GNU gettext with minimal effort in a variety of programming langages. Currently supported are Perl, C, Java, Python, and Ruby.


blog comments powered by Disqus