On the state of i18n in Perl
The following text represents an effort to describe the situation I’ve encountered when I came to the Perl world last December. I’ve done some translating for the Debian project and I was a bit shocked about the state of Perl’s i18n. I have to admit, I’m still an inexperienced hacker, but I wanted to write this article to raise some awareness for the issues described if I’m right and learn something new if I’m wrong. Anyway, I tried to keep this article constructive and it’s still just my opinion, so please comment appropriately.
Disclaimer: I’m essentially talking about l10n, but most people know it as i18n, so I’m keeping “i18n” in text.
The i18n problem
When it comes to making your application tranlatable in Perl, there are actually two schools of doing this: via Maketext and via GNU gettext.
GNU gettext is the most known software translation tool used in most open-source projects while
Maketext is a child of the Perl world. And the bad thing is:
Maketext is currently more popular, but if you are using
Maketext for making your application translatable, you are doing it wrong!
Let’s look at how
Maketext works, according to its documentation and contrast that with the
Maketext manual defines the process as following (quoting freely):
- Decide what system you’ll use for lexicon keys (i.e. base language)
- Create a class for your localization project
- Create a class for the language your internal keys are in
- Go and write your program
- Once the program is otherwise done, and once its localization for the first language works right (via the data and methods in Projname::L10N::en_us), you can get together the data for translation.
- Submit all messages/phrases/etc. to translators
- Translators may request clarification of the situation in which a particular phrase is found
- Each translator should make clear what dependencies the number causes in the sentence
- Remind the translators to consider the case where N is 0
- Remember to ask your translators about numeral formatting in their language
- The basic quant method that Locale::Maketext provides should be good for many languages. […] For the particularly problematic Slavic languages, what you may need is a method which you provide with the number, the citation form of the noun to quantify, and the case and gender that the sentence’s syntax projects onto that noun slot.
- Once you’ve localized your program/site/etc. for all desired languages, be sure to show the result (whether live, or via screenshots) to the translators.
There is a lot of sense in this and this has certainly been valid back in 1999, but a lot of work in this process is not specified. For example, the translation process itself is questionable:
- How do you “Submit all messages/phrases/etc. to translators”?
- How do you integrate translations back from translators?
- How do you resubmit translation strings if they change?
- How do you communicate “situation in which a particular phrase is found” (i.e. context)?
- What happens if one phrase has to be translated differently depending on context? How does one implement that in a module properly?
- How does the translator “make clear what dependencies the number causes”? At what extents does that happen? Will the developer even understand him at all?
- Does the programmer really have to understand all of implications of each language implemented? Should every programmer on the team understand them?
- Who actually implements that “quant” method? How? What about languages with exceptions?
One basic, but fatal, mistake
Maketext does is off-loading a lot of linguistic work onto programmer.
- One particularly important point is the plural forms support (‘1 apple’, ‘2 apples’), which is important for many languages outside of USA and Western Europe .
Maketextrequires you to write a quant function that gets a string and a number as parameters and does some voodoo to produce the right string. Voodoo is undefined. In
gettextit is — a formula for producing plural forms is defined which selects one of provided plural phrases.
- No translator in his sane mind will ever write a Perl module for a language (they aren’t programmers, remember?), the programmer will have to do it and will also have to understand the implications.
- The quant notation (
"Your search matched [quant,_1,document]!") foolishly assumes word order is the same in all languages. Implementing a quant method properly would require passing the whole sentence into the function and doing a complete linguistic transformation which is highly non-trivial and better done by human.
- Most of those linguistic “conventions” like number formatting or plural forms do not change over time and can be compiled at one place. One such place is Unicode’s CLDR project, which also includes plural form building and number/date formatting among other country- and language-dependant data.
- It can’t even be assumed that the translators actually know all of these conventions! They might assume they know them, but translator is not necessarily doing translations for a living, he might be a volunteer, like in most open source projects. Imagine what happens when an amateur translator explains the inner workings of his native language to a programmer?
Compared to this
gettext has a saner, more practical approach — they provide a standardized translation string format, handle updates of message catalogs cleanly, provide all necessary tools for message extraction, don’t require any additional modules, work mostly language-agnostic, provide contexts and translators’ comments, even plural forms calculation formulae are explicitely noted in the manual. It also emphasizes asynchronous translation: translation strings can be extracted and imported at any time in the lifecycle of a project. A developer essentially has to do the following:
- Implement using
gettextin his project (depends on the language used)
- Mark extractable strings
- Run extraction and merging scripts (mostly included by
- Submit translation files to translators
- Copy received translations back into the project
gettext of course is not perfect. It lacks several vastly important features, like proper gender support (e.g. “He was born” and “She was born” is different in Russian). But it generally follows the “It mostly works” principle, making features needed 95% of the time available. Workflow tools make using
gettext a snap. Compared to
Maketext it is also easier to support for the programmer and easier for the translator to produce translations. The dreaded quant function actually makes using
Maketext properly for translations impossible.
Apart from those techical shortcomings, there is a bigger threat.
TPJ13 is an excellent summary of i18n problems, which every developer, even non-Perl one, should read. It’s solution part is hopelessly out-of-date — don’t forget, TPJ13 is getting ten years old this year. Back in 1999
gettext hasn’t had any plural forms support and also lacked many other features so the authors’ point used to be valid at that point. However, gettext had implemented its support for plurals rather fast and at that time
Maketext should have been retired immediately. Sadly, this has not happened.
That misunderstanding haunts us until this day. Every novice Perl hacker is introduced to TPJ13 and tends to believe
Maketext is the way to go. Failing to see its shortcomings however, yields in well-meant but still failed creations like Locale::Maketext::Lexicon
which tries hard to bring the world of
Maketext-infected minds. What we get is crazy stuff like (verbatim from the POD)
#: Hello.pm:11 msgid "You have %quant(%1,piece) of mail." msgstr "Sie haben %quant(%1,Poststueck,Poststuecken)."
instead of a proper (German spelling corrected a bit):
#: Hello.pm:11 msgid "You have 1 piece of mail." msgid_plural "You have %d pieces of mail" msgstr "Sie haben 1 Poststueck" msgstr "Sie haben %d Poststuecke"
The former has virtually no tool support (not even
gettext’s extraction routine
xgettext), but extraction is supported by home-grown
xgettext.pl (notice the
.pl suffix). And there we have some fatal stuff going on:
Locale::Maketext::Lexiconis considered the solution for using
xgettext.plhave any notion of proper
.pofiles created by
xgettext.plare not fully supported by translation tools like PoEdit, KBabel, Launchpad Rosetta, 99translations.com etc.
- Catalyst::Plugin::I18N, the only i18n plugin for the extremely popular Catalyst web framework, is based on
xgettext.plhas support for Template-Toolkit templates, YAML, FormFu and Mason. Original
So there we have it: Perl hackers mostly use tools which are unsuitable and incompatible with the rest of the world without knowing it. The right tools actually can’t help them become “sane”, since
xgettext can’t extract all those formats which
xgettext.pl can and I don’t think that’ll change sometime soon.
Luckily, some hackers have produced a
libintl-perl library which basically re-implements
GNU gettext in Perl. There is a pure Perl implementation of message catalogs called
Locale::gettext_pp, an XS version called
Locale::gettext_xs (Warning: this one has some problems with
mod_perl2!), a Perl wrapper around that (
Locale::Messages) and building upon that an excellent Perl-y implementation of the framework
Locale::TextDomain. These tools are worth your time.
Even though we have
Locale::TextDomain, what should be done to amend the whole
Maketext situation? I’d propose several possible actions:
- Read the GNU gettext Manual to fully understand what these tools can do for you
- Educate your colleagues, tell them about this article and explain the differences
- If you can, port your current code to
- Don’t use
Maketextfor any new code
- Update important code using
Maketextlike the Catalyst plugin mentioned above to support
- Update TPJ13 to reflect the situation
- Port extraction routines from
This and general awareness of the issue should bring Perl’s i18n back on track. Thank you for reading!