Wednesday, May 07, 2008

XML is a wonderful thing, and you can set up some intelligent translation workflows if you put your mind to it. But one aspect of XML which constantly presents minor challenges is encoding - the hoys of UTF-8 versus UTF-16. And just what is Unicode? Here is a wonderful, short, pithy synopsis of Unicode and UTF coding issues. Well worth a read.

We had to do some reading up and research on UTF encoding in XML files when we translated some DITA files recently. The files were in UTF-8, which is fine. But some text editors didn't display accented characters properly. (If you're really interested, TextPad could not spot that the file was in UTF-8, and opened it incorrectly when left to "open automatically"; if we manually tell TextPad "Hey, open this file as UTF-8" then it behaves just fine. UltraEdit is smarter, and correctly opens the translated DITA files as UTF-8 all on its own, so the accents display correctly).

This called for deeper reading. The real issue is that UTF-8 uses 2 to 4 bytes to represent characters above ASCII 127 (that means accented characters to you and me!). But there's no simple way to show how many characters are in each string, so some text editors get try to interpret half a string as a full string, and they confused. This can be resolved to some extent by including a Byte order marker (or BOM), although this is a political hot potato in the geek world. Some people argue in favour, some claim it is the work of the devil himself.

Like most translation companies, we use translation memory tools to ensure consistency when we translate user manuals and help systems, and to enable us to translate a range of file formats. One the main TM tools (or CAT tools - Computer Aided Translation) is called Trados, and their TagEditor tool started off not handling Byte Order Marks in XML files, then changed to give a (well concealed) option to force a BOM into the translated file (target language file). For anyone translating XML files, or translating DITA files, this is significant. Quite when this significant change was introduced is a bit of a mystery, but our translation memory detectives reckon it was somewhere between Trados version and version

To the "TM detectives" who worked with me on this, a very big Thank you. This stuff would try the patience of a saint! Now, back to our translations.......

And as a postscript, I also enjoyed Christian Flury's Unicode Primer for Linguists on this subject. I particularly like his description of Endianism (or Endianness, if you prefer), comparing it with the difference between counting in English and counting in German: In English, we start with the biggest number first, so "99" is "ninety nine". German starts at the other end, so "99" is "nine and ninety" (or neunundneunzig, in German). So English takes the big Endian approach, and German takes the little Endian approach!


Post a Comment

<< Home