« Unicode 31: Lessons from the "Front Line" | Main | Multilingual Mac Leopard Updates »

Which "UTF" Do I use?

Unicode comes in a variety of flavors depending on how many bytes you are using and in which byte order they are coming in.

For most online uses, UTF-8 is the safest, but here's a short summary of other types of Unicode out there.

UTF-16

Recall that Unicode numbers are really hexadecimal numbers. The original specifications called for a 16-bit (or 216 = 65,000+) characters. The highest number would be #FFFF. Within Unicode, the four digit are organized onto blocks (the first two digits), then codepoints in the block.

This capital L (Hexadecimal #4C or #x4C) is in block 00 and codepoint 4C or 004C.

What happens is a character has more than 4 hexadecimal digits (as in some ancient scripts)? You can still use UTF-16, but the larger Unicode code point will be broken into units of 4 units via a transformation algorithm. In short, UTF-16 is organized into blocks of 4 slots per character, but there's a catch:

UTF-16: Little Endian vs. Big Endian

Some systems, notably Intel based systems organize each Unicode number into codepoint (little end) then block (big end). Others, notably Unix, organize Unicode into block then code point.

Returning to the capital L (#4C), there are two UTF-16 ways to represent this:

  • Big Endian (UTF-16BE) : 00.4C = L
  • Little Endian (UTF-16LE) : 4C.00 = L

Software packages, particularly databases and text editors for programmers can switch between the two, but it can be a hassle. UTF-8 is more consistent between systems, so is a little more resilient.

Note: In theory, UTF-16 files begin with a special BOM (Byte Order Mark) which specifies Little Endian or Big Endian.

UTF-32 (UCS-4)

At some point, the Unicode Consortium realized that even 65,000+ characters would not be enough, so provisions were made to add another two places in the hexadecimal system. The next two places were called "planes" (vs. "blocks" and "codepoints"). The original 216 characters are now Plane 0 with additional planes being added for other scripts as needed. At this point, there are some Plane 1 scripts, but they are mostly ancient scripts. There are now 231 characters available.

In any case, to represent all the planes, blocks and codepoints, you need to add extra digits in the Unicode file. Thus capital L (#4C) is now 00004C in UTF-32. As you can see, unless you are dealing with ancient scripts or archaic Chinese texts, you are adding extra "plane" information you may not need and adding more memory to your files.

UTF-8 (Unicode Transformation Format)

The difference between UTF-8 and UTF-16/UTF-32 is that it uses an algorithm to translate any Unicode character into a series of "octets". Character (004C) "L" can be stripped to a simple 4C, just like in ASCII. If you use primarily English or Western languages, file sizes may be smaller in UTF-8 than UTF-16, and ASCII or Latin 1 code will usually be easier to integrate into Unicode.

The other advantage of UTF-8 is that the algorithm allows the data to be less corruptible over the Internet. Thus UTF-8 is recommended for e-mail, Web pages and other online files. Some databases and programming languages may use UTF-16 instead.

Additional Links

About

This page contains a single entry from the blog posted on October 30, 2007 5:27 PM.

The previous post in this blog was Unicode 31: Lessons from the "Front Line".

The next post in this blog is Multilingual Mac Leopard Updates.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33