ELIZABETH J PYATT: January 2011 Archives

Which "UTF" Do I Use? (Updated)

|

I have rewritten this entry to make some aspects of the theory a little more clear and to fix some errors.


Why flavors?

Although Unicode generally assigns one code point to one character, text data is not generally stored or transmitted in that manner. One difference between an earlier encoding scheme like Latin-1 and Unicode is the number of bytes potentially required. In hexadecimal notation, Latin-1 characters ranged from 00-FF (i.e. two hexadecimal digits). In computer memory terms, 2 hexadecimal numbers is 1 byte (where 1 byte = 8 bits), so each character in Latin -1 requires one byte of memory.

In contrast, modern Unicode code points range from 0-FFFFFF (i.e. six hexadecimal digits), which means that each character could require 3 bytes of memory (or actually 4 bytes since memory is allocated in powers of two). Text file sizes could potentially quadruple...unless compression algorithms are applied. This is the origin of the different "flavors" of Unicode.

Unicode comes in a variety of flavors depending on how many bytes you are using for each character and in which byte order they are coming in. For most online uses, UTF-8 is the safest, but here's a short summary of other types of Unicode out there.

UTF-16

The very earliest versions of Unicode specified a 16-bit/2-byte encoding (where 216 = 65,000+) characters. The highest number would be #FFFF. Within Unicode, the four digit are organized onto blocks (the first two digits), then codepoints in the block.

This capital L (Hexadecimal #4C or #x4C) is in block 00 and codepoint 4C or 004C.

This seems simple enough, but there were differences between the placement of the block versus code point:

UTF-16: Little Endian vs. Big Endian

Some systems, notably Intel based systems organize each Unicode number into codepoint (little end) then block (big end). Others, notably Unix, organize Unicode into block then code point.

Returning to the capital L (#4C), there are two UTF-16 ways to represent this:

  • Big Endian (UTF-16BE) : 00.4C = L
  • Little Endian (UTF-16LE) : 4C.00 = L

Software packages, particularly databases and text editors for programmers can switch between the two, but it can be a hassle. UTF-8 is more consistent between systems, so is a little more resilient.

Note: In theory, UTF-16 files begin with a special BOM (Byte Order Mark) which specifies Little Endian or Big Endian.

UTF-32 (UCS-4)

At some point, the Unicode Consortium realized that even 65,000+ characters would not be enough, so provisions were made to allow for another two digits in the encoding scheme. The next two places were called "planes" (vs. "blocks" and "codepoints"). The original 216 characters are now Plane 0 with additional planes being added for other scripts as needed. At this point, there are some Plane 1 scripts, but they are mostly ancient scripts and rarely used Chinese characters.

In any case, to represent all the planes, blocks and codepoints, you need to add extra digits in the Unicode file. Thus capital L (#4C) is now (00)00004C in UTF-32. As you can see, unless you are dealing with ancient scripts or archaic Chinese texts, you are adding extra "plane" information you may not need and adding more memory to your files. For this reason UTF-32 is almost never used in practice. However, if it were, you could specify an LE vs BE version of UTF-32.

UTF-8 (Unicode Transformation Format)

The difference between UTF-8 and UTF-16/UTF-32 is that it uses an algorithm to translate any Unicode character into a series of "octets". Character (004C) "L" can be stripped to a simple 4C, just like in ASCII. If you use primarily English or Western languages, file sizes may be smaller in UTF-8 than UTF-16, and ASCII or Latin 1 code will usually be easier to integrate into Unicode.

The other advantage of UTF-8 is that the algorithm allows the data to be less corruptible over the Internet. Thus UTF-8 is recommended for e-mail, Web pages and other online files. Some databases and programming languages may use UTF-16 instead.

Similar transforms are also applied in UTF-16 to convert codepoints U+10000 and higher to sequences of four-digit characters. However, not every software applications supports this, so some systems may have problems processing code points beyond U+FFFF.

Additional Links

Categories:

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Comments

The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments