October 2007 Archives
Unicode comes in a variety of flavors depending on how many bytes you are using and in which byte order they are coming in.
For most online uses, UTF-8 is the safest, but here's a short summary of other types of Unicode out there.
UTF-16
Recall that Unicode numbers are really hexadecimal numbers. The original specifications called for a 16-bit (or 216 = 65,000+) characters. The highest number would be #FFFF. Within Unicode, the four digit are organized onto blocks (the first two digits), then codepoints in the block.
This capital L (Hexadecimal #4C or #x4C) is in block 00 and codepoint 4C or 004C.
What happens is a character has more than 4 hexadecimal digits (as in some ancient scripts)? You can still use UTF-16, but the larger Unicode code point will be broken into units of 4 units via a transformation algorithm. In short, UTF-16 is organized into blocks of 4 slots per character, but there's a catch:
UTF-16: Little Endian vs. Big Endian
Some systems, notably Intel based systems organize each Unicode number into codepoint (little end) then block (big end). Others, notably Unix, organize Unicode into block then code point.
Returning to the capital L (#4C), there are two UTF-16 ways to represent this:
- Big Endian (UTF-16BE) : 00.4C = L
- Little Endian (UTF-16LE) : 4C.00 = L
Software packages, particularly databases and text editors for programmers can switch between the two, but it can be a hassle. UTF-8 is more consistent between systems, so is a little more resilient.
Note: In theory, UTF-16 files begin with a special BOM (Byte Order Mark) which specifies Little Endian or Big Endian.
UTF-32 (UCS-4)
At some point, the Unicode Consortium realized that even 65,000+ characters would not be enough, so provisions were made to add another two places in the hexadecimal system. The next two places were called "planes" (vs. "blocks" and "codepoints"). The original 216 characters are now Plane 0 with additional planes being added for other scripts as needed. At this point, there are some Plane 1 scripts, but they are mostly ancient scripts. There are now 231 characters available.
In any case, to represent all the planes, blocks and codepoints, you need to add extra digits in the Unicode file. Thus capital L (#4C) is now 00004C in UTF-32. As you can see, unless you are dealing with ancient scripts or archaic Chinese texts, you are adding extra "plane" information you may not need and adding more memory to your files.
UTF-8 (Unicode Transformation Format)
The difference between UTF-8 and UTF-16/UTF-32 is that it uses an algorithm to translate any Unicode character into a series of "octets". Character (004C) "L" can be stripped to a simple 4C, just like in ASCII. If you use primarily English or Western languages, file sizes may be smaller in UTF-8 than UTF-16, and ASCII or Latin 1 code will usually be easier to integrate into Unicode.
The other advantage of UTF-8 is that the algorithm allows the data to be less corruptible over the Internet. Thus UTF-8 is recommended for e-mail, Web pages and other online files. Some databases and programming languages may use UTF-16 instead.
Additional Links
I had a hard time deciding which sessions to attend at the last Unicode conference, but I did end up at "Unicode at the Front Lines", which was a series of mini-presentations from scholars working with lesser-known languages and scripts. This is a place where the Unicode rubber really hits the road, and I learned some interesting "life-lessons".
1. The problem with "reforming" a script is that new readers may not be able to read the older texts. This was in context of the Tai Viet script (apparently the reform was so unpopular, they ditched it), but occurs in Chinese (Traditional vs. Simplified), Korean (new texts use only Hangul, but older ones included Chinese) and even in cases where spelling reform is enacted (as in the Netherlands and Germany).
BTW - I'm not against spelling/script reform, but we do have to admit that there will be some "loss" (enough to keep a few scholars in archaic languages in business).
2. Try not to invent a new letter for new languages. In the earlier part of the 20th century, linguists were fond of inventing quirky new symbols for languages they were documenting. A classic case is Igbo which has a lots of vowels with dots beneath them as in Ị,ị,Ọ,ọ,Ụ,ụ. There is no objection to the dots per se, but they are an unusual in the context to what Western alphabets do. Because these characters are outside the norm, Igbo internationalization has to play continual catch-up because even programs which can handle Western European languages, may not know what to do with the dots.
If your lesser-known language already includes letters that are common to the major languages, implementation of utilities in your language is much easier. Of course, I think Unicode is better for including dotted letters.
For now though...if you have a choice between "v" or "vh" in your language, the latter is (unfortunately) a little more Unicode ready.
3. H ≠ Η ≠ Н - For the record the first is English H /h/, the second is Greek capital Eta /ē/ and the last is Cyrillic En /n/. I knew that many capital letters are triple encoded (e.g. A/alpha/Cyrillic Ah), but this is the first time I realized that the phonetic values can be so different. Normally this isn't an issue unless you have linguists from all over Europe trying to use their native script for phonetic spellings. When do you have the right H?
4. ŵ ≠ ŵ it matters when you type the accent). Unicode supports "pre-composed accents" (that is an accent which can float over any letter) and in theory the combination of ̂́+ w (to make ŵ) should be the same as w + ̂́ (to make ŵ) ...but it's not. A linguistic archive database has these precomposed letters but can't "merge" the two string combinations as one letter.
Again, this wouldn't be too critical except that sometimes a linguist puts the accent before the w, and sometimes they put the w before the accent. Again these are the same world-wide linguists who gave us the problem of the three H's.
A member in the audience did suggest that it was a "training issue", but who are we kidding...these are FACULTY. Faculty are great scholars, but few are well-trained data entry operators.
It's kind of buried, but the OS X release that the Microsoft fonts Arial Unicode, Tahoma, Microsoft Sans and others will now be shipping on the Mac.
http://www.apple.com/macosx/features/300.html
This is a good sign that Apple is moving towards full interoperability with Windows OTF fonts (at the recent IUC 31 conference, they said OTF support for Arabic was complete).
And also, it's good to have Arial Unicode as a test font since that's what's what's on the Windows platform. I have installed older versions of Arial Unicode on the Mac for testing purposes before, but it never did work quite right.
Mac is also promising other international enhancements including Braille support, true Persian support, Tibetan and Kazakh, expanded character palettes and as well as expanded Chinese, Korean and Japanese support
My university was kind enough to send me to IUC 31 (http://www.unicodeconference.org) this year, and I can honestly say that it was one of the best conferences I've been to.
For one thing, almost all the major players (Microsoft, Apple, Sun, IBM, Adobe, Google, Yahoo, W3C) sent representatives, so I got to hear a lot of great information straight from the source. I've been hacking away at this for seven years, but I learned quite a bit of new information, especially about some of the more technical aspects.
The Unicode conference is also very good at providing a good range of how-tos ranging from absolute monolingual beginner to cutting edge tools for the experienced Unicoder. Even the basics gave me some pointers that I had forgotten or hadn't considered. I obviously couldn't make all the sessions (no cloning yet), but the PDF's that attendees can access are fairly detailed and can help you track it down.
I have to confess that my favorite track was probably "Unicode on the Front Lines" in which linguists described encoding issues for minority languages and scripts. From a language geek perspective, it's fascinating what new issues come up. More importantly, I saw that there was a lot of support for outreach in the Unicode community. I heard the members of the Unicode Org point some users to resources they hadn't know about before.
I myself gave a presentation about Unicode at Penn State, and I have to say most of the feedback was very positive, and I got a few tips myself.
So all in all, I have to say thanks to the organizers of the conference for putting on a great event.
I'm actually presenting at IUC 31 (http://www.unicodeconference.org) in San José about supporting international technnoogy at Penn State. You can download the Powerpoint if you want to read more.
Richard Ishida has a Web based Unicode look up tool at
http://people.w3.org/rishida/scripts/uniview/uniview.php
This is a search form which allows you to view data by name, hex value, actual pasted character or range.
There's another conversion utility at
http://people.w3.org/rishida/scripts/uniview/conversion.php
which allows you to convert characters from hex values to different variants such as decimal values, percent escapes (Web address) and UTF-8 vs. UTF-16.
The character paste feature is especially valuable for random symbols such as ∞ (infinity) or ɛ (Open e, epsilon vowel). You can finally extract a code point from a weird symbol used in your Word doc.
Recent Comments