« Arial Unicode Coming to OSX 10.5 (Leopard) | Main | Which "UTF" Do I use? »

Unicode 31: Lessons from the "Front Line"

I had a hard time deciding which sessions to attend at the last Unicode conference, but I did end up at "Unicode at the Front Lines", which was a series of mini-presentations from scholars working with lesser-known languages and scripts. This is a place where the Unicode rubber really hits the road, and I learned some interesting "life-lessons".

1. The problem with "reforming" a script is that new readers may not be able to read the older texts. This was in context of the Tai Viet script (apparently the reform was so unpopular, they ditched it), but occurs in Chinese (Traditional vs. Simplified), Korean (new texts use only Hangul, but older ones included Chinese) and even in cases where spelling reform is enacted (as in the Netherlands and Germany).

BTW - I'm not against spelling/script reform, but we do have to admit that there will be some "loss" (enough to keep a few scholars in archaic languages in business).

2. Try not to invent a new letter for new languages. In the earlier part of the 20th century, linguists were fond of inventing quirky new symbols for languages they were documenting. A classic case is Igbo which has a lots of vowels with dots beneath them as in Ị,ị,Ọ,ọ,Ụ,ụ. There is no objection to the dots per se, but they are an unusual in the context to what Western alphabets do. Because these characters are outside the norm, Igbo internationalization has to play continual catch-up because even programs which can handle Western European languages, may not know what to do with the dots.

If your lesser-known language already includes letters that are common to the major languages, implementation of utilities in your language is much easier. Of course, I think Unicode is better for including dotted letters.

For now though...if you have a choice between "v" or "vh" in your language, the latter is (unfortunately) a little more Unicode ready.

3. H ≠ Η ≠ Н - For the record the first is English H /h/, the second is Greek capital Eta /ē/ and the last is Cyrillic En /n/. I knew that many capital letters are triple encoded (e.g. A/alpha/Cyrillic Ah), but this is the first time I realized that the phonetic values can be so different. Normally this isn't an issue unless you have linguists from all over Europe trying to use their native script for phonetic spellings. When do you have the right H?

4. ŵ ≠ ŵ it matters when you type the accent). Unicode supports "pre-composed accents" (that is an accent which can float over any letter) and in theory the combination of ̂́+ w (to make ŵ) should be the same as w + ̂́ (to make ŵ) ...but it's not. A linguistic archive database has these precomposed letters but can't "merge" the two string combinations as one letter.

Again, this wouldn't be too critical except that sometimes a linguist puts the accent before the w, and sometimes they put the w before the accent. Again these are the same world-wide linguists who gave us the problem of the three H's.

A member in the audience did suggest that it was a "training issue", but who are we kidding...these are FACULTY. Faculty are great scholars, but few are well-trained data entry operators.


This page contains a single entry from the blog posted on October 30, 2007 4:49 PM.

The previous post in this blog was Arial Unicode Coming to OSX 10.5 (Leopard).

The next post in this blog is Which "UTF" Do I use?.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.33