October 2007 Archives

Which "UTF" Do I use? (Deprecated)


Unicode 31: Lessons from the "Front Line"


I had a hard time deciding which sessions to attend at the last Unicode conference, but I did end up at "Unicode at the Front Lines", which was a series of mini-presentations from scholars working with lesser-known languages and scripts. This is a place where the Unicode rubber really hits the road, and I learned some interesting "life-lessons".

1. The problem with "reforming" a script is that new readers may not be able to read the older texts. This was in context of the Tai Viet script (apparently the reform was so unpopular, they ditched it), but occurs in Chinese (Traditional vs. Simplified), Korean (new texts use only Hangul, but older ones included Chinese) and even in cases where spelling reform is enacted (as in the Netherlands and Germany).

BTW - I'm not against spelling/script reform, but we do have to admit that there will be some "loss" (enough to keep a few scholars in archaic languages in business).

2. Try not to invent a new letter for new languages. In the earlier part of the 20th century, linguists were fond of inventing quirky new symbols for languages they were documenting. A classic case is Igbo which has a lots of vowels with dots beneath them as in Ị,ị,Ọ,ọ,Ụ,ụ. There is no objection to the dots per se, but they are an unusual in the context to what Western alphabets do. Because these characters are outside the norm, Igbo internationalization has to play continual catch-up because even programs which can handle Western European languages, may not know what to do with the dots.

If your lesser-known language already includes letters that are common to the major languages, implementation of utilities in your language is much easier. Of course, I think Unicode is better for including dotted letters.

For now though...if you have a choice between "v" or "vh" in your language, the latter is (unfortunately) a little more Unicode ready.

3. H ≠ Η ≠ Н - For the record the first is English H /h/, the second is Greek capital Eta /ē/ and the last is Cyrillic En /n/. I knew that many capital letters are triple encoded (e.g. A/alpha/Cyrillic Ah), but this is the first time I realized that the phonetic values can be so different. Normally this isn't an issue unless you have linguists from all over Europe trying to use their native script for phonetic spellings. When do you have the right H?

4. ŵ ≠ ŵ it matters when you type the accent). Unicode supports "pre-composed accents" (that is an accent which can float over any letter) and in theory the combination of ̂́+ w (to make ŵ) should be the same as w + ̂́ (to make ŵ) ...but it's not. A linguistic archive database has these precomposed letters but can't "merge" the two string combinations as one letter.

Again, this wouldn't be too critical except that sometimes a linguist puts the accent before the w, and sometimes they put the w before the accent. Again these are the same world-wide linguists who gave us the problem of the three H's.

A member in the audience did suggest that it was a "training issue", but who are we kidding...these are FACULTY. Faculty are great scholars, but few are well-trained data entry operators.


Arial Unicode Coming to OSX 10.5 (Leopard)


It's kind of buried, but the OS X release that the Microsoft fonts Arial Unicode, Tahoma, Microsoft Sans and others will now be shipping on the Mac.


This is a good sign that Apple is moving towards full interoperability with Windows OTF fonts (at the recent IUC 31 conference, they said OTF support for Arabic was complete).

And also, it's good to have Arial Unicode as a test font since that's what's what's on the Windows platform. I have installed older versions of Arial Unicode on the Mac for testing purposes before, but it never did work quite right.

Mac is also promising other international enhancements including Braille support, true Persian support, Tibetan and Kazakh, expanded character palettes and as well as expanded Chinese, Korean and Japanese support


Is the IUC Conference Worth it? Absolutely!


My university was kind enough to send me to IUC 31 (http://www.unicodeconference.org) this year, and I can honestly say that it was one of the best conferences I've been to.

For one thing, almost all the major players (Microsoft, Apple, Sun, IBM, Adobe, Google, Yahoo, W3C) sent representatives, so I got to hear a lot of great information straight from the source. I've been hacking away at this for seven years, but I learned quite a bit of new information, especially about some of the more technical aspects.

The Unicode conference is also very good at providing a good range of how-tos ranging from absolute monolingual beginner to cutting edge tools for the experienced Unicoder. Even the basics gave me some pointers that I had forgotten or hadn't considered. I obviously couldn't make all the sessions (no cloning yet), but the PDF's that attendees can access are fairly detailed and can help you track it down.

I have to confess that my favorite track was probably "Unicode on the Front Lines" in which linguists described encoding issues for minority languages and scripts. From a language geek perspective, it's fascinating what new issues come up. More importantly, I saw that there was a lot of support for outreach in the Unicode community. I heard the members of the Unicode Org point some users to resources they hadn't know about before.

I myself gave a presentation about Unicode at Penn State, and I have to say most of the feedback was very positive, and I got a few tips myself.

So all in all, I have to say thanks to the organizers of the conference for putting on a great event.


Unicode 31 Presentation


I'm actually presenting at IUC 31 (http://www.unicodeconference.org) in San José about supporting international technnoogy at Penn State. You can download the Powerpoint if you want to read more.

Download Powerpoint


UniView Unicode Character Lookup


Richard Ishida has a Web based Unicode look up tool at

This is a search form which allows you to view data by name, hex value, actual pasted character or range.

There's another conversion utility at
which allows you to convert characters from hex values to different variants such as decimal values, percent escapes (Web address) and UTF-8 vs. UTF-16.

The character paste feature is especially valuable for random symbols such as (infinity) or ɛ (Open e, epsilon vowel). You can finally extract a code point from a weird symbol used in your Word doc.


About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.


The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments