June 2009 Archives

iPhone 3.0 Unicode Support (Finding the ŵ)


This week I upgraded my iPhone (actually iPod Touch) software to version 3.0, and although I noted the copy/paste and enhanced landscape display, of course I zoned in on the note saying there was increased character support. Hmmm.

As a warning, I have to admit that I'm a little behind the times in mobile computing, so bear with me if I repeat something you already know. Still, I'm not seeing this information all in one place it it may be a good over (at least for me).

The good news is that there does appear to be more character support, but the feature is still too well-hidden (I really had to work hard to find Welsh support). The iPhone also fails my test for general Unicode readiness because I am not able to yet enter phonetic characters like /ŋ,ɛ,ʃ/ (if nothing else that would kill the iPhone as a remote data entry device). However I doubt the iPhone is really not alone in that area.

So if you are wondering what I am talking about, let me discuss in context:

Baseline Support

Unicode data and display for major languages is generally supported. If Safari can display your Unicode Webpage, it will appear correctly on your iPhone...assuming that the built in fonts support the character. Further, if you have entered/purchased an exotic title in iTunes, it will appear correctly in your synched iTunes list on the iPhone.

Entering Accents

The next challenge is entering some exotic characters into e-mail or a notes application. If you are dealing with Roman characters, iPhone does have some support, but not as much as I would like. The easiest non-English characters to find are foreign currency symbols like £ (pound), ¥ (yen) and € (euro). You typically access these by clicking the the symbol set (often right after the numerals).

While I was able to figure that out, I admit to being stumped as to how to enter accented letters such as Spanish ñ or French è. Fortunately a quick Google search turned up some help sites including this blog entry from Pixelcoma. As you can see, the trick is to hold down a base key such as N or E to see the options for accented characters.

The trick though is that you have to drag your finger across to the right character. You can't hold and double tap as I tried to do. Oops

As stated earlier, there are more options in the palette than in previous earlier versions. For instance, the Pixelcoma A options show A,À,Á,Ä,Æ,Ã,Å,Ą which already covers lots of Western and Central European languages, but Version 3 does add Ā (macron) which is good for Japanese Romaji, Hawaiian, Maori and Latin with long marks (I know there are Latin users out there). I assume that there are other important additions at the other base letters.

However, there are still apparent gaps such as Welsh accented W and Icelandic þ,ð/Ð as well as Romanian Ă, Turkish Ğ,Ş and İ, Latvian Ņ and other really exotic accented letters. It turns out that many are actually in keyboard options installed on the iPhone with additional characters. It still can feel like these languages are "second" class in comparison to Spanish, French and German (at least Polish, Czech and Hungarian have been "mainstreamed" which is a plus).

Before I leave this section though, I do have a comment for future devlopers:

Future developers - if you want to wow your audience with global accent support, you may want to start here at the Wikipedia Latin palette.


That way we can avoid the agonizing incremental addition of accented letters as individual user communities step forward. Why not be comprehensive at the start - like the Apple U.S. Extended keyboard (which is major reasons I still love Apple).

As much as I kvetch though, I don't think the iPhone is worse than any other U.S. mobile device. A forum post for Blackberry mentions holding down a vowel and moving a trackball. ¡Qué divertido!

Other Keyboards

As mentioned previously, if your character is not available in the accent palette, you may need to activate the keyboards (just like in the laptop/desktop). On the iPhone, you access these by clicking the Settings app, then going to General Settings then International. A number of keyboards for languages like Chinese, Japanese, Russian, Hebrew, Arabic as well as Icelandic, Turkish, Latvian are available (still no Welsh, unless it's hiding under the U.K. keyboard (yes it is !)).

This adds a globe icon (like the one below) to the usual iPhone keyboard and allows you to switch between keyboard modes. I just switched to the U.K. keyboard and behold, I found the ŵ under the W key (but now the ¥ key is missing).

Icon for International Keyboards on iPhone

What I Really Want...

Actually it's not necessarily more accented letters as I hold down a key. My thumb is shuddering at how the potential pain of dragging or trackballing additional accents on top of the other precision maneuvers required for English texting. I actually want several things

First, slightly better keyboard designs. The iPhone Google keyboard has the right idea when it makes the @ sign and .com extension basic keys. We already have options for switching on canned keyboards, but what if we had options for customizable keyboards. Maybe one with a "symbol" dock into which you drag the characters or phrases you need from a master slot (this way Americans learning Welsh CAN have their accented W's). Maybe you can reshuffle as well (like killing the \ key if you only synch with a Mac).

But I have to confess that I really want to be able to plug my iPhone into a keyboard. IThe touch interface is fine short small tasks on the run (like looking up movie times or weather by zip code), but still not so great for longer data entry or note taking tasks. I know it's Palm Pilot, but I am at a stage where I would like to ditch the laptop for short meetings and only carry a mobile device and take notes. I note that there are there are hacks out there already...despite the useful shortcuts provided. That should be a sign for Apple and other makers of mobile devices that the need is out there (bummer dudes).

It goes without saying that if true Mac keyboard integration comes, it should come with support for the U.S. Extended and other keyboard variations Apple and the user community have concocted (Windows users can use the U.S. International keyboard for the Mac).

A final wish though is better documentation. The Unicode support for iPhone is decent, but it's quite a chore tracking it all down through numerous user blogs and guessing. I know Apple relies somewhat on it's "intuitive" interface to help users through, but, for whatever reason, Unicode support is rarely intuitive. You just have to know where things are. I'm glad there's a user community out there but from the lack of documentation (especially in comparison to Microsoft) it seems like Apple doesn't care about these issues (when I think they really do).

Microsoft has various Globalization sites (in English), so why can't Apple (or at least one I can find)? Is it because we're in the U.S? To me, It's a little condescending to me to assume that just because I live in the U.S. I will rarely need to enter non-English text. In fact, I type something "non-English" nearly every day.


Sensible Language Tagging Advice from Unicode


As I have written before, the language tagging architecture is a little confusing. First, there are successive standards including ISO 639, ISO-639-2, ISO-639-3 and others. In addition, there are multiple ways to tag languages, especially languages like "Chinese" and "Arabic" plus a legacy combination of 2-letter and 3-letter codes.

Spoken vs Written Language

The reason for much of this confusion is that language coding changes depending on whether you are focusing on written language (like Unicode and major vendors do) or spoken language (as linguists or film historians might), but few sources recognize it. However the CLDR does mention it. Specifically:

The Ethnologue [the online language enyclopedia (which maintains ISO-639-3)] focuses on native, spoken languages, whereas CLDR and many other systems are focused on written language, for computer UI and document translation, and for fluent speakers (not necessarily native speakers).

In other words, there are lots of spoken forms in the world which are not used in written form. In the United States for instance, everyone is taught standard (or "proper") written English even if they actually speak AAVE (African American Vernacular English), Boston/New York English or Appalachian English at home. Similarly, no spell checkers recognize subtle pronunciation differences between the English of California, Minnesota or the two East/West halves of Pennsylvania.

As far as most of the world (including the Microsoft Office spell checker and Amazon.com) there is only one U.S. English (en-us), and only one English for Britain as well (en-GB)...even though England, Scotland and Wales have even more variation in spoken forms - enough so that Ozzy Osbourne's local dialect is difficult for American ears to parse.

The more inreresting case are macrolanguages like Arabic or Chinese - which are languages with cultural unity but linguistic diversity. However the CLDR recommends the macro language code. Their advice again is to assume that the macro language is THE language code:

For a number of reasons, Unicode language and locale identifiers always use the Macrolanguage for the predominent form. Thus the Macrolanguage code "zh" (Chinese) is used instead of "cmn" (Mandarin)...It would be a mistake to look at http://www.ethnologue.com/show_country.asp?name=EG and conclude that the right language code for the Arabic used in Egypt was "arz", which has the largest population. Instead, the right code is "ar", Standard Arabic, which would be the one used for document and UI translation.

Let's examine both the Arabic and Chinese case and see how it works.


First modern Arabic scholars distinguish written Modern Standard Arabic (MSA) which most educated speakers are familiar with from different forms of Colloqiual Arabic which what is spoken at home. The Colloquial forms are different enough to be assigned different language codes in ISO-639-3, but in fact these are rarely written - only MSA is usually written (or used in formal speeches).

If you are working or preparing an Arabic document, chances are that it will be in MSA with maybe a few national quirks (i.e. ar-EG may apply in some cases for an MSA document from Egypt).


Chinese, like Arabic is really a macrolanguage with many spoken varieties which are not always understood across the country. However recent governments, with their capitals in Beijing, have promoted a national variety based on Northern Chinese as the national language. Again, most documents from the PRC or Taiwan will be in Mandarin Chinese...so in effect Chinese (zh) = Mandarin (cmn) in most situations.

Ironically though, Mandarin needs multiple codes because there are now multple ways to write this language - the old Traditional Hanzi system (Taiwan), the Simplified characters (China), Pinyin romanization and the older Wade-Giles. Because language tagging is really focused on written language, there are multiple variant tags for Chinese in different scripts (e.g. zh-Hant = Tradtitional Chinese, zh-Hans = Simplified).

When to use "cmn" for Mandarin

Are there situations when "cmn" for Mandarin Chinese might be appropriate? I would say yes...if you are researching or documenting spoken forms in modern China. For instance, a linguist may be doing field work to document spoken forms from across China.

At the spoken level, even Mandarin (i.e. Northern forms) has dialectal features and it may also be important to compare historical developments between Mandarin and other forms such as Cantonese (yue), Wu (wuu) and Hakka (hak). In that case, I would recommend using the ISO-639-3 language codes to tag everything. That will ensure everything is the same format and will probably facilitate searching down the line. Others might recommend using the macrolanguage code plus the ISO-639-3 language code (so that Mandarin is zh-cmn and Cantonese is zh-yue).

As you can see the CLDR advice is a good primer on how to tag. Most documents can be tagged with a simple system defined in ISO-639-2, but documents being tagged by linguists may need the larger set of ISO-639-3 tags. It really clarifies a lot of ambiguity with how to tag

Tagging Language Variations

A final issue is how to tag language variations which can include changes in script, changes in spelling convention or spoken variation. Although many common variants are registered, there are always more to be added.

Following the advice in the CLDR though I would only pursue registration of tags for written variations. This recommendation will likely be controversial, but is actually consistent with common practice and most user needs. For instance, it does make sense for Microsoft to support spell checkers for en-US vs en-GB or other national varieties of English. Similarly everyone needs to support both Simplified and Traditional Chinese.

But will a spell checker or grammar checker ever be programmed for something like Appalachian English? Not anytime soon. For one thing, there probably is NO "standard Appalachian grammar" - just a series of field work studies and observations with LOTS of individual variation. In fact, one of the great challenges for establishing any written standard is getting agreement on how to handle variations across small distances.

Another concern of mine in registering spoken variants is that I am not seeing a systematic pattern of registration of spoken language variations. For instance, dialectologists for American English recognize different regions in the U.S. (e.g. Mid-Atlantic, Mid West, the South California/West, New England, New York etc), which can be further subdivided into more distinct communities (e.g. Queens vs. Brooklyn vs Long Island). This is actually ignoring the reality that a city can have speakers from unrelated dialects (e.g. AAVE, Spanish-influenced English and other world Englishes).

In theory a registration of dialects should be fairly systematic (e.g. en-US-NYC-longisland), but that is NOT what I am seeing. It's very difficult to know how to tag except on an ad hoc basis. And once a tag is registered, it remains there forever, even if a "deprecated" note is added. I'm not sure the current system is really beneficial, since it is just replicating an ad hoc approach that is not necessarily helpful for the field of dialectology.

On the plus side, I think the system works well for written variations - we even have standards tags for scripts to attach to a language tag. If Spanish is ever written in Cyrillic, I will know to tag it "es-Cyrl."


CLDR = Unicode Common Locale Data Repository


An an aspect of internaltionalization (i18n) I often skip over is localization (l10n) or customizing text (e.g. spelling, transliteration conventions, prices in the correct currency, date stamping in the correct format etc). If localization is something you need to have on a Web site, you may want to start at the Unicode CLDR page (http://cldr.unicode.org/) which compiles a variety of charts and some guides.

There are other resources available including several from IBM listed below:

These resources are generally aimed for the programmer audience, but there are interesting nuggets for the non-programmer as well.


About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.


The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments