How Unicode Mattered in Iran

|

As protesters expressed their anger with the Iranian presidential electoral process, the world marvelled at how Twitter, Facebook and other Internet outlets are re being used by Iranians to communicate with each other even as the government was sending out force to suppress the riots. The U.S. State department even requested that Twitter reschedule a fix so as not to interfere with the daylight hours of Iran.

Heady stuff for technologies we normally associate with most insipid of Internet messaging ("OMG - The Orioles lost again?!?"). I'm glad Facebook and Twitter were there, but I suspect that some of the most important messages were in Persian (Farsi) and were made possible by another less glamorous technology - Unicode. Both Facebook and Twitter have had underlying Unicode support in the beginning, so assuming your system had the right fonts, you could communicate in any language from Persian to Igbo and then some.

Although I am normally a symbol geek in my love for Unicode (goes well with my lifelong obsession with fonts, foreign language and exotic characters), at times like these I realize that Unicode is an important tool to the dream of the Internet enabling anyone, anywhere to speak out and be heard. If you are not a symbol geek, but wonder why Unicode is important...I think bloggers and Tweeters in Iran, China and everywhere can show you the answer. Unicode makes it possible for everyone to be heard...even if you haven't had the chance to learn English.

Postscript: English Digital Divide

In some countries there is a real digital divide based on language - that is those who have learned a major language such as English, French or Spanish or Chinese and Arabic are able to use the Internet while others who only know a relatively under supported language do not have little to zero access.

For instance, I asked a scholar at a Sri Lankan university how they computed in the Sinhala script, and his answer was that all computing was assumed to be in English (partly because Sri Lanka used to be the British colony Ceylon). I was a little startled, but it makes sense. Until recently, I suspect that only a few people or institutions in the upper economic tiers could have afforded computers and they were likely already educated in English. Since English support is built in, it might seem a waste to work in support for a "local" script. Still, I think a lot of people and organizations understand the importance of Unicode in increasing access (and preserving local languages) and are working to provide low-cost utilities for these communities.

iPhone 3.0 Unicode Support (Finding the ŵ)

|

This week I upgraded my iPhone (actually iPod Touch) software to version 3.0, and although I noted the copy/paste and enhanced landscape display, of course I zoned in on the note saying there was increased character support. Hmmm.

As a warning, I have to admit that I'm a little behind the times in mobile computing, so bear with me if I repeat something you already know. Still, I'm not seeing this information all in one place it it may be a good over (at least for me).

The good news is that there does appear to be more character support, but the feature is still too well-hidden (I really had to work hard to find Welsh support). The iPhone also fails my test for general Unicode readiness because I am not able to yet enter phonetic characters like /ŋ,ɛ,ʃ/ (if nothing else that would kill the iPhone as a remote data entry device). However I doubt the iPhone is really not alone in that area.

So if you are wondering what I am talking about, let me discuss in context:

Baseline Support

Unicode data and display for major languages is generally supported. If Safari can display your Unicode Webpage, it will appear correctly on your iPhone...assuming that the built in fonts support the character. Further, if you have entered/purchased an exotic title in iTunes, it will appear correctly in your synched iTunes list on the iPhone.

Entering Accents

The next challenge is entering some exotic characters into e-mail or a notes application. If you are dealing with Roman characters, iPhone does have some support, but not as much as I would like. The easiest non-English characters to find are foreign currency symbols like £ (pound), ¥ (yen) and € (euro). You typically access these by clicking the the symbol set (often right after the numerals).

While I was able to figure that out, I admit to being stumped as to how to enter accented letters such as Spanish ñ or French è. Fortunately a quick Google search turned up some help sites including this blog entry from Pixelcoma. As you can see, the trick is to hold down a base key such as N or E to see the options for accented characters.

The trick though is that you have to drag your finger across to the right character. You can't hold and double tap as I tried to do. Oops

As stated earlier, there are more options in the palette than in previous earlier versions. For instance, the Pixelcoma A options show A,À,Á,Ä,Æ,Ã,Å,Ą which already covers lots of Western and Central European languages, but Version 3 does add Ā (macron) which is good for Japanese Romaji, Hawaiian, Maori and Latin with long marks (I know there are Latin users out there). I assume that there are other important additions at the other base letters.

However, there are still apparent gaps such as Welsh accented W and Icelandic þ,ð/Ð as well as Romanian Ă, Turkish Ğ,Ş and İ, Latvian Ņ and other really exotic accented letters. It turns out that many are actually in keyboard options installed on the iPhone with additional characters. It still can feel like these languages are "second" class in comparison to Spanish, French and German (at least Polish, Czech and Hungarian have been "mainstreamed" which is a plus).

Before I leave this section though, I do have a comment for future devlopers:

Future developers - if you want to wow your audience with global accent support, you may want to start here at the Wikipedia Latin palette.

WikipediaLatinPal.png

That way we can avoid the agonizing incremental addition of accented letters as individual user communities step forward. Why not be comprehensive at the start - like the Apple U.S. Extended keyboard (which is major reasons I still love Apple).

As much as I kvetch though, I don't think the iPhone is worse than any other U.S. mobile device. A forum post for Blackberry mentions holding down a vowel and moving a trackball. ¡Qué divertido!

Other Keyboards

As mentioned previously, if your character is not available in the accent palette, you may need to activate the keyboards (just like in the laptop/desktop). On the iPhone, you access these by clicking the Settings app, then going to General Settings then International. A number of keyboards for languages like Chinese, Japanese, Russian, Hebrew, Arabic as well as Icelandic, Turkish, Latvian are available (still no Welsh, unless it's hiding under the U.K. keyboard (yes it is !)).

This adds a globe icon (like the one below) to the usual iPhone keyboard and allows you to switch between keyboard modes. I just switched to the U.K. keyboard and behold, I found the ŵ under the W key (but now the ¥ key is missing).

GlobeIcon.png
Icon for International Keyboards on iPhone

What I Really Want...

Actually it's not necessarily more accented letters as I hold down a key. My thumb is shuddering at how the potential pain of dragging or trackballing additional accents on top of the other precision maneuvers required for English texting. I actually want several things

First, slightly better keyboard designs. The iPhone Google keyboard has the right idea when it makes the @ sign and .com extension basic keys. We already have options for switching on canned keyboards, but what if we had options for customizable keyboards. Maybe one with a "symbol" dock into which you drag the characters or phrases you need from a master slot (this way Americans learning Welsh CAN have their accented W's). Maybe you can reshuffle as well (like killing the \ key if you only synch with a Mac).

But I have to confess that I really want to be able to plug my iPhone into a keyboard. IThe touch interface is fine short small tasks on the run (like looking up movie times or weather by zip code), but still not so great for longer data entry or note taking tasks. I know it's Palm Pilot, but I am at a stage where I would like to ditch the laptop for short meetings and only carry a mobile device and take notes. I note that there are there are hacks out there already...despite the useful shortcuts provided. That should be a sign for Apple and other makers of mobile devices that the need is out there (bummer dudes).

It goes without saying that if true Mac keyboard integration comes, it should come with support for the U.S. Extended and other keyboard variations Apple and the user community have concocted (Windows users can use the U.S. International keyboard for the Mac).

A final wish though is better documentation. The Unicode support for iPhone is decent, but it's quite a chore tracking it all down through numerous user blogs and guessing. I know Apple relies somewhat on it's "intuitive" interface to help users through, but, for whatever reason, Unicode support is rarely intuitive. You just have to know where things are. I'm glad there's a user community out there but from the lack of documentation (especially in comparison to Microsoft) it seems like Apple doesn't care about these issues (when I think they really do).

Microsoft has various Globalization sites (in English), so why can't Apple (or at least one I can find)? Is it because we're in the U.S? To me, It's a little condescending to me to assume that just because I live in the U.S. I will rarely need to enter non-English text. In fact, I type something "non-English" nearly every day.

Sensible Language Tagging Advice from Unicode

|

As I have written before, the language tagging architecture is a little confusing. First, there are successive standards including ISO 639, ISO-639-2, ISO-639-3 and others. In addition, there are multiple ways to tag languages, especially languages like "Chinese" and "Arabic" plus a legacy combination of 2-letter and 3-letter codes.

Spoken vs Written Language

The reason for much of this confusion is that language coding changes depending on whether you are focusing on written language (like Unicode and major vendors do) or spoken language (as linguists or film historians might), but few sources recognize it. However the CLDR does mention it. Specifically:

The Ethnologue [the online language enyclopedia (which maintains ISO-639-3)] focuses on native, spoken languages, whereas CLDR and many other systems are focused on written language, for computer UI and document translation, and for fluent speakers (not necessarily native speakers).

In other words, there are lots of spoken forms in the world which are not used in written form. In the United States for instance, everyone is taught standard (or "proper") written English even if they actually speak AAVE (African American Vernacular English), Boston/New York English or Appalachian English at home. Similarly, no spell checkers recognize subtle pronunciation differences between the English of California, Minnesota or the two East/West halves of Pennsylvania.

As far as most of the world (including the Microsoft Office spell checker and Amazon.com) there is only one U.S. English (en-us), and only one English for Britain as well (en-GB)...even though England, Scotland and Wales have even more variation in spoken forms - enough so that Ozzy Osbourne's local dialect is difficult for American ears to parse.

The more inreresting case are macrolanguages like Arabic or Chinese - which are languages with cultural unity but linguistic diversity. However the CLDR recommends the macro language code. Their advice again is to assume that the macro language is THE language code:

For a number of reasons, Unicode language and locale identifiers always use the Macrolanguage for the predominent form. Thus the Macrolanguage code "zh" (Chinese) is used instead of "cmn" (Mandarin)...It would be a mistake to look at http://www.ethnologue.com/show_country.asp?name=EG and conclude that the right language code for the Arabic used in Egypt was "arz", which has the largest population. Instead, the right code is "ar", Standard Arabic, which would be the one used for document and UI translation.

Let's examine both the Arabic and Chinese case and see how it works.

Arabic

First modern Arabic scholars distinguish written Modern Standard Arabic (MSA) which most educated speakers are familiar with from different forms of Colloqiual Arabic which what is spoken at home. The Colloquial forms are different enough to be assigned different language codes in ISO-639-3, but in fact these are rarely written - only MSA is usually written (or used in formal speeches).

If you are working or preparing an Arabic document, chances are that it will be in MSA with maybe a few national quirks (i.e. ar-EG may apply in some cases for an MSA document from Egypt).

Chinese

Chinese, like Arabic is really a macrolanguage with many spoken varieties which are not always understood across the country. However recent governments, with their capitals in Beijing, have promoted a national variety based on Northern Chinese as the national language. Again, most documents from the PRC or Taiwan will be in Mandarin Chinese...so in effect Chinese (zh) = Mandarin (cmn) in most situations.

Ironically though, Mandarin needs multiple codes because there are now multple ways to write this language - the old Traditional Hanzi system (Taiwan), the Simplified characters (China), Pinyin romanization and the older Wade-Giles. Because language tagging is really focused on written language, there are multiple variant tags for Chinese in different scripts (e.g. zh-Hant = Tradtitional Chinese, zh-Hans = Simplified).

When to use "cmn" for Mandarin

Are there situations when "cmn" for Mandarin Chinese might be appropriate? I would say yes...if you are researching or documenting spoken forms in modern China. For instance, a linguist may be doing field work to document spoken forms from across China.

At the spoken level, even Mandarin (i.e. Northern forms) has dialectal features and it may also be important to compare historical developments between Mandarin and other forms such as Cantonese (yue), Wu (wuu) and Hakka (hak). In that case, I would recommend using the ISO-639-3 language codes to tag everything. That will ensure everything is the same format and will probably facilitate searching down the line. Others might recommend using the macrolanguage code plus the ISO-639-3 language code (so that Mandarin is zh-cmn and Cantonese is zh-yue).

As you can see the CLDR advice is a good primer on how to tag. Most documents can be tagged with a simple system defined in ISO-639-2, but documents being tagged by linguists may need the larger set of ISO-639-3 tags. It really clarifies a lot of ambiguity with how to tag

Tagging Language Variations

A final issue is how to tag language variations which can include changes in script, changes in spelling convention or spoken variation. Although many common variants are registered, there are always more to be added.

Following the advice in the CLDR though I would only pursue registration of tags for written variations. This recommendation will likely be controversial, but is actually consistent with common practice and most user needs. For instance, it does make sense for Microsoft to support spell checkers for en-US vs en-GB or other national varieties of English. Similarly everyone needs to support both Simplified and Traditional Chinese.

But will a spell checker or grammar checker ever be programmed for something like Appalachian English? Not anytime soon. For one thing, there probably is NO "standard Appalachian grammar" - just a series of field work studies and observations with LOTS of individual variation. In fact, one of the great challenges for establishing any written standard is getting agreement on how to handle variations across small distances.

Another concern of mine in registering spoken variants is that I am not seeing a systematic pattern of registration of spoken language variations. For instance, dialectologists for American English recognize different regions in the U.S. (e.g. Mid-Atlantic, Mid West, the South California/West, New England, New York etc), which can be further subdivided into more distinct communities (e.g. Queens vs. Brooklyn vs Long Island). This is actually ignoring the reality that a city can have speakers from unrelated dialects (e.g. AAVE, Spanish-influenced English and other world Englishes).

In theory a registration of dialects should be fairly systematic (e.g. en-US-NYC-longisland), but that is NOT what I am seeing. It's very difficult to know how to tag except on an ad hoc basis. And once a tag is registered, it remains there forever, even if a "deprecated" note is added. I'm not sure the current system is really beneficial, since it is just replicating an ad hoc approach that is not necessarily helpful for the field of dialectology.

On the plus side, I think the system works well for written variations - we even have standards tags for scripts to attach to a language tag. If Spanish is ever written in Cyrillic, I will know to tag it "es-Cyrl."

CLDR = Unicode Common Locale Data Repository

|

An an aspect of internaltionalization (i18n) I often skip over is localization (l10n) or customizing text (e.g. spelling, transliteration conventions, prices in the correct currency, date stamping in the correct format etc). If localization is something you need to have on a Web site, you may want to start at the Unicode CLDR page (http://cldr.unicode.org/) which compiles a variety of charts and some guides.

There are other resources available including several from IBM listed below:

These resources are generally aimed for the programmer audience, but there are interesting nuggets for the non-programmer as well.

Glyph Du Jour: Arabic Bismallah/Basmala

|

A while back I wrote a blog entry asking how to define the boundary between glyph variant and calligraphic art. Today I ran into the case that I really thinks highlights how complex it is.

Bismallah chracter

The Bishmallah character is in the Arabic presentation block and it visually jumped out at me because it was so complex in comparison to every other symbol. You can see the character below at 288 point, 36 point, and 14 point. My initial reaction? Wow, what a beauty.

Bismallah Sign from Pak Type Font 288 Point
288 point
Intertwined Arabic letters forming semi-circle
36 point 14 point
Bismallah36.png Bismallah14.png

Meaning

The full name of the sign in the Unicode spec is "ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM" and it is assigned to Unicode Point U+FDFD. According to my research, this phrase translates to "In the name of God (Allah), Most Gracious, Most Merciful/Compassionate" (translations vary). It begins every chapter of the Koran (Qur'an) except one and is used in prayers and is apparently used in other contexts including preambles of several constitutions in the Islamic world. Wikipedia has a good overview of the Bismallah/Basmala.

It has a deep spiritual meaning and this phrase has become the basis of many pieces of Arabic calligraphy. Since the phrase is so common in Islamic religion, it makes sense that a special sign may be needed.

Technical Challenges

Having said that, there are several technical challenges that can be considered. One is the complexity of the sign itself. As you can see in the images above, at 14 points, it looks almost like a piece of lace with none of the characters distinguishable. The structure is not really visible until the point count is in the 30s (headline size), and even then it the size should be larger to gain full appreciation of its design. It is clearly not meant to be a simple logogram incoporated into a text.

More interestingly, there are many variations in what a calligraphic Bismallah looks like. You can see examples from Flickr User Said Bak, Islam 101 and eleswhere. Some look like cirlcles others birds or fruits and many are artistic lines. Based on the examples I've seen the creation of new forms of Bismallah is a vibrant art form.

So the question is...with so many variations, which variation do you select for your font? It does look like there are standard forms (past masterpieces I am assuming). The font I used is PakType Naskh (from Pakistan) and the designers selected a semi-circular form.

At one level the technical challenge has been overcome, but it still does not answer address the question of information versus art. The symbol in the PakType font is beautiful but will future generations think that the Bismallah should have a set form or will the calligraphic tradition survive? And the tricky question - will there be variants encoded for archival purposes or where there will be one Unicode point with an infinite number of a variations. I know I do not have the answer to that one.

Refreshable Braille Display Video

|

Unicode does have a Braille block, but other than creating ordinary text documents with Braille, I am not entirely sure how the main Braille audience accesses the script (other than it's probably not a printout from a laser printer).

However one the accessibility lists I subscribe to mentioned this video of a refreshable Braille display. Basically a Braille user has a device which has 32 blocks (or cells) of pins. When connected to a computer, it reads 32 characters at a time raises the appropriate combination of pins for each character. When the reader has processed each line he or she can press a button to continue. As the demo shows, an experienced Braille reader can read quite quickly.

They didn't happen to mention Unicode, but I did notice that his display has 8 pins per cell, not just the 6 needed for English only. In theory, the display can handle characters from the entire Unicode block. However, it would be interesting to know how the conversion happens and how Braille from beyond English is handled...but that may be a future blog post.

WAVE AIM Ate my Latin ō

|

I was wondering what to write next when an accessibility test presented a perfect example of how you can be fluent in one Web standard, but goof up on another standard (Oy!).

I wanted to test Movable Type in the nifty Web AIM Wave accessibility checker. One feature of this tool is that it will show you the location of header tags (e.g. H1,H2,H3), which can be handy to know if you are testing a Web page for markup and don't feel like plowing through a sea of HTML tags.

By chance I chose an entry about You-Tube videos in Latin which talked about Latin versions of Star Wars (Bella Stellārum) which include the scene in Empire Strikes Back (Imperium Contra Offendit) where Luke learns that Darth Vader may be his father and screams "Nōōōō...n" in utter horror.

Original Blog Entry (Screen Capture)

Blog entry with stellarum is now ste and noooon highligted

Tragically though, when WAVE rendered this page for me, I got the less dramatic "NMMMM...n". Apparently WAVE doesn't understand Unicode too well.

 

As Seen on WAVE

Stella:rum is stellMrum and nooooon is nMMMMn

It looks like accessibilty and Unicode together present another trap for the unwary Web worker, but then again you can always show your superior knowledge in one standard or the other - depending on your audience. In the war of the standards, it can be very comforting.

Unifaces and Other Unusual Unicode Applications

|

A while ago, I pointed out that vision charts have expanded beyond the Western scripts, and now so have emoticons. Check out http://twitter.com/unifaces for ways to use the wide range of Unicode symbols to express different facial expressions. Thanks to the Twitter feed authors for sending this to me.

And while I was at it I checked out her del.icio.us site and discovered that:

  1. Mojibake is the Japanese term for the Unicode question mark of death when symbol cannot be displayed. I am glad to have a technical term, but since it's not translated, I do wonder what the literal meaning is. Hopefully it means "ghost character" or "character changing". It appears that the verb bakeru means "change spookily" or "appear in disguise". Ah the mysteries of Unicode.

  2. If you need a new hobby, you can try faking Cyrillic text with Latin characters (e.g. PyccKNN instead of Русский). Detailed instructions are on the Wikipedia Volapuk encoding page. Actually there was a scare a few years back where some Russian spammers were using Cyrllic characters to fake Western URLs (e.g. РЕИИ SТАТЕ ... or if you like Greek - ΡΕΝΝ SΤΑΤΕ) Only the "S" is Western Latin. It turns out to be tricky in both directions because it's the capitals that match the best. But I guess it's the global version of Leet (L33t/1337)

I'd be tempted to tell everyone to get back to work, but then I would have to get back to my work, and that's not always Unicode related.

Some Recommended Books

|

Although the vast majority of my Unicode knowledge has come courtesy of the Internet, there are some print resources that I am beginning to find very useful, so I thought I would add some quick notes. I would add that audience for the Unicode books is generally the programmer audience needing implement Unicode support. These books really don't tell you how to type an accented letter.

Unicode Demystified

If you're a programmer who's been handed a foreign language project and really aren't sure where to go next, I think this is a good place to start. This book by Richard Gillam dates from 2002, but is still a valuable resource because it explains the basic concepts behind Unicode in fairly straightforward language.

This book covers the major world scripts including Latin, Cyrillic, Greek, Arabic, Hebrew, East Asian scripts, major South Asian scripts, Cherokee, Canadian Aboriginal and so forth. These scripts generally cover most of the major typographical and sorting issues you are likely to encounter, so it remains very handy for the newcomer. However, If your script is a little more exotic (or newer to Unicode), you will probably need to find alternate resources.

Information

Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard (Paperback)

Author: Richard Gillam
Year: 2002
ISBN (10/13): 0201700522 / 978-0201700527

Unicode Explained

This is from 2006, so it's more recent, and it's by Jukka Korpela who is good at explaining concepts behind encoding (as well as accessibility). Unfortunately, I haven't had a chance to acquire it yet. I will be looking forward to taking a look at this.

Information

Unicode Explained (Paperback)

Author: Jukka Korpela
Year: 2002
ISBN (10/13): 059610121X / 978-0596101213

The Unicode Standard

For each of the major Unicode Standards (e.g. 4.0, 5.0), the Unicode Consortium releases a hard-bound reference of the actual standard. If you're semi-serious about Unicode programming, I would recommend picking up at least one version of the standard and then updating over time. It does gather everything in one place...at least for the moment.

The first part explains the standard including issues of direction (LTR/RTL), casing, ligature, different flavors and so forth. There is also an explanation for each script. The last section prints the character list block by block, including the East Asian CJK characters which are normally referenced with just a database online.

I think the reference aspect is the most important benefit of this book. Although there are sections for each script, this work tends to assume that you are fairly familiar with whatever script you are with and so devotes most of the text to technical explanations. Fortunately, I think the technical explanations and examples are core examples that a programmer would need.

Although most of the content is replicated in PDF on the Web site, it can be handy to have the actual book as a baseline reference. For one thing, the charts are of high quality print, allowing you to see minute typographic details. For another thing, you never know where you will need to work on a project without the Internet....

Information

Unicode Standard, Version 5.0, The (5th Edition) (Hardcover)

Author: Unicode Consortium
Year: 2006
ISBN (10/13): 0321480910 / 978-0321480910

Arabic Math Symbols

|

I was comparing notes for the Arabic block and noticed some new additions for which I was getting Unicode box of death (i.e. none of my fonts have that symbols).

Some of them are actually Arabic math symbols which were recently added. You can read about them in the proposal at http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3086-1.pdf But of course I MUST find fonts to cover these extra symbols. Some of this can be handled by using different symbols when working with Arabic math text, but it's good to have a reference glyph.

It looks like the latest Unicode Symbols font has the Outlined White Star (5 points, rounded corners = U+269D).

An interesting conundrum are arrows which are designated "LEFTWARDS" or "RIGHTWARDS". If I understand the proposal correctly, it appears that the conventions for which arrow is forwards or backwards would be be reverse in Arabic, so mirroring conventions are needed when using mathematical arrows in a RTL language.

Postscript - April 16, 2009

Still hunting down Arabic fonts for some of the Unicode 5 characters, but I did find a W3C page describing Arabic mathematical typesetting.

http://www.w3.org/TR/arabic-math/. Note that the some of the code is still theoretical.

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Powered by Movable Type

Recent Comments