ELIZABETH J PYATT: August 2009 Archives

Announced i18n Enhancements for Mac Snow Leopard (10.6)


New operating systems often mean new i18n toys to play with and even through the upgrade from Apple 10.5 (Leopard) to 10.5 (Snow Leopard) is not supposed to be full of new features, there are, in fact, new features scheduled for the upgrade.

According to the Apple Snow Leopard Enhancement page, 10.6 will include:

  • Redesign of Pinyin Chinese input with faster speed and enhanced dictionary
  • Improvements to handwritten Chinese input
  • New Asian fonts - Heiti SC, Heiti TC, Hiragino Sans B.
  • New generic monospace font Menlo to be used in applications such as Terminal
  • Enhanced RTL support including split cursor option to show text direction in documents with bidirectional text
  • General Text substitution (e.g. (c) to ©) across applications. Could be handy for a lot of situations when you need to enter an unusual symbol. This already exists in Microsoft Office (Mac/PC).

But I almost missed the big one - the International pane in the System Preferences has been redesigned and will now be the Language and Text pane, presumably with more features. There may be other enhancements in the works that are too minor to be announced (or at least too minor for most people), but there may be more things to find out.

How will they work? Alas, no details from Apple yet. I guess we won't know until we know....


Enter Plane 1 (Phonecian/Linear B...) on Mac Unicode Hex Keyboard


A useful utility on the Mac is the Unicode Hex keyboard which allows you to press Option plus any four digit Unicode code to get that character.

For instance, if you need to enter the rarely seen archaic Roman numeral symbol for 5,000 (), you could look up its Unicode character number (U+2181), then activate this keyboard then type Option+2181 and generate the code (assuming the correct font is loaded).

But a lot of ancient scripts are in Plane 1, meaning they have Unicode values with five digits (i.e. U+10000 or higher). In Unicode world, adding the fifth digit means that some processes go slightly awry, and the Unicode Hex keyboard is one of them. Suppose I want to input Phonecian character Alf (Aleph) (𐤀 or an A on its side), which is U+10900. If I enter Option+10900 on the Unicode Hex keyboard, I will not get Alf, but ႐ instead.

Note: Characters U+0000 to U+FFFF are in Plane 0 or the BMP (Basic Multilingual Plane). A lot of systems are set up to deal with BMP only, but need special support for codes beyond U+FFFF. The four-digit restriction corresponds to 16-bytes which a constraint in older systems. If you're not a programmer, let's just say it's a long story and leave it at that.

It turns out that the Unicode Hex keyboard has a four-digit limit. To get around it, you can break U+10900 into two 16-byte (i.e. 4-digit) sequences, also known as as a UTF-16 Surrogate Pair. For U+10900, the surrogate pair is D802+DD0C. So in the Unicode Hex utility, you can now do this.

  1. Hold down the Option key.
  2. Type D802+DD0C, where the + means type the Plus sign.
  3. Release the Option key.

I bet you're asking - how did she get from U+10900 to D802+DD0C? There is an algorithm, but in this case I got it by opening the Character Palette, finding the character I wanted and mousing over it. When you do that, the Unicode code point appears along with its surrogate pair in parentheses.

Of course, you could also directly Insert the character with the palette, but actually there are times when the Insert doesn't quite work (at some points in the careers of my laptops, I have corrupted my Character Palette so badly, it refused to play with me anymore).

Although this utility seems a little limited at the moment, if there's one thing I have learned is that Unicode no trick has ever gone to waste.


Korean Script Heads to Indonesia (Archived)


The biggest sensation in Unicode land these days is that the Korean script Hangul (or Hangeul/Han'gŭl depending on your transliteration preferences) has been adopted by the speakers of Cia-Cia in the nation of Indonesia. This will be the first time any language other than Korean has adopted Hangul as it's writing system, so it is a cultural triumph for them.

What's interesting is how this decision happened. The standard press releases are not giving much information and even the linguistic community is a little perplexed. It's actually more interesting if the Wikipedia report that Cia-Cia was formerly written in the Arabic script (specifically the Jawi variant in Indonesia) is accurate. According to Ethnologue, the population is still mostly Islamic, so there shouldn't be a religious reason to switch.

So what about it? First, let's discuss the switch from Arabic. Actually a lot of Muslim communities including speakers of Hausa, Swahili, Malay and Turkish have switched from Arabic to the Latin alphabet. Malaysia and Indonesian are two countries following this trend, although the Jawi/Arabic script is still used in some religious and cultural contexts. There may be a variety of reasons for this including European colonial policy or the perception that the Latin alphabet is easier to learn and enhances literacy (Turkish). A move to the Latin alphabet may also represent a move towards a secular government (as in the case of Turkey).

It should also be mentioned that the Arabic script must be modified heavily when it is used for non-Semitic languages if all the sounds are to be represented. If you look at the Omniglot Jawi chart for example, you will see that many consonants have the same shape but with with different patterns of dots to indicate the differences. This also happens in the Latin alphabet (e.g. n vs. ñ in Spanish), but if Jawi also includes the different letter shapes depending on word position as Arabic, then the script becomes more complex.

Cia Cia is unique though in switching to something other than the Latin alphabet. One reader commented that this may be due to the fact that in South and Southeast Asia, a language gains social status by having its own script. In Indonesia, Balinese, Javanese and Sundanese have their own historic scripts. Although these scripts may not be used on an everyday basis, they do show that there is a cultural tradition having nothing to do with the West.

In theory, Cia Cia could adopt one of these scripts or one from India (e.g. Devanagari) would would probably be a good fit, but none would probably be perceived as being unique in Indonesia. On the other hand...no one else in Indonesia is using Hangul. It is very unique. Fortunately, Hangul is probably a good fit. Although the forms are somewhat angular like Chinese writing, the underlying principles are actually very similar those used in India and Southeast Asia (with some differences of course).

There's another benefit to Hangul over scripts like Javanese and Balinese and that's enhanced Unicode support. Korea has been fortunate enough to have the economic and political influence for developers to develop functional encoding schemes, fonts and input utilities for Hangul. Many Southeast Asian scripts are still catching up Unicode wise.

Whether this is the reason Cia Cia switched to Hangul or not, I wish them the best of luck. I think there are lots of people now invested in the success of this project.


The story is not accurate. Although the Cia Cia community has been taught some Hangul, there was no official decision to adopt Hangul as the writing system.


Accessibility and Unicode


Here at Penn State my duties include being an accessibility guru as well as being a Unicode guru, and not too surprisingly, Unicode can enhance accessibility in some situations. And not just in the abstract "standards enhance accessibility" but more concretely as in:

It's An Encoded Character, Not a Font Trick

We all know that relying on fonts to display characters (e.g. the use of the Symbol font for Greek characters) is a Bad, Bad Idea, but it's even worse for a screen reader. Consider the expression θ = 2π. In the old Symbol font days, this might have been coded as:

<p> <font face="Symbol">q = 2p</font></p>

And guess what the screen reader would read - Q equals 2 P. Since the screen reader is essentially "font blind", the underlying text is what is read. Hence the Unicode correct code below is preferred:

<p> θ = 2π</p>


<p> &theta; = 2&pi;</p>

If you think about it, the screen reader is a good tool for conceptualizing how characters (and their variants) may function semantically in different contexts.

I should mention that screen readers can get confused with a Unicode character if it can't recognize it, but that's more of a dictionary problem than a Unicode problem. For Jaws, it is possible to install .sbl pronunciation files to increase the character repertoire, especially for math and science.

It's Text, Not An Image

Perhaps the biggest advantage for Unicode though is that it allows characters that used to be embedded in images to be just plain text. For instance you could embed the following equation for the volume of a sphere:


V = 4/3πr³


AreaSphere.png V = four thirds pi R cubed

Consider what happens though if a low-vision reader (or a middle aged reader with decrepit eye sight) needs to zoom in on the text. As you will see in the screen capture below, the image will pixelate while the text remains crisp.

Zoomed Text vs Zoomed Image

Enlarged formula. Text is crisper than image

When you combine Unicode with creative CSS, you can see the possibilities for replacing images, including buttons with text. Not only is this more accessible, but it also results in smaller file sizes and is easier to edit.

Hearing Impaired Users

Unicode is actually important for these users because they need to read text captions or transcripts for video and audio. Once you get beyond basic English (e.g. Spanish subtitles)...well you know Unicode will be important.

Motion Impaired Users

For these users, the issue probably isn't so much reading text as being able to input it - which is the job of developers of operating systems and software. For motion impaired users, a good generalization is that keyboard access is better than using the mouse which requires a little more hand control. In the past I've commented on usability of various inputting devices, but since most do rely on key strokes, there are really no major complaints here.

One audience I didn't touch was color deficient vision, but except possibly for the Aztec script (which isn't even in Unicode yet)...it's not too much of an issue.


About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.


The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments