ELIZABETH J PYATT: April 2009 Archives

Unifaces and Other Unusual Unicode Applications


A while ago, I pointed out that vision charts have expanded beyond the Western scripts, and now so have emoticons. Check out http://twitter.com/unifaces for ways to use the wide range of Unicode symbols to express different facial expressions. Thanks to the Twitter feed authors for sending this to me.

And while I was at it I checked out her del.icio.us site and discovered that:

  1. Mojibake is the Japanese term for the Unicode question mark of death when symbol cannot be displayed. I am glad to have a technical term, but since it's not translated, I do wonder what the literal meaning is. Hopefully it means "ghost character" or "character changing". It appears that the verb bakeru means "change spookily" or "appear in disguise". Ah the mysteries of Unicode.

  2. If you need a new hobby, you can try faking Cyrillic text with Latin characters (e.g. PyccKNN instead of Русский). Detailed instructions are on the Wikipedia Volapuk encoding page. Actually there was a scare a few years back where some Russian spammers were using Cyrllic characters to fake Western URLs (e.g. РЕИИ SТАТЕ ... or if you like Greek - ΡΕΝΝ SΤΑΤΕ) Only the "S" is Western Latin. It turns out to be tricky in both directions because it's the capitals that match the best. But I guess it's the global version of Leet (L33t/1337)

I'd be tempted to tell everyone to get back to work, but then I would have to get back to my work, and that's not always Unicode related.


Some Recommended Books


Although the vast majority of my Unicode knowledge has come courtesy of the Internet, there are some print resources that I am beginning to find very useful, so I thought I would add some quick notes. I would add that audience for the Unicode books is generally the programmer audience needing implement Unicode support. These books really don't tell you how to type an accented letter.

Unicode Demystified

If you're a programmer who's been handed a foreign language project and really aren't sure where to go next, I think this is a good place to start. This book by Richard Gillam dates from 2002, but is still a valuable resource because it explains the basic concepts behind Unicode in fairly straightforward language.

This book covers the major world scripts including Latin, Cyrillic, Greek, Arabic, Hebrew, East Asian scripts, major South Asian scripts, Cherokee, Canadian Aboriginal and so forth. These scripts generally cover most of the major typographical and sorting issues you are likely to encounter, so it remains very handy for the newcomer. However, If your script is a little more exotic (or newer to Unicode), you will probably need to find alternate resources.


Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard (Paperback)

Author: Richard Gillam
Year: 2002
ISBN (10/13): 0201700522 / 978-0201700527

Unicode Explained

This is from 2006, so it's more recent, and it's by Jukka Korpela who is good at explaining concepts behind encoding (as well as accessibility). Unfortunately, I haven't had a chance to acquire it yet. I will be looking forward to taking a look at this.


Unicode Explained (Paperback)

Author: Jukka Korpela
Year: 2002
ISBN (10/13): 059610121X / 978-0596101213

The Unicode Standard

For each of the major Unicode Standards (e.g. 4.0, 5.0), the Unicode Consortium releases a hard-bound reference of the actual standard. If you're semi-serious about Unicode programming, I would recommend picking up at least one version of the standard and then updating over time. It does gather everything in one place...at least for the moment.

The first part explains the standard including issues of direction (LTR/RTL), casing, ligature, different flavors and so forth. There is also an explanation for each script. The last section prints the character list block by block, including the East Asian CJK characters which are normally referenced with just a database online.

I think the reference aspect is the most important benefit of this book. Although there are sections for each script, this work tends to assume that you are fairly familiar with whatever script you are with and so devotes most of the text to technical explanations. Fortunately, I think the technical explanations and examples are core examples that a programmer would need.

Although most of the content is replicated in PDF on the Web site, it can be handy to have the actual book as a baseline reference. For one thing, the charts are of high quality print, allowing you to see minute typographic details. For another thing, you never know where you will need to work on a project without the Internet....


Unicode Standard, Version 5.0, The (5th Edition) (Hardcover)

Author: Unicode Consortium
Year: 2006
ISBN (10/13): 0321480910 / 978-0321480910


Arabic Math Symbols


I was comparing notes for the Arabic block and noticed some new additions for which I was getting Unicode box of death (i.e. none of my fonts have that symbols).

Some of them are actually Arabic math symbols which were recently added. You can read about them in the proposal at http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3086-1.pdf But of course I MUST find fonts to cover these extra symbols. Some of this can be handled by using different symbols when working with Arabic math text, but it's good to have a reference glyph.

It looks like the latest Unicode Symbols font has the Outlined White Star (5 points, rounded corners = U+269D).

An interesting conundrum are arrows which are designated "LEFTWARDS" or "RIGHTWARDS". If I understand the proposal correctly, it appears that the conventions for which arrow is forwards or backwards would be be reverse in Arabic, so mirroring conventions are needed when using mathematical arrows in a RTL language.

Postscript - April 16, 2009

Still hunting down Arabic fonts for some of the Unicode 5 characters, but I did find a W3C page describing Arabic mathematical typesetting.

http://www.w3.org/TR/arabic-math/. Note that the some of the code is still theoretical.


Unicode Tool in Mac Calculator


Every so often you notice a brand new feature in an old tool and wonder where it's been all these years. One of these was the Unicode Character decoder in the built in Mac calculator. (I'm not the first to notice, but the last reference I found was from 2004, when the latest OS X was 10.3 (Panther).

If you open the Calculator app (it's in your Applications folder), you can switch over to the Programmer view (Command-3 or View » Programmer) which is where you can convert between decimal and hexadecimal number using the tabs at the right.

On the left though there is an ASCII/Unicode tab. This shows the characte corresponding to whatever number you've entered. For instance hex x562 (or 1378 in decimal) is an Armenian character բ (Armenian Small Letter Ben).

Calculator with Unicode tab on showing Armenian character and Hex tab on for x562

I have to confess that I wish it could do the reverse - you paste a character and it shows the Unicode code point, but this is good for testing odd character references you may find in a source file.

However a student in a recent seminar pointed out a site which does convert a character to a decimal code reference at http://www-atm.physics.ox.ac.uk/user/iwi/charmap.html (from Alan Iwi at the Rutherford Lab at Oxford). Just enter or paste the character and click the the Make HTML button to see a decimal entity code.

Although I don't generally recommend entity codes, there are times when your environment is so antique you need to convert some text into entity codes.


About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.


The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments