Encoding Theory: January 2008 Archives

Can Unicode Handle Calligraphy?


I'm a little behind in this blog, ... but at a talk I attended recently (mid October 2007), the keynote speaker mentioned some interesting challenges for encoding scripts with a strong manuscript (and calligraphic) tradition.

Most scripts in use today were originally designed to be handwritten in ink over a relatively smooth surface such as paper, papyrus, parchment or palm leaves. The benefit of handwriting is that you don't need a lot of expensive equipment (such as a printing press) to produce a document, but the writer must make each letter form one by one.  Writer's cramp can be a serious consideration for workers in the manuscript industry.

To reduce both time and strain to the wrist and hands, most scripts using paper-type media develop cursive forms and special abbreviation symbols (e.g. "&" for 'and' and "@" for 'at'). For instance, Arabic letters vary in shape depending on whether the letter is at the beginning, end or the middle of the word, and it's generally due to the fact that Arabic is essentially a continuous cursive script.

The abbreviation symbols are easily encoded and many are already in the standard, but the alternate letter forms are even trickier. On U.S. computers, if you type the "S" key, the screen usually displays an "S" almost instantaneously. With other scripts like Arabic or Devanagari, the text editor has to know the position of the character within the word before it can display something. In some cases, the text editor has to wait for the NEXT character before it can give you a display. Issues like these are a major why support for Arabic and South Asian scripts continues to lag behind other scripts.

But the story doesn't end there. Beause manuscripts are always handmade, lots of local variations have developed (lots and lots). The preferred Arabic script of Saudi Arabia (Naskh)  is quite a bit different from the preferred script of Urdu (Nastaliq). Even though an Urdu writer is using the same script is using the same script as  someone in Saudi Arabia, he or she may not be able to use the same font base. Similar variations can be seen in Chinese vs Japanese writing. Even in Europe, German Fraktur (Blackletter) is quite a bit different from manuscript Gaelic both of which differ from modern typography.

And just when you thought you had it all figured out, someone will discover a new manuscript needing a new symbol to encode. Yikes!

Our speaker was documenting some of the more interesting variations you can find in pieces of calligraphy when I hit a conceptual wall. I agree that encoding most of this (probably 90% of this) is historically and culturally important. But...at some point calligraphy is no longer really a document, but an art form. Where do you stop?

After all, the point of many calligraphic traditions isn't really to send a new message, but to find new meaning in old words. Many calligraphic works are actually older texts rewritten to visually represent different nuances in meaning. And many practitioners become celebrated for their abilities to develop a new style of writing.

Graphic programs have protocols for selecting color, shapes, line weight, orientation and so forth, but there is a point where the specifications end and the art begins. Maybe a few of our archival questions can be solved if we remember that some manuscripts are art as well as textual documents.

Some Calligraphy Links


The Cost of a Unicode Code Point


The Unicode list was discussing whether a recently discovered phonetic character should be encoded in the future or not, and some interesting issues of cost/benefit ratios came up.

The symbol is something like a combination of "Gj" (capital G and lowercase j) and was used in a few foreign language dictionaries from Germany and elsewhere to represent the /ʒ/ sound (spelled "j" in French and sometimes spelled "zh" in English).

The main benefit of encoding would be for archival purposes. Almost all modern linguists use /ʒ/ (or sometimes /ž/). If you were analyzing linguistic data, you probably would change the "Gj" to a modern symbol. On the other hand, scholars who found a previously unknown document with "Gj" would want to be able to know what it was and may need to represent that glyph in particular. So there is some reason to encode it.

BUT someone pointed out that each new codepoint does come with a cost. Specifically

1) New versions of "Extended Latin Fonts" would need to include "Gj" taking up designer time (for multiple fonts usually).

2) The Unicode data table itself has to be updated, and when that happens, developers have to incorporate the new characters into whatever systems they are using. That typically includes utilities for sorting characters into alphabetical orders, the default character insertion utilities of Microsoft and Apple, and the basic Unicode friendly Unicode text editors.

Will "Gj" get encoded? I actually think it will, but not right away. Believe it or not, the community keeps finding new symbols/letters invented for different languages and sooner or later, most make it in. Unicode 5.0 included a "Latin Extended D" and "Latin Extended Additiona" block to handle these recent discoveries, so I am sure there may be a "Latin Extended E" in the future.

But I do understand why the Unicode committee gets a little cranky sometimes.


About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.


The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments