January 2008 Archives

Can Unicode Handle Calligraphy?

|

I'm a little behind in this blog, ... but at a talk I attended recently (mid October 2007), the keynote speaker mentioned some interesting challenges for encoding scripts with a strong manuscript (and calligraphic) tradition.

Most scripts in use today were originally designed to be handwritten in ink over a relatively smooth surface such as paper, papyrus, parchment or palm leaves. The benefit of handwriting is that you don't need a lot of expensive equipment (such as a printing press) to produce a document, but the writer must make each letter form one by one.  Writer's cramp can be a serious consideration for workers in the manuscript industry.

To reduce both time and strain to the wrist and hands, most scripts using paper-type media develop cursive forms and special abbreviation symbols (e.g. "&" for 'and' and "@" for 'at'). For instance, Arabic letters vary in shape depending on whether the letter is at the beginning, end or the middle of the word, and it's generally due to the fact that Arabic is essentially a continuous cursive script.

The abbreviation symbols are easily encoded and many are already in the standard, but the alternate letter forms are even trickier. On U.S. computers, if you type the "S" key, the screen usually displays an "S" almost instantaneously. With other scripts like Arabic or Devanagari, the text editor has to know the position of the character within the word before it can display something. In some cases, the text editor has to wait for the NEXT character before it can give you a display. Issues like these are a major why support for Arabic and South Asian scripts continues to lag behind other scripts.

But the story doesn't end there. Beause manuscripts are always handmade, lots of local variations have developed (lots and lots). The preferred Arabic script of Saudi Arabia (Naskh)  is quite a bit different from the preferred script of Urdu (Nastaliq). Even though an Urdu writer is using the same script is using the same script as  someone in Saudi Arabia, he or she may not be able to use the same font base. Similar variations can be seen in Chinese vs Japanese writing. Even in Europe, German Fraktur (Blackletter) is quite a bit different from manuscript Gaelic both of which differ from modern typography.

And just when you thought you had it all figured out, someone will discover a new manuscript needing a new symbol to encode. Yikes!

Our speaker was documenting some of the more interesting variations you can find in pieces of calligraphy when I hit a conceptual wall. I agree that encoding most of this (probably 90% of this) is historically and culturally important. But...at some point calligraphy is no longer really a document, but an art form. Where do you stop?

After all, the point of many calligraphic traditions isn't really to send a new message, but to find new meaning in old words. Many calligraphic works are actually older texts rewritten to visually represent different nuances in meaning. And many practitioners become celebrated for their abilities to develop a new style of writing.

Graphic programs have protocols for selecting color, shapes, line weight, orientation and so forth, but there is a point where the specifications end and the art begins. Maybe a few of our archival questions can be solved if we remember that some manuscripts are art as well as textual documents.

Some Calligraphy Links


Categories:

Does English Need Unicode?

|

Traditional wisdom holds that ASCII or maybe ANSI (ISO-8859-1) is sufficient for English and that it's not a language that needs any Unicode support. But is this actually true?

It's certainly not true in any higher education environment where not only do we work with foreign languages, but also mathematical symbols, including the obscure ones. Any time an institution needs to build an archive for an ancient language or math/science, the problem of encoding will rear its ugly little head. Ironically, it may be the classicists, medievalists and comparative literature specialists (fields which are not traditionally not seen as high tech) who have had the most experience with working the Unicode issue.

Is it just some scholars in exotic languages or physics then? Alas not. Many of the carefully crafted punctuation symbols that are appreciated by copy writers and desktop publishers everywhere are ALSO in Unicode. These include the em-dash (—), the en-dash (–), the Euro sign (€) and Smart Quotes “ and ”. There are some kluges in "ISO-8859-1" for some of these symbols...but not all of them. If you want these to work reliably, it's best to select Unicode (UTF-8) and say you're using Unicode!

Even the "foreign" accents work their way into our prose. Once it was just fiancé and José, but now it's even baseball players like Magglio Ordóñez (the "Big Tilde"). If you check out Ordóñez's uniform, you'll see that even his uniform has a tilde on his name. As we gradually learn to embrace some non-Anglo culture and wish to "get it right", the need for spelling with appropriate accents will continue to rise.

In fact, it's amazing that every office I've ever been to, I've had someone ask me how to insert some "exotic" symbol into some document. So yes...even English needs Unicode support to express the full range of textual possibilities.

Categories:

The Cost of a Unicode Code Point

|

The Unicode list was discussing whether a recently discovered phonetic character should be encoded in the future or not, and some interesting issues of cost/benefit ratios came up.

The symbol is something like a combination of "Gj" (capital G and lowercase j) and was used in a few foreign language dictionaries from Germany and elsewhere to represent the /ʒ/ sound (spelled "j" in French and sometimes spelled "zh" in English).

The main benefit of encoding would be for archival purposes. Almost all modern linguists use /ʒ/ (or sometimes /ž/). If you were analyzing linguistic data, you probably would change the "Gj" to a modern symbol. On the other hand, scholars who found a previously unknown document with "Gj" would want to be able to know what it was and may need to represent that glyph in particular. So there is some reason to encode it.

BUT someone pointed out that each new codepoint does come with a cost. Specifically

1) New versions of "Extended Latin Fonts" would need to include "Gj" taking up designer time (for multiple fonts usually).

2) The Unicode data table itself has to be updated, and when that happens, developers have to incorporate the new characters into whatever systems they are using. That typically includes utilities for sorting characters into alphabetical orders, the default character insertion utilities of Microsoft and Apple, and the basic Unicode friendly Unicode text editors.

Will "Gj" get encoded? I actually think it will, but not right away. Believe it or not, the community keeps finding new symbols/letters invented for different languages and sooner or later, most make it in. Unicode 5.0 included a "Latin Extended D" and "Latin Extended Additiona" block to handle these recent discoveries, so I am sure there may be a "Latin Extended E" in the future.

But I do understand why the Unicode committee gets a little cranky sometimes.

Categories:

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Comments

The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments