Recently in Encoding Theory Category

Explaining and Inventing Your Own Unicode Jargon - Part 1

|

I love the i18n/UTF-8 process as much as anyone, but you have to admit that all those flying letters and number combinations can be a little overwhelming to the newcomer. So I think a primer is needed

There are some real glossaries out there such as the Unicode Glossary and the Penn State i18n glossary, and the IBM Glossary of Unicode Terms...but you really do learn more when you create your own material. So with that in mind, I present

Encoding in the World of Star Trek

I would like to believe that someday we will contact other civilizations (with some sort of encoded communication) and at that point there will need to expand and create new encodings (and of course new jargon) such as

Jargon of Process

Three current terms for the field of wrangling non-English text include i18n for "internationalization", g11n for "globalization" (both refer to making content/systems usable to people using any script) and the related l10n "localization" (adapting information from region one to a second region, (e.g. a Japanese product sold in the United States).

These terms have the same structure start with the first letter, end with the last letter and insert the number of letters in between. Thus internationalization (20 letters total, 18 between "i" and "n") becomes i18n.

You can apply this to any term such as "Romanization" and "transliteration" (see answers below for new terms), and in the future we will need alternate terms to include the fact that we are working with planets, not just nations. So maybe we will have

  • galaxification (g12n) - even greater than g11n
  • interplanetarization (i19n) - also greater i19n
  • astrointernationalization (a23n) - the biggest of them all
  • Romanization (r10n) - I made this up
  • transliteration (t13n) - this does exist, but is not frequently seen

FYI - Both r10n and t13n refer to the process of writing any language in the Roman (Western/Latin) alphabet. Japanese Romāji is an example of this process.

Local Government Standards

Before the days of Unicode, each region had established its own encoding standard for its own language(s). The most famous may be ASCII (American Standard Code for Information Interchange) from which we also got VISCII (Vietnamese), ISCII (India) and ArmSCII (Armenian).

Another pattern is to name the encoding standard after the governmental standards body and the number of the encoding scheme (usually a sequential number). This is how we arrive at TIS-620 (Thailand, Thai Industrial Standard #620), GB3212 (China) and ELOT 928 (Greece/Ellas). A governmental agency also gave names to Shift-JIS (Japan, combination of JIS X 0201 and JIS X 0208) and ANSI (U.S., American National Standards Institute).

Finally, if for some reason, the local government doesn't move as rapidly as needed , then a corporation will invent its own standard on the fly. In the U.S. we got both Windows-1252 (Win-1252) and MacRoman encodings this way. In Taiwan, they got Big5 (a Traditional Chinese encoding standard agreed upon by five corporations).

Future Local Planetary Encoding Standards

In the future, I will assume that each Star Trek planet has its own version of Unicode, but of course each will have its own encoding designation. Can you Star Trek fans guess where these are from?

  • KLISCII or TLHLSCII (depending on linguistic accuracy)
  • RIS-105
  • VSAUS-210A (because this planet uses hex numbers)
  • FMSS-13B1 (in duodecimal numbers because you can quickly divide by 3)
  • TUTF-32 (future name for an existing standard)

Since I will be talking cross-planetary standardization next time, I will add these potential encodings

  • ACS34 - Andorian Communication Standard #34
  • TelSCII - Tellarite Standard Code for Information Interchange
  • OTLC-10 - Orion Technology Limited Code #10
  • SuperSix - As agreed upon by six major Orion Trading Houses
  • BNTCXS - Betazed Non-Telepathic Communication Exchange Standard

And to finalize the list

  • KLISCII - Klingon Language Institute Standard Code for Information Exchange or
    TLHLSCII - tlhIngan Hol Language Institute Standard Code for Information Exchange
  • RIS-105 - Romulan Imperial Standard #105
  • VSAUS-210A - Vulcan Science Academy Unified Standard #210A
  • FMSS-13B1 - Ferengi Mercantile Society Standard #13BC
  • TUTF-32 - Terran Unicode (32 bit)

Final challenge - what encoding would you invent for the Cardassians?

What's New in Unicode 5.1?

|

Unicode version 5.1 was recently released, and includes some new code blocks as well as new specifications. As with all new versions of Unicode there will be a time lag until the new items can be incorporated into fonts and utilities, but here is a partial list of new items

If you're interested in the new characters, the best place to view them is at http://www.unicode.org/charts/

New Plane 0 Scripts

  • Cham (Cambodia/Vietnam)
  • Kayah Li (Thailand/Myanmar)
  • Lepcha (India)
  • Ol Chiki/Santali (India)
  • Rejang (indonesia)
  • Saurashtra (India)
  • Sundanese (Indonesia)
  • Vai (Liberia)

Script Extensions

These blocks add characters to previously encoded scripts.

  • Cyrillic Extended-A
  • Cyrillic Extended-B
  • Arabic - characters for math, 4 Qu'ranic and multiple characters for different languages
  • Indic - Malayalam, Tamil character sequences, Devanagari chandra a,
    Sanskrit sounds in Gurmukhi, Oriya, Telegu
  • Latin - characters for minority languages and capital German sharp S (rare)
  • Math Symbols
  • Medievalist Punctuation - for research
  • Myanmar Additions

New Plane 1 Ancient Scripts and Miscellaneous Symbols

  • Carian (Anatolia/Turkey)
  • Lycian (Anatolia/Turkey)
  • Lydian (Anatolia/Turkey)
  • Phaistos Disk (Crete)
  • Domino Tile Symbols
  • Mahjong Tile Symbols

Can Unicode Handle Calligraphy?

|

I'm a little behind in this blog, ... but at a talk I attended recently (mid October 2007), the keynote speaker mentioned some interesting challenges for encoding scripts with a strong manuscript (and calligraphic) tradition.

Most scripts in use today were originally designed to be handwritten in ink over a relatively smooth surface such as paper, papyrus, parchment or palm leaves. The benefit of handwriting is that you don't need a lot of expensive equipment (such as a printing press) to produce a document, but the writer must make each letter form one by one.  Writer's cramp can be a serious consideration for workers in the manuscript industry.

To reduce both time and strain to the wrist and hands, most scripts using paper-type media develop cursive forms and special abbreviation symbols (e.g. "&" for 'and' and "@" for 'at'). For instance, Arabic letters vary in shape depending on whether the letter is at the beginning, end or the middle of the word, and it's generally due to the fact that Arabic is essentially a continuous cursive script.

The abbreviation symbols are easily encoded and many are already in the standard, but the alternate letter forms are even trickier. On U.S. computers, if you type the "S" key, the screen usually displays an "S" almost instantaneously. With other scripts like Arabic or Devanagari, the text editor has to know the position of the character within the word before it can display something. In some cases, the text editor has to wait for the NEXT character before it can give you a display. Issues like these are a major why support for Arabic and South Asian scripts continues to lag behind other scripts.

But the story doesn't end there. Beause manuscripts are always handmade, lots of local variations have developed (lots and lots). The preferred Arabic script of Saudi Arabia (Naskh)  is quite a bit different from the preferred script of Urdu (Nastaliq). Even though an  Urdu writer is using the same script is using the same script as  someone in  Saudi Arabia, he or she may not be able to use the same font base. Similar variations can be seen in Chinese vs Japanese writing. Even in Europe, German Fraktur (Blackletter) is quite a bit different from manuscript Gaelic both of which differ quite a bit from modern typography.

And just when you thought you had it all figured out, someone will discover a new manuscript needing a new symbol to encode. Yikes!

Our speaker was documenting some of the more interesting variations you can find in pieces of calligraphy when I hit a conceptual wall. I agree that encoding most of this (probably 90% of this) is historically and culturally important. But...at some point calligraphy is no longer really a document, but an art form. Where do you stop?

After all, the point of many calligraphic traditions isn't really to send a new message, but to find new meaning in old words. Many calligraphic works are actually older texts rewritten to visually represent different nuances in meaning. And many practitioners become celebrated for their abilities to develop a new style of writing.

Graphic programs have protocols for selecting color, shapes, line weight, orientation and so forth, but there is a point where the specifications end and the art begins. Maybe a few of our archival questions can be solved if we remember that some manuscripts are art as well as textual documents.

Some Calligraphy Links


The Cost of a Unicode Code Point

|

The Unicode list was discussing whether a recently discovered phonetic character should be encoded in the future or not, and some interesting issues of cost/benefit ratios came up.

The symbol is something like a combination of "Gj" (capital G and lowercase j) and was used in a few foreign language dictionaries from Germany and elsewhere to represent the /ʒ/ sound (spelled "j" in French and sometimes spelled "zh" in English).

The main benefit of encoding would be for archival purposes. Almost all modern linguists use /ʒ/ (or sometimes /ž/). If you were analyzing linguistic data, you probably would change the "Gj" to a modern symbol. On the other hand, scholars who found a previously unknown document with "Gj" would want to be able to know what it was and may need to represent that glyph in particular. So there is some reason to encode it.

BUT someone pointed out that each new codepoint does come with a cost. Specifically

1) New versions of "Extended Latin Fonts" would need to include "Gj" taking up designer time (for multiple fonts usually).

2) The Unicode data table itself has to be updated, and when that happens, developers have to incorporate the new characters into whatever systems they are using. That typically includes utilities for sorting characters into alphabetical orders, the default character insertion utilities of Microsoft and Apple, and the basic Unicode friendly Unicode text editors.

Will "Gj" get encoded? I actually think it will, but not right away. Believe it or not, the community keeps finding new symbols/letters invented for different languages and sooner or later, most make it in. Unicode 5.0 included a "Latin Extended D" and "Latin Extended Additiona" block to handle these recent discoveries, so I am sure there may be a "Latin Extended E" in the future.

But I do understand why the Unicode committee gets a little cranky sometimes.

Promoting UTF-8 over ASCII

|

At the last Unicode Conference in October, Computer Science professor Jiangping Wang gave a good talk about how to train new programmers (especially those in the U.S.) how to program software which can easily use Unicode.

One issue Dr Wang mentioned is that when encoding is taught in traditional computer science programs, it is very brief and the topic sticks to ASCII only. This is obviously problematic since encoding had extended beyond ASCII since the 1980s. Another problem is that ASCII encoding isn't as complex as Unicode encoding.

Unicode isn't just about expanding the set, but understanding how additional typographic issues. For instance Unicode contains characters which control text direction (Left or right) which is not found in ASCII. In addition, Unicode can be presented in "several flavors" such as UTF-8, UTF-16 and so forth. ASCII also had a few national variants, but it was never dependant on byte order like Unicode is.

Of course Dr. Wang was "preaching to the choir" at Unicode 31 - WE all know how important proper Unicode support is. The real challenge is convincing others that Unicode is really the true wave of the future.

Will this ever happen? Actually, one thing that will probably accelerate adoption of Unicode is developing online Web 2.0 technologies. Those companies who want their tools to reach a global audience (e.g. Google, Yahoo, del.cio.us, Twitter) are building in Unicode support from the start. That way, anyone from Japan to Russia can tag their custom maps with their native characters.

I don't know about you, but nothing makes me feel more connected to the world wide web than seeing a Twitter posting in Cyrillic.

Which "UTF" Do I use?

|

Unicode comes in a variety of flavors depending on how many bytes you are using and in which byte order they are coming in.

For most online uses, UTF-8 is the safest, but here's a short summary of other types of Unicode out there.

UTF-16

Recall that Unicode numbers are really hexadecimal numbers. The original specifications called for a 16-bit (or 216 = 65,000+) characters. The highest number would be #FFFF. Within Unicode, the four digit are organized onto blocks (the first two digits), then codepoints in the block.

This capital L (Hexadecimal #4C or #x4C) is in block 00 and codepoint 4C or 004C.

What happens is a character has more than 4 hexadecimal digits (as in some ancient scripts)? You can still use UTF-16, but the larger Unicode code point will be broken into units of 4 units via a transformation algorithm. In short, UTF-16 is organized into blocks of 4 slots per character, but there's a catch:

UTF-16: Little Endian vs. Big Endian

Some systems, notably Intel based systems organize each Unicode number into codepoint (little end) then block (big end). Others, notably Unix, organize Unicode into block then code point.

Returning to the capital L (#4C), there are two UTF-16 ways to represent this:

  • Big Endian (UTF-16BE) : 00.4C = L
  • Little Endian (UTF-16LE) : 4C.00 = L

Software packages, particularly databases and text editors for programmers can switch between the two, but it can be a hassle. UTF-8 is more consistent between systems, so is a little more resilient.

Note: In theory, UTF-16 files begin with a special BOM (Byte Order Mark) which specifies Little Endian or Big Endian.

UTF-32 (UCS-4)

At some point, the Unicode Consortium realized that even 65,000+ characters would not be enough, so provisions were made to add another two places in the hexadecimal system. The next two places were called "planes" (vs. "blocks" and "codepoints"). The original 216 characters are now Plane 0 with additional planes being added for other scripts as needed. At this point, there are some Plane 1 scripts, but they are mostly ancient scripts. There are now 231 characters available.

In any case, to represent all the planes, blocks and codepoints, you need to add extra digits in the Unicode file. Thus capital L (#4C) is now 00004C in UTF-32. As you can see, unless you are dealing with ancient scripts or archaic Chinese texts, you are adding extra "plane" information you may not need and adding more memory to your files.

UTF-8 (Unicode Transformation Format)

The difference between UTF-8 and UTF-16/UTF-32 is that it uses an algorithm to translate any Unicode character into a series of "octets". Character (004C) "L" can be stripped to a simple 4C, just like in ASCII. If you use primarily English or Western languages, file sizes may be smaller in UTF-8 than UTF-16, and ASCII or Latin 1 code will usually be easier to integrate into Unicode.

The other advantage of UTF-8 is that the algorithm allows the data to be less corruptible over the Internet. Thus UTF-8 is recommended for e-mail, Web pages and other online files. Some databases and programming languages may use UTF-16 instead.

Additional Links