ELIZABETH J PYATT: March 2007 Archives

ConScript Unicode Registry for Klingon, Tolkien

|

"ConScripts" are scripts invented for constructed languages, those languages created for a science fiction or fantasy story. Famous example of ConScripts include Tengwar, Cirth (both for languages invented by Tolkien) and Klingon.

In Unicode, these are handled in the Private Use area, but the ConScript Registry includes semi-formal protocols for assigning these codes.

That way the user base can build fonts which are compatible with each other. Additional information can be found at http://www.evertype.com/standards/csur/

Categories:

Glyph du Jour: Interrobang

|

A combination of question mark and exclamation point described in the Wikipedia as "rarely used, nonstandard English-language punctuation mark" (21 Mar 2007).

Here is a sample of it in different fonts
Interrobang.png

So what's doing in the Unicode specification? Here are some likely reasons:

  1. There may be historical documents using it - This was invented by Martin K. Speckter in 1962 to make advertisments using "?!" look "cleaner" (compare WTH?I with WTH‽). Ads from that era may include the "interrobang"
    Note: Historic usage is also why Unicode includes provisions for Tolkein scripts.
  2. Because it's there - The interrobang may arise again some day if fonts include it

Interrobang Now

A few fonts designed for Unicode include the Interrobang character. The interrobang is also in the new Microsoft Clear Type fonts, but it not widespread. On the other hand, most graphic designers could probably use super compressed character spacing to create one on the fly if it's really needed.

But...if you really want one on your Web site, just use code ‽ and it should appear on most modern browsers. In other documents, you can use the Windows Character Map or the Mac Character Palette.

You never know when you might say "Unicode has a separate interrobang character‽"

Additional Reading

Categories:

Vietnamese Support Article

|

Vietnamese is different from other East Asian languages because it is currently written in the Latin alphabet (same as English). On the other hand it has so many tone marks that it is treated differently from "typical" Western European languages like Spanish and French.

The following site - http://vietunicode.sourceforge.net/ is an excellent source of Vietnamese information.

FYI - In the past, it was written in Chinese, but this system is rarely used in modern Viet Nam.

Categories:

Persian Support Article

|

I just found this 2004 presentation on Persian language support from Behdad Esfabod at
http://behdad.org/download/Publications/persiancomputing/a007.pdf

Interestingly, they seemed to have surrendered to the generic Tahoma font (although maybe things have improved since then).

Categories:

Arabic: Nastaliq or Naskh?

|

Arabic is enough of a challenge to work with on a computing level because it's right to left and has special ligature forms for when certain letters come together (plus consonant forms change depending on if it's at the end, beginning or middle of a word).

But wait until you hit Urdu and Persian! Now you have to work with letters not found in Arabic and a different form of calligraphy. Although modern Arabic text is based on Naskh writing, Persian and Urdu prefer Nastaliq writing.

There's actually a nice picture from the Wikipedia at
http://en.wikipedia.org/wiki/Naskh_%28script%29

The lesson for me was that every language seems to need its own special support even if its script is already "covered."

Categories:

Armenian Font Woes

|

* Note: This entry was published elsewhere in 2006.

There's still one gap in Unicode implementation. Quality fonts featuring characters with exotic accents or lesser used scripts are not widely available yet, leaving us with more generic fonts which vary in quality depending on the expertise of the font designer.

A few weeks ago I was gently chided by a very polite Armenian speaker who informed me that the Penn State Armenian Unicode chart had incorrect character forms for the punctuation. Sure enough, when I looked at the Unicode PDF chart for Armenian and compared it with some common fonts, I found out there were significant differences.

The problem was I couldn't fix it. The most common fonts for Armenian have the incorrect forms. Although I could specify a correct Armenian font in my CSS, there is no guarantee that the user will have the correct font. Even worse, some browsers like Safari won't allow you to select alternate fonts for "exotic scripts" even if you have them. In the case of Safari, you may be stuck with the Apple Lucida Unicode font which is complete, but definitely has a utilitarian look.

If I wanted fix the character, I would have to use images or PDF files - thus defeating the goals of Unicode to send data as simple text. In this case, I kept the chart as is, but pointed users to the Unicode PDF chart.

Incidentally if you are a font designer or need to make sure your font is correct, then you can use the Unicode PDF charts as a reference. The PDF fixes the character shapes so what they make is what you will see.

By the way, if you do want specialized Armenian fonts, then see theGallery of Unicode Fonts (http://www.wazu.jp/gallery/Fonts_Armenian.html) or and ArmenianUnicode Org. Thanks to Sarkis Baltayian for this information.

Categories:

My Favorite Unicode Fonts

|

These are the fonts my inner linguist can't live without. Many of these are from academic consortiums (consortia) who offer them for free. This avoids the trouble of waiting for the corporate vendors to get around to us linguists (we really aren't that big a customer base darn it!)

TITUS Cyberbit - Freeware from the University of Frankfurt. This includes characters from many scripts such as Armenian, Cyrillic, Greek, Coptic and more. I find that the characters for each script have been designed for god readability based on traditional forms.

SIL Fonts - Your choices are Charis SIL (a new font designed partially for print), Doulos SIL and Gentium. These include all phonetic symbols and Latin alphabet symbols as wellas Greek and Cyrillic. Additional phonetic symbols are included in the Private Use Area.

Cardo - This one is tied to the Thesaurus Linga Graecae and includes Coptic and unusual variants and rare ancient Greek letters/symbols in the Private Use Area as well as Latin and phonetic letters. I rarely need a digamma, but I'm very happy to have it available.

Aboriginal Sabs Serif - This one includes phonetic stuff and Cherokee and Canadian Aborigonal Syllabics (another script used by several Native American Languages). Oh, and it gets you a sans-serif phonetics font.

Chrysanthi (Chrʃsanþi) - Don't let the New Age symbols fool you. The Chrysanthi font is actually a nice little addition for your font library containing symbol Unicode blocks which are otherwise hard to find. (FYI - þ = "th" and ſ = "s").

Junicode - Includes characters for medieval languages, Runes and more unusual combined characters and medieval symbols in the Private Use Area.

Categories:

Still ASCII in SSI and CSS Files

|

* Note: This entry was published elsewhere in 2006.

The Penn State server delivers UTF-8 Unicode pages. Dreamweaver creates Unicode pages. They appear fine in all my browsers without the entity code translation. So I should be able to include Unicode characters in server side includes - right? Not exactly. Hidden UTF-8 character seem to

Any .inc file must be encoded as ASCII and only include ASCII characters. Otherwise you will get an error that the file "cannot be processed". I suspect the culprit are some hidden Unicode control characters that the server doesn't recognize. If you want to include a Unicode character (like the £ symbol, you have to use an entity code like £ (all characters in the entity code are ASCII). If you enter raw Unicode, then users will see a question mark, even if the character is actually available in that font.

As for CSS stylesheets, there are not issues technically prohibiting .css files from being UTF-8, but I found out a few years ago that if I placed CSS in UTF-8 files, then attributes would mysteriously fail to apply even though the syntax was correct. Again it was probably a hidden UTF-8 character that was interfering. It's little glitches like these that make Unicode development still an entertaining adventure even in 2007.

What are "hidden" UTF-8 control characters? These are code points which don't represent a character but signify text formatting elements like right to left text vs. left to right text or which kind of line break you are using. ASCII has control characters just in positions #0-31 (and most software programs recognize them), but Unicode includes additional control characters that older programs don't recognize. The problem is that the new control character are included.

By the way, if you cut and paste from a UTF-8 file and see strange behavior in a software package, sometimes backspacing through a "space" will eliminate an unrecognized control character and fix the problem.

Categories:

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Comments

The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments