Got Unicode? Blog
Archives: View 2006 Articles By Date
Unicode is the new encoding scheme which will allow all computers to understand all foreign languages...one of these days. This site lists my resources, war stories, tips and comments about the whole Unicode enterprise.
2007 Blog Posts
I am finally transferring to a real blog application!
All new entries will be posted at http://www.personal.psu.edu/ejp10/blogs/gotunicode, and the 2006 entries will eventually be transferred as well (but with 2007 dates). The archives will be place for the next few months though.
2006 Blog Posts
18-Dec: Some Stat Symbols on the Web
I thought I would close the year by reporting on how to do some common statistical notations. Unlike other mathematical symbols, stat symbols like "X-Bar" and "p hat", are actually made with a letter followed by a "combining diacritic."
Note on Safari 2: These codes use "combining diacritics" which are not supported in Safari as of version 2. Mac users can use Firefox or Opera instead.
Edward Lucas, one of the Eastern European journalists at the the Economist, has a fun article (The language of Šekspīrs) pointing out that even people comfortable with umlauts, tildes and cedilles may tremble at the sight of the Czech hacheck and Polish ogonek.
If you want Euro quotation authenticity in Word and on the Web try these techniques out. You'll be writing «¿Qué pasa aquí?» and „Sprechen Sie Deutsh?“ in no time.
You'll be able to use “Smart Quotes” – and even long dashes!!
Japanese is complex just like Chinese, but yet so very different. Here are some notes on multiple scripts in Japan along with reasons why Unicode has had its critics in Japan.
First the story, then the definition....
Back in my youth I studied two languages of South Asia (India and neighboring countries), namely Sanskrit and Sinhala. Until recently, these were the only non-Western scripts I could read at all. They're gorgeous, compact, but because for various reasons, very tricky to encode.
South Asian scripts, unlike Western scripts, mark vowels not with their own letters, but with little tails and swirls (svara or ligature ) attached to different consonants (hence it is quite compact). In scripts towards the South (e.g. Tamil, Sinhala, Oriya), some "swirls" come after the consonant, but some come before the consonant. This makes font design tricky.
To figure out vowel placement instructions, Adobe and Microsoft developed the Open Type Font format. But for some reason, Apple did not participate - instead they developed the ATSUI standard for their own fonts. If you have the right Apple fonts (and only some scripts are provided)and you use Text Edit (and sometimes NeoOfficeJ) all is well... Diverge from this path and vowel marks dance all over the place.
While we're all used to Windows-Mac compatibility issues, this one is pretty insidious for Mac users. Lots of Open Type fonts are available, but very few ATSUI fonts (in fact only Apple makes them as far as I can tell). Relatively few people know there's an ATSUI standard. In fact, unless you are working with a South Asian sciript, you may never notice that OTF fonts don't exactly work the same on a Mac. All development efforts are focused on Open Type fonts.
The ultimate horror for this Mac fanatic is that Windows is BETTER. Even worse, Microsoft publishes all sorts of information on Open Font Type font design specs for South Asian scripts while Apple has virtually no documentation of creating or converting South Asian scripts to the ATSUI format. (FYI to ATSUI font designers, it's the Linguistic Rearrangment Feature).
However this lovely Wikipedia article on Apple fonts actually explains why the Indic fonts can't be designed in ATSUI.
The situation may change with the new Intel Macs (they'll run Windows anyway), but for now...ugh!
Even linguists get confused by language names outside their main region. So while I was checking out some African language data, I finally got a chance to straighten out some terms. So for the record:
- Ethiopic is a script developed in Ethiopia. It looks like this (ግዕዝ)...assuming you have the right font.
- Ethiopic is used to write Ge'ez, the classical language of Ethiopia and still used in the Ethiopic Orthodox Church (and Ethiopian Jewish services). It's South Semitic and is distantly related to Hebrew and Arabic. Ge'ez evolved into Amharic, the modern official language of Ethiopia.
- Other languages are also written in the Ethiopic script including Tigrinya and Oromo.
- An alternate name for Ethiopia is Abyssinia (from the original Arabic name for Ethiopia). However, some people may use Abyssinia for areas in east Africa which are not exactly the same as modern Ethiopia.
- Ethiopic Unicode fonts and utilities are available...if you know where to look.
In any case, Ethiopia has had a long and influential political history dating from the Sabean.
Yorùbá is your typical African Niger-Khordofanian language with a simple seven-vowel system and three tones much like most other languages in Sub-Saharan African. The vowels are basically a,e,i,o,u plus dotted ẹ,ọ (for /ɛ,ɔ/). Dotted vowels are not common, but can be dealt with in Unicode.
What makes Yoruba interesting is that, unlike other neighborng African languages, they also write out tones using the accute accent high tones and grave accent low tones. This means that the dotted vowels ẹ,ọ can also get marked for tones with an accent mark above.
Now we have a Unicode problem because there are two diacritics on one vowel. Vietnamese has the same type of double marked vowels, but in Unicode, each version of the vowel with diacritic gets its own numeric slot and font point, so it's not too much of an issue.
Alas, this is not what was recommended for Yoruba. Instead, Unicode (at least as of 4.0) wants us to use "combining" accents to make our own characters up. This makes theoretical sense, but can actually be harder on font designers and people trying to alphabetically sort Yoruba. Also, support for "combining" accents is pretty weak in most U.S. systems. They would much rather you work with "precombined" characters. Since I'm a linguist though, I had to take up the challenge...because I may have to write a Yoruba word for class one day.
Mac Yoruba Challenge
With accent marks, I normally try the Mac U.S. Extended keyboard. I can get the dotted vowel or an accented vowel...but not both. When I tried the codes, the second diacritic got separated thus (´ọ). No dice. The same thing pretty much happens with the Character Pallette as well.
The easiest solution for me is to download and install the SIL Phonetic keyboard, since this one supports combining accents. So for dotted o with acute accent (ọ), I now type Shift+2 then Option+8 then O. The only problem is that most phonetics fonts which support combining accents don't support capital versions pretty well. The accent tends to run into the vowel (Ọ́) ... but it's critical, I would do PDF instead.
Problems on Safari and Internet Explorer - Combined accents don't play well on Safari. Even with a CSS file set to switch fonts to an appropriate phonetics font, you WILL get kicked into a font which does not like combining accents. You will see floating accents or squares instead. With Internet Explorer, you MUST use CSS to change the font to something like Arial Unicode MS.
Windows Yoruba Challenge
I tried a phonetics keyboard download for Windows, but I couldn'f find the combining dot below character. It's not their fault...the fact that Windows keyboards really only support Shift and sometime ALT reduces the number of possible combinations compared to the Mac Shift, Option and Shift+Option.
Fortunately, I had access to an IPA keyboard from the commercial language software Global Writer. If I were stuck with a PC for real, I would be forced to buy my own copy of this for emergencies like these.
A Final Note on Unicode Social Justice
Unicode develepers are as sensitive to linguistic mionorites as anyone, but it's still amazing that economic and political prestige plays such an important role in determining language support. If Yoruba had the pull of the Vietnamese technology/political community, I'm pretty sure there would be precomposed dotted accent letters.
"Hard" languages like Chinese are well supported partly because the market is desireable to Western corporations and partly because there is such a strong government making sure of it. There is also a relatively strong pool of technically educated Chinese speakers able to develop appropriate technological utilities. Even European minority languages are further ahead because there are technologically skilled bilingual speakers able to work on their native languages.
"Easier" languages like Yoruba get left behind (for now) because there may not be as strong a technology/political community. Even for a normally non-political person such as myself, it's an eye operner. I don't think it's an issue of blame, but it is still a sad fact that Africans may feel compelled to simplfy Yoruba spelling or switch to French or English.
After being diverted by other work issues, I hope to return to regular postings. The first one has to do with the Aries (♈) symbol which also used in astronomy to represent (whichever point the sun cross the equator at the Spring equinox).
For various reasons, I needed to generate the vernal point symbol in Photoshop (on the Mac), so I thought...why not Unicode? I dutifully opened the Apple Character Palette tool, found the astronomical symbols under "Miscelleneous Symbols" listing. The result was the question mark of death (no Aries symbol). The weird thing was that the Aries sign was visible in the text layer name. That was a "sign" that Photoshop does support Unicode, but that there was a font issue.
The trick was that I had to figure out which font actually has the Aries symbols because Times New Roman is not one of them. Fortunately I was able to remember whch font had it, so I got the symbol visible, but I had to think on it a bit.
For the record, here are some fonts with the Aries symbol (and other astro symbols). For astronomy, you are usually looking for fonts which support the Miscellaneous Symbols and Mathematical Operators Unicode block.
- Arial Unicode MS (free Microsoft)
- Apple Symbol (free from Apple)
- CERG Chinese Font
- Alan Wood's List
Step-by-step tutorial from the W3C explaining how to create an Arabic or Hebrew XHTML document supporting Unicode and RTL direction. Includes screen capture images and references to RTL Uniode characters and tags.
There's still one gap in Unicode implementation. Quality fonts featuring characters with exotic accents or lesser used scripts are not widely available yet, leaving us with more generic fonts which vary in quality depending on the expertise of the font designer.
A few weeks ago I was gently chided by a very polite Armenian speaker who informed me that the Penn State Armenian Unicode chart had incorrect character forms for the punctuation. Sure enough, when I looked at the Unicode PDF chart for Armenian and compared it with some common fonts, I found out there were significant differences.
The problem was I couldn't fix it. The most common fonts for Armenian have the incorrect forms. Although I could specify a correct Armenian font in my CSS, there is no guarentee that the user will have the correct font. Even worse, some browsers like Safari won't allow you to select alternate fonts for "exotic scripts" even if you have them. In the case of Safari, you may be stuck with the Apple Lucida Unicode font which is complete, but definitely has a utilitarian look.
If I wanted fix the character, I would have to use images or PDF files - thus defeating the goals of Unicode to send data as simple text. In this case, I kept the chart as is, but pointed users to the Unicode PDF chart.
Incidentally if you are a font designer or need to make sure your font is correct, then you can use the Unicode PDF charts as a reference. The PDF fixes the character shapes so what they make is what you will see.
You can download the FPMF Armenian package free. The package contains five fonts in a mixture of faces ranging from traditional to modern. See the Gallery of Unicode Fonts (http://www.travelphrases.info/gallery/Fonts_Armenian.html) and ArmenianUnicode Org for additional fonts. Thanks to Sarkis Baltayian for this information.
I'm working on a math quiz in Math for set theory where questions are pulled into a text file. The instructor wants to include the union (∪) and intersection symbol (∩) in his problems, so what to do?
The good news is that if you can create a UTF-8 text file and insert the symbols, it will import into Flash (at least in Flash 8.) For math, your best bet is usally to use the Windows Character Map utility and insert the symbols into a Notepad text file or use the Macintosh Character Palette with a Text Edit text file. Unfortunately, the process is still a little clunky in both platforms, but it's better than in 2005.
You have to open both Notepad (Start » Accessories) and Character Map (Start » Accessories » System Tools)
For the Windows Character Map, it's a semi-clunky process. You have to switch the font to "Arial Unicode MS" (because it has the all themath symbols), then scroll down to window untul you see the math section. Then you have to select, copy and paste each symbol into Notepad.
In Notepad, when you save the file, you have to make sure the encoding menu under the file name is changed from "ANSI" to "UTF-8". Fortunately, it will warn you.
In Text Edit for the Mac, you go to Edit » Special Characters to bring up the Character Palette. Click the Math option and hunt for the symbol. Highlight and click Insert to place it in Text Edit.
Once you insert the symbols, you have to make sure your encoding is set to UTF-8 during the save process. Go to the Format menu and select "Make Plain Text." Then, when you save the file you have to make sure the encoding menu under the file name is changed from "MacRoman" to "UTF-8".
Reopening UTF-8 Files in Mac Text Edit
In Text Edit, if you reopen a UTF-8 file it may be magically transformed to MacRoman (you'll see things like Á& instead of your intended character). Very annoying (grr!!) To prevent this, you must go into the Text Edit Preferences, then click the Open and Save panel. Make sure that the Plain Text Encoding options for opening and saving are set for "UTF-8." Or you can spring for a license for BBEdit or Mellel which are better about warning you.
As for Flash - fonts are still a little tricky within Flash, but at least it's playing well with text files.
28-April: InterSol Inc. Newsletter
I ran into this newsletter from a InterSol, Inc., a company specializing in translation and internationalization. The articles are a combination of translation quirks and encoding issues. Short, but informative.
Some Unicode articles
Programmers - Are you seeing Á& in your text file instead of a mathematical symbol like ∴ (the therefore triangle)? There may several things wrong.
The biggest problem may be that you opened it in an old-school program like EditPlus (PC) or an old version of Word. These do not actually support UTF-8 files, so they will attempt to convert your Unicode character to ANSI (Latin 1), and you will get the weird double characters.
If you are on Windows, you should use Notepad or you can download the freeware Unipad. You can experiment with Words, but Unicode support is a little strange sometimes, even though it's not supposed to be.
If you are on a Mac, try TextEdit. Other programs like BBEdit also work with UTF-8 code. The advantage to newer text editors is that they can usually detect a UTF-8 file (some progrms may ask for an encoding), and open it correctly. See Alan Wood's Unicode Resources for a list of Unicode-friendly applications.
Finally - make sure your files are actually saved with the UTF-8 encoding (check out the Save As options). If you are still having problems, you may have to try other flavors of Unicode (e.g. Big Endian, UTF-16, etc) until you find one that works for your particular application.
Based on a true story, although the names have been changed.
10-April: Details on Chinese and Unicode
I'm not an expert, but I put together a few notes on Simplified Chinese, Traditional Chinese and Unicode...in case anyone asks.
Although the Wikipedia has had it's share of public relations snafus, on the whole I think it is a good tool. One little side benefit of Wikipedia is that a native speaker will often take the time to add the actual foreign language version for a term. For instance, the Wikipedia haiku entry shows the Japanese term in Kanji.
The Wikipedia modern Hebrew entry not only shows the name in Hebrew characters, but the transliterated name 'ivrit or Ivrit) and it's phonetic transcription.
But wait, how do I know it's right? First, I can check the "Ivrit" part out by entering it into Google and seeing if I get results about Modern Hebrew (I do). Second, who the heck is going to take the time to do a fake phonetic transcription? It may be wrong, but it's probably not fake.
16-Mar: Which "UTF" Do I Use?
Unicode comes in a variety of flavors depending on how many bytes you are using and in which byte order they are coming in.
For most online uses, UTF-8 is the safest, but here's a short summary of other types of Unicode out there.
Recall that Unicode numbers are really hexadecimal numbers. The original specifications called for a 16-bit (or 216 = 65,000+) characters. The highest number would be #FFFF. Within Unicode, the four digit are organized onto blocks (the first two digits), then codepoints in the block.
This capital L (Hexadecimal #4C or #x4C) is in block 00 and codepoint 4C or 004C. In UTF-16, all four places are represented, but there's a catch.
UTF-16: Little Endian vs. Big Endian
Some systems, notably Intel based systems organize each Unicode number into codepoint (little end) then block (big end). Others, notably Unix, organize Unicode into block then code point.
Returning to the capital L (#4C), there are two UTF-16 ways to represent this:
- Big Endian (UTF-16BE) : 00.4C = L
- Little Endian (UTF-16LE) : 4C.00 = L
Software packages, particularly databases and text editors for programmers can switch between the two, but it can be a hassle. UTF-8 is more consistent between systems, so is a little more resilient.
Note: In theory, UTF-16 files begin with a special BOM (Byte Order Mark) which specifies Little Endian or Big Endian.
At some point, the Unicode Consortium realized that even 65,000+ characters would not be enough, so provisions were made to add another two places in the hexadecimal system. The next two places were called "planes" (vs. "blocks" and "codepoints"). The original 216 characters are now Plane 0 with additional planes being added for other scripts as needed. At this point, there are some Plane 1 scripts, but they are mostly ancient scripts. There are now 231 characters available.
In any case, to represent all the planes, blocks and codepoints, you need to add extra digits in the Unicode file. Thus capital L (#4C) is now 00004C in UTF-32. As you can see, unless you are dealing with ancient scripts, you are adding extra "plane" information you may not need and adding more memory to your files.
UTF-8 (Unicode Transformation Format)
The difference between UTF-8 and UTF-16/UTF-32 is that it uses an algorithm to translate any Unicode character into a series of "octets". Character (004C) "L" can be stripped to a simple 4C, just like in ASCII. If you use primarily English or Western languages, file sizes may be smaller in UTF-8 than UTF-16, and ASCII or Latin 1 code will usually be easier to integrate into Unicode.
The other advantage of UTF-8 is that the algorithm allows the data to be less corruptible over the Internet. Thus UTF-8 is recommended for e-mail, Web pages and other online files. Some databases and programming languages may use UTF-16 instead.
- What Every Developer Should Know About Unicode
- UTF-8 and Unicode FAQ
- Unicode Org UTF-8, UTF-16, UTF-32 & BOM
- UTF-8 on Wikipedia
8-Mar: New Phonetics Keyboard for Windows
University College London has a new freeware Unicode Phonetcs keyboard for Windows based on Charis SIL and Doulos SIL fonts. You can download and activate it in the Control Panel (they have instructions).
Mac users - get the IPA-SIL freeware Mac keyboard. It says alpha, but I've been using it for about 3 years. It's been truly a godsend for me.
I'm sure you've wondering if the iTunes music application supports Unicode, and the answer is ... YES! (at least on the Mac version).
I found this out when I imported two Japanese instrumental tracks from the Kill Bill (Vol 1) soundtrack and discovered that the Japanese artists names were written in Japanese script. Then I added some phonetic Unicode phonetic transcription using the IPA-SIL keyboard as a test.
So...if you have iTunes playlists for non-English musicians, you will be able to represent titles and names in their native format.
Once a year I test the current browsers (Mac and Windows) on different scripts, so I get to see a wide array of quirks. No one single browser is perfect, but a few a great for every day use. My recommeded browsers for intense Unicode browsing are in order:
1 - Opera
|Opera is an Unicode excellent browser all around. I especially like it because it allows you to change fonts via the style sheet and it has the best Indic font support on the Mac (although not perfect - this is probably an Apple thing).
It's only drawback is that it doesn't support Plane 1 fonts (e.g. Gothic, Old Italic). But how many people need Plane 1 font support now?
2 - Firefox/Mozilla
|Mozilla/Firefox are also excellent Unicode browsers (and I use them everyday). They let you set the font for most scripts and they support Plane 1 fonts (unlike Opera). And then there are all the cool plugins.
The only reason it's not first is that the Mac version doesn't have perfect Indic support. Vowel marks are placed a little funny.
3 - Safari (Mac)
|Free from Apple. This is fast and works great so long as you don't care about font display. As long as you have a font with the right characters loaded, Safari will work. But I wish it didn't always default to Lucida Unicode for phonetics. I have other fonts installed - I want to pick them.|
4 - Internet Explorer (Win)
|Free from Windows. This is almost tied with Safari. It's advantage is that it supports East Asian vertical text when no other major browser does.
But I find that it's a little quirkier than I like. For phonetic scripts, my CSS must manually specify the correct font, or else the default Times New Roman will generate the question mark of death. And it's missing Plane 1 font support.
5 - Netscape 4.7 (Win)
|Netscape 4.7 for Windows had many problems with CSS, but its Unicode support is pretty decent. Still I would suggest moving on.
The one bad quirk is that Netscape 4.7 does not recognize new entity codes like
5 - Internet Explorer (Mac)
Older Mac browsers like IE 5.5 coded display script by script and they only got around to the major scripts, meaning only Russian, Chinese, Japanese, Korean, Central European, Baltic and Turkish are properly supported. Everyone else is out of luck.
You can have a valid font for another script, but IE 5 will still give you the question mark of death. Time to make the plunge to Firefox, Safari or Opera.
6 - Netscape 4.7 (Mac)
This has all the problems of Internet Explorer, but it has the added quirk of using characters from Japanese fonts for Russian and Greek. The spacing is truly hideous.
Note for System 9 users - You may need to stay with System 9 for economic reasons, but your Unicode experience will never be as good as it needs to be. And many vendors have switched to OS X, so I strongly recommend an upgrade to OS X (and not only for Unicode).
When Microsoft announced in December 2005 that the would no longer support Internet Explorer 5.5 for the Mac, I breathed a sigh of relief. It was fine in it's heyday (2000-2001), but it's Unicode support never caught up with OS X.
Even though Mac OS X is Unicode based and supplies lots of fonts with lots of characters, IE Mac was UNABLE to display them. Older Mac browsers had to code proper display script by script. If you weren't Russian, Central European, Chinese, Japanese, Korean or Modern Greek ... sorry folks you will get the Unicode question mark of death ("glyph not available") - even if you actually had the font available.
Yet...usage Internet Explorer 5.5. still persists out there in Mac land, so people sometimes still ask why my page isn't working (it's not my code, it's your browser I'm afraid).
My other complaint about Internet Explorer is that the Mac version is radically different from the Windows version (I know because I had to document them both). Unicode tags/formatting which work in the Windows version may not work on the Mac version. I even once heard someone complain "It doesn't work on a Mac" when the situation was that the page did not display on IE Mac (it's not the Mac, it's your browser I'm afraid).
So...IE Mac I salute as being the first major software application to incorporate OS X style gel graphics, but now I wish to bid you a fond farewell as you join the ether of defunct browsers (along with Netscape 4.7)
24-Feb: Runes (ᚠᚢᚦᚨᚱᚲ) on the Web
Here are some tips for inputting runes into a Unicode Web page. As far as ancient scripts go, runes are actually one of the easier scripts to work with.
16-Feb: When CSS Files Don't Work
In Dreamweaver, I set my default encoding for my files for utf-8 (Unicode) so that all my HTML files will be encoded as Unicode by default. This has improved my life quite a bit, but CSS did take a hit.
I code my CSS by hand generally speaking, but occassionally I would add a style attribute that would not appear in either Dreamweaver WYSWIG or in the browser. I would literally copy and past correct code from another document into my CSS file and NOTHING would happen. Text remained plain and black, even if it was supposed to be red and italics.
The culprit was that the .css file was also in UTF-8 format which adds a few invisible bytes. Occassionally, the browser would encounter one of these and not be able to parse the attribute. Very weird.
So... I now have to create CSS files in BBEdit, and make sure that the encoding is set to plain vanilla MacRoman (or ANSI in the Windows world). Now if I don't see styles, it's usually because of a syntax error.
So the future is coming, but remember that alternate transportation still may need to be used to get there.
Due to a recent computer catastrophe, I had to re-install my browsers and reconfigure them. Most modern browsers are pre-set to match fonts and scripts for major languages, but that doesn't mean I don't like to tweak. I'm not a big fan of default Times New Roman, so I usually choose something else.
For this round, I chose the Medieval Unicode font Junicode (pronounced /yunikod/ I believe), and it looks quite handsome on the Mac (haven't checked it on the PC yet). It has nice tall serifs, round o's and readable font weight - and it will include exotic characters not included in other fonts. I've been very happy with it so far.
I wouldn't want to live electronically without these fonts. Many of these are from academic consortiums (consortia) who offer them for free. This avoids the trouble of waiting for the corporate vendors to get around to us linguists (we really aren't that big a customer base darn it!)
TITUS Cyberbit - Freeware from the University of Frankfurt. This includes characters from many scripts such as Armenian, Cyrillic, Greek, Coptic and more. I find that the characters for each script have been designed for god readability based on traditional forms.
SIL Fonts - Your choices are Charis SIL (a new fornt designed partially for print), Doulos SIL and Gentium. These include all phonetic symbols and Latin alphabet symbols as wellas Greek and Cyrillic. Additional phonetic symbols are included in the Private Use Area.
Cardo - This one is tied to the Thesaurus Linga Graecae and includes Coptic and unusual variants and rare ancient Greek letters/symbols in the Private Use Area as well as Latin and phonetic letters. I rarely need a digamma, but I'm very happy to have it available.
Aboriginal Sabs Serif - This one includes phonetic stuff and Cherokee and Canadian Aborigonal Syllabics (another script used by several Native American Languages). Oh, and it gets you a sans-serif phonetics font.
Chrysanthi (Chrʃsanþi) - Don't let the New Age symbols fool you. The Chrysanthi font is actually a nice little addition for your font library containing symbol Unicode blocks which are otherwise hard to find. (FYI - þ = "th" and ſ = "s").
- Junicode - Includes characters for medieval languages, Runes and more unusual combined characters and medieval symbols in the Private Use Area.
Last entry, I used a long a with a macron (ā) in the title to see how Unicode support was. RSS is an XML format designed for news readers. Like other XML schemas, Unicode is a supported encoding.
For once, this is actually true in implementation, and even more unusually you MUST use native Unicode (RSS and other XML files do not support HTML entity codes). So... I took a deep breath, activated my Macintosh Extended keyboard, typed the macron into the XML and posted it.
The long a (ā) appeared in NetNewsWire (Mac), Safari (Mac) and Firefox (with CSS). It did not appear in FeedDemon (Windows)... still not bad though. There is a lot of work to be done with Unicode, but the future is coming.
26-Jan: How to Get ā (Long A/Macron/Line Thingie) and Other Weird Accents
One question that comes up a lot is how to generate a long mark or macron over vowels. It's "exotic" enough to be outside the normal Western European character set for French, Spanish, German and so forth, so the usual rules do not apply. Read the macron page for how-to details.
The Penn State server delivers UTF-8 Unicode pages. Dreamweaver creates Unicode pages. They appear fine in all my browsers without the entity code translation. So I should be able to include Unicode characters in server side includes - right?
Not exactly. Any .inc file must be encoded as ASCII and only include ASCII characters. If you want to include a Unicode character (like the £ symbol, you have to use an entity code like £ (all characters in the entity code are ASCII). If a non-ASCII character is included, then users will see a question mark, indicating that the character is "not available in the font" (yeah right).
Simlarly, CSS files had better be encoded as something simpler than Unicode or styles will mysteriously fail to apply. Unicode adds hidden characters which not all systems recognize.
It's little glitches like these that make Unicode development still an entertaining adventure even in 2006.
18-Jan: My favorite Unicode Jargon
Tech professionals love jargon, and Unicode experts are no different. Here's a jargon list, just so you know.
17-Jan: What is Unicode and how do I use it?
Funny you should ask. I wrote a tutorial for the Penn State Computing with Accents page which summarizes it as best as possible. Unicode is a way of classifying all scripts of the world so all computers can theoretically parse any language, even Russian and Chinese. In practice, a lot of glitches need to be worked out.
17-Jan: What fonts do I need?
Depending on the language, you may have the right Unicode font already. But see the Unicode Links for a list of font resources.