February 2009 Archives

Windows U.S. International Keyboard...on a Mac

|

The Windows International keyboard is a Windows utility from Microsoft which allows users to enter a variety of accent codes with combinations of keys like '+e (for é) instead of memorizing a list of numeric ALT codes. If you are typing a lot of accented characters on a Windows machine, it's a godsend.

The interesting thing is that you can now download a Mac version of the Windows International keyboard. As a longtime Mac addict, I find it amusing because I am so used to the Apple Option keys. To me it's an interesting reduncancy.

But I can imagine that if you are a long-time Windows user, you may not want to re-learn a new set of Option codes. I can relate, because I've been struggling with my new phonetics keyboard which is very different from my old one. There's some serious retraining needed before I could use it.

What's really important is that there are utilities out there which allow users to customize their keyboards to just the way they want it. Vive la différance.

Categories:

When Apache and UTF-8 Fight

|

When you create a Web page with Unicode characters, it is recommended that you include the following character meta tag:

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
...
</head>

And if it's XHTML, you need to include a final "/" at the end.

The idea behind this tag is to force the other broswer into the correct view and prevent the display of Roman character gibberish. Sometimes though, you can place a properly formatted UTF-8 Web page (meta tag and all) and still see gibberish.

In this case the problem is not you, but the Web server, typically configured with Apache. If it's an American server, Apache is probably set up to ONLY deliver ISO-8859-1 encoding and, even though your file has the UTF-8 data in it, the server is trying to deliver it as Latin 1 (hence the Latin 1 gibberish).

There are three possible solutions available when this happens

Talk to Your Server Admin

And when you do, you can politely suggest changing the httpd.conf file as documented on Seapine Software. You can also comment that most modern Web apps are set to serve UTF-8 data including CMS programs such as Plone, Movable Type and Drupal. Others such as Facebook and Twitter support UTF-8 natively.

I believe this is what a Web service having this issue did recently.

Use an .htaccess file to just configure specific directories and pages

If you're comfortable enough to mess around with changing your directory preferences you can try this suggestion from Ted Texin about using AddType statements

The main proviso here is that an .htaccess file can do some serious damage unless you are careful. It's possible that you may not be able to upload one into your directory because of this, but it could be a good solution to suggest to a server admin if only your directory is affected and the rest of the site has to be encoded differently.

Unicode Escape Codes

If neither of the above solutions is available, then you can deliver the content within any encoding...if you encode the "exotic" characters as Unicode numeric escape codes.

For example if your site is Latin 1, but you need to present Russian content you can change your code from

Русский

to

&#x0420;&#x0443;&#x0441;&#x0441;&#x043A;&#x0438;&#x0439;

As you can imagine, this IS an absolute last resort solution. If you ever need to transfer content between systems, you will have many more problems with escape codes (none of which are supported in true XML or Microsoft Word). Not to mention the difficulty of replacing each character with it's Unicode numeric equivalent. Escape codes were really only meant for short passages of text.

But...if this is where you are, then you can try either the old Mozilla Composer which converted anything you typed into escape codes or maybe you can try another utility. Truthfully it is extremely difficult problem to do convert raw UTF-8 text to HTML entitiy codes these days.

So I emphasize that this a rare problem and should be easily corrected by your server admin...and if it's a personal Web site, you may want to think about alternative providers.

Or you could try the ultimate last resort - attack of the angry Unicode expert.

Post Script (Apr 3, 2009)

A student in a recent seminar pointed out a site which does convert a character to a decimal code reference at http://www-atm.physics.ox.ac.uk/user/iwi/charmap.html (from Alan Iwi at the Rutherford Lab at Oxford). Just enter or paste the character and click the the Make HTML button to see a decimal entity code. You can enter an entire string of characters.

Categories:

Where Have All the Escape Codes Gone?

|

I'm currently preparing a seminar on Unicode and I was struck by how far Unicode implementation, especially in terms of raw Unicode text, has come in the past 4 years. Some of the warnings I used to present in 2000 or even in 2004 seem almost quaint now.

For instance when Mac OS X first came out, the older applications were not set up to take advantage of the Mac Unicode utilities, such as the U.S. Extended keyboards. I used to have to specify which applications could work with Unicode and which couldn't do it. But yesterday I realized that I couldn't find any old applications on my machine that didn't work correctly. What a difference that makes.

The same is true on the Windows side. If you get the latest version of most applications, the chances are that Unicode support is there - even for raw text editors.

Similarly, I recall when many HTML editors converted any non-English character to an numeric HTML entity, but now most applications are set to work with real UTF-8 text embeded in HTML tags. This is much easier to edit and crucial for being able to transfer data between the Web and other XML resources.

Russian, Chinese and Greek data are being treated as just "text" and not as a special case that programmers need to agonize over. There are still plenty of issues to be worked out, but it's good to appreciate progress when it's made.

Categories:

Some Arabic Script Fonts

|

One set of fonts I didn't have a chance to include in last week's font wrap are Arabic script fonts. When discussing Arabic script fonts, it's important to note which language or region a font has a been design for because the requirements for writing Arabic versus Persian versus Urdu and so forth can vary (just as they do for English vs. German).

I should also comment that I am not an Arabic script typography expert, but these fonts are recommended by reliable sources. So with that warning:

Arabic

Persian (Iran)

Urdu & Sindhi (Pakistan/India)

Pashto (Afghanistan)

Uighur and Central Asian Turkic

I also generally recommend Gallery of Unicode fonts which is a Unicode font directory. The Web master very helpfully splits many of the fonts into separate language pages.

Categories:

SALRC - South Asia Language Resource Center

|

A great resource from my library is the South Asia Language Resource Center out of the University of Chicago. They include information about the major scripts of India and neighboring countries including font information (with samples).

Address is http://salrc.uchicago.edu/.

Categories:

Some New LGC Fonts

|

I was checking the font repositories and found some new fonts that might be of interest to the linguistics/medieval/math crowd. But before that, I would like to define a new term LGC = Latin/Greek/Cyrillic font which refers to any font which includes the Latin, Latin-A, Cyrillic and Greek and a few math symbols. So many fonts include all three blocks, that's a handy acronym for me.

One caveat is that Basic LGC fonts don't necessarily include ALL LGC characters. For instance a font like Verdana may be missing IPA extensions, Cyrillic extensions and Greek extensions. The good news is that more fonts including the special characters are becoming available, and we're getting freeware large fonts to fill in typographical needs like small caps and narrow characters.

  • Arev Sans - A sans serif font with excellent LGC coverage including Latin/Greek/Cyrililc extensions, a good inventory of math symbols and other symbols/punctuation.
  • Linux Libertine - A family of OTF fonts with separate fonts for bold, italics, small caps. Good LGC coverage. It's also good to have a small caps font for Greek and Cyrillic, but it seems to be missing some of the phonetic characters.
  • Marin Font - This font is notable for being a little narrower than others which is a nice change and has glyphs for the Cherokee block and the Canadian Aboriginal Syllables. It also includes a separate Small Caps font.
  • Roman Cyrillic Std, BukyVede, KlimentStd from Kodeks German Medieval Slavicists Server - Bukyvede in particular includes a lot of historical Cyrillic characters and includes the Glagoltic characters. Kliment and Roman Cyrillic are LGC fonts which include other variations of the Glagoltic block. Latin and Greek are also included
  • Quivira - I discussed this a few entries ago, but to repeat: Big font. Lots of scripts including LGC, Coptic, Armenian, Hebrew, Georgian, Thai, Baybayin, Runic, Thai, Braille, some Indic...
  • Sophia Nubian - a new Coptic and Nubian script font from SIL with Keyman keyboard utility (Windows). A Mac Coptic Unicode Keyboard is also available.

I should mention that SIL is an excellent source of freeware fonts for undersupported scripts. Here's a list of the SIL fonts.

There are always more fonts out there so I recommend a periodic check of Gallery of Unicode Fonts and Alan Wood's Font list periodically. You never know what you might find.

Categories:

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Comments

The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments