Software and Unicode: February 2009 Archives

When Apache and UTF-8 Fight


When you create a Web page with Unicode characters, it is recommended that you include the following character meta tag:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

And if it's XHTML, you need to include a final "/" at the end.

The idea behind this tag is to force the other broswer into the correct view and prevent the display of Roman character gibberish. Sometimes though, you can place a properly formatted UTF-8 Web page (meta tag and all) and still see gibberish.

In this case the problem is not you, but the Web server, typically configured with Apache. If it's an American server, Apache is probably set up to ONLY deliver ISO-8859-1 encoding and, even though your file has the UTF-8 data in it, the server is trying to deliver it as Latin 1 (hence the Latin 1 gibberish).

There are three possible solutions available when this happens

Talk to Your Server Admin

And when you do, you can politely suggest changing the httpd.conf file as documented on Seapine Software. You can also comment that most modern Web apps are set to serve UTF-8 data including CMS programs such as Plone, Movable Type and Drupal. Others such as Facebook and Twitter support UTF-8 natively.

I believe this is what a Web service having this issue did recently.

Use an .htaccess file to just configure specific directories and pages

If you're comfortable enough to mess around with changing your directory preferences you can try this suggestion from Ted Texin about using AddType statements

The main proviso here is that an .htaccess file can do some serious damage unless you are careful. It's possible that you may not be able to upload one into your directory because of this, but it could be a good solution to suggest to a server admin if only your directory is affected and the rest of the site has to be encoded differently.

Unicode Escape Codes

If neither of the above solutions is available, then you can deliver the content within any encoding...if you encode the "exotic" characters as Unicode numeric escape codes.

For example if your site is Latin 1, but you need to present Russian content you can change your code from




As you can imagine, this IS an absolute last resort solution. If you ever need to transfer content between systems, you will have many more problems with escape codes (none of which are supported in true XML or Microsoft Word). Not to mention the difficulty of replacing each character with it's Unicode numeric equivalent. Escape codes were really only meant for short passages of text.

But...if this is where you are, then you can try either the old Mozilla Composer which converted anything you typed into escape codes or maybe you can try another utility. Truthfully it is extremely difficult problem to do convert raw UTF-8 text to HTML entitiy codes these days.

So I emphasize that this a rare problem and should be easily corrected by your server admin...and if it's a personal Web site, you may want to think about alternative providers.

Or you could try the ultimate last resort - attack of the angry Unicode expert.

Post Script (Apr 3, 2009)

A student in a recent seminar pointed out a site which does convert a character to a decimal code reference at (from Alan Iwi at the Rutherford Lab at Oxford). Just enter or paste the character and click the the Make HTML button to see a decimal entity code. You can enter an entire string of characters.


Where Have All the Escape Codes Gone?


I'm currently preparing a seminar on Unicode and I was struck by how far Unicode implementation, especially in terms of raw Unicode text, has come in the past 4 years. Some of the warnings I used to present in 2000 or even in 2004 seem almost quaint now.

For instance when Mac OS X first came out, the older applications were not set up to take advantage of the Mac Unicode utilities, such as the U.S. Extended keyboards. I used to have to specify which applications could work with Unicode and which couldn't do it. But yesterday I realized that I couldn't find any old applications on my machine that didn't work correctly. What a difference that makes.

The same is true on the Windows side. If you get the latest version of most applications, the chances are that Unicode support is there - even for raw text editors.

Similarly, I recall when many HTML editors converted any non-English character to an numeric HTML entity, but now most applications are set to work with real UTF-8 text embeded in HTML tags. This is much easier to edit and crucial for being able to transfer data between the Web and other XML resources.

Russian, Chinese and Greek data are being treated as just "text" and not as a special case that programmers need to agonize over. There are still plenty of issues to be worked out, but it's good to appreciate progress when it's made.


About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage ( for a profile.


The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (

Powered by Movable Type Pro

Recent Comments