August 2011 Archives

Got Double Hypens from Word?


Unicode hasn't been part of my life enough recently, but it did emerge in a very unexpected way this week to during a recent calendar upgrade.

One of the conversion tasks was for us to add group e-mail addresses so we could share calendars among each other efficiently. But when I tried to copy and paste, I got a "not found error." Here is one of these addresses (altered for security reasons):

Can you spot the problem (HINT: Try cutting and pasting into a text file).

Given up? The problem is the hyphen. In the right font, you will see that it's not just a hyphen (U+002D or ASCII #45), but actually the more elegant and slightly longer en dash which is actually U+2013 (not in ASCII). As many of you know, many databases are still sensitive to differences, so a hyphen is just not the same as an en dash. Theis means searching is a FAIL.

How did the en-dash get in there if it's outside of ASCII? My guess is that it's a result of an auto-correct feature from Word which makes some formatting tweaks to enhance visual appeal. One is to change plain hyphens into a slightly longer en-dash (more favored by typographers).

Another common change is to convert plain straight quotes (" at U+0022 or ASCII #34) to "Smart Quotes" like (“ at U+201C) and (” at U+201D). Copying HTML code attributes from Word can be similarly dangerous since HTML recognizes plain quotes, but NOT fancy double quotes. Most of the time, the change does nothing, but when it comes to interacting with some systems, the reformatting makes a difference in a very annoying way.

How to catch it? In some cases, you can change the font, but many fonts make the dash and en-dash appear identical (Arggh!). Which leaves the old standdy (test,test,test) plus some Unicode awareness (which is increasing among programmers).


"Coming Soon to Unicode" Pipeline Table


The Unicode Consortium announced they they had created a Unicode "Pipeline Table" page of characters scheduled for future versions of Unicode.

The table is organized by projected UCS code point number, but they are in various stages of the proposal process. Although dates of acceptance to a particular stage are posted, the target future version is not listed. Although many specifications look complete, the Unicode Consortium does warn that they are subject to change.

If you are interested in entire script blocks (particularly Ancient and lesser-known Indian scripts) coming to Unicode, you can go to the Proposed New Script page. The caveat that "things are subject to change" also applies here.


About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage ( for a profile.


The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (

Powered by Movable Type Pro

Recent Comments