Recently in Encoding Theory Category

A Unicode Ted Talk


Johannes Bergerhausen recently gave a Ted Talk in Vienna on Unicode. It's a good summary of key issues for those who don't know know about bits or bytes

Looking forward to the day when you can "send text messages in Cuneiform".


Explaining UTF-8


The UTF-8 encoding is not a straight encoding of Unicode code points, but rather a "compromise character encoding" which allows files with just ASCII characters to stay the same size as ASCII, but also include any Unicode code point, regardless of byte size.

If this is sounding a bit confusing, you may want to try this Game Dev article on UTF 8. It's still under review, but it does step through some parts of the the conversion from a Unicode code point to a UTF-8 representation.


Non-ASCII URLs to Test


Recently, there has been a lot of progress in implementing Web addresses with non-English characters including different scripts. This ability is courtesy of the IDN (Internationalized Domain Name) technology.

Some test links are given below, but they may not work in older browsers. Note also that many have ASCII only equivalents or redirect to a site with an older ASCII only URL.

Wikipedia is also a good source of non-ASCII URL's. Translated articles on Elizabeth I of England can be found at these locations.

For the latest news on IDN implementation, see IDN page on the ICANN Web site at


Which "UTF" Do I Use? (Updated)


I have rewritten this entry to make some aspects of the theory a little more clear and to fix some errors.

Why flavors?

Although Unicode generally assigns one code point to one character, text data is not generally stored or transmitted in that manner. One difference between an earlier encoding scheme like Latin-1 and Unicode is the number of bytes potentially required. In hexadecimal notation, Latin-1 characters ranged from 00-FF (i.e. two hexadecimal digits). In computer memory terms, 2 hexadecimal numbers is 1 byte (where 1 byte = 8 bits), so each character in Latin -1 requires one byte of memory.

In contrast, modern Unicode code points range from 0-FFFFFF (i.e. six hexadecimal digits), which means that each character could require 3 bytes of memory (or actually 4 bytes since memory is allocated in powers of two). Text file sizes could potentially quadruple...unless compression algorithms are applied. This is the origin of the different "flavors" of Unicode.

Unicode comes in a variety of flavors depending on how many bytes you are using for each character and in which byte order they are coming in. For most online uses, UTF-8 is the safest, but here's a short summary of other types of Unicode out there.


The very earliest versions of Unicode specified a 16-bit/2-byte encoding (where 216 = 65,000+) characters. The highest number would be #FFFF. Within Unicode, the four digit are organized onto blocks (the first two digits), then codepoints in the block.

This capital L (Hexadecimal #4C or #x4C) is in block 00 and codepoint 4C or 004C.

This seems simple enough, but there were differences between the placement of the block versus code point:

UTF-16: Little Endian vs. Big Endian

Some systems, notably Intel based systems organize each Unicode number into codepoint (little end) then block (big end). Others, notably Unix, organize Unicode into block then code point.

Returning to the capital L (#4C), there are two UTF-16 ways to represent this:

  • Big Endian (UTF-16BE) : 00.4C = L
  • Little Endian (UTF-16LE) : 4C.00 = L

Software packages, particularly databases and text editors for programmers can switch between the two, but it can be a hassle. UTF-8 is more consistent between systems, so is a little more resilient.

Note: In theory, UTF-16 files begin with a special BOM (Byte Order Mark) which specifies Little Endian or Big Endian.

UTF-32 (UCS-4)

At some point, the Unicode Consortium realized that even 65,000+ characters would not be enough, so provisions were made to allow for another two digits in the encoding scheme. The next two places were called "planes" (vs. "blocks" and "codepoints"). The original 216 characters are now Plane 0 with additional planes being added for other scripts as needed. At this point, there are some Plane 1 scripts, but they are mostly ancient scripts and rarely used Chinese characters.

In any case, to represent all the planes, blocks and codepoints, you need to add extra digits in the Unicode file. Thus capital L (#4C) is now (00)00004C in UTF-32. As you can see, unless you are dealing with ancient scripts or archaic Chinese texts, you are adding extra "plane" information you may not need and adding more memory to your files. For this reason UTF-32 is almost never used in practice. However, if it were, you could specify an LE vs BE version of UTF-32.

UTF-8 (Unicode Transformation Format)

The difference between UTF-8 and UTF-16/UTF-32 is that it uses an algorithm to translate any Unicode character into a series of "octets". Character (004C) "L" can be stripped to a simple 4C, just like in ASCII. If you use primarily English or Western languages, file sizes may be smaller in UTF-8 than UTF-16, and ASCII or Latin 1 code will usually be easier to integrate into Unicode.

The other advantage of UTF-8 is that the algorithm allows the data to be less corruptible over the Internet. Thus UTF-8 is recommended for e-mail, Web pages and other online files. Some databases and programming languages may use UTF-16 instead.

Similar transforms are also applied in UTF-16 to convert codepoints U+10000 and higher to sequences of four-digit characters. However, not every software applications supports this, so some systems may have problems processing code points beyond U+FFFF.

Additional Links


Language Tagging and JAWS: How to return to English?



I am not seeing other reports of the JAWS quirk reported in this entry. It is based on hearsay from a JAWS user, although one who is fairly tech literate. Hopefully, the point is moot, but since information is so spotty, I am leaving this entry up for now.

Original Article

Unicode and accessibility should be natural partners, but sometimes the tools get a little confused. Take language tagging for instance....

Language tagging identifies the language of a text to search engines, databases and significantly, screen reader tools used by those with severe visual impairments. The newer screen readers can switch pronunciation dictionaries if they encounter a language tag. Language tagging syntax, as recommended by the W3C for HTML 4 works as follows:

  1. Include the primary language tag for the document in the initial HTML tag. For example, an English document would be tagged as <html lang="en">
  2. Tag any passages in a second language individually. For instance, a paragraph in French would be <p lang="fr"> while a word or phrase would be <span lang="fr">.

The idea though is that once you exit the passage tagged with the second language code, you should assume that the language is back to the primary language. Unfortunately, a comment I heard from a JAWS user was something like "The lang tag works, but developers forget to switch back to English." When I asked him for details, he indicated that an English text with a Spanish word makes the switch in pronunciation engines, but then remains in Spanish mode for the rest of the passage.

What I interpret from this is that the JAWS developers are assuming that there should be a SECOND LANG tag to return the document back to the primary language. So we have two syntax schemes:

What W3C Expects

Text: The French name for "The United States" is Les États Unis, not Le United States.

Code: <p>The French name for "The United States" is <i lang="fr">Les États Unis.</i> not <i>Le United States.</i></p>

Note that the only LANG tag is the one for French Les États Unis with the assumption that the document contains a <html lang="en"> specification which applies to the entire document.

What JAWS Wants

As I indicated earlier, it appears that if this code is parsed by the JAWS screen reader, it would remain in French mode even after Les États Unis was read. I am not sure what the syntax would be, but I'm guessing something like this:

Code: <p>The French name for "The United States" is <i lang="fr">Les États Unis.</i> <span lang="en">not <i>Le United States.</i></span></p>

Now there is a second English LANG tag whose domain is the rest of the sentence. I am assuming that JAWS would remain set as English thereafter. In this scenario, I am also guessing that what the JAWS programmers did was to set the switch in pronunciation engines to be triggered ONLY by a language tag - which would explain why it didn't switch back to English in the previous code.

What the W3C is expecting though is that tools should be sensitive to domains of language tags and know to switch back to English when the appropriate end tag is encountered. It's more difficult to program, but it CAN be done.

The Coding Dilemma

So here's the coding dilemma developers face: Do they code to the declared and accepted W3C standard or do they code for JAWS? Of course, the JAWS community would like developers to code for JAWS (after all the person I was speaking with was convinced the problem was developer cluelessness, not bad JAWS standards implementation).

The problem is that this approach perpetuates the more bloated code standards were supposed to streamline. Essentially, you are coding for a specific Web browser just like those developers who only code for Internet Explorer. It's an appealing short term solution, but in the long run counter-productive. This is why even Web-AIM (Web Accessibility group from Utah State) recommends NOT coding for the quirks in JAWS or user agents.

Besides, we can always hope this quirk will be fixed in a future release of JAWS.

Did I Mention Unicode Above 255?

I've also heard rumors that JAWS may read some Unicode characters above 255 as just the Unicode code point. Thus ∀ ("for all" or the upside-down A symbol) might be read as "2200" or "U+2200". There are special .sbl symbol files you can install in JAWS, but it would be nice if the process were a little more transparent. I feel it's the equivalent of Apple or Microsoft not providing any default fonts for non-Western European language...


Accessibility and Unicode


Here at Penn State my duties include being an accessibility guru as well as being a Unicode guru, and not too surprisingly, Unicode can enhance accessibility in some situations. And not just in the abstract "standards enhance accessibility" but more concretely as in:

It's An Encoded Character, Not a Font Trick

We all know that relying on fonts to display characters (e.g. the use of the Symbol font for Greek characters) is a Bad, Bad Idea, but it's even worse for a screen reader. Consider the expression θ = 2π. In the old Symbol font days, this might have been coded as:

<p> <font face="Symbol">q = 2p</font></p>

And guess what the screen reader would read - Q equals 2 P. Since the screen reader is essentially "font blind", the underlying text is what is read. Hence the Unicode correct code below is preferred:

<p> θ = 2π</p>


<p> &theta; = 2&pi;</p>

If you think about it, the screen reader is a good tool for conceptualizing how characters (and their variants) may function semantically in different contexts.

I should mention that screen readers can get confused with a Unicode character if it can't recognize it, but that's more of a dictionary problem than a Unicode problem. For Jaws, it is possible to install .sbl pronunciation files to increase the character repertoire, especially for math and science.

It's Text, Not An Image

Perhaps the biggest advantage for Unicode though is that it allows characters that used to be embedded in images to be just plain text. For instance you could embed the following equation for the volume of a sphere:


V = 4/3πr³


AreaSphere.png V = four thirds pi R cubed

Consider what happens though if a low-vision reader (or a middle aged reader with decrepit eye sight) needs to zoom in on the text. As you will see in the screen capture below, the image will pixelate while the text remains crisp.

Zoomed Text vs Zoomed Image

Enlarged formula. Text is crisper than image

When you combine Unicode with creative CSS, you can see the possibilities for replacing images, including buttons with text. Not only is this more accessible, but it also results in smaller file sizes and is easier to edit.

Hearing Impaired Users

Unicode is actually important for these users because they need to read text captions or transcripts for video and audio. Once you get beyond basic English (e.g. Spanish subtitles)...well you know Unicode will be important.

Motion Impaired Users

For these users, the issue probably isn't so much reading text as being able to input it - which is the job of developers of operating systems and software. For motion impaired users, a good generalization is that keyboard access is better than using the mouse which requires a little more hand control. In the past I've commented on usability of various inputting devices, but since most do rely on key strokes, there are really no major complaints here.

One audience I didn't touch was color deficient vision, but except possibly for the Aztec script (which isn't even in Unicode yet)'s not too much of an issue.


CLDR = Unicode Common Locale Data Repository


An an aspect of internaltionalization (i18n) I often skip over is localization (l10n) or customizing text (e.g. spelling, transliteration conventions, prices in the correct currency, date stamping in the correct format etc). If localization is something you need to have on a Web site, you may want to start at the Unicode CLDR page ( which compiles a variety of charts and some guides.

There are other resources available including several from IBM listed below:

These resources are generally aimed for the programmer audience, but there are interesting nuggets for the non-programmer as well.


Some Recommended Books


Although the vast majority of my Unicode knowledge has come courtesy of the Internet, there are some print resources that I am beginning to find very useful, so I thought I would add some quick notes. I would add that audience for the Unicode books is generally the programmer audience needing implement Unicode support. These books really don't tell you how to type an accented letter.

Unicode Demystified

If you're a programmer who's been handed a foreign language project and really aren't sure where to go next, I think this is a good place to start. This book by Richard Gillam dates from 2002, but is still a valuable resource because it explains the basic concepts behind Unicode in fairly straightforward language.

This book covers the major world scripts including Latin, Cyrillic, Greek, Arabic, Hebrew, East Asian scripts, major South Asian scripts, Cherokee, Canadian Aboriginal and so forth. These scripts generally cover most of the major typographical and sorting issues you are likely to encounter, so it remains very handy for the newcomer. However, If your script is a little more exotic (or newer to Unicode), you will probably need to find alternate resources.


Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard (Paperback)

Author: Richard Gillam
Year: 2002
ISBN (10/13): 0201700522 / 978-0201700527

Unicode Explained

This is from 2006, so it's more recent, and it's by Jukka Korpela who is good at explaining concepts behind encoding (as well as accessibility). Unfortunately, I haven't had a chance to acquire it yet. I will be looking forward to taking a look at this.


Unicode Explained (Paperback)

Author: Jukka Korpela
Year: 2002
ISBN (10/13): 059610121X / 978-0596101213

The Unicode Standard

For each of the major Unicode Standards (e.g. 4.0, 5.0), the Unicode Consortium releases a hard-bound reference of the actual standard. If you're semi-serious about Unicode programming, I would recommend picking up at least one version of the standard and then updating over time. It does gather everything in one least for the moment.

The first part explains the standard including issues of direction (LTR/RTL), casing, ligature, different flavors and so forth. There is also an explanation for each script. The last section prints the character list block by block, including the East Asian CJK characters which are normally referenced with just a database online.

I think the reference aspect is the most important benefit of this book. Although there are sections for each script, this work tends to assume that you are fairly familiar with whatever script you are with and so devotes most of the text to technical explanations. Fortunately, I think the technical explanations and examples are core examples that a programmer would need.

Although most of the content is replicated in PDF on the Web site, it can be handy to have the actual book as a baseline reference. For one thing, the charts are of high quality print, allowing you to see minute typographic details. For another thing, you never know where you will need to work on a project without the Internet....


Unicode Standard, Version 5.0, The (5th Edition) (Hardcover)

Author: Unicode Consortium
Year: 2006
ISBN (10/13): 0321480910 / 978-0321480910


Tutorial on RTL/LTR & BIDI in Arabic/Hebrew


The tutorial technically Creating SVG Tiny Pages in Arabic, Hebrew and other Right-to-Left Scripts, but it actually provides an excellent explanation of how Unicode specifies text direction and how you need to encode both RTL (right to left) and LTR (left to right) in a Middle Eastern text which includes European words as BIDI (Bidirectional).


7 Things You Should Know About Unicode


If you know about the Educause 7 Things You Should Know About... Series, then you should know that it is important to be able to identify seven important elements about any technology.

So here is my spin on what the "you should know" (or what someone not familiar with Unicode might need to know).

1. What is it?

Unicode is an encoding scheme. Each character in each script has a number (because computers track everything by number).Unicode is an encoding standard of millions of characters allowing literally any character from any script to be assigned a number. Unicode does this by assigning a block of numbers of a script (

Unicode began in 1999 and focused the most commonly used scripts first such as the Latin alphabet, Cyrillic, Chinese, Japanese, Arabic, Greek, Hebrew, Devanagari and others.All major world scripts are covered, as well as many minority and ancient scripts.

2. Who's doing it?

Unicode encoding has been incorporated into Windows (since Windows NT), Macintosh (since OS X) and new versions of Linux/Unix. Applications supporting Unicode include newer versions of Adobe applications, Microsoft Office, the Apple iLife/iWork series, FileMaker, EndNote, Google, GoogleDocs, Twitter, Zotero, blogs, Facebook and many others.

3. How does it work?

To read Unicode text, a user needs to have the correct Unicode font installed. Both Apple and Microsoft provide well-stocked fonts for free, but not every character is covered. Fortunately many freeware fonts are available.

To enter Unicode text, users must activate keyboard utilities or use special escape codes to enter characters for the appropriate script. Again Microsoft and Apple provide a lot of built-in utilities, but additional ones are also available online, many as freeware.

4. Why is it significant?

Consistent encoding allows users to exchange text consistently and for font developers to develop new fonts with a wide range of characters in a consistent manner. When properly implemented, a Mac user can read a Greek text file created on a Windows machine with minimal adjustment.

5. What are the downsides?

One is that older programs developed before Unicode may need to be retrofitted if they are meant to be used by a global audience. Programmers need to learn new techniques in order to take advantage of Unicode encoding.

The other remaining problems is that Unicode implementation on the user end is still confusing. Users working with languages other than English need to either activate/install special utilities or memorize a series of special codes. Methods to input text also vary from software to software. A lot of tech-saviness is required in order to maximize Unicode compatibility.

6. Where is it going?

The goal is for every script, even those for ancient languages, to be encoded within Unicode. This will not only enable new technologies to be used in any language, but will allow texts from around the world to be digitized in a common format. Unicode support for major languages has arrived, but support for many lesser-known scripts and quirky cases in major scripts still needs to be implemented.

7. What are the implications for teaching and learning?

Unicode will

  • Simplify the display of non-English texts in foreign language courses and courses taught in non-English speaking areas
  • Standardize the display of mathematical and technical symbols
  • Allow non-English speaking communities to write in their native scripts instead of transliterating text in the Roman alphabet
  • Expand the typographical repertoire of font designers
  • And...if you're a pioneer...Unicode will introduce you to the joys of converting between decimal and hexadecimal values


About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage ( for a profile.


The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (

Powered by Movable Type Pro

Recent Comments