Recently in Encoding Theory Category
Here at Penn State my duties include being an accessibility guru as well as being a Unicode guru, and not too surprisingly, Unicode can enhance accessibility in some situations. And not just in the abstract "standards enhance accessibility" but more concretely as in:
It's An Encoded Character, Not a Font Trick
We all know that relying on fonts to display characters (e.g. the use of the Symbol font for Greek characters) is a Bad, Bad Idea, but it's even worse for a screen reader. Consider the expression θ = 2π. In the old Symbol font days, this might have been coded as:
<p> <font face="Symbol">q = 2p</font></p>
And guess what the screen reader would read - Q equals 2 P. Since the screen reader is essentially "font blind", the underlying text is what is read. Hence the Unicode correct code below is preferred:
<p> θ = 2π</p>-OR-
<p> θ = 2π</p>
If you think about it, the screen reader is a good tool for conceptualizing how characters (and their variants) may function semantically in different contexts.
I should mention that screen readers can get confused with a Unicode character if it can't recognize it, but that's more of a dictionary problem than a Unicode problem. For Jaws, it is possible to install .sbl pronunciation files to increase the character repertoire, especially for math and science.
It's Text, Not An Image
Perhaps the biggest advantage for Unicode though is that it allows characters that used to be embedded in images to be just plain text. For instance you could embed the following equation for the volume of a sphere:
Text
V = 4/3πr³
Image
![]()
Consider what happens though if a low-vision reader (or a middle aged reader with decrepit eye sight) needs to zoom in on the text. As you will see in the screen capture below, the image will pixelate while the text remains crisp.
Zoomed Text vs Zoomed Image
>
When you combine Unicode with creative CSS, you can see the possibilities for replacing images, including buttons with text. Not only is this more accessible, but it also results in smaller file sizes and is easier to edit.
Hearing Impaired Users
Unicode is actually important for these users because they need to read text captions or transcripts for video and audio. Once you get beyond basic English (e.g. Spanish subtitles)...well you know Unicode will be important.
Motion Impaired Users
For these users, the issue probably isn't so much reading text as being able to input it - which is the job of developers of operating systems and software. For motion impaired users, a good generalization is that keyboard access is better than using the mouse which requires a little more hand control. In the past I've commented on usability of various inputting devices, but since most do rely on key strokes, there are really no major complaints here.
One audience I didn't touch was color deficient vision, but except possibly for the Aztec script (which isn't even in Unicode yet)...it's not too much of an issue.
An an aspect of internaltionalization (i18n) I often skip over is localization (l10n) or customizing text (e.g. spelling, transliteration conventions, prices in the correct currency, date stamping in the correct format etc). If localization is something you need to have on a Web site, you may want to start at the Unicode CLDR page (http://cldr.unicode.org/) which compiles a variety of charts and some guides.
There are other resources available including several from IBM listed below:
These resources are generally aimed for the programmer audience, but there are interesting nuggets for the non-programmer as well.
Although the vast majority of my Unicode knowledge has come courtesy of the Internet, there are some print resources that I am beginning to find very useful, so I thought I would add some quick notes. I would add that audience for the Unicode books is generally the programmer audience needing implement Unicode support. These books really don't tell you how to type an accented letter.
Unicode Demystified
If you're a programmer who's been handed a foreign language project and really aren't sure where to go next, I think this is a good place to start. This book by Richard Gillam dates from 2002, but is still a valuable resource because it explains the basic concepts behind Unicode in fairly straightforward language.
This book covers the major world scripts including Latin, Cyrillic, Greek, Arabic, Hebrew, East Asian scripts, major South Asian scripts, Cherokee, Canadian Aboriginal and so forth. These scripts generally cover most of the major typographical and sorting issues you are likely to encounter, so it remains very handy for the newcomer. However, If your script is a little more exotic (or newer to Unicode), you will probably need to find alternate resources.
Information
Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard (Paperback)
Author: Richard Gillam
Year: 2002
ISBN (10/13): 0201700522 / 978-0201700527
Unicode Explained
This is from 2006, so it's more recent, and it's by Jukka Korpela who is good at explaining concepts behind encoding (as well as accessibility). Unfortunately, I haven't had a chance to acquire it yet. I will be looking forward to taking a look at this.
Information
Author: Jukka Korpela
Year: 2002
ISBN (10/13): 059610121X / 978-0596101213
The Unicode Standard
For each of the major Unicode Standards (e.g. 4.0, 5.0), the Unicode Consortium releases a hard-bound reference of the actual standard. If you're semi-serious about Unicode programming, I would recommend picking up at least one version of the standard and then updating over time. It does gather everything in one place...at least for the moment.
The first part explains the standard including issues of direction (LTR/RTL), casing, ligature, different flavors and so forth. There is also an explanation for each script. The last section prints the character list block by block, including the East Asian CJK characters which are normally referenced with just a database online.
I think the reference aspect is the most important benefit of this book. Although there are sections for each script, this work tends to assume that you are fairly familiar with whatever script you are with and so devotes most of the text to technical explanations. Fortunately, I think the technical explanations and examples are core examples that a programmer would need.
Although most of the content is replicated in PDF on the Web site, it can be handy to have the actual book as a baseline reference. For one thing, the charts are of high quality print, allowing you to see minute typographic details. For another thing, you never know where you will need to work on a project without the Internet....
Information
Unicode Standard, Version 5.0, The (5th Edition) (Hardcover)
Author: Unicode Consortium
Year: 2006
ISBN (10/13): 0321480910 / 978-0321480910
The tutorial technically Creating SVG Tiny Pages in Arabic, Hebrew and other Right-to-Left Scripts, but it actually provides an excellent explanation of how Unicode specifies text direction and how you need to encode both RTL (right to left) and LTR (left to right) in a Middle Eastern text which includes European words as BIDI (Bidirectional).
If you know about the Educause 7 Things You Should Know About... Series, then you should know that it is important to be able to identify seven important elements about any technology.
So here is my spin on what the "you should know" (or what someone not familiar with Unicode might need to know).
1. What is it?
Unicode is an encoding scheme. Each character in each script has a number (because computers track everything by number).Unicode is an encoding standard of millions of characters allowing literally any character from any script to be assigned a number. Unicode does this by assigning a block of numbers of a script (http://www.unicode.org/charts)
Unicode began in 1999 and focused the most commonly used scripts first such as the Latin alphabet, Cyrillic, Chinese, Japanese, Arabic, Greek, Hebrew, Devanagari and others.All major world scripts are covered, as well as many minority and ancient scripts.
2. Who's doing it?
Unicode encoding has been incorporated into Windows (since Windows NT), Macintosh (since OS X) and new versions of Linux/Unix. Applications supporting Unicode include newer versions of Adobe applications, Microsoft Office, the Apple iLife/iWork series, FileMaker, EndNote, Google, GoogleDocs, Twitter, Zotero, blogs, Facebook and many others.
3. How does it work?
To read Unicode text, a user needs to have the correct Unicode font installed. Both Apple and Microsoft provide well-stocked fonts for free, but not every character is covered. Fortunately many freeware fonts are available.
To enter Unicode text, users must activate keyboard utilities or use special escape codes to enter characters for the appropriate script. Again Microsoft and Apple provide a lot of built-in utilities, but additional ones are also available online, many as freeware.
4. Why is it significant?
Consistent encoding allows users to exchange text consistently and for font developers to develop new fonts with a wide range of characters in a consistent manner. When properly implemented, a Mac user can read a Greek text file created on a Windows machine with minimal adjustment.
5. What are the downsides?
One is that older programs developed before Unicode may need to be retrofitted if they are meant to be used by a global audience. Programmers need to learn new techniques in order to take advantage of Unicode encoding.
The other remaining problems is that Unicode implementation on the user end is still confusing. Users working with languages other than English need to either activate/install special utilities or memorize a series of special codes. Methods to input text also vary from software to software. A lot of tech-saviness is required in order to maximize Unicode compatibility.
6. Where is it going?
The goal is for every script, even those for ancient languages, to be encoded within Unicode. This will not only enable new technologies to be used in any language, but will allow texts from around the world to be digitized in a common format. Unicode support for major languages has arrived, but support for many lesser-known scripts and quirky cases in major scripts still needs to be implemented.
7. What are the implications for teaching and learning?
Unicode will
- Simplify the display of non-English texts in foreign language courses and courses taught in non-English speaking areas
- Standardize the display of mathematical and technical symbols
- Allow non-English speaking communities to write in their native scripts instead of transliterating text in the Roman alphabet
- Expand the typographical repertoire of font designers
- And...if you're a pioneer...Unicode will introduce you to the joys of converting between decimal and hexadecimal values
The latest reports from the W3C Japanese Layout Task Force is posted at
http://www.w3.org/2007/02/japanese-layout/. The working language is Japanese, but key documents are translated into English.
The page also includes a basic layout primer which discusses issues for vertical layout iin Japanese, Ruby Annotation (not Ruby on Rails), switching to the Roman alphabet, Japanese punctuation and more.
Two entries ago, I extrapolated what would happen to encoding jargon in the Star Trek universe, mostly an exercise to explain how internationalization (i18n) is structured. In this installment, I hope to demonstrate how things only get more complicated when local encodings meet each other.
Starting "Local" Standards
In the new frontier of "interplanetarization (i19n)", we'll already be starting with a buffet of alphanumeric terms - namely the encoding standard(s) each planetary system. I'll repeat some below. Notice that the Orions still have two competing standards.
- TUTF-32 - Terran Unicode (32 bit)
- TLHLSCII - tlhIngan Hol (Klingon) Language Institute Standard Code for Information Exchange
- RIS-105 - Romulan Imperial Standard #105
- VSAUS-210A - Vulcan Science Academy Unified Standard #210A
- ACS34 - Andorian Communication Standard #34
- TelSCII - Tellarite Standard Code for Information Interchange
- OTLC-10 - Orion Technology Limited Code #10
- SuperSix - As agreed upon by six major Orion Trading Houses
Before They Create Fedcode
I would assume that eventually the Federation will eventually develop a really large unified standard similar to Unicode. I will call this Fedcode. However...the development of Fedcode will take a while and may even present new challenges in how many bytes are needed for each character.
In the meantime, the local computing systems will need a way to exchange information quickly, so I extrapolotate that lot of adhoc encodings will take place first. Such as:
What the Terrans may Invent
Similar to the Vulcans, I think Unicode will try to incorporate the new scripts into Unicode. At version 9.2, Unicode had 16 planes which was enough to accomodate the new Terran scripts, but finding new historical scripts will really add to the complexity.
Unicode 10, might have to add another layer (a "dimension"?). In this scheme, Dimension 0 will be the Unicode we now have, and then we would add
- Unicode 10, Dimension 0 (= today's Unicode)
- Unicode 10, Dimension 1 (= VSAUS-210A )
- Unicode 10, Dimension 2 (= TLHLSCII)
- Unicode 10, Dimension 3 (= OTLC10, not SuperSix) ...
What the Vulcans Might Invent
- VSAUS-210A -1 (All Vulcan scripts)
- VSAUS-210A -2 (Basic Vulcan plus Andorrian scripts, based on ACS34)
- VSAUS-210A -3 (Basic Vulcan plus Tellerite scripts, based on TelSCII)
- VSAUS-210A -4 (Basic Vulcan plus Klingon scripts, based on TLHLSCII )
- VSAUS-210A -5 (Basic Vulcan plus Orion scripts, based on SuperSix, not OTLC-10)
- VSAUS-210A -6 (Basic Vulcan plus Terran scripts, based on Unicode 9.2)
Again, the 1 through 6 are referring to blocks/planes/dimensions in VSAUS-210A; it's just that the Vulcan encoding allows users to specify location in the scheme to facilitate their processing.
What the Orions Might Invent
Let's skip the Klingons and the Andorrians and jump to the worst case scenario - the Orions whose two encodings are developed by competing technology corporate interests. Each vendor/trading house will expand their encodings, but in different directions
Thus we will have:
- OTLC-10 (Orion/all Orion measurements) - 16 bit for rapid processing
- OTLC-11 (Vulcan)
- OTLC-12 (Terran Unicode Plane 0)
As well as
- SuperSix (Orion) - 64bit for "exact recording"
- SuperSixV - Orion plus Vulcan
- SuperSixT - Orion plus Unicode Plane 0
- SuperSixPlus - Combines all scripts
By Fedcode
As you can see that by the time the Federation i19n experts meet for the first time to standardize Fedcode, there will not only be local planetary standards to work with but competing "combined" standards such as Unicode 10.5, SuperSixPlus and VSAUS-210A.
Which will become the basis of Fedcode? How will they plan for expansion for new scripts encountered?
And most of all - how will future computers handle the transformation between Fedcode and KDS (Cardassian Processing Standard)?
I love the i18n/UTF-8 process as much as anyone, but you have to admit that all those flying letters and number combinations can be a little overwhelming to the newcomer. So I think a primer is needed
There are some real glossaries out there such as the Unicode Glossary and the Penn State i18n glossary, and the IBM Glossary of Unicode Terms...but you really do learn more when you create your own material. So with that in mind, I present
Encoding in the World of Star Trek
I would like to believe that someday we will contact other civilizations (with some sort of encoded communication) and at that point there will need to expand and create new encodings (and of course new jargon) such as
Jargon of Process
Three current terms for the field of wrangling non-English text include i18n for "internationalization", g11n for "globalization" (both refer to making content/systems usable to people using any script) and the related l10n "localization" (adapting information from region one to a second region, (e.g. a Japanese product sold in the United States).
These terms have the same structure start with the first letter, end with the last letter and insert the number of letters in between. Thus internationalization (20 letters total, 18 between "i" and "n") becomes i18n.
You can apply this to any term such as "Romanization" and "transliteration" (see answers below for new terms), and in the future we will need alternate terms to include the fact that we are working with planets, not just nations. So maybe we will have
- galaxification (g12n) - even greater than g11n
- interplanetarization (i19n) - also greater i19n
- astrointernationalization (a23n) - the biggest of them all
- Romanization (r10n) - I made this up
- transliteration (t13n) - this does exist, but is not frequently seen
FYI - Both r10n and t13n refer to the process of writing any language in the Roman (Western/Latin) alphabet. Japanese Romāji is an example of this process.
Local Government Standards
Before the days of Unicode, each region had established its own encoding standard for its own language(s). The most famous may be ASCII (American Standard Code for Information Interchange) from which we also got VISCII (Vietnamese), ISCII (India) and ArmSCII (Armenian).
Another pattern is to name the encoding standard after the governmental standards body and the number of the encoding scheme (usually a sequential number). This is how we arrive at TIS-620 (Thailand, Thai Industrial Standard #620), GB3212 (China) and ELOT 928 (Greece/Ellas). A governmental agency also gave names to Shift-JIS (Japan, combination of JIS X 0201 and JIS X 0208) and ANSI (U.S., American National Standards Institute).
Finally, if for some reason, the local government doesn't move as rapidly as needed , then a corporation will invent its own standard on the fly. In the U.S. we got both Windows-1252 (Win-1252) and MacRoman encodings this way. In Taiwan, they got Big5 (a Traditional Chinese encoding standard agreed upon by five corporations).
Future Local Planetary Encoding Standards
In the future, I will assume that each Star Trek planet has its own version of Unicode, but of course each will have its own encoding designation. Can you Star Trek fans guess where these are from?
- KLISCII or TLHLSCII (depending on linguistic accuracy)
- RIS-105
- VSAUS-210A (because this planet uses hex numbers)
- FMSS-13B1 (in duodecimal numbers because you can quickly divide by 3)
- TUTF-32 (future name for an existing standard)
Since I will be talking cross-planetary standardization next time, I will add these potential encodings
- ACS34 - Andorian Communication Standard #34
- TelSCII - Tellarite Standard Code for Information Interchange
- OTLC-10 - Orion Technology Limited Code #10
- SuperSix - As agreed upon by six major Orion Trading Houses
- BNTCXS - Betazed Non-Telepathic Communication Exchange Standard
And to finalize the list
- KLISCII - Klingon Language Institute Standard Code for Information Exchange or
TLHLSCII - tlhIngan Hol Language Institute Standard Code for Information Exchange - RIS-105 - Romulan Imperial Standard #105
- VSAUS-210A - Vulcan Science Academy Unified Standard #210A
- FMSS-13B1 - Ferengi Mercantile Society Standard #13BC
- TUTF-32 - Terran Unicode (32 bit)
Final challenge - what encoding would you invent for the Cardassians?
Unicode version 5.1 was recently released, and includes some new code blocks as well as new specifications. As with all new versions of Unicode there will be a time lag until the new items can be incorporated into fonts and utilities, but here is a partial list of new items
If you're interested in the new characters, the best place to view them is at http://www.unicode.org/charts/
New Plane 0 Scripts
- Cham (Cambodia/Vietnam)
- Kayah Li (Thailand/Myanmar)
- Lepcha (India)
- Ol Chiki/Santali (India)
- Rejang (indonesia)
- Saurashtra (India)
- Sundanese (Indonesia)
- Vai (Liberia)
Script Extensions
These blocks add characters to previously encoded scripts.
- Cyrillic Extended-A
- Cyrillic Extended-B
- Arabic - characters for math, 4 Qu'ranic and multiple characters for different languages
- Indic - Malayalam, Tamil character sequences, Devanagari chandra a,
Sanskrit sounds in Gurmukhi, Oriya, Telegu - Latin - characters for minority languages and capital German sharp S (rare)
- Math Symbols
- Medievalist Punctuation - for research
- Myanmar Additions
New Plane 1 Ancient Scripts and Miscellaneous Symbols
- Carian (Anatolia/Turkey)
- Lycian (Anatolia/Turkey)
- Lydian (Anatolia/Turkey)
- Phaistos Disk (Crete)
- Domino Tile Symbols
- Mahjong Tile Symbols
I'm a little behind in this blog, ... but at a talk I attended recently (mid October 2007), the keynote speaker mentioned some interesting challenges for encoding scripts with a strong manuscript (and calligraphic) tradition.
Most scripts in use today were originally designed to be handwritten in ink over a relatively smooth surface such as paper, papyrus, parchment or palm leaves. The benefit of handwriting is that you don't need a lot of expensive equipment (such as a printing press) to produce a document, but the writer must make each letter form one by one. Writer's cramp can be a serious consideration for workers in the manuscript industry.
To reduce both time and strain to the wrist and hands, most scripts using paper-type media develop cursive forms and special abbreviation symbols (e.g. "&" for 'and' and "@" for 'at'). For instance, Arabic letters vary in shape depending on whether the letter is at the beginning, end or the middle of the word, and it's generally due to the fact that Arabic is essentially a continuous cursive script.
The abbreviation symbols are easily encoded and many are already in the standard, but the alternate letter forms are even trickier. On U.S. computers, if you type the "S" key, the screen usually displays an "S" almost instantaneously. With other scripts like Arabic or Devanagari, the text editor has to know the position of the character within the word before it can display something. In some cases, the text editor has to wait for the NEXT character before it can give you a display. Issues like these are a major why support for Arabic and South Asian scripts continues to lag behind other scripts.
But the story doesn't end there. Beause manuscripts are always handmade, lots of local variations have developed (lots and lots). The preferred Arabic script of Saudi Arabia (Naskh) is quite a bit different from the preferred script of Urdu (Nastaliq). Even though an Urdu writer is using the same script is using the same script as someone in Saudi Arabia, he or she may not be able to use the same font base. Similar variations can be seen in Chinese vs Japanese writing. Even in Europe, German Fraktur (Blackletter) is quite a bit different from manuscript Gaelic both of which differ from modern typography.
And just when you thought you had it all figured out, someone will discover a new manuscript needing a new symbol to encode. Yikes!
Our speaker was documenting some of the more interesting variations you can find in pieces of calligraphy when I hit a conceptual wall. I agree that encoding most of this (probably 90% of this) is historically and culturally important. But...at some point calligraphy is no longer really a document, but an art form. Where do you stop?
After all, the point of many calligraphic traditions isn't really to send a new message, but to find new meaning in old words. Many calligraphic works are actually older texts rewritten to visually represent different nuances in meaning. And many practitioners become celebrated for their abilities to develop a new style of writing.
Graphic programs have protocols for selecting color, shapes, line weight, orientation and so forth, but there is a point where the specifications end and the art begins. Maybe a few of our archival questions can be solved if we remember that some manuscripts are art as well as textual documents.
Recent Comments