An entry from the Google Blog has a graph showing that Unicode is rapidly gaining dominance as the defacto encoding standard on the Web.
The good news is that Unicode is now the number 1 encoding standard used, but it's not quite 50% yet (more like around 48%). Only 6 years ago (ca. 2004), the percentage was about 5-10%, so acceptance of Unicode and ability to implement it has increased geometrically.
As of 2010 though, about 40% of the Web was either in Latin-1/Win-1251 (ca 19%) while another was still in ... US-ASCII (ca 20%). Fortunately, that percentage is also dropping geometrically (from 55%+ ASCII in 2001). Some of us may be lagging behind, but it looks like we're all going to catch up sooner or later.
Web Reference.com is offering a series of tutorials on how to implement localization, globalization and internationalization features in PHP. Topics include working with Unicode; setting date, currency and number-display formats; and PHP environmental variables.
It's nice to see the these issues getting the spotlight in such a major tech forum.
However I do have one very important recommendation - make sure you visit the site when the ad for SitePal is NOT running or at least disable the volume. Having the virtual lady ask me to enter a message over into the text box over and over and over and over and over is very aggravating. Especially if it's after work hours and you are trying to fit in an episode on Hulu.com.
Disclaimer
I am not seeing other reports of the JAWS quirk reported in this entry. It is based on hearsay from a JAWS user, although one who is fairly tech literate. Hopefully, the point is moot, but since information is so spotty, I am leaving this entry up for now.
Original Article
Unicode and accessibility should be natural partners, but sometimes the tools get a little confused. Take language tagging for instance....
Language tagging identifies the language of a text to search engines, databases and significantly, screen reader tools used by those with severe visual impairments. The newer screen readers can switch pronunciation dictionaries if they encounter a language tag. Language tagging syntax, as recommended by the W3C for HTML 4 works as follows:
- Include the primary language tag for the document in the initial HTML tag. For example, an English document would be tagged as
<html lang="en"> - Tag any passages in a second language individually. For instance, a paragraph in French would be
<p lang="fr">while a word or phrase would be<span lang="fr">.
The idea though is that once you exit the passage tagged with the second language code, you should assume that the language is back to the primary language. Unfortunately, a comment I heard from a JAWS user was something like "The lang tag works, but developers forget to switch back to English." When I asked him for details, he indicated that an English text with a Spanish word makes the switch in pronunciation engines, but then remains in Spanish mode for the rest of the passage.
What I interpret from this is that the JAWS developers are assuming that there should be a SECOND LANG tag to return the document back to the primary language. So we have two syntax schemes:
What W3C Expects
Text: The French name for "The United States" is Les États Unis, not Le United States.
Code: <p>The French name for "The United States" is <i lang="fr">Les États Unis.</i> not <i>Le United States.</i></p>
Note that the only LANG tag is the one for French Les États Unis with the assumption that the document contains a <html lang="en"> specification which applies to the entire document.
What JAWS Wants
As I indicated earlier, it appears that if this code is parsed by the JAWS screen reader, it would remain in French mode even after Les États Unis was read. I am not sure what the syntax would be, but I'm guessing something like this:
Code: <p>The French name for "The United States" is <i lang="fr">Les États Unis.</i> <span lang="en">not <i>Le United States.</i></span></p>
Now there is a second English LANG tag whose domain is the rest of the sentence. I am assuming that JAWS would remain set as English thereafter. In this scenario, I am also guessing that what the JAWS programmers did was to set the switch in pronunciation engines to be triggered ONLY by a language tag - which would explain why it didn't switch back to English in the previous code.
What the W3C is expecting though is that tools should be sensitive to domains of language tags and know to switch back to English when the appropriate end tag is encountered. It's more difficult to program, but it CAN be done.
The Coding Dilemma
So here's the coding dilemma developers face: Do they code to the declared and accepted W3C standard or do they code for JAWS? Of course, the JAWS community would like developers to code for JAWS (after all the person I was speaking with was convinced the problem was developer cluelessness, not bad JAWS standards implementation).
The problem is that this approach perpetuates the more bloated code standards were supposed to streamline. Essentially, you are coding for a specific Web browser just like those developers who only code for Internet Explorer. It's an appealing short term solution, but in the long run counter-productive. This is why even Web-AIM (Web Accessibility group from Utah State) recommends NOT coding for the quirks in JAWS or user agents.
Besides, we can always hope this quirk will be fixed in a future release of JAWS.
Did I Mention Unicode Above 255?
I've also heard rumors that JAWS may read some Unicode characters above 255 as just the Unicode code point. Thus ∀ ("for all" or the upside-down A symbol) might be read as "2200" or "U+2200". There are special .sbl symbol files you can install in JAWS, but it would be nice if the process were a little more transparent. I feel it's the equivalent of Apple or Microsoft not providing any default fonts for non-Western European language...
Had an interesting meeting with an instructor of world media who pointed out that the popular game show Who Wants to be a Millionaire has been export around the world, often with the same set design, background music, text fonts, graphics and lifelines. You can check YouTube to see for yourself.
So the challenge would be...what differences are there left? Well in the case of Arabic, the right aligned (RTL) text is one one. Not only are the answers in the distinctive WWTBAM angular slots right aligned, but the choices are layed out with #1 choice set in the upper right box, not the upper left as in the U.S.
Interestingly, even the prize level numbering is reversed with the values (apparently in Saudi ri(y)als) on the left and the 15 prize levels on the right. Compare with the LTR Italian version with the prize levels on the left and the monetary values (in euros €).
Kind of a more interesting RTL example. Hope you weren't expecting much more in depth so close to Winter Break....
I'm updating my FileMaker Unicode database database to reflect the changes in the recent versions of Unicode. As part of the database, I like to have the decimal version of the code point handy as well as the actual hexadecimal version (it's good for debugging purposes).
Now the default version does not appear to have to hex to decimal conversion built in (not even in FileMaker 10), so here's my (updated) solution.
- In the main table corresponding to the list of code points, I created a field for the Hexadecimal Unicode code point value. I'll call this HexValue for now. It must be a Text field. You can create a Decimal field (Calculated), but you won't be able to fill in the formula yet.
- Then I created a second table to store the correspondence between a hex digit (0-F) and its decimal value (0-15). The HexValuefield is Text, but the DecValue field is a Number. See the sample table below (some values skipped).
HexValue (Text) DecValue (Number) 0 0 1 1 2 2 3 3 4...9 (1 row each) 4...9 A 10 B 11 C...E (1 row each) 12...14 F 15 To do all the conversions, you need to extract the text value of each position in the code point. So, I created fields corresponding to the value for each place in the hex code point as shown in the list below. I'll explain the formulas below.
Note: In case you're wondering, the name of the places are semi-inspired by Roman numerals and algebra.
- Rightmost digit Units (n) : nhex = Right(HexValue;1)
- Penultimate digit (t) : thex = Left(Right(UnicodeHex;2);1)
- Antepenultimate digit (c) : chex = Left(Right(UnicodeHex;3);1)
- 4th from right (m): mhex =Left(Right(UnicodeHex;4);1)
- 5th from right (d): dhex =If(Length (UnicodeHex)>4;Left(Right(UnicodeHex;5);1);"0")
- 6th from right (x): xhex = If(Length (UnicodeHex)>5;Left(Right(UnicodeHex;6);1);"0")
The challenge for modern Unicode is that code points now come in variable lengths (4-6), so if you count from the left you can't always know you are the appropriate digit. That means you have to count from the right, but there's no simple formula for picking the 2nd digit from the right. My solution is to take a rightmost chunk then count in from the left. So to get the 3rd hex digit from the left, I take the right most 3 digits, then find the leftmost digit in that chunk (hence the embedded left(right) formulas).
I also have to check to see if the length is greater than 4. When the length is 4, some digits are filled in with the value 0, otherwise you do a string extraction. Hence the formulas for dhex and xhex use conditional logic. Hopefully though, if Unicode adds more digits, these formulas will continue to work (unlike my original attempt which only assumed 4 digits in the code point.
- To convert each extracted digit to its decimal version. I need to set up some Relationships between tables so that each extracted digit can look up the decimal equivalent. For each of the intermediate digit fields above, I created a link to an instance of the Hexadecimal Lookup table (there are 4 instances total). It's important to make sure each instance has a name you can remember later; mine mention which digit I am working on. See the Relationships diagram below.
- Now we can finally get that decimal value! If you haven't already, create a DecimalValue field and make it Calculated.
- Here's my calculation. I'll explain what the parts mean below
HexLookup N::DecValue + 16*HexLookup T::DecValue + 16^2* HexLookup C::DecValue + 16^3*HexLookup M::DecValue + 16^4*HexLookup D::DecValue+16^5*HexLookup X::DecValue- "HexLookupN::DecValue" means give me the equivalent decimal value column based on the hex value in the "HexLookupN" (units digit) table instance.
- "HexLookup T::DecValue" does a look up for the tens unit. I multiply the value by 16 an add it to the ones value. Remember the hex #FF (F=15) means 15*16+15
- I look up the hundreds place decimal value and multiply it by 16^2 (256), then the thousands place decimal and multiply it by 16^3 (4096).
- I add up the results of each converted decimal digits times its appropriate power of 16.The calculation is complete.
Prelude about the Problem
In terms of handling non-English characters, apps come in two types (at least on the Mac). There are apps which switch fonts behind the scenes without telling you, and those which don't...but then you have to guess which font to use.
To take a concrete example, if I switch from the English keyboard to Japanese input in FileMaker, the font will automatically switch to one of the Japanese fonts. In theory, once I switch back to English, I should return to the original font (except when I don't...we'll get to that). The same principle applies in most apps including TextEdit, FileMaker and so forth. In contrast, if I switch to Japanese input in Adobe Photoshop, I also have to change fonts.
In theory, the automatic font switching sounds nice except when 1) when the font doesn't change back after typing the exotic character (this happens a lot in phonetic transcription and elsewhere) or 2) you're trying to figure if font X actually has that glyph (or whether it's the illusion of font switching in action. With the Adobe products, the manual font switching means you know exactly which font you are using at all times, which is important in desktop publishing.
FileMaker
For instance...I uploaded a version of the UCD Unicode files into FileMaker so I would have a searchable reference locally. An additional function is that I can display glyphs in different fonts for comparison. I have most of the mega fonts selected, but few fonts have everything, so I know there are gaps.
However, because FileMaker switches fonts behind the scenes, I can't always be sure if font X actually has that glyph. If I see a bunch of boxes with identical glyphs, I can suspect an unannounced font switch...but to what?
Solution
The best solution now is to copy and paste the text into TextEdit then open up the font formatting palette (Command+T), and see what it says. Kind of dorky, but still more information than I had.
For the record, I understand why FileMaker is set up this way. For most purposes, you don't want your data entry operators to fidget with fonts. However, you can get inconsistent results if you are not careful. For instance, once I do switch to Japanese, I get the Japanese font, but if I return to English...I still get the Japanese font. I know Japanese fonts contain Latin characters, but the formatting is almost always NOT the one I intended.
It would be nice if FileMaker and the other apps (including Microsoft Office) could return you to your original English font formatting after your exotic sidetrip to the higher code points of Unicode.
What these are
The superscript a/o (sometimes underlined) are abbreviations for ordinal numbers used in Spanish, Italian and Portuguese similar to English -th (as in "4th, 5th, 6th.."). The use of "o" vs "a" depends on the gender of the noun. For instance, the "1st American woman" would be 1ª americana in Spanish and the "1st American man" would be 1º americano. The 5th Amercan woman and man would be 5ª americana/5º americano.
The Codes
I got a request for putting codes for these on the Penn State Web Computing with Accents Web site in various locations, so I thought I would summarize the codes here.
| Feminine Ordinal (ª) | Masculine Ordinal (º) | |
|---|---|---|
| Unicode Code Point | U+00AA (170) | U+00BA (186) |
| Windows Alt Code | ALT+0170 | ALT+0186 |
| Mac Option Code | Option+9 | Option+0 |
| HTML Entity Code | ª | º |
But Wait There's More
But in the land of Unicode, there's always more to know...such as that in Spanish 1º primero '1st.masc' or '1º' may be shortened to primer which can be abbreviated as '1er'...or that you may write octavo 'eight.masc' as 8º or 8.º or possibly 8vo...although Google tends to have more instances of 8º.
What's important though is that only º and ª have their own code points in Unicode. For English -th, -nd, -rd or Spanish -vo,-er you have to rely on the old fashioned SUP (superscript) tag or its equivalent in CSS.
The latest Unicode Standard, Version 5.2, was released at the beginning of October, 2009. A lot is added each standard, but I confess that the most noteworthy for me was that an Egyptian Heiroglyphic block (U+13000 to U+1342E) was added. It was certainly the largest block added at 1071 code points.
Additional code points added included blocks for Avestan, Old South Arabic, Samaratian, Imperial Aramaic, Inscriptional Parthian, Old Turkic. In addition, supporting characters were added for the Coptic, Devanagari (esp Vedic support), Hangul (Old Korean), Phonecian and other ancient script blocks.
In South and Southeast Asia, support was added for Javanese, Tai Tham, Lisu, Kaithi, Meitei Mayak, Myanmar (new points), New Tai Lue (new points) and others. In other regions, a new Caniadian Aboriginal Syllabics Extended block was created with 80 additional code points. Some African scripts were also encoded including the Banum script and Rumi numerals. Additions were also made to various math and symbol blocks.
For a complete list of changes, see the information on the DerivedAge.txt file (scroll to end) and Revised Unicode 5.2 charts. In terms of support, there may be freeware (or commercial) fonts available, but time will be needed to develop the input utilities and then for these glyphs to be incorporated into major operating systems.
Until then...there's always Unicode 6.0.
Defining Emoji
There were lots of interesting sessions at last week's Unicode conference, but the one that I think non-experts can relate to the most was the one about Emoji or those little tiny icons popular in Japanese e-mail messages.
A rough translation of emoji might be emoticon, but the range of images goes way beyond smiley faces to include weather symbols, hearts, beer steins, sports icons, high heels,fast food, astrological signs, warnings, hand gestures and bikinis.
Why Unicode?
It's good to catalog and standardize any symbol set, but in this case economic necessity is driving this campaign. Specifically, Google and Apple (and its iPhone) who want to expand more into the Japanese market.
According to our presenters, the three major Japanese cell phone carriers all support emoji, and these images are popular with most adults (even the ones over 30). It's an important enough feature that iPhone (and iChat), Gmail and even Twitter support emoji.
But really it would be good to support one encoded set of emoji, not a hack of three emoji encodings from the Japanese cell phone carriers...hence the need for a unified encoding which combines those items already encoded (e.g. zodiac symbols) with symbols not currently in Unicode.
Remaining Issues
Because no Unicode script block is free of quirks, I document the issues overheard at the conference and at the Web. Namely:
Color - Real emoji have colors (really bright ones), but the spec is in black and white. This makes sense because the rest of Unicode is also in black and white. Plus you will have more options to add the colors you want!
5-Digit Code Points - Or more technically, the new glyphs will be assigned a number above U+FFFF (i.e. not in the BMP or Plane 0). Not surprisingly, many mobile devices are limited to U+FFFF and below. The committee's comment was that they expected that moble developers would learn to overcome this restriction...because they really are running out of room in the U+0000-FFFF range. That may be good news for anyone wanting to transmit the ancient scripts over cell phones. You never know when you need to access a Mycenaean Greek text away from the office or when the next Linear B revival may happen.
There's a
JailbreakApp for that - When researching this article I encountered articles about tricks for enabling emoji on non-Japanese iPhones, not all of which were legit. For a while, Apple was discouraging use of emoji outside of Japan so it was hiding the emoji. Fortunately, there is a legal way to enable emoji now (both a trick and an app).
So there you have it - thanks to the great folks at Google and Apple, we will all be able to standardize the addition of cute icons in our online communication...or at least we will have a documented explanation of what they were for future generations. Trust me, in about 500 years, we will need it.
- MacKeyboardTutorialPyatt.2.pdf (Wednesday)
- KeyboardLayoutFiles2.zip (Wednesday)
- Unicode33PyattLogic.ppt (Friday)
Recent Comments