Recently in Software and Unicode Category

Formatting Ordered Lists

|

A topic receiving some attention in the CSS specs are how to format ordered lists across different numbering systems. Not all are supported in every browser, but a wide range are, so I thought I would present some test data.

If your browser does not support a particular list type, you will see something like "1,2,3" as bullets for the list items. If your browser supports a list, but is missing a font, you may see some Unicode question marks of death indicating to go find a font for that glyph.

Note: Test data is not complete, so a untested type may be supported in some browsers.

Supported in all browsers

Numeric
list-style-type:decimal
Capital Alphabetical
list-style-type:upper-alpha
Lower Alphabetical
list-style-type:lower-alpha
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

Capital Roman
list-style-type:upper-roman
Lower Roman
list-style-type:lower-roman
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

CSS 2 Firefox, Safari, Opera, Internet Explorer 8

  • These are supported in Firefox/Safari.
  • They are also supported in Internet Explorer 8, but a DOCTYPE statement must be included.
Leading Zero
list-style-type:
decimal-leading-zero

Armenian
list-style-type:
aremenian
Georgian
list-style-type:
georgian
Lower Greek
list-style-type:
lower-greek
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

Only in Firefox, Safari

These are other list styles being proposed as well. The following styles are found in Dreamweaver CS5 and are supported in recent versions of Firefox and Safari

Hebrew
list-style-type:
hebrew

Katakana
list-style-type:
katakana
Hiragana
list-style-type:
hiragana
Hiragana-Iroha
list-style-type:
hiragana-iroha
CJK Numbers
list-style-type:
cjk-ideographic
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

 

Only in Safari

CSS 3 includes many more specifications, particularly for Asian languages. Some are supported in in recent versions Safari like the ones below. For a complete list of propsed specifications see the W3C Specification for CSS 3 Lists.

Arabic-Indic
list-style-type:
arabic-indic
Devanagari
list-style-type:
devanagari

Thai
list-style-type:
thai
Bengali
list-style-type:
bengali
Gujarati
list-style-type:
gujarati
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

Gurmukhi
list-style-type:
gurmukhi

Kannada
list-style-type:
kannada
Lao
list-style-type:
lao
Malayalam
list-style-type:
malayalam
Mongolian
list-style-type:
mongolian

  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

Myanmar
list-style-type:
myanmar
Persian
list-style-type:
persian
Telugu
list-style-type:
telugu
Tibetan
list-style-type:
tibetan
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

Categories:

Turnitin, Plagiarism and spotting Cyrillic Е for E

|

An interesting Unicode tidbit came up when I was reviewing the some literature from Turnitin.com. One article from Turitnin discuss attempted tricks to circumvent Turntin and their countermeasures. It won't surprise Unicode experts that one is to replace Latin alphabet E (i.e. our E) with a Cyrillic Е or a Greek Ε. The technical term for this is visual spoofing

In theory, the instructor will see "thе" as "the", but will actually be different enough so as to NOT be flagged. Apparently Turnitin has seen this and offers their counter measure.

One trick is to replace a common character like "e" throughout the text of their paper with a foreign language character that looks like an "e" but is actually different (for example, a Cyrillic "e"). This method does not work because our algorithms replace such characters with the corresponding standard English character. The special character will still appear in the Originality Report; however, the word it is in will have been matched against words containing every character that looks like that character. This allows us to show you matches to words with both the special character and the standard character.

Checking Outside of Turnitin

I'm relieved that Turnitin is on top of this, but there are some tricks for instructors who aren't using the service to spot Cyrillic/Greek letters masquerading as the English alphabet. Namely:

  1. Use spelling checking to find errors in words that looked correctly spelled.
  2. Switch to a decorative font which does NOT contain Greek and Cyrillic letters.

One trick is to use or turn on the visual spell check feature in Word and other tools. This the one that puts red wavy lines under a misspelling. For instance, the image below shows three versions of "Elizabeth" with Latin, Cyrillic and Greek E's. They look alike, but only one is free from the red squiggly underline - this is the one with the English E. The others are hiding Greek and Cyrillic E's and thus triggering notifications from the Microsoft Word spell check.

3 versions of Elizabeth, only center one is NOT underlined

Another trick is to switch to an unusual font. Common fonts like Times New Roman, Arial and Verdana contain Greek and Cyrillic characters, but a lot of decorative fonts are missing them. In some cases the Greek/Cyrillic E's visible in one font will be converted to a box/question mark or other weird symbol indicating that the system isn't processing the character like an English letter.

Elizabeth with Russian E with E replaced by box

In some cases though, the non-English E may be rendered in a similar font which contains that character. That's what's happening in this Comic-Sans example. If you look closely though, you will see that only the center capital E is rounded like the other letters. The non-English E's in the other two are straight up and down because they are in another font.

ElizabethComicSans.png

There are other similar tests you can perform but the upshot is that most technologies are not really set up to integrate Cyrillic/Greek text with English text, and for the savvy instructor looking for spoofing, this is a good thing.

Categories:

Language Tagging and JAWS: How to return to English?

|

Disclaimer

I am not seeing other reports of the JAWS quirk reported in this entry. It is based on hearsay from a JAWS user, although one who is fairly tech literate. Hopefully, the point is moot, but since information is so spotty, I am leaving this entry up for now.

Original Article

Unicode and accessibility should be natural partners, but sometimes the tools get a little confused. Take language tagging for instance....

Language tagging identifies the language of a text to search engines, databases and significantly, screen reader tools used by those with severe visual impairments. The newer screen readers can switch pronunciation dictionaries if they encounter a language tag. Language tagging syntax, as recommended by the W3C for HTML 4 works as follows:

  1. Include the primary language tag for the document in the initial HTML tag. For example, an English document would be tagged as <html lang="en">
  2. Tag any passages in a second language individually. For instance, a paragraph in French would be <p lang="fr"> while a word or phrase would be <span lang="fr">.

The idea though is that once you exit the passage tagged with the second language code, you should assume that the language is back to the primary language. Unfortunately, a comment I heard from a JAWS user was something like "The lang tag works, but developers forget to switch back to English." When I asked him for details, he indicated that an English text with a Spanish word makes the switch in pronunciation engines, but then remains in Spanish mode for the rest of the passage.

What I interpret from this is that the JAWS developers are assuming that there should be a SECOND LANG tag to return the document back to the primary language. So we have two syntax schemes:

What W3C Expects

Text: The French name for "The United States" is Les États Unis, not Le United States.

Code: <p>The French name for "The United States" is <i lang="fr">Les États Unis.</i> not <i>Le United States.</i></p>

Note that the only LANG tag is the one for French Les États Unis with the assumption that the document contains a <html lang="en"> specification which applies to the entire document.

What JAWS Wants

As I indicated earlier, it appears that if this code is parsed by the JAWS screen reader, it would remain in French mode even after Les États Unis was read. I am not sure what the syntax would be, but I'm guessing something like this:

Code: <p>The French name for "The United States" is <i lang="fr">Les États Unis.</i> <span lang="en">not <i>Le United States.</i></span></p>

Now there is a second English LANG tag whose domain is the rest of the sentence. I am assuming that JAWS would remain set as English thereafter. In this scenario, I am also guessing that what the JAWS programmers did was to set the switch in pronunciation engines to be triggered ONLY by a language tag - which would explain why it didn't switch back to English in the previous code.

What the W3C is expecting though is that tools should be sensitive to domains of language tags and know to switch back to English when the appropriate end tag is encountered. It's more difficult to program, but it CAN be done.

The Coding Dilemma

So here's the coding dilemma developers face: Do they code to the declared and accepted W3C standard or do they code for JAWS? Of course, the JAWS community would like developers to code for JAWS (after all the person I was speaking with was convinced the problem was developer cluelessness, not bad JAWS standards implementation).

The problem is that this approach perpetuates the more bloated code standards were supposed to streamline. Essentially, you are coding for a specific Web browser just like those developers who only code for Internet Explorer. It's an appealing short term solution, but in the long run counter-productive. This is why even Web-AIM (Web Accessibility group from Utah State) recommends NOT coding for the quirks in JAWS or user agents.

Besides, we can always hope this quirk will be fixed in a future release of JAWS.

Did I Mention Unicode Above 255?

I've also heard rumors that JAWS may read some Unicode characters above 255 as just the Unicode code point. Thus ∀ ("for all" or the upside-down A symbol) might be read as "2200" or "U+2200". There are special .sbl symbol files you can install in JAWS, but it would be nice if the process were a little more transparent. I feel it's the equivalent of Apple or Microsoft not providing any default fonts for non-Western European language...

Categories:

Hexadecimal to Decimal in FileMaker 7+ (Revised)

|

I'm updating my FileMaker Unicode database database to reflect the changes in the recent versions of Unicode. As part of the database, I like to have the decimal version of the code point handy as well as the actual hexadecimal version (it's good for debugging purposes).

Now the default version does not appear to have to hex to decimal conversion built in (not even in FileMaker 10), so here's my (updated) solution.

  1. In the main table corresponding to the list of code points, I created a field for the Hexadecimal Unicode code point value. I'll call this HexValue for now. It must be a Text field. You can create a Decimal field (Calculated), but you won't be able to fill in the formula yet.
  2. Then I created a second table to store the correspondence between a hex digit (0-F) and its decimal value (0-15). The HexValuefield is Text, but the DecValue field is a Number. See the sample table below (some values skipped).
    HexValue (Text) DecValue (Number)
    00
    11
    22
    33
    4...9 (1 row each) 4...9
    A10
    B11
    C...E (1 row each)12...14
    F15
  3. To do all the conversions, you need to extract the text value of each position in the code point. So, I created fields corresponding to the value for each place in the hex code point as shown in the list below. I'll explain the formulas below.

    Note: In case you're wondering, the name of the places are semi-inspired by Roman numerals and algebra.

    • Rightmost digit Units (n) : nhex = Right(HexValue;1)
    • Penultimate digit (t) : thex = Left(Right(UnicodeHex;2);1)
    • Antepenultimate digit (c) : chex = Left(Right(UnicodeHex;3);1)
    • 4th from right (m): mhex =Left(Right(UnicodeHex;4);1)
    • 5th from right (d): dhex =If(Length (UnicodeHex)>4;Left(Right(UnicodeHex;5);1);"0")
    • 6th from right (x): xhex = If(Length (UnicodeHex)>5;Left(Right(UnicodeHex;6);1);"0")

    The challenge for modern Unicode is that code points now come in variable lengths (4-6), so if you count from the left you can't always know you are the appropriate digit. That means you have to count from the right, but there's no simple formula for picking the 2nd digit from the right. My solution is to take a rightmost chunk then count in from the left. So to get the 3rd hex digit from the left, I take the right most 3 digits, then find the leftmost digit in that chunk (hence the embedded left(right) formulas).

    I also have to check to see if the length is greater than 4. When the length is 4, some digits are filled in with the value 0, otherwise you do a string extraction. Hence the formulas for dhex and xhex use conditional logic. Hopefully though, if Unicode adds more digits, these formulas will continue to work (unlike my original attempt which only assumed 4 digits in the code point.

  4. To convert each extracted digit to its decimal version. I need to set up some Relationships between tables so that each extracted digit can look up the decimal equivalent. For each of the intermediate digit fields above, I created a link to an instance of the Hexadecimal Lookup table (there are 4 instances total). It's important to make sure each instance has a name you can remember later; mine mention which digit I am working on. See the Relationships diagram below.
    HexRelationships.png
  5. Now we can finally get that decimal value! If you haven't already, create a DecimalValue field and make it Calculated.
  6. Here's my calculation. I'll explain what the parts mean below
    HexLookup N::DecValue + 16*HexLookup T::DecValue + 16^2* HexLookup C::DecValue + 16^3*HexLookup M::DecValue + 16^4*HexLookup D::DecValue+16^5*HexLookup X::DecValue
    • "HexLookupN::DecValue" means give me the equivalent decimal value column based on the hex value in the "HexLookupN" (units digit) table instance.
    • "HexLookup T::DecValue" does a look up for the tens unit. I multiply the value by 16 an add it to the ones value. Remember the hex #FF (F=15) means 15*16+15
    • I look up the hundreds place decimal value and multiply it by 16^2 (256), then the thousands place decimal and multiply it by 16^3 (4096).
    • I add up the results of each converted decimal digits times its appropriate power of 16.The calculation is complete.

Categories:

WAVE AIM Ate my Latin ō

|

I was wondering what to write next when an accessibility test presented a perfect example of how you can be fluent in one Web standard, but goof up on another standard (Oy!).

I wanted to test Movable Type in the nifty Web AIM Wave accessibility checker. One feature of this tool is that it will show you the location of header tags (e.g. H1,H2,H3), which can be handy to know if you are testing a Web page for markup and don't feel like plowing through a sea of HTML tags.

By chance I chose an entry about You-Tube videos in Latin which talked about Latin versions of Star Wars (Bella Stellārum) which include the scene in Empire Strikes Back (Imperium Contra Offendit) where Luke learns that Darth Vader may be his father and screams "Nōōōō...n" in utter horror.

Original Blog Entry (Screen Capture)

Blog entry with stellarum is now ste and noooon highligted

Tragically though, when WAVE rendered this page for me, I got the less dramatic "NMMMM...n". Apparently WAVE doesn't understand Unicode too well.

 

As Seen on WAVE

Stella:rum is stellMrum and nooooon is nMMMMn

It looks like accessibilty and Unicode together present another trap for the unwary Web worker, but then again you can always show your superior knowledge in one standard or the other - depending on your audience. In the war of the standards, it can be very comforting.

Categories:

When Apache and UTF-8 Fight

|

When you create a Web page with Unicode characters, it is recommended that you include the following character meta tag:

<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
...
</head>

And if it's XHTML, you need to include a final "/" at the end.

The idea behind this tag is to force the other broswer into the correct view and prevent the display of Roman character gibberish. Sometimes though, you can place a properly formatted UTF-8 Web page (meta tag and all) and still see gibberish.

In this case the problem is not you, but the Web server, typically configured with Apache. If it's an American server, Apache is probably set up to ONLY deliver ISO-8859-1 encoding and, even though your file has the UTF-8 data in it, the server is trying to deliver it as Latin 1 (hence the Latin 1 gibberish).

There are three possible solutions available when this happens

Talk to Your Server Admin

And when you do, you can politely suggest changing the httpd.conf file as documented on Seapine Software. You can also comment that most modern Web apps are set to serve UTF-8 data including CMS programs such as Plone, Movable Type and Drupal. Others such as Facebook and Twitter support UTF-8 natively.

I believe this is what a Web service having this issue did recently.

Use an .htaccess file to just configure specific directories and pages

If you're comfortable enough to mess around with changing your directory preferences you can try this suggestion from Ted Texin about using AddType statements

The main proviso here is that an .htaccess file can do some serious damage unless you are careful. It's possible that you may not be able to upload one into your directory because of this, but it could be a good solution to suggest to a server admin if only your directory is affected and the rest of the site has to be encoded differently.

Unicode Escape Codes

If neither of the above solutions is available, then you can deliver the content within any encoding...if you encode the "exotic" characters as Unicode numeric escape codes.

For example if your site is Latin 1, but you need to present Russian content you can change your code from

Русский

to

&#x0420;&#x0443;&#x0441;&#x0441;&#x043A;&#x0438;&#x0439;

As you can imagine, this IS an absolute last resort solution. If you ever need to transfer content between systems, you will have many more problems with escape codes (none of which are supported in true XML or Microsoft Word). Not to mention the difficulty of replacing each character with it's Unicode numeric equivalent. Escape codes were really only meant for short passages of text.

But...if this is where you are, then you can try either the old Mozilla Composer which converted anything you typed into escape codes or maybe you can try another utility. Truthfully it is extremely difficult problem to do convert raw UTF-8 text to HTML entitiy codes these days.

So I emphasize that this a rare problem and should be easily corrected by your server admin...and if it's a personal Web site, you may want to think about alternative providers.

Or you could try the ultimate last resort - attack of the angry Unicode expert.

Post Script (Apr 3, 2009)

A student in a recent seminar pointed out a site which does convert a character to a decimal code reference at http://www-atm.physics.ox.ac.uk/user/iwi/charmap.html (from Alan Iwi at the Rutherford Lab at Oxford). Just enter or paste the character and click the the Make HTML button to see a decimal entity code. You can enter an entire string of characters.

Categories:

Where Have All the Escape Codes Gone?

|

I'm currently preparing a seminar on Unicode and I was struck by how far Unicode implementation, especially in terms of raw Unicode text, has come in the past 4 years. Some of the warnings I used to present in 2000 or even in 2004 seem almost quaint now.

For instance when Mac OS X first came out, the older applications were not set up to take advantage of the Mac Unicode utilities, such as the U.S. Extended keyboards. I used to have to specify which applications could work with Unicode and which couldn't do it. But yesterday I realized that I couldn't find any old applications on my machine that didn't work correctly. What a difference that makes.

The same is true on the Windows side. If you get the latest version of most applications, the chances are that Unicode support is there - even for raw text editors.

Similarly, I recall when many HTML editors converted any non-English character to an numeric HTML entity, but now most applications are set to work with real UTF-8 text embeded in HTML tags. This is much easier to edit and crucial for being able to transfer data between the Web and other XML resources.

Russian, Chinese and Greek data are being treated as just "text" and not as a special case that programmers need to agonize over. There are still plenty of issues to be worked out, but it's good to appreciate progress when it's made.

Categories:

UniView Unicode Character Lookup

|

Richard Ishida has a Web based Unicode look up tool at
http://people.w3.org/rishida/scripts/uniview/uniview.php

This is a search form which allows you to view data by name, hex value, actual pasted character or range.

There's another conversion utility at
http://people.w3.org/rishida/scripts/uniview/conversion.php
which allows you to convert characters from hex values to different variants such as decimal values, percent escapes (Web address) and UTF-8 vs. UTF-16.

The character paste feature is especially valuable for random symbols such as (infinity) or ɛ (Open e, epsilon vowel). You can finally extract a code point from a weird symbol used in your Word doc.

Categories:

The IPA Unicode Friendliness Test

|

When I'm doing an initial test to see if a product is Unicode friendly or not, I typically switch to my IPA keyboard and see if it will accept and display phonetic character input. Why this test?

The first reason is that I actually know my phonetic symbols and can type something pretty quickly. They're also a fairly straightforward Western type alphabet so there are minimal font display issues.

The second is that while developers may program specific support for East Asian, Cyrillic or Middle Eastern languages, they rarely build in IPA phonetic symbol support (unless the product is targeted towards linguists). So, if the product can handle phonetics, it's a very good sign that generalized Unicode support has been implemented.

Does it mean every script is equally supported? Probably not. The gotchas are usually RTL languages like Arabic and Hebrew and the dead scripts like Gothic and Linear B. But if you have IPA support, you probably also have basic support for Czech, Welsh, Chinese, Japanese, Korean, Russian and maybe Armenian and Georgian. That does cover a lot of territory believe it or not.

Categories:

Using UTC vs. Local Server Time

|

The concept of time zones isn't exactly a Unicode issue, but it does relate to issues of globalization.

This blog entry from 4 Guys from Rolla explains the advantages of storing times/dates in UTC format vs. local time. The first one mentioned is that if your servers switch time zones, your data will still be the same.

Quick UTC Primer

UTC time zones are defined in terms of the Greenwich Mean Time (GMT), from the 0° longitude line estabilshed at Great Britain's Royal Observatory, Greenwich.

If you live in London, then you live in the UTC or GMT. If you live Paris, which is one time zone to the east, then you live in UTC +1 (or one hour later than London). If you live in Philadelphia (Eastern Time Zone), then you live in UTC -5 (i.e. five hours behind London).

The idea of using UTC is to flatten time zones and place everyone in the GMT (London) time zone, but then add information about how many hours to add or subtract in order to convert to local time. If you have operations in multiple time zones, looking at the UTC time can help you determine the sequence of events much better than local time alone.

About Daylight Savings

Interestingly, even though most countries (except Japan) implements daylight savings in the summer, UTC does not. Right now (Aug, 2007) London is UTC +1 (1 hour ahead), but in the fall it will return to UTC 0.

For the Eastern Time zone inhabitants, the summer time zone is UTC -4, and will return to UTC -5 in November.

That means right now, my EDT time of 2:35 PM converts to 6:35 PM UTC (or 18:35 UTC in military time).

References

NASA - http://science.nasa.gov/Realtime/Rocket_Sci/clocks/time-gmt.html
Federation of the Swiss Watch Industry - http://www.fhs.ch/en/worldclock.php

Categories:

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Comments

The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments