Recently in (X)HTML Markup Category

Converting Numeric Entity Codes Back to Text

|

I got a technical question recently which I thought to share.

Not so long ago in the history of Web Development, the safest way to display non-Western text was the use of numeric entity codes. For instance, one course management system would convert Cyrillic text like Україна (Ukraine) to a series of numeric codes like:

Ч&#x;країла

This is fine for single words and small phrases, but it's bad for an entire page...especially if you want to edit it.

Fortunately, there is a quasi-fix for this if you need to replace numeric codes with real text. That is:

  1. Open your page in a browser which does render the entity codes as the correct text.
  2. Copy displayed text and paste it in another file. It will be rendered as text.
  3. Put the text back into your HTML source.

It's a little tedious, but since I couldn't quickly find a better tool for this, it is a decent stop gap. At least you won't have to re-type everything...

Categories:

Unicode and WCAG 2.0 (Accessibility)

|

Unicode is incorporated into multiple standards such as RSS (newsfeeds), MathML and other standards. Unicode is also incorporated into the newest WCAG 2.0 (Web Content Accessibility Guidelines) standard in some interesting ways.

Text not Image

One guideline in particular of interest is Guideline 1.4.5:

WCAG Guideline 1.4.5 Images of Text: If the technologies being used can achieve the visual presentation, text is used to convey information rather than images of text except for the following:

In other words, it is generally better to use CSS+actual text to present textual information, even when it is stylized. Unicode is especially important for doing this especially for characters beyond ASCII or Latin-1.

There are two reasons for this guideline. First is that if a screen reader has text available, the developer does not need to include any additional information such as an image ALT tag. The other is that text tends to be more flexible across devices. It particular, it can be zoomed without being rasterized (appearing jagged at large sizes) and it can have its format changed without information loss (say flipping from black text on white to white text on black - a format preferred by some users).

Right to Left Marker

A second relevant guideline is:

WCAG Guideline 1.3.2: When the sequence in which content is presented affects its meaning, a correct reading sequence can be programmatically determined. (Level A)

An important concept for RTL (right to left languages) is ensuring that text remains in logical order so that characters are in their correct linear order, even if they are presented "backwards" from the more common LTR order. WCAG threrfore also recommends logical order and mentions the Unicode RLM (right-to-left marker) and LRM characters

Language Tags

A final i18n technology mandate of the WCAG 2.0 is the use of language tags.

WCAG Guideline 3.1.1: Language of Page: The default human language of each Web page can be programmatically determined.

In other words, use language tags to identify page language. This is especially important for screen readers which need to switch pronunciation engines between languages.

There have been several debates about the utility of WCAG 2.0, but I can rest assured that at least the needs of multilingual users have been considered.

Categories:

CSS3 Greek Font Embed with Font Squirrel

|

I've heard some buzz about newer methods font-embedding, but hadn't had a chance to test it until now. The good news is that you CAN embed fonts across multiple browsers (including Internet Explorer, Safari, Firefox and Google Chrome.) The silly news is that it looks like each browser wants a different font format (or pretty darned close). But it's surprisingly robust for all that.

I'll describe the process, but I strongly recommend getting help from a Web font repository like Font Squirrel, Webfonts.info or Kernest which will generate some code for you. I will be documenting with Font Squirrel so I won't have to rely on remote hosting.

@font-face Theory

The magic of modern font embedding happens via the magic of a @font-face CSS style declaration. This declaration names the font then provides the URL so it can be embedded, but because each browser supports one and only one format, you actually need links to four different uploaded versions of the font.

The font versions in play are the following:

  • Embedded Open Type/EOT (.eot) from Microsoft for Internet Explorer - this has actually been around since the late 90s but is only now living to its potential.
  • TTF and OTF - These are the usual True Type and Open Type font formats and embedding these are supported on FIrefox (3.5+), Safari (3.1+) and Opera (10+).
  • SVG - The Scalar Vector Graphic format. This is the format that Google Chrome and many mobile phones support including iPhone 3.1 (although Droid apparently supports TTF).
  • WOFF - this is a new format that is supported on Firefox 3.6+, Internet Explorer 9+, Chrome 5+

I'll talk about how to get the different versions of the fonts in a future blog post, but it looks like that at some point the key format will be WOFF which is a compressed version of a TTF/OTF font. Since embedding requires the viewer to download a font, smaller font sizes are better.

Simple Download with Character Range Tip

Another piece of helpful news is that some common open source fonts have been converted for you including Galatia SIL (Greek and Latin) and Gentium (Phonetics, Extended Latin plus Greek) (thanks Font Squirrel!)

Warning - there is a download catch. Font Squirrel assumes that you are writing in English only, so the default download gives you ONLY ENGLISH LANGUAGE characters in order to make the file size smaller.

Since you're at a Unicode blog, I will assume you want these fonts for their non-English characters. So when you download Galatia SIL and Gentium, make sure you do the following:

  1. At Font Squirrel, select the font you wish to embed.
  2. Change the Choose a Subset menu from English to Don't Subset
  3. Click Download@font-face kit

Planning Supported Ranges in Fonts

Speaking of character ranges, you should plan your embedded font selections carefully so that viewers download a font with only the characters needed to view the Web page. That is, you probably want to avoid full versions of the mega fonts and use specialized fonts or slimmed-down versions of a mega font. Indeed, if your script is well supported (e.g. Chinese, Japanese), you can probably skip font embedding except for some extremely rare characters.

For instance, on the Penn State Computing with Accents language pages, I will be including custom @font-face declarations for the specific scripts used on a page. One of these is the Greek Unicode page in which Galatia SIL is embedded. (FYI - I embedded Galatia SIL because it includes some of the rarer Greek characters and is a serif fonts, which I do like for reference).

Some Embedding Code

Let's talk about embedding Galatia SIL on a Greek page. The @font-face file that is downloaded from Font Squirrel contains the different versions of each font as well as sample code and CSS declarations to copy and paste.

Once the file is downloaded, you can test locally, then upload the fonts and your new pages to your Web site. I put any fonts I will embed into a fonts directory (along with licenses in case anyone pokes around). I also put each font into its own folder.

The next step is to add a @font-face declaration in CSS. Here is mine, based on the stylesheet.css file from Font Squirrel:

<style type="text/css">
<!--

@import url("../int.css");

/*** @font-face code adapted from stylesheet.css file from Font Squirrel. Thanks again! ***/
@font-face {
font-family: 'GalatiaSILBold'; /*** Name of Font ***/
src: url('/fonts/Galatia/GalSILB-webfont.eot'); /*** Link to IE EOT file first ****/
src: url('/fonts/Galatia/GalSILB-webfont.eot?iefix') format('eot'), /*** EOT again with IE Version control **/
url('/fonts/Galatia/GalSILB-webfont.woff') format('woff'),
url('/fonts/Galatia/GalSILB-webfont.ttf') format('truetype'),
url('/fonts/Galatia/GalSILB-webfont.svg#webfontJEXBBlW4') format('svg');
font-weight: normal;
font-style: normal;
}

-->
</style>

This embeds the font on a single page, but if you need to embed a font on multiple pages, add the @font-face declaration to the site-wide .css file.

At that point, the font named in the font-face declaration can be used as part the font-family or font attributes in later declarations. Here is my .bigbluegreek class and then the reference to the class used in HTML

.bigbluegreek {font-family: 'GalatiaSILBold', 'Arial Unicode MS', sans-serif;
font-size:24 px; color: #006; text-align:center;}

<!-- Table Cell -->

<td class="bigbluegreek">μ</td>

Note that the font-family declaration still includes alternate fonts...just in case the font-embedding doesn't work on a particular browser.

Font Copyright

I'm going to the end the entry here, and talk about font conversion another time, but if you do want to embed a font, make sure the license lets you do it. Many open-source fonts include the options to modify the font, so creating alternate versions is OK. Commercial foundries are also offering @font-face kits for their fonts also... for a fee.

Categories:

Formatting Ordered Lists

|

A topic receiving some attention in the CSS specs are how to format ordered lists across different numbering systems. Not all are supported in every browser, but a wide range are, so I thought I would present some test data.

If your browser does not support a particular list type, you will see something like "1,2,3" as bullets for the list items. If your browser supports a list, but is missing a font, you may see some Unicode question marks of death indicating to go find a font for that glyph.

Note: Test data is not complete, so a untested type may be supported in some browsers.

Supported in all browsers

Numeric
list-style-type:decimal
Capital Alphabetical
list-style-type:upper-alpha
Lower Alphabetical
list-style-type:lower-alpha
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

Capital Roman
list-style-type:upper-roman
Lower Roman
list-style-type:lower-roman
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

CSS 2 Firefox, Safari, Opera, Internet Explorer 8

  • These are supported in Firefox/Safari.
  • They are also supported in Internet Explorer 8, but a DOCTYPE statement must be included.
Leading Zero
list-style-type:
decimal-leading-zero

Armenian
list-style-type:
aremenian
Georgian
list-style-type:
georgian
Lower Greek
list-style-type:
lower-greek
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

Only in Firefox, Safari

These are other list styles being proposed as well. The following styles are found in Dreamweaver CS5 and are supported in recent versions of Firefox and Safari

Hebrew
list-style-type:
hebrew

Katakana
list-style-type:
katakana
Hiragana
list-style-type:
hiragana
Hiragana-Iroha
list-style-type:
hiragana-iroha
CJK Numbers
list-style-type:
cjk-ideographic
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

 

Only in Safari

CSS 3 includes many more specifications, particularly for Asian languages. Some are supported in in recent versions Safari like the ones below. For a complete list of propsed specifications see the W3C Specification for CSS 3 Lists.

Arabic-Indic
list-style-type:
arabic-indic
Devanagari
list-style-type:
devanagari

Thai
list-style-type:
thai
Bengali
list-style-type:
bengali
Gujarati
list-style-type:
gujarati
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

Gurmukhi
list-style-type:
gurmukhi

Kannada
list-style-type:
kannada
Lao
list-style-type:
lao
Malayalam
list-style-type:
malayalam
Mongolian
list-style-type:
mongolian

  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

Myanmar
list-style-type:
myanmar
Persian
list-style-type:
persian
Telugu
list-style-type:
telugu
Tibetan
list-style-type:
tibetan
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3
  1. Item 1
  2. Item 2
  3. Item 3

Categories:

Entity Code or Raw Text in HTML?

|

If you need to enter a non-English character like /ə/ (schwa) on a Web page, you actually have two choices - you can use an entity code like &#601; or you can use an input utility to enter in raw date (i.e. ə). You can see how the code looks below

Bahama = /bəhamə/

Bahama = <b>/b&#601;ham&#601;/</b>

Bahama = <b>/bəhamə/</b>

Which is Better?

The entity code was a solution from before browsers could reliably recognize true UTF-8 text or before a server or Web tool could "serve" it up properly. In the year 2000, it was the safest solution by far.

These days though the tide is shifting so that I would recommend avoiding entity codes unless your tech is not quite up to speed. For instance, this blog (served by Movable Type) hardly ever uses an entity code. I type or copy Unicode text and out it goes. The same is true for Facebook, Twitter and most Web 1.0 sites hosted at Penn State.

The advantage of NOT using entity codes is that it is easier to port content between file formats including RSS and other XML formats. RSS, unlike HTML, does not recognize the HTML entity codes. An entity code such as &#601 will be displayed as... &#601 (not schwa). Only ə is displayed as ə. The same is true if you want to include Unicode on your Facebook profile page.

The other advantage is debugging and proof. Which do you want to spell check? Русский or &#x0420;&#x0443;&#x0441;&#x0441;&#x043A;&#x0438;&#x0439;?

However there are cases where you need to use escape codes just to be safe. Often the problem is that you are using a server which can't deliver UTF-8 encoded text for whatever reason. One of these, unfortunately, has been our course management system - fortunately it's WYSIWYG editor converts non-English text to escape codes for you.

If you are working with a static page, there really should be no roadblock at this point...so long as your page has the correct UTF-8 meta tag. The cases where this isn't working is likely due to a under configured Apache set up.

Ironically though, I seem to see more under configured Apache issues than I used to...One step forward, one step back?

Categories:

Disabling Auto Link Generator with Entity Code

|

Problem: The content management system I'm using takes any URL it recognizes and changes it into a link. That's normally good EXCEPT if you want to create a fake URL as an example.

Solution: Replace the slash with its numeric entity code (&#47;). Voilà - the system can't find the slashes anymore, so leaves the URL alone.

By the way, the entity code hack looks like this
http:&#47;&#47;www....

Even ASCII characters sometimes need an entity code.

Categories:

Still ASCII in SSI and CSS Files

|

* Note: This entry was published elsewhere in 2006.

The Penn State server delivers UTF-8 Unicode pages. Dreamweaver creates Unicode pages. They appear fine in all my browsers without the entity code translation. So I should be able to include Unicode characters in server side includes - right? Not exactly. Hidden UTF-8 character seem to

Any .inc file must be encoded as ASCII and only include ASCII characters. Otherwise you will get an error that the file "cannot be processed". I suspect the culprit are some hidden Unicode control characters that the server doesn't recognize. If you want to include a Unicode character (like the £ symbol, you have to use an entity code like &pound; (all characters in the entity code are ASCII). If you enter raw Unicode, then users will see a question mark, even if the character is actually available in that font.

As for CSS stylesheets, there are not issues technically prohibiting .css files from being UTF-8, but I found out a few years ago that if I placed CSS in UTF-8 files, then attributes would mysteriously fail to apply even though the syntax was correct. Again it was probably a hidden UTF-8 character that was interfering. It's little glitches like these that make Unicode development still an entertaining adventure even in 2007.

What are "hidden" UTF-8 control characters? These are code points which don't represent a character but signify text formatting elements like right to left text vs. left to right text or which kind of line break you are using. ASCII has control characters just in positions #0-31 (and most software programs recognize them), but Unicode includes additional control characters that older programs don't recognize. The problem is that the new control character are included.

By the way, if you cut and paste from a UTF-8 file and see strange behavior in a software package, sometimes backspacing through a "space" will eliminate an unrecognized control character and fix the problem.

Categories:

Superscripts - TAGS vs Unicode Glyphs

|

Superscripts in HTML

Both HTML and XHTML include the SUP tag for superscripts and the SUB tag for subscripts. Yet the Unicode specification also includes specific slots for individual superscript/subscript characters. For example the phrase “two to the fourth power” could be encoded as
  • 2<sup>4</sup> (SUP tag) = 24
  • 2&#8308; (numeric entity code) = 2⁴
  • 2⁴ (raw Unicode data) = 2⁴


What’s the difference and which should you use? If you’re displaying static Web pages, there’s probably very minimal difference. Although the entity code &8303; takes up less file space than the SUP tag does, the SUP tag works across most browsers/fonts and can be styled.

The raw data method is the most correct, but also the most prone to cross-platform difficulties. For one thing, you MUST have the UTF-8 encoding header meta tag included or the display will be broken. Another issue is that some browsers (e.g. Mac/Firefox) include extra space around superscript entities or shrink the characters to unreadable sizes. If you’re working with XML though, then you may need to enter superscript/subscripts as raw data.

XML and Flash

On one project we had to feed data for College Algebra exercises into a Flash quiz application. The XML spec didn’t recognize numeric entity codes or the SUP/SUB tag, so we had to enter the superscripts as Unicode characters.

The good news is that if you can create a UTF-8 text file and insert the symbols, it will import into Flash (at least in Flash 8.) For math, your best bet is usally to use the Windows Character Map utility and insert the symbols into a Notepad text file or use the Macintosh Character Palette with a Text Edit text file. The Penn State Unicode and XML page explains how to create UTF-8 encoded XML files.

Reason for Unicode Character Points

Ultimately, the reason why Unicode has positions for these characters isn’t to help Flash developers, but because the superscripts/subscripts do add content to a text string.

If you’re exchanging raw data files, you may need to know whether a character is superscript or subscript, so it has to be encoded within Unicode. Hence, we have superscript/subscript characters

Categories:

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Comments

The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments