February 2007 Archives

Greek Unicode Sites


For those of you wanting to understand "polytonic" vs. "monotonic" sites, these pages may be helpful.

I sense I will be updating some pages at some point....


The Schwa (Upside Down E)


If there's one phonetic symbol Americans are mostly likely to know it's the "schwa" /ə/ or "upside down e" for the "uh" sound. I personally remember from elementary school. Here it is in multiple fonts

Schwa in Multiple Fonts


In phonetics this is the sound similar to "uh" in American English. In many dialects of English, vowels of unstressed syllables are commonly pronounced as schwa and is one reason for spelling difficulties (e.g. is it -ible or -able both of which are really [əbəl]) It's a common "neutral" or "resting" vowel found in many languages including French, Welsh, Irish and others.

Origin of Glyph

Schwa is close to Spanish "e" (and closer to French "e" of le), so that's why the Letter E got flipped in this case.

Origin of Name

The word "schwa" is from the Hebrew word שְׁוָא (šěwā’, /ʃəˈwa/), meaning "nought"—it originally referred to one of the niqqud vowel points used with the Hebrew alphabet, which looks like a vertical pair of dots under a letter. This sign has two uses: one to indicate the schwa vowel-sound
http://en.wikipedia.org/wiki/Schwa#The_term (19 Feb 2007)


Smart Quotes Entity Codes


I'm on the warpath (again) about improperly encoded Smart Quotes which appear as mystery boxes on my Mac/Safari. See example below.

Screen capture of page with smart quotes replaced with boxes

Web Site with Improperly Encoded Quotes as Seen on Safari
For the record, the HTML Entity Codes are as follows.
Smart Quotes and Hypens
“ (left curly quote)
‘ (left single curly quote)
” (right curly quote)
’ (right single curly quote)
– (en dash)
— (em dash)

And for those of you who may encounter non-English text, these quote marks are also in use.

European Quote Marks
« « (left angle)
» » (right angle)
‹ (left single angle)
› (right single angle)
„ (bottom quote)
‚ (single bottom quote)


Superscripts - TAGS vs Unicode Glyphs


Superscripts in HTML

Both HTML and XHTML include the SUP tag for superscripts and the SUB tag for subscripts. Yet the Unicode specification also includes specific slots for individual superscript/subscript characters. For example the phrase “two to the fourth power” could be encoded as
  • 2<sup>4</sup> (SUP tag) = 24
  • 2&#8308; (numeric entity code) = 2⁴
  • 2⁴ (raw Unicode data) = 2⁴

What’s the difference and which should you use? If you’re displaying static Web pages, there’s probably very minimal difference. Although the entity code &8303; takes up less file space than the SUP tag does, the SUP tag works across most browsers/fonts and can be styled.

The raw data method is the most correct, but also the most prone to cross-platform difficulties. For one thing, you MUST have the UTF-8 encoding header meta tag included or the display will be broken. Another issue is that some browsers (e.g. Mac/Firefox) include extra space around superscript entities or shrink the characters to unreadable sizes. If you’re working with XML though, then you may need to enter superscript/subscripts as raw data.

XML and Flash

On one project we had to feed data for College Algebra exercises into a Flash quiz application. The XML spec didn’t recognize numeric entity codes or the SUP/SUB tag, so we had to enter the superscripts as Unicode characters.

The good news is that if you can create a UTF-8 text file and insert the symbols, it will import into Flash (at least in Flash 8.) For math, your best bet is usally to use the Windows Character Map utility and insert the symbols into a Notepad text file or use the Macintosh Character Palette with a Text Edit text file. The Penn State Unicode and XML page explains how to create UTF-8 encoded XML files.

Reason for Unicode Character Points

Ultimately, the reason why Unicode has positions for these characters isn’t to help Flash developers, but because the superscripts/subscripts do add content to a text string.

If you’re exchanging raw data files, you may need to know whether a character is superscript or subscript, so it has to be encoded within Unicode. Hence, we have superscript/subscript characters


Vista: New International Utilities


Vista is scheduled to include a new set of fonts, keyboards and fixes. A lot are for Indic languages, but others are for languages like Georgian, Ethiopic and Cherokee and East Asian languages.

See http://www.microsoft.com/globaldev/vista/Whats_New_Vista.mspx for details.


RUBY Vertical Text for Japanese? (2007 Update)


Did you want robust vertical text or furigana support on the Web? Well maybe you'll get it some day, but not in early 2007 (unless you go the PDF route).

But check in with the W3C RUBY Annotation Specification page for more details and tests. Currently, CSS3 is scheduled to include RUBY formatting attributes.

CSS3 is also scheduled to include a "writing-mode" attribute for other types of vertical writing, but these must be incorporated into the various browsers and text devices.

FYI - There is a vertical text CSS spec out there but it ONLY works in Internet Explorer 6/7 for Windows, so I don't recommend it. It's documented at the Penn State TLT International Vertical Text page.

But I'm positive....Some year, "someday" of vertical text support may be today!


Unicode Angst in Japan and East Asia


The site Unicode in Japan tracks the history of encoding in Japan and explains the technical and not-so-technical issues for Unicode detractors. An even harsher criticism was written by Norman Goundry (date 2001).

One problem for the East Asian languages is that different countries (China, Taiwan, Japan) may use different shapes to draw the "same" character. But since Chinese writing is made up of thousands of charcters, the question then become how many variations are needed.

The Unicode Consortium proposed Han Character Unification to avoid designating too many characters, but this has its quirks. One potential problem is that the same "character" could look very different if you are using a Japanese font vs. a Chinese font. Thus you are back to specifying fonts again.

Issues like this are one reason national character sets like Shift-JIS for Japanese persist. For instance, the Mojikyo Character set has been developed apart from Unicode specifically to support archaic Japanese characters and other variants.

Is it hopeless? Probably not. For one thing Unicode has been rapidly evolving so that 2006 Unicode is quite different from 2001 Unicode. Every version from Unicode 3.1 through Unicode 5.0 has added characters and specifications to resolve older issues with Asian encoding.

Another plus is that the Unicode Consortium seems to be changing its policy on unifying every script...all sorts of historical variations are popping up in even the Western European Latin blocks. My favorite has been the encoding of German Fraktur letters and Gaelic alphabetic variants.


Bill Poser's Notes on Chinese Character Simplification


This article by Bill Poser of the Language Log explains some of the mechanics of Simiplified vs. Traditional Chinese characters and the rationale for some of the objections raised. He also confirms that Simplified characters may be more phonetically based on Mandarin forms, and could be harder for non Mandarin speakers to memorize.


Central Europe: Fear of Polish & Czech Accent Marks


Edward Lucas, one of the Eastern European journalists at the the Economist, has a fun article (The language of Šekspīrs) pointing out that even people comfortable with umlauts, tildes and cedilles may tremble at the sight of the Czech hacheck and Polish ogonek.


About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.


The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments