Recently in Accents & Punctuation Category

A New German Unicode Letter - Capital S Sharp

|

A relatively "hot" new addition to Unicode 5.1 is LATIN CAPITAL LETTER DOUBLE S (aka Sharp S or ß) for German. I'd thought I'd write about this because it covers both policy and an important Unicode concept of casing.

About Sharp S (ß)

Many of you may already know about lowercase Sharp S (ß) which is used in German spelling as a replacement for "ss". For instance, the German word gross 'large' could also be spelled as groß and Strasse 'street' can be spelled as Straße. The form itself is an old manuscript convention that was incorporated into modern typograhpy.

So far so good, but what it means from a computing perspective is that any program working with German text has to know that gross and groß are essentially the same word, just with sligthly different spellings. If you're looking in a library database for instance, you would want to see both sets of results. On an interesting side note, I entered in groß and pulled up the English Wikipedia page on the "gross" unit of measure as the first result - correct, but weird..

But not capital ß

In official German spelling convention, there is NO CAPITAL SHARP S. First, no German word starts with "SS", so no word could ever begin with ß anyway. But even if a word is in all-caps or small caps, the convention should be to convert all ß to SS - thus groß should be GROSS in all caps.

Makes sense...except that people in German DO use capital Sharp S in some signs, gravestones and business names (similar to "Nite-Quil" instead of "Night-Quill"). The 2004 Proposal on Encoding Capital S Sharp (PDF) contains a variety of photographs of Capital S Sharp in use. You can see one of these on Wikipedia (Capital ß page). In other words, Unicode ultimately has to bow to social usage.

So finally we have the official Unicode announcement...

Official Unicode 5.1 Announcement

U+1E9E LATIN CAPITAL LETTER SHARP S
In particular, capital sharp s is intended for typographical representations of signage and uppercase titles, and other environments where users require the sharp s to be preserved in uppercase. Overall, such usage is rare. In contrast, standard German orthography uses the string "SS" as uppercase mapping for small sharp s. Thus, with the default Unicode casing operations, capital sharp s will lowercase to small sharp s, but not the reverse: small sharp s uppercases to "SS". In those instances where the reverse casing operation is needed, a tailored operation would be required.

http://unicode.org/versions/Unicode5.1.0/

There's a very nice write up of the issue at http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2888.pdf (PDF)

Now What?

First the fonts will have to be developed to include a capital ß variant. This may or be in your system yet. Here's a quick test. It wasn't looking good, even though I am on on Leopard Mac.

Character Name Unicode Number Character
LATIN SMALL LETTER SHARP S U+00DF ß
LATIN CAPITAL LETTER SHARP S U+1E9E

Next comes the "casing" question. Casing is the set of eqiuvalences which match capital and lowercase letters as "the same" even though they are really two Unicode code points. For instance capital A is U+0041 (ASCII 65) encoded as while lowercase A is U+0061 (ASCII 97). When you search Google and most databases, both A and a are treated the same (yet are kept distinct enough so that you can switch between A and a in your word processor). Note that English casing also conflates Á,Å,À,Ä as just A.

As stated before, official German spelling does not recognize capital ß, but not surprisingly, there was a discussion in the Unicode list just this week on whether this too will change over time. I'll be staying tuned.

A Linguistic Closing Thought

Normally linguists talk about seeing a sound change or a grammar change in progress, but this appears to be a spelling change in progress. Wikipedia Capital ß page claims that legal documents often use capital sharp S in all cap names in order to avoid ambiguiity (e.g the defendant Hans Straßer or HANS STRAßER). And apparently the most notorious use of capital ß is the title page of Der Große Duden (The Great Duden dictionary) which was rendered as DER GROßE DUDEN. Clearly the capital sharp S was destined for permanent encoding.

Working with Doublestruck P & Q (ℙ& ℚ)

|

As I've been reporting in recent entries, I've been working with a symbolic logic course which has been using various exotic symbols including double struck P (ℙ). Since every Unicode point seems to have its own story, I thought I would report some of the ineresting challenges for this character.

Finding It

When you are discussing a topic with lots of different symbols, you soon realize that in terms of Unicode, they will come from multiple blocks. For instance double struck P is from the Letter Like Symbols block (starts at U+2100), while other math symbols may be in Arrows block, the Number Forms block, the Mathematical Operators Block or possibly the Dingbats Block. You can see from the Unicode Org Symbols and Punctuation Chart just how many blocks are involved.

Although a user doesn't normally have to know the Unicode point value, because many insertion tools such as the Windows Character Map, Mac Character Pallete or others are organized primarily by block, you do have to sort of have an idea of how blocks work.

Rarity

Fonts with a robust set of math symbols are still pretty rare, and sometimes the letter like symbols are even rarer. At one point I had ℙ (P) pulling from one font and ℚ (Q) from another...interesting. Below are some fonts I know have doublestruck letters like ℙ,ℚ.

Formatting Issues

Normally I try to avoid font and size specifications, but double struck P is an interesting counterexample. One challenge is that because the legs are hollowed out, it has a much lighter visual appearance than say normal P. My base text is 12 px on the Web, but for the double struck P, I decided to bump up the size to about 16 px (in a standards-compliant way of 1.3 em).

The other issue was selecting font faces. I wanted one with thick double legs - if you look at the font chart below from my Mac, you'll see that some fonts had some very skinny legs.

Double Struck P in multiple fonts as seen on Mac Character Palette

I also prefer the serif fonts in this case since I personally believe serifs help inexperieced users in reading unfamiliar scripts (in this case undergraduate college students). For this course, I'll probably point students to some freeware fonts I like

Glyph Du Jour: Doublestruck P (ℙ)

|

Math symbols can stretch the boundaries of Unicode display technology, but not as much as some other related blocks like Letterlike Symbols the home of such symbols as ℙ (double struck P, see image below), ℚ (double struck Q), and even the pharmacy prescription symbol (℞).

GlyphSampleDSP.png

Double struck letters in particular are used in different branches of mathematics to respresent, for instance, the set of all real numbers (double struck R) or in symbolic logic to symbolize any atomic proposition. See the table below for different double struck letters and their Unicode values. See the Penn State Math Symbol chart for other common letter like symbols of math.

Character Name Character Entity Num
Entity
Hex
Entity
DOUBLE-STRUCK REAL NUMBER (Double R) -- ℝ
COMPLEX NUMBERS (Double C) -- ℂ
NATURAL NUMBERS (Double N) -- ℕ
PRIME NUMBERS (Double P) -- ℙ
RATIONAL NUMBERS (Double Q) -- ℚ
INTEGERS (Double Z) -- ℤ

Arial Unicode on OS X (Leopard)

|

I was able to upgrade to Leopard recently on my Mac which means I'm able to manipulate a working version of Arial Unicode MS for the Mac...yeah.

Web Display

My blog actually switched to Arial Unicode because of the way I had coded the CSS. It was very legible, but the x-height seemed smaller in comparison to the Apple Lucida Grande - so I reordered the priority. I will have to see if I can download Lucida Grande onto Windows via the Windows Safari download.

Back to the Logic Symbols in Word

Most of my recent Unicode adventures have been about inserting logic symbols like (∨,∧,⊃) into Word (and later Excel). My main struggle has been that if I insert them from the Character Palette, the font switches to Symbol... which is OK until I start typing English. At that point I will stop outputting the English alphabet and σταρτ ουτπυτιν τηε γρεεκ αλπηαβετ. Greek is great...unless you're typing English text. I was using the left arrow key quite a bit.

Now that Microsoft has developed a working version of Arial Unicode MS, I can input the symbols without switching over to Greek. The only gotcha is that I have to shif old logic symbols out of their pre Arial Unicode fonts (thank goodness for keyboard shortcuts). What I'm hoping is that I can bypass the big font switch in Windows word too.

So I'm happy to say that we're adding another small step towards Unicode compatibility. Finally I can have logic symbols in a non-Greek, non-Japanese, non-Chinese font!

What's New in Unicode 5.1?

|

Unicode version 5.1 was recently released, and includes some new code blocks as well as new specifications. As with all new versions of Unicode there will be a time lag until the new items can be incorporated into fonts and utilities, but here is a partial list of new items

If you're interested in the new characters, the best place to view them is at http://www.unicode.org/charts/

New Plane 0 Scripts

  • Cham (Cambodia/Vietnam)
  • Kayah Li (Thailand/Myanmar)
  • Lepcha (India)
  • Ol Chiki/Santali (India)
  • Rejang (indonesia)
  • Saurashtra (India)
  • Sundanese (Indonesia)
  • Vai (Liberia)

Script Extensions

These blocks add characters to previously encoded scripts.

  • Cyrillic Extended-A
  • Cyrillic Extended-B
  • Arabic - characters for math, 4 Qu'ranic and multiple characters for different languages
  • Indic - Malayalam, Tamil character sequences, Devanagari chandra a,
    Sanskrit sounds in Gurmukhi, Oriya, Telegu
  • Latin - characters for minority languages and capital German sharp S (rare)
  • Math Symbols
  • Medievalist Punctuation - for research
  • Myanmar Additions

New Plane 1 Ancient Scripts and Miscellaneous Symbols

  • Carian (Anatolia/Turkey)
  • Lycian (Anatolia/Turkey)
  • Lydian (Anatolia/Turkey)
  • Phaistos Disk (Crete)
  • Domino Tile Symbols
  • Mahjong Tile Symbols

Igbo in Facebook - It Can Be Done (But Numeric Code Breaks)

|

How does Facebook handle accents? Pretty well actually - but you can't use the numeric code. Instead you have to directly insert the character either by typing it in an Igbo Keyboard or via the Windows Character Map or Mac Character Palette.

For Web 1.0, the safest way to display accented letters was with numeric entity codes. For instance, if wanted to display Ụwa, I might write Ụwa within the HTML document. The codes were safer because they would work even if a developer forgot to include the UTF-8 meta tag.

In a Web based form, the rules may differ depending on how the developer configured the service. In some forms, you MUST enter the numeric code (often because the UTF-8 tag is missing). In other cases you CANNOT use the numeric code - this is true when you are entering data into a text field which will not go through any HTML formatting schemes. As long as the output has the UTF-8 meta tag (and Facebook does), you can avoid a numeric code (i.e. enter a "raw" accented letter) and still be OK.

How can you tell? Unfortunately, you have to test each application one by one. As I've commented before, applications which truly expect to support a global audience are generally UTF-8 ready and you can skip the numeric code. This includes Facebook, MovableType, iTunes, GoogleMaps, Twitter and so forth.

Being able to skip the numeric code is a positive sign (why memorize numbers when you can type?), but as with all change, there will be some old habits to break.

Please install the Character Map

|

A few weeks ago I commented/complained that most people in the U.S. technology field consider foreign language support a peripheral issue even though English does need "foreign language" support for special punctuation and foreign words. An unfortunate corollary is that the U.S. tech industry also assumes that people will not need to type beyond ASCII either.

As a result, some of the base-line tools that Apple and Microsoft provide may not necessarily get installed. For instance, I recommend the Windows Character Map as a last resort for a lot of Windows users. But in the past few months, I've gotten questions (mostly outside Penn State) saying the user can't find it and where the heck is it.

The truth is...it may not have gotten installed. I've noticed that in order to save space, some "exotic" utilities may be skipped. Hmmmm!

Truthfully, I can understand skipping the East Asian utilities because they do take up a lot of disk space (one East Asian font can be about 8-20 MB vs. 200-500 K for Western-only fonts)...but I do worry that even the basic tools for handling the € sign are also not included.

It's difficult enough for the busy administrative assistant, instructor or Spanish I student (in the lab) to figure out how to insert the exotic symbols. Imaging trying to convince an even busier tech support specialist that they need to install some new utilities from the Windows CD-ROM (or the Mac disk) and it's not a very happy scenario.

FYI - The situation at most of Penn State is not like this - I think the Character Map is universally installed. Also, the CLC Student Computing Labs in particular have worked hard to ensure that the best Unicode toolset is available, even East Asian languages. Having said this though, I do hear about the occasional tale of a missing Unicode utility somewhere out there in PSU computer land.

Does English Need Unicode?

|

Traditional wisdom holds that ASCII or maybe ANSI (ISO-8859-1) is sufficient for English and that it's not a language that needs any Unicode support. But is this actually true?

It's certainly not true in any higher education environment where not only do we work with foreign languages, but also mathematical symbols, including the obscure ones. Any time an institution needs to build an archive for an ancient language or math/science, the problem of encoding will rear its ugly little head. Ironically, it may be the classicists, medievalists and comparative literature specialists (fields which are not traditionally not seen as high tech) who have had the most experience with working the Unicode issue.

Is it just some scholars in exotic languages or physics then? Alas not. Many of the carefully crafted punctuation symbols that are appreciated by copy writers and desktop publishers everywhere are ALSO in Unicode. These include the em-dash (—), the en-dash (–), the Euro sign (€) and Smart Quotes “ and ”. There are some kluges in "ISO-8859-1" for some of these symbols...but not all of them. If you want these to work reliably, it's best to select Unicode (UTF-8) and say you're using Unicode!

Even the "foreign" accents work their way into our prose. Once it was just fiancé and José, but now it's even baseball players like Magglio Ordóñez (the "Big Tilde"). If you check out Ordóñez's uniform, you'll see that even his uniform has a tilde on his name. As we gradually learn to embrace some non-Anglo culture and wish to "get it right", the need for spelling with appropriate accents will continue to rise.

In fact, it's amazing that every office I've ever been to, I've had someone ask me how to insert some "exotic" symbol into some document. So yes...even English needs Unicode support to express the full range of textual possibilities.

I worked on a math quiz in Math for set theory where questions are pulled into a text file within Flash. The instructor wants to include the union (∪) and intersection symbol (∩) in his problems, so what to do?

The good news is that if you can create a UTF-8 text file and insert the symbols, it will import into Flash (at least in Flash 8.) For math, your best bet is usually to use the Windows Character Map utility and insert the symbols into a Notepad text file or use the Macintosh Character Palette with a Text Edit or BBEdit text file. Unfortunately, the process is still a little clunky in both platforms, but it's better than in 2005.

Windows

You have to open both Notepad (Start » Accessories) and Character Map (Start » Accessories » System Tools)

For the Windows Character Map, it's a semi-clunky process. You have to switch the font to Arial Unicode MS (because it has the all the math symbols), then scroll down to window untul you see the math section. Then you have to select, copy and paste each symbol into Notepad.

In Notepad, when you save the file, you have to make sure the encoding menu under the file name is changed from "ANSI" to "UTF-8". Fortunately, it will warn you.

Macintosh

In Text Edit for the Mac, you go to Edit » Special Characters to bring up the Character Palette. Click the Math option and hunt for the symbol. Highlight and click Insert to place it in Text Edit.

Once you insert the symbols, you have to make sure your encoding is set to UTF-8 during the save process. Go to the Format menu and select "Make Plain Text." Then, when you save the file you have to make sure the encoding menu under the file name is changed from "MacRoman" to "UTF-8".

Reopening UTF-8 Files in Mac Text Edit

In Text Edit, if you reopen a UTF-8 file it may be magically transformed to MacRoman (you'll see things like Á& instead of your intended character). Very annoying (grr!!) To prevent this, you must go into the Text Edit Preferences, then click the Open and Save panel. Make sure that the Plain Text Encoding options for opening and saving are set for "UTF-8." Or you can spring for a license for BBEdit or Mellel which are better about warning you.

As for Flash - fonts are still a little tricky within Flash, but at least it's playing well with text files.

Superscript and Subscript

I also used Flash for a College Algebra quiz where I discovered that the XML format does not support HTML tags like <sup> and <sub>. Instead, you may need to use the Unicode characters for superscripts and subscripts.

Glyph Du Jour: Reversed Open E

|

This is a phonetic symbol I have seen before, but did not understand...until this semester. This is the vowel found in the standard RP British pronunciation of bird /bɜd/ (U.K.) or /br̩d/ in U.S. English.
image of backwards open e - it resembles a 3 or a backwards epsilon
Many accents in English, including UK English, have generally lost the /r/ after a vowel while standard American has maintained this /r/. What's interesting is that even though the /r/ is theoretically gone, there are still subtle changes in pronunciation where the /r/ used to be...which is why English speakers can still "hear" an /r/ even though it may not really be there any more.

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Recent Comments