Recently in Humor Category

Blackletter Gone Bild Wild

|

The "Tweed" column from the Chronicle of Higher Education had an amusing story of a Blackletter glyph variant glitch on the new University of Idaho diplomas (specifically "Congrabulations on Your Grabuation!")

As with many U.S. diplomas, the university name was rendered in a Blackletter (aka "Old English" or Gothic") calligraphic style font. This font though had a particularly high flourish on the lower case "v", high enough that recipients wondered if they had written a "b" instead of "v" (and who wants a diploma from the Unibersity of Iowa?).

According to the Chronicle, the administration reassured them that it was an archaic "v", but this case does highlight the legibility issues of some older manuscript fonts and the need to balance historical font authenticity with modern needs.

Categories:

Got Double Hypens from Word?

|

Unicode hasn't been part of my life enough recently, but it did emerge in a very unexpected way this week to during a recent calendar upgrade.

One of the conversion tasks was for us to add group e-mail addresses so we could share calendars among each other efficiently. But when I tried to copy and paste, I got a "not found error." Here is one of these addresses (altered for security reasons):

umg-sc.foo.staff@fuyu.ucal.psu.edu

Can you spot the problem (HINT: Try cutting and pasting into a text file).

Given up? The problem is the hyphen. In the right font, you will see that it's not just a hyphen (U+002D or ASCII #45), but actually the more elegant and slightly longer en dash which is actually U+2013 (not in ASCII). As many of you know, many databases are still sensitive to differences, so a hyphen is just not the same as an en dash. Theis means searching is a FAIL.

How did the en-dash get in there if it's outside of ASCII? My guess is that it's a result of an auto-correct feature from Word which makes some formatting tweaks to enhance visual appeal. One is to change plain hyphens into a slightly longer en-dash (more favored by typographers).

Another common change is to convert plain straight quotes (" at U+0022 or ASCII #34) to "Smart Quotes" like (“ at U+201C) and (” at U+201D). Copying HTML code attributes from Word can be similarly dangerous since HTML recognizes plain quotes, but NOT fancy double quotes. Most of the time, the change does nothing, but when it comes to interacting with some systems, the reformatting makes a difference in a very annoying way.

How to catch it? In some cases, you can change the font, but many fonts make the dash and en-dash appear identical (Arggh!). Which leaves the old standdy (test,test,test) plus some Unicode awareness (which is increasing among programmers).

Categories:

The Hot New .CO Domain Is....

|

It's the Go Daddy.co Super Bowl secret that was just TOO CALIENTE for TV. What is it?

Last night during the Super Bowl, Go Daddy.com revealed new Go Gaddy girl Joan Rivers and a "hot new .co domain" you can register for. It's the ".co" domain you see in UK (http://www.bbc.co.uk/) and Japan (http://www.kikkoman.co.jp/)...except that it really isn't.

Now that you're on the Internet, the secret of .co can now to be revealed as...the country code of Colombia. So now we can welcome another nation into the fold those who are using their country code to generate revenue from domain name registration. Other famous domain nations include Tuvalu (.tv), Montenegro (.me) and Libya (.ly),

The .co domain is especially well-positioned though because not only is it very similar to the prized .com suffix, but other countries have established .co as a near synonym for .com. I wish Colombia the best in this enterprise.

Categories:

And now a Time.ly Po.st from...

|

Ever since I learned that .tv sites are actually from domains registered in the Pacific Island nation of Tuvalu, I keep an eye for unusual domain suffixes. One of my former favorites was del.icio.us (using the rare .us domain suffix for United States). I'm sorry it's now officially delicious.com.

My new favorite may be the bit.ly addresses used for short URL aliases (similar to tinyurl.com aliases). But at some point, I finally had to ask...where is .ly? Answer: It's Libya. You can look it up at http://users.telenet.be/worldstandards/internet%20domain%20suffixes.htm (out of Belgium.

Of course, there are many more opportunities out there to explore - like .al (Albania), .an (Netherlands Antillies) .er (Eritrea), .es (Spain) .it (Italy), .in (India) and even .um (US Minor Outlying Islands). Spanish Web services may find .ar (Argentina), .er (Eretria) and .ir (Iran) interesting since these are all verb ifinitives endings. You can see even more options at this globalbydesign.com blog post. As you can see, the only barrier is our imagination and a nation's willingness to participate in these pun schemes.

This is nothing new, but always fun to observe and ponder...who are these people who provide us our popular online services? I was interested to note that bit.ly has apparently branched to j.mp where .mp are the Northern Mariana Islands.

P.S. The .st suffix is São Tomé and Principe.

Categories:

The RTL Millionaire

|

Had an interesting meeting with an instructor of world media who pointed out that the popular game show Who Wants to be a Millionaire has been exported around the world, often with the same set design, background music, text fonts, graphics and lifelines. You can check YouTube to see for yourself.

So the challenge would be...what differences are there left? Well in the case of Arabic, the right aligned (RTL) text is one one. Not only are the answers in the distinctive WWTBAM angular slots right aligned, but the choices are layed out with #1 choice set in the upper right box, not the upper left as in the U.S.

Contestant with 4 right aligned answer choices

Interestingly, even the prize level numbering is reversed with the values (apparently in Saudi ri(y)als) on the left and the 15 prize levels on the right. Compare with the LTR Italian version with the prize levels on the left and the monetary values (in euros €).

Prize list, levels 1-15 on right and values up to 2,000,000 riyals on left

Italian prize list with levels 1-15 on left and values up 10 1 millione euros on right.

Kind of a more interesting RTL example. Hope you weren't expecting much more in depth so close to Winter Break....

Categories:

Unifaces and Other Unusual Unicode Applications

|

A while ago, I pointed out that vision charts have expanded beyond the Western scripts, and now so have emoticons. Check out http://twitter.com/unifaces for ways to use the wide range of Unicode symbols to express different facial expressions. Thanks to the Twitter feed authors for sending this to me.

And while I was at it I checked out her del.icio.us site and discovered that:

  1. Mojibake is the Japanese term for the Unicode question mark of death when symbol cannot be displayed. I am glad to have a technical term, but since it's not translated, I do wonder what the literal meaning is. Hopefully it means "ghost character" or "character changing". It appears that the verb bakeru means "change spookily" or "appear in disguise". Ah the mysteries of Unicode.

  2. If you need a new hobby, you can try faking Cyrillic text with Latin characters (e.g. PyccKNN instead of Русский). Detailed instructions are on the Wikipedia Volapuk encoding page. Actually there was a scare a few years back where some Russian spammers were using Cyrllic characters to fake Western URLs (e.g. РЕИИ SТАТЕ ... or if you like Greek - ΡΕΝΝ SΤΑΤΕ) Only the "S" is Western Latin. It turns out to be tricky in both directions because it's the capitals that match the best. But I guess it's the global version of Leet (L33t/1337)

I'd be tempted to tell everyone to get back to work, but then I would have to get back to my work, and that's not always Unicode related.

Categories:

A Unicode Eye Chart

|

If your eyes are becoming glazed trying to determine if that glyph is = or ≡ or ≅ or something else remarkably similar...then you may want to check your vision with this helpful Unicode Eye Chart.

Comes with a useful key at the bottom. Isn't is amazing what you can find on the Web?

Categories:

Funky Fraction Glitch

|

It's been a long week and I was catching up on my celebrity, when I saw the following in my RSS headline reader.

O.J. Simpson Sentenced to 171-2 Years

I'm not a big O.J. Simpson fan, but a 171-2 year sentence seemed a little excessive for robbery. But actually it was a Unicode glitch. It was supposed to be a 17½ but that part of the reader was having problem.

17.5Not171.gif

The lesson learned - always leave a space between the whole number and it's fractional component. TGIF!!

Categories:

7 Things You Should Know About Unicode

|

If you know about the Educause 7 Things You Should Know About... Series, then you should know that it is important to be able to identify seven important elements about any technology.

So here is my spin on what the "you should know" (or what someone not familiar with Unicode might need to know).

1. What is it?

Unicode is an encoding scheme. Each character in each script has a number (because computers track everything by number).Unicode is an encoding standard of millions of characters allowing literally any character from any script to be assigned a number. Unicode does this by assigning a block of numbers of a script (http://www.unicode.org/charts)

Unicode began in 1999 and focused the most commonly used scripts first such as the Latin alphabet, Cyrillic, Chinese, Japanese, Arabic, Greek, Hebrew, Devanagari and others.All major world scripts are covered, as well as many minority and ancient scripts.

2. Who's doing it?

Unicode encoding has been incorporated into Windows (since Windows NT), Macintosh (since OS X) and new versions of Linux/Unix. Applications supporting Unicode include newer versions of Adobe applications, Microsoft Office, the Apple iLife/iWork series, FileMaker, EndNote, Google, GoogleDocs, Twitter, Zotero, blogs, Facebook and many others.

3. How does it work?

To read Unicode text, a user needs to have the correct Unicode font installed. Both Apple and Microsoft provide well-stocked fonts for free, but not every character is covered. Fortunately many freeware fonts are available.

To enter Unicode text, users must activate keyboard utilities or use special escape codes to enter characters for the appropriate script. Again Microsoft and Apple provide a lot of built-in utilities, but additional ones are also available online, many as freeware.

4. Why is it significant?

Consistent encoding allows users to exchange text consistently and for font developers to develop new fonts with a wide range of characters in a consistent manner. When properly implemented, a Mac user can read a Greek text file created on a Windows machine with minimal adjustment.

5. What are the downsides?

One is that older programs developed before Unicode may need to be retrofitted if they are meant to be used by a global audience. Programmers need to learn new techniques in order to take advantage of Unicode encoding.

The other remaining problems is that Unicode implementation on the user end is still confusing. Users working with languages other than English need to either activate/install special utilities or memorize a series of special codes. Methods to input text also vary from software to software. A lot of tech-saviness is required in order to maximize Unicode compatibility.

6. Where is it going?

The goal is for every script, even those for ancient languages, to be encoded within Unicode. This will not only enable new technologies to be used in any language, but will allow texts from around the world to be digitized in a common format. Unicode support for major languages has arrived, but support for many lesser-known scripts and quirky cases in major scripts still needs to be implemented.

7. What are the implications for teaching and learning?

Unicode will

  • Simplify the display of non-English texts in foreign language courses and courses taught in non-English speaking areas
  • Standardize the display of mathematical and technical symbols
  • Allow non-English speaking communities to write in their native scripts instead of transliterating text in the Roman alphabet
  • Expand the typographical repertoire of font designers
  • And...if you're a pioneer...Unicode will introduce you to the joys of converting between decimal and hexadecimal values

Categories:

Explaining and Inventing Your Own Unicode Jargon - Part 2

|

Two entries ago, I extrapolated what would happen to encoding jargon in the Star Trek universe, mostly an exercise to explain how internationalization (i18n) is structured. In this installment, I hope to demonstrate how things only get more complicated when local encodings meet each other.

Starting "Local" Standards

In the new frontier of "interplanetarization (i19n)", we'll already be starting with a buffet of alphanumeric terms - namely the encoding standard(s) each planetary system. I'll repeat some below. Notice that the Orions still have two competing standards.

  • TUTF-32 - Terran Unicode (32 bit)
  • TLHLSCII - tlhIngan Hol (Klingon) Language Institute Standard Code for Information Exchange
  • RIS-105 - Romulan Imperial Standard #105
  • VSAUS-210A - Vulcan Science Academy Unified Standard #210A
  • ACS34 - Andorian Communication Standard #34
  • TelSCII - Tellarite Standard Code for Information Interchange
  • OTLC-10 - Orion Technology Limited Code #10
  • SuperSix - As agreed upon by six major Orion Trading Houses

Before They Create Fedcode

I would assume that eventually the Federation will eventually develop a really large unified standard similar to Unicode. I will call this Fedcode. However...the development of Fedcode will take a while and may even present new challenges in how many bytes are needed for each character.

In the meantime, the local computing systems will need a way to exchange information quickly, so I extrapolotate that lot of adhoc encodings will take place first. Such as:

What the Terrans may Invent

Similar to the Vulcans, I think Unicode will try to incorporate the new scripts into Unicode. At version 9.2, Unicode had 16 planes which was enough to accomodate the new Terran scripts, but finding new historical scripts will really add to the complexity.

Unicode 10, might have to add another layer (a "dimension"?). In this scheme, Dimension 0 will be the Unicode we now have, and then we would add

  • Unicode 10, Dimension 0 (= today's Unicode)
  • Unicode 10, Dimension 1 (= VSAUS-210A )
  • Unicode 10, Dimension 2 (= TLHLSCII)
  • Unicode 10, Dimension 3 (= OTLC10, not SuperSix)
  • ...

What the Vulcans Might Invent

  • VSAUS-210A -1 (All Vulcan scripts)
  • VSAUS-210A -2 (Basic Vulcan plus Andorrian scripts, based on ACS34)
  • VSAUS-210A -3 (Basic Vulcan plus Tellerite scripts, based on TelSCII)
  • VSAUS-210A -4 (Basic Vulcan plus Klingon scripts, based on TLHLSCII )
  • VSAUS-210A -5 (Basic Vulcan plus Orion scripts, based on SuperSix, not OTLC-10)
  • VSAUS-210A -6 (Basic Vulcan plus Terran scripts, based on Unicode 9.2)

Again, the 1 through 6 are referring to blocks/planes/dimensions in VSAUS-210A; it's just that the Vulcan encoding allows users to specify location in the scheme to facilitate their processing.

What the Orions Might Invent

Let's skip the Klingons and the Andorrians and jump to the worst case scenario - the Orions whose two encodings are developed by competing technology corporate interests. Each vendor/trading house will expand their encodings, but in different directions

Thus we will have:

  • OTLC-10 (Orion/all Orion measurements) - 16 bit for rapid processing
  • OTLC-11 (Vulcan)
  • OTLC-12 (Terran Unicode Plane 0)

As well as

  • SuperSix (Orion) - 64bit for "exact recording"
  • SuperSixV - Orion plus Vulcan
  • SuperSixT - Orion plus Unicode Plane 0
  • SuperSixPlus - Combines all scripts

By Fedcode

As you can see that by the time the Federation i19n experts meet for the first time to standardize Fedcode, there will not only be local planetary standards to work with but competing "combined" standards such as Unicode 10.5, SuperSixPlus and VSAUS-210A.

Which will become the basis of Fedcode? How will they plan for expansion for new scripts encountered?

And most of all - how will future computers handle the transformation between Fedcode and KDS (Cardassian Processing Standard)?

Categories:

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Comments

The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments