Recently in Humor Category

Unifaces and Other Unusual Unicode Applications

|

A while ago, I pointed out that vision charts have expanded beyond the Western scripts, and now so have emoticons. Check out http://twitter.com/unifaces for ways to use the wide range of Unicode symbols to express different facial expressions. Thanks to the Twitter feed authors for sending this to me.

And while I was at it I checked out her del.icio.us site and discovered that:

  1. Mojibake is the Japanese term for the Unicode question mark of death when symbol cannot be displayed. I am glad to have a technical term, but since it's not translated, I do wonder what the literal meaning is. Hopefully it means "ghost character" or "character changing". It appears that the verb bakeru means "change spookily" or "appear in disguise". Ah the mysteries of Unicode.

  2. If you need a new hobby, you can try faking Cyrillic text with Latin characters (e.g. PyccKNN instead of Русский). Detailed instructions are on the Wikipedia Volapuk encoding page. Actually there was a scare a few years back where some Russian spammers were using Cyrllic characters to fake Western URLs (e.g. РЕИИ SТАТЕ ... or if you like Greek - ΡΕΝΝ SΤΑΤΕ) Only the "S" is Western Latin. It turns out to be tricky in both directions because it's the capitals that match the best. But I guess it's the global version of Leet (L33t/1337)

I'd be tempted to tell everyone to get back to work, but then I would have to get back to my work, and that's not always Unicode related.

A Unicode Eye Chart

|

If your eyes are becoming glazed trying to determine if that glyph is = or ≡ or ≅ or something else remarkably similar...then you may want to check your vision with this helpful Unicode Eye Chart.

Comes with a useful key at the bottom. Isn't is amazing what you can find on the Web?

Funky Fraction Glitch

|

It's been a long week and I was catching up on my celebrity, when I saw the following in my RSS headline reader.

O.J. Simpson Sentenced to 171-2 Years

I'm not a big O.J. Simpson fan, but a 171-2 year sentence seemed a little excessive for robbery. But actually it was a Unicode glitch. It was supposed to be a 17½ but that part of the reader was having problem.

17.5Not171.gif

The lesson learned - always leave a space between the whole number and it's fractional component. TGIF!!

7 Things You Should Know About Unicode

|

If you know about the Educause 7 Things You Should Know About... Series, then you should know that it is important to be able to identify seven important elements about any technology.

So here is my spin on what the "you should know" (or what someone not familiar with Unicode might need to know).

1. What is it?

Unicode is an encoding scheme. Each character in each script has a number (because computers track everything by number).Unicode is an encoding standard of millions of characters allowing literally any character from any script to be assigned a number. Unicode does this by assigning a block of numbers of a script (http://www.unicode.org/charts)

Unicode began in 1999 and focused the most commonly used scripts first such as the Latin alphabet, Cyrillic, Chinese, Japanese, Arabic, Greek, Hebrew, Devanagari and others.All major world scripts are covered, as well as many minority and ancient scripts.

2. Who's doing it?

Unicode encoding has been incorporated into Windows (since Windows NT), Macintosh (since OS X) and new versions of Linux/Unix. Applications supporting Unicode include newer versions of Adobe applications, Microsoft Office, the Apple iLife/iWork series, FileMaker, EndNote, Google, GoogleDocs, Twitter, Zotero, blogs, Facebook and many others.

3. How does it work?

To read Unicode text, a user needs to have the correct Unicode font installed. Both Apple and Microsoft provide well-stocked fonts for free, but not every character is covered. Fortunately many freeware fonts are available.

To enter Unicode text, users must activate keyboard utilities or use special escape codes to enter characters for the appropriate script. Again Microsoft and Apple provide a lot of built-in utilities, but additional ones are also available online, many as freeware.

4. Why is it significant?

Consistent encoding allows users to exchange text consistently and for font developers to develop new fonts with a wide range of characters in a consistent manner. When properly implemented, a Mac user can read a Greek text file created on a Windows machine with minimal adjustment.

5. What are the downsides?

One is that older programs developed before Unicode may need to be retrofitted if they are meant to be used by a global audience. Programmers need to learn new techniques in order to take advantage of Unicode encoding.

The other remaining problems is that Unicode implementation on the user end is still confusing. Users working with languages other than English need to either activate/install special utilities or memorize a series of special codes. Methods to input text also vary from software to software. A lot of tech-saviness is required in order to maximize Unicode compatibility.

6. Where is it going?

The goal is for every script, even those for ancient languages, to be encoded within Unicode. This will not only enable new technologies to be used in any language, but will allow texts from around the world to be digitized in a common format. Unicode support for major languages has arrived, but support for many lesser-known scripts and quirky cases in major scripts still needs to be implemented.

7. What are the implications for teaching and learning?

Unicode will

  • Simplify the display of non-English texts in foreign language courses and courses taught in non-English speaking areas
  • Standardize the display of mathematical and technical symbols
  • Allow non-English speaking communities to write in their native scripts instead of transliterating text in the Roman alphabet
  • Expand the typographical repertoire of font designers
  • And...if you're a pioneer...Unicode will introduce you to the joys of converting between decimal and hexadecimal values

Explaining and Inventing Your Own Unicode Jargon - Part 2

|

Two entries ago, I extrapolated what would happen to encoding jargon in the Star Trek universe, mostly an exercise to explain how internationalization (i18n) is structured. In this installment, I hope to demonstrate how things only get more complicated when local encodings meet each other.

Starting "Local" Standards

In the new frontier of "interplanetarization (i19n)", we'll already be starting with a buffet of alphanumeric terms - namely the encoding standard(s) each planetary system. I'll repeat some below. Notice that the Orions still have two competing standards.

  • TUTF-32 - Terran Unicode (32 bit)
  • TLHLSCII - tlhIngan Hol (Klingon) Language Institute Standard Code for Information Exchange
  • RIS-105 - Romulan Imperial Standard #105
  • VSAUS-210A - Vulcan Science Academy Unified Standard #210A
  • ACS34 - Andorian Communication Standard #34
  • TelSCII - Tellarite Standard Code for Information Interchange
  • OTLC-10 - Orion Technology Limited Code #10
  • SuperSix - As agreed upon by six major Orion Trading Houses

Before They Create Fedcode

I would assume that eventually the Federation will eventually develop a really large unified standard similar to Unicode. I will call this Fedcode. However...the development of Fedcode will take a while and may even present new challenges in how many bytes are needed for each character.

In the meantime, the local computing systems will need a way to exchange information quickly, so I extrapolotate that lot of adhoc encodings will take place first. Such as:

What the Terrans may Invent

Similar to the Vulcans, I think Unicode will try to incorporate the new scripts into Unicode. At version 9.2, Unicode had 16 planes which was enough to accomodate the new Terran scripts, but finding new historical scripts will really add to the complexity.

Unicode 10, might have to add another layer (a "dimension"?). In this scheme, Dimension 0 will be the Unicode we now have, and then we would add

  • Unicode 10, Dimension 0 (= today's Unicode)
  • Unicode 10, Dimension 1 (= VSAUS-210A )
  • Unicode 10, Dimension 2 (= TLHLSCII)
  • Unicode 10, Dimension 3 (= OTLC10, not SuperSix)
  • ...

What the Vulcans Might Invent

  • VSAUS-210A -1 (All Vulcan scripts)
  • VSAUS-210A -2 (Basic Vulcan plus Andorrian scripts, based on ACS34)
  • VSAUS-210A -3 (Basic Vulcan plus Tellerite scripts, based on TelSCII)
  • VSAUS-210A -4 (Basic Vulcan plus Klingon scripts, based on TLHLSCII )
  • VSAUS-210A -5 (Basic Vulcan plus Orion scripts, based on SuperSix, not OTLC-10)
  • VSAUS-210A -6 (Basic Vulcan plus Terran scripts, based on Unicode 9.2)

Again, the 1 through 6 are referring to blocks/planes/dimensions in VSAUS-210A; it's just that the Vulcan encoding allows users to specify location in the scheme to facilitate their processing.

What the Orions Might Invent

Let's skip the Klingons and the Andorrians and jump to the worst case scenario - the Orions whose two encodings are developed by competing technology corporate interests. Each vendor/trading house will expand their encodings, but in different directions

Thus we will have:

  • OTLC-10 (Orion/all Orion measurements) - 16 bit for rapid processing
  • OTLC-11 (Vulcan)
  • OTLC-12 (Terran Unicode Plane 0)

As well as

  • SuperSix (Orion) - 64bit for "exact recording"
  • SuperSixV - Orion plus Vulcan
  • SuperSixT - Orion plus Unicode Plane 0
  • SuperSixPlus - Combines all scripts

By Fedcode

As you can see that by the time the Federation i19n experts meet for the first time to standardize Fedcode, there will not only be local planetary standards to work with but competing "combined" standards such as Unicode 10.5, SuperSixPlus and VSAUS-210A.

Which will become the basis of Fedcode? How will they plan for expansion for new scripts encountered?

And most of all - how will future computers handle the transformation between Fedcode and KDS (Cardassian Processing Standard)?

Explaining and Inventing Your Own Unicode Jargon - Part 1

|

I love the i18n/UTF-8 process as much as anyone, but you have to admit that all those flying letters and number combinations can be a little overwhelming to the newcomer. So I think a primer is needed

There are some real glossaries out there such as the Unicode Glossary and the Penn State i18n glossary, and the IBM Glossary of Unicode Terms...but you really do learn more when you create your own material. So with that in mind, I present

Encoding in the World of Star Trek

I would like to believe that someday we will contact other civilizations (with some sort of encoded communication) and at that point there will need to expand and create new encodings (and of course new jargon) such as

Jargon of Process

Three current terms for the field of wrangling non-English text include i18n for "internationalization", g11n for "globalization" (both refer to making content/systems usable to people using any script) and the related l10n "localization" (adapting information from region one to a second region, (e.g. a Japanese product sold in the United States).

These terms have the same structure start with the first letter, end with the last letter and insert the number of letters in between. Thus internationalization (20 letters total, 18 between "i" and "n") becomes i18n.

You can apply this to any term such as "Romanization" and "transliteration" (see answers below for new terms), and in the future we will need alternate terms to include the fact that we are working with planets, not just nations. So maybe we will have

  • galaxification (g12n) - even greater than g11n
  • interplanetarization (i19n) - also greater i19n
  • astrointernationalization (a23n) - the biggest of them all
  • Romanization (r10n) - I made this up
  • transliteration (t13n) - this does exist, but is not frequently seen

FYI - Both r10n and t13n refer to the process of writing any language in the Roman (Western/Latin) alphabet. Japanese Romāji is an example of this process.

Local Government Standards

Before the days of Unicode, each region had established its own encoding standard for its own language(s). The most famous may be ASCII (American Standard Code for Information Interchange) from which we also got VISCII (Vietnamese), ISCII (India) and ArmSCII (Armenian).

Another pattern is to name the encoding standard after the governmental standards body and the number of the encoding scheme (usually a sequential number). This is how we arrive at TIS-620 (Thailand, Thai Industrial Standard #620), GB3212 (China) and ELOT 928 (Greece/Ellas). A governmental agency also gave names to Shift-JIS (Japan, combination of JIS X 0201 and JIS X 0208) and ANSI (U.S., American National Standards Institute).

Finally, if for some reason, the local government doesn't move as rapidly as needed , then a corporation will invent its own standard on the fly. In the U.S. we got both Windows-1252 (Win-1252) and MacRoman encodings this way. In Taiwan, they got Big5 (a Traditional Chinese encoding standard agreed upon by five corporations).

Future Local Planetary Encoding Standards

In the future, I will assume that each Star Trek planet has its own version of Unicode, but of course each will have its own encoding designation. Can you Star Trek fans guess where these are from?

  • KLISCII or TLHLSCII (depending on linguistic accuracy)
  • RIS-105
  • VSAUS-210A (because this planet uses hex numbers)
  • FMSS-13B1 (in duodecimal numbers because you can quickly divide by 3)
  • TUTF-32 (future name for an existing standard)

Since I will be talking cross-planetary standardization next time, I will add these potential encodings

  • ACS34 - Andorian Communication Standard #34
  • TelSCII - Tellarite Standard Code for Information Interchange
  • OTLC-10 - Orion Technology Limited Code #10
  • SuperSix - As agreed upon by six major Orion Trading Houses
  • BNTCXS - Betazed Non-Telepathic Communication Exchange Standard

And to finalize the list

  • KLISCII - Klingon Language Institute Standard Code for Information Exchange or
    TLHLSCII - tlhIngan Hol Language Institute Standard Code for Information Exchange
  • RIS-105 - Romulan Imperial Standard #105
  • VSAUS-210A - Vulcan Science Academy Unified Standard #210A
  • FMSS-13B1 - Ferengi Mercantile Society Standard #13BC
  • TUTF-32 - Terran Unicode (32 bit)

Final challenge - what encoding would you invent for the Cardassians?

Quirkiest i18n Linux Logos

|

Linux is that great open source OS which has been adopted around the world. And sometimes, the Linux penguin gets to wear new outfits (or hang out with new friends). Some of my favorite i18n penguin costume changes are....

  • Cymrux Welsh Linux - The penguin has a red dragon pal...and both are the same size!
  • Linux4Arab - Stylish eye wear and head gear for our Antarctic avian.
  • Linux Malta (It's back!) (dead link) - The penguin hangs out on the beach with a trendy Maltese cross tatoo.
  • Russian Linux (dead link) - imagine if you can, a brown furry penguin with teddy bear ears. This one was a Photoshop composite.

P.S. If you're wondering where the heck this post came from - I'm testing the blog tool again.

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Comments

The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type

Recent Comments