Encoding Theory: September 2008 Archives

Explaining and Inventing Your Own Unicode Jargon - Part 2


Two entries ago, I extrapolated what would happen to encoding jargon in the Star Trek universe, mostly an exercise to explain how internationalization (i18n) is structured. In this installment, I hope to demonstrate how things only get more complicated when local encodings meet each other.

Starting "Local" Standards

In the new frontier of "interplanetarization (i19n)", we'll already be starting with a buffet of alphanumeric terms - namely the encoding standard(s) each planetary system. I'll repeat some below. Notice that the Orions still have two competing standards.

  • TUTF-32 - Terran Unicode (32 bit)
  • TLHLSCII - tlhIngan Hol (Klingon) Language Institute Standard Code for Information Exchange
  • RIS-105 - Romulan Imperial Standard #105
  • VSAUS-210A - Vulcan Science Academy Unified Standard #210A
  • ACS34 - Andorian Communication Standard #34
  • TelSCII - Tellarite Standard Code for Information Interchange
  • OTLC-10 - Orion Technology Limited Code #10
  • SuperSix - As agreed upon by six major Orion Trading Houses

Before They Create Fedcode

I would assume that eventually the Federation will eventually develop a really large unified standard similar to Unicode. I will call this Fedcode. However...the development of Fedcode will take a while and may even present new challenges in how many bytes are needed for each character.

In the meantime, the local computing systems will need a way to exchange information quickly, so I extrapolotate that lot of adhoc encodings will take place first. Such as:

What the Terrans may Invent

Similar to the Vulcans, I think Unicode will try to incorporate the new scripts into Unicode. At version 9.2, Unicode had 16 planes which was enough to accomodate the new Terran scripts, but finding new historical scripts will really add to the complexity.

Unicode 10, might have to add another layer (a "dimension"?). In this scheme, Dimension 0 will be the Unicode we now have, and then we would add

  • Unicode 10, Dimension 0 (= today's Unicode)
  • Unicode 10, Dimension 1 (= VSAUS-210A )
  • Unicode 10, Dimension 2 (= TLHLSCII)
  • Unicode 10, Dimension 3 (= OTLC10, not SuperSix)
  • ...

What the Vulcans Might Invent

  • VSAUS-210A -1 (All Vulcan scripts)
  • VSAUS-210A -2 (Basic Vulcan plus Andorrian scripts, based on ACS34)
  • VSAUS-210A -3 (Basic Vulcan plus Tellerite scripts, based on TelSCII)
  • VSAUS-210A -4 (Basic Vulcan plus Klingon scripts, based on TLHLSCII )
  • VSAUS-210A -5 (Basic Vulcan plus Orion scripts, based on SuperSix, not OTLC-10)
  • VSAUS-210A -6 (Basic Vulcan plus Terran scripts, based on Unicode 9.2)

Again, the 1 through 6 are referring to blocks/planes/dimensions in VSAUS-210A; it's just that the Vulcan encoding allows users to specify location in the scheme to facilitate their processing.

What the Orions Might Invent

Let's skip the Klingons and the Andorrians and jump to the worst case scenario - the Orions whose two encodings are developed by competing technology corporate interests. Each vendor/trading house will expand their encodings, but in different directions

Thus we will have:

  • OTLC-10 (Orion/all Orion measurements) - 16 bit for rapid processing
  • OTLC-11 (Vulcan)
  • OTLC-12 (Terran Unicode Plane 0)

As well as

  • SuperSix (Orion) - 64bit for "exact recording"
  • SuperSixV - Orion plus Vulcan
  • SuperSixT - Orion plus Unicode Plane 0
  • SuperSixPlus - Combines all scripts

By Fedcode

As you can see that by the time the Federation i19n experts meet for the first time to standardize Fedcode, there will not only be local planetary standards to work with but competing "combined" standards such as Unicode 10.5, SuperSixPlus and VSAUS-210A.

Which will become the basis of Fedcode? How will they plan for expansion for new scripts encountered?

And most of all - how will future computers handle the transformation between Fedcode and KDS (Cardassian Processing Standard)?


Explaining and Inventing Your Own Unicode Jargon - Part 1


I love the i18n/UTF-8 process as much as anyone, but you have to admit that all those flying letters and number combinations can be a little overwhelming to the newcomer. So I think a primer is needed

There are some real glossaries out there such as the Unicode Glossary and the Penn State i18n glossary, and the IBM Glossary of Unicode Terms...but you really do learn more when you create your own material. So with that in mind, I present

Encoding in the World of Star Trek

I would like to believe that someday we will contact other civilizations (with some sort of encoded communication) and at that point there will need to expand and create new encodings (and of course new jargon) such as

Jargon of Process

Three current terms for the field of wrangling non-English text include i18n for "internationalization", g11n for "globalization" (both refer to making content/systems usable to people using any script) and the related l10n "localization" (adapting information from region one to a second region, (e.g. a Japanese product sold in the United States).

These terms have the same structure start with the first letter, end with the last letter and insert the number of letters in between. Thus internationalization (20 letters total, 18 between "i" and "n") becomes i18n.

You can apply this to any term such as "Romanization" and "transliteration" (see answers below for new terms), and in the future we will need alternate terms to include the fact that we are working with planets, not just nations. So maybe we will have

  • galaxification (g12n) - even greater than g11n
  • interplanetarization (i19n) - also greater i19n
  • astrointernationalization (a23n) - the biggest of them all
  • Romanization (r10n) - I made this up
  • transliteration (t13n) - this does exist, but is not frequently seen

FYI - Both r10n and t13n refer to the process of writing any language in the Roman (Western/Latin) alphabet. Japanese Romāji is an example of this process.

Local Government Standards

Before the days of Unicode, each region had established its own encoding standard for its own language(s). The most famous may be ASCII (American Standard Code for Information Interchange) from which we also got VISCII (Vietnamese), ISCII (India) and ArmSCII (Armenian).

Another pattern is to name the encoding standard after the governmental standards body and the number of the encoding scheme (usually a sequential number). This is how we arrive at TIS-620 (Thailand, Thai Industrial Standard #620), GB3212 (China) and ELOT 928 (Greece/Ellas). A governmental agency also gave names to Shift-JIS (Japan, combination of JIS X 0201 and JIS X 0208) and ANSI (U.S., American National Standards Institute).

Finally, if for some reason, the local government doesn't move as rapidly as needed , then a corporation will invent its own standard on the fly. In the U.S. we got both Windows-1252 (Win-1252) and MacRoman encodings this way. In Taiwan, they got Big5 (a Traditional Chinese encoding standard agreed upon by five corporations).

Future Local Planetary Encoding Standards

In the future, I will assume that each Star Trek planet has its own version of Unicode, but of course each will have its own encoding designation. Can you Star Trek fans guess where these are from?

  • KLISCII or TLHLSCII (depending on linguistic accuracy)
  • RIS-105
  • VSAUS-210A (because this planet uses hex numbers)
  • FMSS-13B1 (in duodecimal numbers because you can quickly divide by 3)
  • TUTF-32 (future name for an existing standard)

Since I will be talking cross-planetary standardization next time, I will add these potential encodings

  • ACS34 - Andorian Communication Standard #34
  • TelSCII - Tellarite Standard Code for Information Interchange
  • OTLC-10 - Orion Technology Limited Code #10
  • SuperSix - As agreed upon by six major Orion Trading Houses
  • BNTCXS - Betazed Non-Telepathic Communication Exchange Standard

And to finalize the list

  • KLISCII - Klingon Language Institute Standard Code for Information Exchange or
    TLHLSCII - tlhIngan Hol Language Institute Standard Code for Information Exchange
  • RIS-105 - Romulan Imperial Standard #105
  • VSAUS-210A - Vulcan Science Academy Unified Standard #210A
  • FMSS-13B1 - Ferengi Mercantile Society Standard #13BC
  • TUTF-32 - Terran Unicode (32 bit)

Final challenge - what encoding would you invent for the Cardassians?


About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.


The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments