Got Unicode?
Jargon List
Tech professionals love jargon, and Unicode experts are no different. Here's a list, just so you know.
Process
- i18n - "internationalization" (because there are 18 letters)
- l10n - "localization". Localization is the process of adapting technology for a particular country or region (e.g. Japan).
- question mark of death - my term for the symbol you see when a product cannot interpret an exotic Unicode character. It's often a question mark in a diamond, but other variants exist.
Script Types
- RTL - Right to left like in Arabic and Hebrew
- CJK - Chinese-Japanese-Korean. All these scripts may reference Chinese characters from time to time.
- Indic - Scripts of India which use consonants with vowel marks. These scripts are literally in a class of their own in terms if i18n.
- Roman/Latin - Our alphabet, derived from the Roman alphabet used to write Latin.
- Cyrillic - the name of the Russian alphabet
- Traditional Chinese - the older form of the Chinese script used in Taiwan and many older Chinese communities in North America and Europe
- Simplified Chinese - newer simplified form of the script devleoped by the People's Republic of China.
- Hiragana/Katagana = two phonetic scripts used in Japan in addition to Chinese letters (Kanji) and Roman letters (RomÄji)
Encodings
Before Unicode, text was encoded in a number of ways. Ones you may encounter are listed below. Once you read this list, you will see why Unicode was developed.
- UTF-8 - A near synonym for Unicode, especially for Web pages. UTF-16 and UTF-32 are also versions of Unicode.
- ASCII - Original encoding from the 1960s. It only includes 128 characters.
- 8-bit encodings - encodings expanded to 256 or 28 characters
- MacRoman - Developed by Apple to include characters for French, Spanish, German, Danish and so forth.
- ISO-8859-1/Latin-1/Western European - Similar (yet) different system used by the Internet.
- CP1252/Win-1252 - The Microsoft encoding. It's the same as ISO-8859-1 but with additional characters (which are recognized on the Internet)
- ISO-8859-2/Latin-2/Central European - Latin encoding with letters for Polish, Hungarian, Czech and other Central European languages.
- CP1250/Win-1250 - the Central European encoding used by Microsoft and some Web sites.
- CP1251/Win-1251 - Cyrillic encoding from Microsoft and some Web sites. The encoding is actually a combination of ASCII plus extra Cyrillic letters.
- KOI-8 - Cyrillic encoding for Unix and some Web sites.
- 16-Bit - Extra large sets used for CJK (East Asian) scripts. These contain 216 characters or over 65,000. Asian encodings ALSO include ASCII, Latin 1, Greek and Cyrillic letters.
- EUC_JP/Shift_JIS - Different Japanese encodings.
- Big5/EUC-TW - Different encodings for Traditional Chinese. See the Chinese page for more details.
- GB2312, GBK, GB18030 - Different Simplified Chinese encodings. See the Chinese page for more details.
Unicode Structure
Flavors of Unicode
- UTF-8 - the most commonly used form of Unicode used for Web pages and e-mail. Characters are translated into 8-bit chunks. ASCII and UTF-8 are equivalent characters.
- UTF-16 - Unicode represented as a set of four hexadecimal digits. ASCII characters like capital L (#4C) are represented with leading zeros (#004C). Each set of two characters is called a byte.
- Block - the first byte.
- Code Point - the second byte In L (#004C), 00. is the block and .4C is the code point.
- Byte Order
- UTF-16 BE (Big Endian) - The UTF-16 data is fed block first
(e.g. 00.4C for L) - UTF-16 LE (Little Endian -The UTF-16 data is fed code point first
(e.g. 4C.00 for L)
- UTF-16 BE (Big Endian) - The UTF-16 data is fed block first
- UTF-32 - Unicode represented as a set of six hexadecimal digits.ASCII characters like L (#4C) are represented with leading zeros (#00004C). This takes up more memory and can be avoided for most applications except those using ancient languages.
- Plane - If a Unicode character has more than four hexadecimal digits, the first set are called "planes". Capital L is in plane 0 or (#00.00.4C)
- ISO-10646 - Officially registersed Unicode encoding sets. Rarely used in programming specifications as such.