The great thing about Unicode research is that I get to learn an amazing amount of trivia I never knew existed about different languages and scripts. Chinese is interesting because of its complexity and the chance for naive Westerners to step on major political and cultural landmines.
I am definitely not an expert, but here are some things I have found out over the years.
If you just want to set up on Chinese on your Windows or Mac, see the Penn State Chinese Set Up Page.
Simplified vs. Traditional Chinese
The first thing to know for Chinese internationalization is that the modern script comes in two flavors – Simplified (Mainland China) and Traditional (Taiwan, Hong Kong). The choice depends on country of origin, and for a Westerner, comes with some political strings attached. Fortunately, most U.S. systems support both.
Recognizing the complex nature of the Chinese script (with literally tens of thousands of characters), the government of Mainland China implemented a simplified script with less complex shapes and fewer characters with the goal of increasing literacy. Simplified characters often have fewer strokes than the older versions and may be easier to read at smaller font sizes.
The use of Simplified Chinese was also adopted by Singapore.
The older form of the script. This form is used in Taiwan, Hong Kong, Macao and many older expatriate Chinese communites.
Different Chinese speaking regions (Mainland China, Taiwn, Hong Kong, Singapore, etc) have complicated diplomatic relations to say the least. When you choose a script you may be implying support for one or the other government's position. Thus when the U.N. announced it would be using Simplified Chinese only for documents, there was much discussion in the blogs.
Similarly, there is a lot controversy in San Francisco on which script should be taught in Chinese courses in the U.S. (San Francisco Chronicle, May 8 2006).
This is a great landmine be wary of. I now ask people in internationalization class where they come from and cross my fingers.
It should be noted that in some cases, Simplified Chinese combines multiple forms from Traditional Chinese so that one Simplified Character could represent more than one older character. Therefore, Traditional Chinese is still in use in Mainland China for older texts, traditional calligraphy or ceremonial occassions.
Similarly the economic realities are such that many people in Taiwan, Macao, Hong Kong are also familiar with Simplified Chinese because they may be doing business with Mainland China or Singapore. Apparently people at the borders of China may also be receiving TV shows with Traditional Chinese subtitles, so they are learning both
Hong Kong Supplemental Characters
To make things a little more interesting, writers in Hong Kong may use special characters not used elsewhere to represent certain Cantonese words. Not all of these Hong Kong Supplementary Characters are encoded within Simplified Chinese
Japanese and Korean
Both languages can be written in scripts which combine Chinese characters with phonetically based characters. The Chinese characters were originally taken from the Traditional Chinese script centuries ago, but over time, these two have evolved on their own path.
What about Unicode?
It's complicated, but it boils Unicode only assigns a new number to a Simplified character or Japanese Kanji character if it is significantly different from the Traditional character. As might be expected, people may argue on where the cut off line is. This process has been called "Han Character unification" and it can be controversial.
- W3C Micro Tutorial
- Wikipedia Simplified Chinese
- San Francisco Chronicle - Politics Fills the Characters
- Notes on Chinese Character Simplification (Bill Poser)
The Western keyboard is clearly not designed for scripts like Chinese, so alternative utilities are included. For Chinese, native speakers often type keys for different stroke components. Once the first one is chosen, users see a list of possible complete characters then choose the one they want. That is, the script is organized by shape, not necessarily by sound.
There are systems in which you type a Roman prounuciation equivalent (e.g. "ma") then choose the character. However, most native speakers do not use this system.
The List of Encodings
Along with the script variants come a list of alternate encodings for Chinese. Normally I'm an advocate of Unicode, but given the complexities of Chinese, I would check to see what encoding you need to support.
The Yale Chinese Computing Center has an excellent overview which I will summarize below.
- Big5 - Developed by a group of five corporations in Taiwan
- EUC-TW - Similar to Big5, but based on CNS 11643 developed by the government of Taiwan
- HKSCS - The Hong Supplementary Character Set
- GB2312 (EUC-CN) - Developed in 1980 by the government of the People's Republic of China. GB is Guojia Biaozhun or "national standard."
- GBK - Additional characters in GB2312. Note that the Windows version of "GB2312" includes GBK characters.
- GB18030 - The most recent encoding issued by the Chinese government. This is mandatated for all computers sold in China. It's a four-byte character set and references Unicode. Both I.B.M. and Microsoft have information about GB18030.