Recently in CJK Category

Korean Script Heads to Indonesia

|

The biggest sensation in Unicode land these days is that the Korean script Hangul (or Hangeul/Han'gŭl depending on your transliteration preferences) has been adopted by the speakers of Cia-Cia in the nation of Indonesia. This will be the first time any language other than Korean has adopted Hangul as it's writing system, so it is a cultural triumph for them.

What's interesting is how this decision happened. The standard press releases are not giving much information and even the linguistic community is a little perplexed. It's actually more interesting if the Wikipedia report that Cia-Cia was formerly written in the Arabic script (specifically the Jawi variant in Indonesia) is accurate. According to Ethnologue, the population is still mostly Islamic, so there shouldn't be a religious reason to switch.

So what about it? First, let's discuss the switch from Arabic. Actually a lot of Muslim communities including speakers of Hausa, Swahili, Malay and Turkish have switched from Arabic to the Latin alphabet. Malaysia and Indonesian are two countries following this trend, although the Jawi/Arabic script is still used in some religious and cultural contexts. There may be a variety of reasons for this including European colonial policy or the perception that the Latin alphabet is easier to learn and enhances literacy (Turkish). A move to the Latin alphabet may also represent a move towards a secular government (as in the case of Turkey).

It should also be mentioned that the Arabic script must be modified heavily when it is used for non-Semitic languages if all the sounds are to be represented. If you look at the Omniglot Jawi chart for example, you will see that many consonants have the same shape but with with different patterns of dots to indicate the differences. This also happens in the Latin alphabet (e.g. n vs. ñ in Spanish), but if Jawi also includes the different letter shapes depending on word position as Arabic, then the script becomes more complex.

Cia Cia is unique though in switching to something other than the Latin alphabet. One reader commented that this may be due to the fact that in South and Southeast Asia, a language gains social status by having its own script. In Indonesia, Balinese, Javanese and Sundanese have their own historic scripts. Although these scripts may not be used on an everyday basis, they do show that there is a cultural tradition having nothing to do with the West.

In theory, Cia Cia could adopt one of these scripts or one from India (e.g. Devanagari) would would probably be a good fit, but none would probably be perceived as being unique in Indonesia. On the other hand...no one else in Indonesia is using Hangul. It is very unique. Fortunately, Hangul is probably a good fit. Although the forms are somewhat angular like Chinese writing, the underlying principles are actually very similar those used in India and Southeast Asia (with some differences of course).

There's another benefit to Hangul over scripts like Javanese and Balinese and that's enhanced Unicode support. Korea has been fortunate enough to have the economic and political influence for developers to develop functional encoding schemes, fonts and input utilities for Hangul. Many Southeast Asian scripts are still catching up Unicode wise.

Whether this is the reason Cia Cia switched to Hangul or not, I wish them the best of luck. I think there are lots of people now invested in the success of this project.

Pinyin Joe's Chinese Computing Help Desk for Vista/XP

|

Webmaster "Pinyin Joe" site that might clear up some of the mysteries of Chinese support in Windows (including Vista)

http://www.pinyinjoe.com/vista/vista_new.htm

He goes through set up, the possible input utilities you can activate and even some font samples for the Microsoft Chinese fonts. There's good coverage of Windows XP as well.

FYI - Mac users should check Yale's Chinese Mac site.

HKSCS (Hong Kong Supplementary Character Set) Links

|

A while ago, I wrote about the complexity of specifying a language code for Cantonese, the form of Chinese spoken in Hong Kong. As many East Asian specialists know, Cantonese is so distinct from standard Mandarin Chinese (Beijing) that Western universities offer separate Cantonese language classes.

To further complicate the situation I also recently learned that there is also HKSCS or the "Hong Kong Supplementary Character Set" which is a block of Chinese hanzi characters used just on Hong Kong. I did decide to gather a few links for myself, in case the topic ever comes up. Here is what I found.

Some Basic Notes

1. Microsoft does incorporate HKSCS support into Windows in principle, but you may need to download the appropriate plugins, especially for XP and earlier versions of Windows. See the first few links above for details. Full support may also depend on implementation in other software packages.

2. Recent versions of Mac include Changjie and Janyie option in the Traditional Chinese input utilities. See the Yale Chinese Mac page above for details. Full support may also depend on implementation in other software packages.

3. HKSCS comes in a 2001 and a 2004 version. It is also tied to both Uniicode (UCS) and Big5 encoding (Traditional Chinese, Taiwan) even though the rest of China mostly uses Simplified Chinese.

4. Some recent discussions on the Unicode list (ca. Nov 2008) seemed to indicate that HKSCS was not as wide-spread as it could be, but it does appear that the major vendors are making initial steps.

While I am not an expert on the technical aspects of HKSCS, I do think it's interesting that there continues to be a "Hong Kong" issue even though it's been a part of China for over 10 years. Several centuries of a separate colonial heritage has allowed a Cantonese written standard to more fully emerge than it might otherwise have happened.

W3C Japanese Layout Task Force

|

The latest reports from the W3C Japanese Layout Task Force is posted at
http://www.w3.org/2007/02/japanese-layout/. The working language is Japanese, but key documents are translated into English.

The page also includes a basic layout primer which discusses issues for vertical layout iin Japanese, Ruby Annotation (not Ruby on Rails), switching to the Roman alphabet, Japanese punctuation and more.

English in Chinese Script?

|

This is an article which attempts to explain how English would be written if the Chinese hanzi system were adapted as it is in the the modern era

http://zompist.com/yingzi/yingzi.htm

Interestingly - it's not all pictograms, and some syllables may be rhyming with proto-West Germanic (yikes)

Yale Chinese Support Site

|

Despite some of my previous entries, it's a fact that I really know very little about Chinese writing (I think I can recognize the characters 1,2,3). But if I really had to figure out what was going on the first place I would probably go to is Yale Chinese Mac which started back in the Mac Classic days.

Ironically though, the site is no longer just Chinese on a Mac, but includes information on Chinese on Windows, Chinese on Palm Pilot, encodings, free fonts and more. Many mysteries can be resolved here. If only I could find one of these for every script!

URL: http://www.yale.edu/chinesemac/

Notes on Japanese Scripts

|

I'm not a Japanese expert by any means, but here are of my notes on what I've discovered about Japanese scripts.

Japanese is an East Asian script, but differs significantly from the Chinese script because it uses three phonetic scripts in addition to the Chinese kanji characters.

Multiple Scripts

The Japanese script is considered one of the most complex because it combines four writing systems in one. Fortunately, three of them are phonetic, but you cannot be considered an educated until you can also read Chinese Kanji. The scripts are:

  • Katakana - Based on Chinese, but each symbol is a syllable. Used for foreign words or technical vocabular.
  • Hiragana - Also based on Chinese, but rounder. Each symbol is also a syllable. Often used for grammatical endings.
  • Romāji - Roman (English) alphabet, often mixed in with other scripts in modern Japan
  • Kanji - the set of Chinese characters used in Japanese. However, not all Japanese characters are the same as the characters used for Chinese (hanzi) (Japan Reference)

Phonetic scripts developed in Japan partly as a way to write Japanese case endings (okurigana) not found in Chinese.

Still more

In addition to the forms found on the Web, there are a few more variants

  • Furigana - Kanji Characters with minature Katakana or Hiragana above or below to show the phonetic pronunciation. Technially
  • Hentaigana - an archaic syllabary found in soba noodle shops, diplomas, invitations and other times when a formal script might be used. Can also refer to a style of Japanese calligraphy.
  • Manyogana - Another syllabary with Chinese Kanji used only for their phonetic value (not their meaning). These were used in ancient poetry.

Information about these additional scripts can be found at these sites:

As of September 2006, neither Hentaigana or Manyogana blocks had been develeoped in Unicode, but there may be non-Unicode fonts that could be used.

Computing Set up

If you just want to set up on Japanese on your Windows or Mac, see the Penn State Japanese Set Up Page.

Picking the Right Cantonese Language Tag

|

Language codes are important, but in my humble opinion, kind of confusingly implemented. A classic example is Cantonese, the language of Hong Kong, which has three competing language codes

The codes are result of the fact that there isn’t a good consensus on whether Cantonese is a language or a dialect. Which one is best? It depends on what you’re doing...

  • zh-HK (ISO-639) - the oldest and safest code to use because software knows what it is
  • zh-yue (IANA) - to tag the script/language as Chinese, but add dialect/language information.
  • yue (ISO-639-3) - to tag content as separate languages (with local dialects). You may need to convert to zh-HK though.

Read below for the gritty details.

Cantonese Language or Dialect?

As most Chinese specialists know, the language to buy fruit in Hong Kong is quite different from the language to buy fruit in Shanghai or Beijing. When my aunt traveled to Beijing, she learned some basic shopping terms, but by the time they got to Shanghai, the tour guide told her to not bother.

Linguists tend to call these separate linguistic forms languages because the ability to understand speech from different regions is low to non-existent. In fact the names are Mandarin (Beijing), Cantonese (Hong Kong) or Wu. If you learn Chinese in the U.S., you are probably learning Mandarin which is the national standard (even in Taiwan). If you want to do business in Hong Kong though, you need to take a separate Cantonese class.

Speakers from China, on the other hand, call them dialects. They understand that they are very different, but think they are forms of the same master language because they are written in the same script (and they all do descend from a mother Proto-Chinese language spoken centuries ago). As far as the Chinese are concerned, we really have to worry about just one language only.

The interesting dilemma is that because Hong Kong was a British colony for so long, Cantonese gained some prominence as the business language of Hong Kong. And apparently there are local quirks to the Hong Kong writing system. So the tech community decided long ago that a separate code was needed. But...what it should be?

zh-HK

The first pass was zh-HK or Chinese as spoken in the colony of Hong Kong which was created under the original ISO-639 language code scheme. At the time of ISO-639, only national dialectal differences were allowed to be recognized. Hong Kong was a British colony so had its own country code.

This is the code used by the Microsoft Spell checker for instance; none of the other codes are recognized by Microsoft (even though they are better in some senses). This code will probably exist as long as Unicode does...

The problem is that there was no way to encode the other languages/dialects of China because the regions did not have their own country codes...and sometimes this was necessary.

zh-yue

At some point the language technology groups realized that dialects weren’t restricted to countries, so alternate dialect tags were created including this one. By the way yue is the (Mandarin) Chinese form for Cantonese.

All the Chinese forms got dialect tags (even Shanghai or zh-wuu), so it is an improvement. On the other hand it’s still not linguistically accurate (they’re really not dialects). Even worse, few major vendors have implemented these tags. So you can tag your content with a better tag, but the applications may get confused ...

yue

This tag says Cantonese is its own language. And so is Wu of Shanghai (wuu). Awesome! This code is from the latest language tag scheme (ISO-639-3) which was developed more by linguists to reflect linguistic reality.

It’s good for noting script differences (yue-Trad, yue-Latn) or regional Cantonese dialects.

But as with zh-yue, Microsoft and other vendors do not recognize it yet and for all I know, may never recognize it. There’s a good chance your browser may get a little confused if it sees yue instead of zh-HK.

Does that mean the linguists are wasting their time? Probably not. For linguistic database/archive applications, you probably would want to use the more accurate yue tag, especially in keyword metadata.

The trick would be that during PUBLICATION, you might need a utility that also marks your yue content as zh-HK or whatever.

Stupid? Probably, but it wouldn’t be the first time a Unicode specialist had to account for backwards compatibility.

Other Chinese Codes

Documented at http://tlt.its.psu.edu/suggestions/international/bylanguage/chinese.html#dialect...with much more neutral language.

Vietnamese Support Article

|

Vietnamese is different from other East Asian languages because it is currently written in the Latin alphabet (same as English). On the other hand it has so many tone marks that it is treated differently from "typical" Western European languages like Spanish and French.

The following site - http://vietunicode.sourceforge.net/ is an excellent source of Vietnamese information.

FYI - In the past, it was written in Chinese, but this system is rarely used in modern Viet Nam.

RUBY Vertical Text for Japanese? (2007 Update)

|

Did you want robust vertical text or furigana support on the Web? Well maybe you'll get it some day, but not in early 2007 (unless you go the PDF route).

But check in with the W3C RUBY Annotation Specification page for more details and tests. Currently, CSS3 is scheduled to include RUBY formatting attributes.

CSS3 is also scheduled to include a "writing-mode" attribute for other types of vertical writing, but these must be incorporated into the various browsers and text devices.

FYI - There is a vertical text CSS spec out there but it ONLY works in Internet Explorer 6/7 for Windows, so I don't recommend it. It's documented at the Penn State TLT International Vertical Text page.

But I'm positive....Some year, "someday" of vertical text support may be today!

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Comments

The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type

Recent Comments