Recently in Language Codes Category

Language Codes: Dialect vs. Macrolanguage

|

A while ago, I was writing about the difficulty of defining some language tags like Cantonese because even though it's called a dialect, it's really a separate language.

The SIL group is using a new term I think should become more common - the macrolanguage. A macrolanguage is basically a set of related languages that share a common "identity" even though speakers can't normally understand each other.

Macrolanguages happen when language spreads to different regions and changes, but the cultural or political unity remains. Other macrolanguages include Arabic, Cree, Hmong, Quechua (as spoken in the Incan Empire), and Norweigian. I suspect that you could thrown in some other candidates like German and Italian - (we'd have more if the Roman Empire had made it to the 21st century.)

In any case, The ISO-639-3 language tag standard has a set of macrolanguage mappings which show how different related languages can map to each other so that either Mandarin Chinese (cmn) or Cantonese (yue) can also be called Chinese (zh or zho)

I really hope this term takes hold...because I really think it will simplify other discussions about language tags. After all, it was just this year that a language technology guru claimed that English had no "true dialects." I think he meant to say that English hasn't reached macrolanguage status yet.

ISO-639-3 Language Code Changes

|

In a post about Cantonese Language tags, I mentioned ISO-639-3 language codes. This is a new series of codes developed by the linguistic organization SIL which attempts to cover a broader spectrum of languages than had been named in previous registries.

Although I recommend these codes for anyone working with linguistic information, it should be noted that they are being revised. The latest set of changes are announced on the ISO-639-3 home page. You should check these pages out when determining which codes to use.

Picking the Right Cantonese Language Tag

|

Language codes are important, but in my humble opinion, kind of confusingly implemented. A classic example is Cantonese, the language of Hong Kong, which has three competing language codes

The codes are result of the fact that there isn’t a good consensus on whether Cantonese is a language or a dialect. Which one is best? It depends on what you’re doing...

  • zh-HK (ISO-639) - the oldest and safest code to use because software knows what it is
  • zh-yue (IANA) - to tag the script/language as Chinese, but add dialect/language information.
  • yue (ISO-639-3) - to tag content as separate languages (with local dialects). You may need to convert to zh-HK though.

Read below for the gritty details.

Cantonese Language or Dialect?

As most Chinese specialists know, the language to buy fruit in Hong Kong is quite different from the language to buy fruit in Shanghai or Beijing. When my aunt traveled to Beijing, she learned some basic shopping terms, but by the time they got to Shanghai, the tour guide told her to not bother.

Linguists tend to call these separate linguistic forms languages because the ability to understand speech from different regions is low to non-existent. In fact the names are Mandarin (Beijing), Cantonese (Hong Kong) or Wu. If you learn Chinese in the U.S., you are probably learning Mandarin which is the national standard (even in Taiwan). If you want to do business in Hong Kong though, you need to take a separate Cantonese class.

Speakers from China, on the other hand, call them dialects. They understand that they are very different, but think they are forms of the same master language because they are written in the same script (and they all do descend from a mother Proto-Chinese language spoken centuries ago). As far as the Chinese are concerned, we really have to worry about just one language only.

The interesting dilemma is that because Hong Kong was a British colony for so long, Cantonese gained some prominence as the business language of Hong Kong. And apparently there are local quirks to the Hong Kong writing system. So the tech community decided long ago that a separate code was needed. But...what it should be?

zh-HK

The first pass was zh-HK or Chinese as spoken in the colony of Hong Kong which was created under the original ISO-639 language code scheme. At the time of ISO-639, only national dialectal differences were allowed to be recognized. Hong Kong was a British colony so had its own country code.

This is the code used by the Microsoft Spell checker for instance; none of the other codes are recognized by Microsoft (even though they are better in some senses). This code will probably exist as long as Unicode does...

The problem is that there was no way to encode the other languages/dialects of China because the regions did not have their own country codes...and sometimes this was necessary.

zh-yue

At some point the language technology groups realized that dialects weren’t restricted to countries, so alternate dialect tags were created including this one. By the way yue is the (Mandarin) Chinese form for Cantonese.

All the Chinese forms got dialect tags (even Shanghai or zh-wuu), so it is an improvement. On the other hand it’s still not linguistically accurate (they’re really not dialects). Even worse, few major vendors have implemented these tags. So you can tag your content with a better tag, but the applications may get confused ...

yue

This tag says Cantonese is its own language. And so is Wu of Shanghai (wuu). Awesome! This code is from the latest language tag scheme (ISO-639-3) which was developed more by linguists to reflect linguistic reality.

It’s good for noting script differences (yue-Trad, yue-Latn) or regional Cantonese dialects.

But as with zh-yue, Microsoft and other vendors do not recognize it yet and for all I know, may never recognize it. There’s a good chance your browser may get a little confused if it sees yue instead of zh-HK.

Does that mean the linguists are wasting their time? Probably not. For linguistic database/archive applications, you probably would want to use the more accurate yue tag, especially in keyword metadata.

The trick would be that during PUBLICATION, you might need a utility that also marks your yue content as zh-HK or whatever.

Stupid? Probably, but it wouldn’t be the first time a Unicode specialist had to account for backwards compatibility.

Other Chinese Codes

Documented at http://tlt.its.psu.edu/suggestions/international/bylanguage/chinese.html#dialect...with much more neutral language.