Sensible Language Tagging Advice from Unicode

|

As I have written before, the language tagging architecture is a little confusing. First, there are successive standards including ISO 639, ISO-639-2, ISO-639-3 and others. In addition, there are multiple ways to tag languages, especially languages like "Chinese" and "Arabic" plus a legacy combination of 2-letter and 3-letter codes.

Spoken vs Written Language

The reason for much of this confusion is that language coding changes depending on whether you are focusing on written language (like Unicode and major vendors do) or spoken language (as linguists or film historians might), but few sources recognize it. However the CLDR does mention it. Specifically:

The Ethnologue [the online language enyclopedia (which maintains ISO-639-3)] focuses on native, spoken languages, whereas CLDR and many other systems are focused on written language, for computer UI and document translation, and for fluent speakers (not necessarily native speakers).

In other words, there are lots of spoken forms in the world which are not used in written form. In the United States for instance, everyone is taught standard (or "proper") written English even if they actually speak AAVE (African American Vernacular English), Boston/New York English or Appalachian English at home. Similarly, no spell checkers recognize subtle pronunciation differences between the English of California, Minnesota or the two East/West halves of Pennsylvania.

As far as most of the world (including the Microsoft Office spell checker and Amazon.com) there is only one U.S. English (en-us), and only one English for Britain as well (en-GB)...even though England, Scotland and Wales have even more variation in spoken forms - enough so that Ozzy Osbourne's local dialect is difficult for American ears to parse.

The more inreresting case are macrolanguages like Arabic or Chinese - which are languages with cultural unity but linguistic diversity. However the CLDR recommends the macro language code. Their advice again is to assume that the macro language is THE language code:

For a number of reasons, Unicode language and locale identifiers always use the Macrolanguage for the predominent form. Thus the Macrolanguage code "zh" (Chinese) is used instead of "cmn" (Mandarin)...It would be a mistake to look at http://www.ethnologue.com/show_country.asp?name=EG and conclude that the right language code for the Arabic used in Egypt was "arz", which has the largest population. Instead, the right code is "ar", Standard Arabic, which would be the one used for document and UI translation.

Let's examine both the Arabic and Chinese case and see how it works.

Arabic

First modern Arabic scholars distinguish written Modern Standard Arabic (MSA) which most educated speakers are familiar with from different forms of Colloqiual Arabic which what is spoken at home. The Colloquial forms are different enough to be assigned different language codes in ISO-639-3, but in fact these are rarely written - only MSA is usually written (or used in formal speeches).

If you are working or preparing an Arabic document, chances are that it will be in MSA with maybe a few national quirks (i.e. ar-EG may apply in some cases for an MSA document from Egypt).

Chinese

Chinese, like Arabic is really a macrolanguage with many spoken varieties which are not always understood across the country. However recent governments, with their capitals in Beijing, have promoted a national variety based on Northern Chinese as the national language. Again, most documents from the PRC or Taiwan will be in Mandarin Chinese...so in effect Chinese (zh) = Mandarin (cmn) in most situations.

Ironically though, Mandarin needs multiple codes because there are now multple ways to write this language - the old Traditional Hanzi system (Taiwan), the Simplified characters (China), Pinyin romanization and the older Wade-Giles. Because language tagging is really focused on written language, there are multiple variant tags for Chinese in different scripts (e.g. zh-Hant = Tradtitional Chinese, zh-Hans = Simplified).

When to use "cmn" for Mandarin

Are there situations when "cmn" for Mandarin Chinese might be appropriate? I would say yes...if you are researching or documenting spoken forms in modern China. For instance, a linguist may be doing field work to document spoken forms from across China.

At the spoken level, even Mandarin (i.e. Northern forms) has dialectal features and it may also be important to compare historical developments between Mandarin and other forms such as Cantonese (yue), Wu (wuu) and Hakka (hak). In that case, I would recommend using the ISO-639-3 language codes to tag everything. That will ensure everything is the same format and will probably facilitate searching down the line. Others might recommend using the macrolanguage code plus the ISO-639-3 language code (so that Mandarin is zh-cmn and Cantonese is zh-yue).

As you can see the CLDR advice is a good primer on how to tag. Most documents can be tagged with a simple system defined in ISO-639-2, but documents being tagged by linguists may need the larger set of ISO-639-3 tags. It really clarifies a lot of ambiguity with how to tag

Tagging Language Variations

A final issue is how to tag language variations which can include changes in script, changes in spelling convention or spoken variation. Although many common variants are registered, there are always more to be added.

Following the advice in the CLDR though I would only pursue registration of tags for written variations. This recommendation will likely be controversial, but is actually consistent with common practice and most user needs. For instance, it does make sense for Microsoft to support spell checkers for en-US vs en-GB or other national varieties of English. Similarly everyone needs to support both Simplified and Traditional Chinese.

But will a spell checker or grammar checker ever be programmed for something like Appalachian English? Not anytime soon. For one thing, there probably is NO "standard Appalachian grammar" - just a series of field work studies and observations with LOTS of individual variation. In fact, one of the great challenges for establishing any written standard is getting agreement on how to handle variations across small distances.

Another concern of mine in registering spoken variants is that I am not seeing a systematic pattern of registration of spoken language variations. For instance, dialectologists for American English recognize different regions in the U.S. (e.g. Mid-Atlantic, Mid West, the South California/West, New England, New York etc), which can be further subdivided into more distinct communities (e.g. Queens vs. Brooklyn vs Long Island). This is actually ignoring the reality that a city can have speakers from unrelated dialects (e.g. AAVE, Spanish-influenced English and other world Englishes).

In theory a registration of dialects should be fairly systematic (e.g. en-US-NYC-longisland), but that is NOT what I am seeing. It's very difficult to know how to tag except on an ad hoc basis. And once a tag is registered, it remains there forever, even if a "deprecated" note is added. I'm not sure the current system is really beneficial, since it is just replicating an ad hoc approach that is not necessarily helpful for the field of dialectology.

On the plus side, I think the system works well for written variations - we even have standards tags for scripts to attach to a language tag. If Spanish is ever written in Cyrillic, I will know to tag it "es-Cyrl."

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Comments

The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments