Recently in Language Codes Category

The Language Codes of the Former Yugoslavia

|

If you want to develop a language code headache, head straight to the former Yugoslavia, once a country whose national language was "Serbo-Croatian" (ISO-639 language code: sh (Deprecated).

In the 1990s of course, Yugoslavia violently broke up into its constituent ethnic groups, all of whom agreed that Serbo-Croatian had been an artificially imposed literary language.

Today, most agree that "Serbo-Croatian" is really a macrolanguage of national forms for Croatian (ISO-639 language code hr), Serbian (ISO-639 language code sr), Bosnian (ISO-639 language code bs) and Montegrin (too new to have a code)

Indeed, if you look at the pages for Croatian Wikipedia (http://hr.wikipedia.org), Serbian Wikipedia (http://sr.wikipedia.org) and Bosnian Wikipedia (http://bs.wikipedia.org), you will see that although words are similar, there are distinct vocabulary differences. You will also see that Serbian Wikipedia is in Cyrillic, unlike Croatian and Bosnian.

Another Wikipedia

Yet there's another Wikipedia - the Serbo Croatian Wikipedia (http://sh.wikipedia.org/), dually scripted in both Cyrillic and the Latin alphabet.. This Wikipedia is yet another related form, similar to Croatian, Serbian and Bosnian yet eerily different. A ghost of a language that has been officially declared dead, but still breathes through living speakers.

All linguists know the standard line "A language is a dialect with an army", but creation of Serbo-Croatian Wikipedia shows that governmental language planning is not so easy. Several generations of speakers grew up in "Yugoslavia" learning to become educated in "Serbo-Croatian". I even learned about Serbo-Croatian syntax from a linguistics professor from Yugoslavia...who said absolutely nothing about regional differences or Serbo-Croatian being an artificial language. As far as I knew in 1990, she was a native speaker of Serbo-Croatian.

I don't know what the fate of these linguistic forms will be. I would doubt that Yugoslavia would reunite anytime soon, and the longer the countries remain separate, the more that speakers will feel they are speaking separate languages. Yet, somewhere out there is a community who still speaks Serbo-Croatian and probably mourns the passing of a nation and its language. How long it can last is an interesting question.

Categories:

Language Tagging and JAWS: How to return to English?

|

Disclaimer

I am not seeing other reports of the JAWS quirk reported in this entry. It is based on hearsay from a JAWS user, although one who is fairly tech literate. Hopefully, the point is moot, but since information is so spotty, I am leaving this entry up for now.

Original Article

Unicode and accessibility should be natural partners, but sometimes the tools get a little confused. Take language tagging for instance....

Language tagging identifies the language of a text to search engines, databases and significantly, screen reader tools used by those with severe visual impairments. The newer screen readers can switch pronunciation dictionaries if they encounter a language tag. Language tagging syntax, as recommended by the W3C for HTML 4 works as follows:

  1. Include the primary language tag for the document in the initial HTML tag. For example, an English document would be tagged as <html lang="en">
  2. Tag any passages in a second language individually. For instance, a paragraph in French would be <p lang="fr"> while a word or phrase would be <span lang="fr">.

The idea though is that once you exit the passage tagged with the second language code, you should assume that the language is back to the primary language. Unfortunately, a comment I heard from a JAWS user was something like "The lang tag works, but developers forget to switch back to English." When I asked him for details, he indicated that an English text with a Spanish word makes the switch in pronunciation engines, but then remains in Spanish mode for the rest of the passage.

What I interpret from this is that the JAWS developers are assuming that there should be a SECOND LANG tag to return the document back to the primary language. So we have two syntax schemes:

What W3C Expects

Text: The French name for "The United States" is Les États Unis, not Le United States.

Code: <p>The French name for "The United States" is <i lang="fr">Les États Unis.</i> not <i>Le United States.</i></p>

Note that the only LANG tag is the one for French Les États Unis with the assumption that the document contains a <html lang="en"> specification which applies to the entire document.

What JAWS Wants

As I indicated earlier, it appears that if this code is parsed by the JAWS screen reader, it would remain in French mode even after Les États Unis was read. I am not sure what the syntax would be, but I'm guessing something like this:

Code: <p>The French name for "The United States" is <i lang="fr">Les États Unis.</i> <span lang="en">not <i>Le United States.</i></span></p>

Now there is a second English LANG tag whose domain is the rest of the sentence. I am assuming that JAWS would remain set as English thereafter. In this scenario, I am also guessing that what the JAWS programmers did was to set the switch in pronunciation engines to be triggered ONLY by a language tag - which would explain why it didn't switch back to English in the previous code.

What the W3C is expecting though is that tools should be sensitive to domains of language tags and know to switch back to English when the appropriate end tag is encountered. It's more difficult to program, but it CAN be done.

The Coding Dilemma

So here's the coding dilemma developers face: Do they code to the declared and accepted W3C standard or do they code for JAWS? Of course, the JAWS community would like developers to code for JAWS (after all the person I was speaking with was convinced the problem was developer cluelessness, not bad JAWS standards implementation).

The problem is that this approach perpetuates the more bloated code standards were supposed to streamline. Essentially, you are coding for a specific Web browser just like those developers who only code for Internet Explorer. It's an appealing short term solution, but in the long run counter-productive. This is why even Web-AIM (Web Accessibility group from Utah State) recommends NOT coding for the quirks in JAWS or user agents.

Besides, we can always hope this quirk will be fixed in a future release of JAWS.

Did I Mention Unicode Above 255?

I've also heard rumors that JAWS may read some Unicode characters above 255 as just the Unicode code point. Thus ∀ ("for all" or the upside-down A symbol) might be read as "2200" or "U+2200". There are special .sbl symbol files you can install in JAWS, but it would be nice if the process were a little more transparent. I feel it's the equivalent of Apple or Microsoft not providing any default fonts for non-Western European language...

Categories:

Sensible Language Tagging Advice from Unicode

|

As I have written before, the language tagging architecture is a little confusing. First, there are successive standards including ISO 639, ISO-639-2, ISO-639-3 and others. In addition, there are multiple ways to tag languages, especially languages like "Chinese" and "Arabic" plus a legacy combination of 2-letter and 3-letter codes.

Spoken vs Written Language

The reason for much of this confusion is that language coding changes depending on whether you are focusing on written language (like Unicode and major vendors do) or spoken language (as linguists or film historians might), but few sources recognize it. However the CLDR does mention it. Specifically:

The Ethnologue [the online language enyclopedia (which maintains ISO-639-3)] focuses on native, spoken languages, whereas CLDR and many other systems are focused on written language, for computer UI and document translation, and for fluent speakers (not necessarily native speakers).

In other words, there are lots of spoken forms in the world which are not used in written form. In the United States for instance, everyone is taught standard (or "proper") written English even if they actually speak AAVE (African American Vernacular English), Boston/New York English or Appalachian English at home. Similarly, no spell checkers recognize subtle pronunciation differences between the English of California, Minnesota or the two East/West halves of Pennsylvania.

As far as most of the world (including the Microsoft Office spell checker and Amazon.com) there is only one U.S. English (en-us), and only one English for Britain as well (en-GB)...even though England, Scotland and Wales have even more variation in spoken forms - enough so that Ozzy Osbourne's local dialect is difficult for American ears to parse.

The more inreresting case are macrolanguages like Arabic or Chinese - which are languages with cultural unity but linguistic diversity. However the CLDR recommends the macro language code. Their advice again is to assume that the macro language is THE language code:

For a number of reasons, Unicode language and locale identifiers always use the Macrolanguage for the predominent form. Thus the Macrolanguage code "zh" (Chinese) is used instead of "cmn" (Mandarin)...It would be a mistake to look at http://www.ethnologue.com/show_country.asp?name=EG and conclude that the right language code for the Arabic used in Egypt was "arz", which has the largest population. Instead, the right code is "ar", Standard Arabic, which would be the one used for document and UI translation.

Let's examine both the Arabic and Chinese case and see how it works.

Arabic

First modern Arabic scholars distinguish written Modern Standard Arabic (MSA) which most educated speakers are familiar with from different forms of Colloqiual Arabic which what is spoken at home. The Colloquial forms are different enough to be assigned different language codes in ISO-639-3, but in fact these are rarely written - only MSA is usually written (or used in formal speeches).

If you are working or preparing an Arabic document, chances are that it will be in MSA with maybe a few national quirks (i.e. ar-EG may apply in some cases for an MSA document from Egypt).

Chinese

Chinese, like Arabic is really a macrolanguage with many spoken varieties which are not always understood across the country. However recent governments, with their capitals in Beijing, have promoted a national variety based on Northern Chinese as the national language. Again, most documents from the PRC or Taiwan will be in Mandarin Chinese...so in effect Chinese (zh) = Mandarin (cmn) in most situations.

Ironically though, Mandarin needs multiple codes because there are now multple ways to write this language - the old Traditional Hanzi system (Taiwan), the Simplified characters (China), Pinyin romanization and the older Wade-Giles. Because language tagging is really focused on written language, there are multiple variant tags for Chinese in different scripts (e.g. zh-Hant = Tradtitional Chinese, zh-Hans = Simplified).

When to use "cmn" for Mandarin

Are there situations when "cmn" for Mandarin Chinese might be appropriate? I would say yes...if you are researching or documenting spoken forms in modern China. For instance, a linguist may be doing field work to document spoken forms from across China.

At the spoken level, even Mandarin (i.e. Northern forms) has dialectal features and it may also be important to compare historical developments between Mandarin and other forms such as Cantonese (yue), Wu (wuu) and Hakka (hak). In that case, I would recommend using the ISO-639-3 language codes to tag everything. That will ensure everything is the same format and will probably facilitate searching down the line. Others might recommend using the macrolanguage code plus the ISO-639-3 language code (so that Mandarin is zh-cmn and Cantonese is zh-yue).

As you can see the CLDR advice is a good primer on how to tag. Most documents can be tagged with a simple system defined in ISO-639-2, but documents being tagged by linguists may need the larger set of ISO-639-3 tags. It really clarifies a lot of ambiguity with how to tag

Tagging Language Variations

A final issue is how to tag language variations which can include changes in script, changes in spelling convention or spoken variation. Although many common variants are registered, there are always more to be added.

Following the advice in the CLDR though I would only pursue registration of tags for written variations. This recommendation will likely be controversial, but is actually consistent with common practice and most user needs. For instance, it does make sense for Microsoft to support spell checkers for en-US vs en-GB or other national varieties of English. Similarly everyone needs to support both Simplified and Traditional Chinese.

But will a spell checker or grammar checker ever be programmed for something like Appalachian English? Not anytime soon. For one thing, there probably is NO "standard Appalachian grammar" - just a series of field work studies and observations with LOTS of individual variation. In fact, one of the great challenges for establishing any written standard is getting agreement on how to handle variations across small distances.

Another concern of mine in registering spoken variants is that I am not seeing a systematic pattern of registration of spoken language variations. For instance, dialectologists for American English recognize different regions in the U.S. (e.g. Mid-Atlantic, Mid West, the South California/West, New England, New York etc), which can be further subdivided into more distinct communities (e.g. Queens vs. Brooklyn vs Long Island). This is actually ignoring the reality that a city can have speakers from unrelated dialects (e.g. AAVE, Spanish-influenced English and other world Englishes).

In theory a registration of dialects should be fairly systematic (e.g. en-US-NYC-longisland), but that is NOT what I am seeing. It's very difficult to know how to tag except on an ad hoc basis. And once a tag is registered, it remains there forever, even if a "deprecated" note is added. I'm not sure the current system is really beneficial, since it is just replicating an ad hoc approach that is not necessarily helpful for the field of dialectology.

On the plus side, I think the system works well for written variations - we even have standards tags for scripts to attach to a language tag. If Spanish is ever written in Cyrillic, I will know to tag it "es-Cyrl."

Categories:

Language Tage "mo" for Moldovan Deprecated

|

As of November 3, 2008, both the ISO-639 language code mo (Moldovan) and the ISO-639-2 code mol (Moldovan) were deprecated in favor of Romanian.

In other words, the encoding standards authorities have embodied the notion that Moldovan, as spoken in the Republic of Moldavia, is actually so closely related to Romanian that they are both dialects of each other. This has been the stance claimed by the linguistic community and many elements in both the Romanian and Moldovan community.

From now on, the code ro(Romanian) will refer to the language forms used in both the countries of Romania and Moldova. The tags to distinguish linguistic forms in Romania from that of Moldova will be ro-RO (Romanian or Romania) and ro-MD (Romania of Moldavia).

This may seem to be a trivial change, but it's heartening from my point of view. In recent years, there had been a trend in language code assignments to favor political expedience over linguistic reality.

The most similar case was the elimination of the sh for Serbo-Croatian, as spoken in the former Yugoslavia in favor of three "separate" language codes for Serbian (sr), Croatian (hr) and Bosnian (bs). Although there are genuine regional differences between the forms (especially for Croatian), linguists still debate whether these forms are separate languages or dialects.

Although I do not expect the three codes for Serbian, Croatian and Bosnian to be eliminated anytime soon, I do think it's a good sign that speakers in Moldova and Romania were willing to re-evaluate their linguistic identity.

Categories:

Some Recent Language Tagging News (incl Pinyin/Wade-Giles)

|

Codes for language varieties are constantly being updated, but here is a list of some important changes that have happened in recent months.

The most up-to-date list is available at:
http://www.iana.org/assignments/language-subtag-registry

Chinese Romanizations

  • zh-Latn-pinyin for Pinyin Latin romanization (Mandarin)
  • zh-Latn-wadegile for Wade-Giles romanization (Mandarin)

Note that here the assumption is that zh is Mandarin Chinese. From the discussion it appears that more precise codes for Mandarin could not be used because they had not been fully-approved (sigh). If you are working with a "dialect", you may need to include an appropriate dialect/language extension.

Cornish Spelling

It's hard to believe that a language just being revived already has multiple competing spelling systems, but that's how it goes sometimes. The codes are:

  • kw for Cornish
  • kw-kkcor for Cornish, Common Cornish orthography
  • kw-uccor for Cornish, Unified Cornish orthography
  • kw-ucrcor for Cornish, Unified Cornish Revised orthography

Valencian

Valencian (Spain) is considered to be a regional dialect of Catalan or code ca-valencia.

Belarusian, 1959 spelling

The code be-1959acad is for "Academic (govermental) variant of Belarusian as codified in 1959.

Categories:

Language Codes: Dialect vs. Macrolanguage

|

A while ago, I was writing about the difficulty of defining some language tags like Cantonese because even though it's called a dialect, it's really a separate language.

The SIL group is using a new term I think should become more common - the macrolanguage. A macrolanguage is basically a set of related languages that share a common "identity" even though speakers can't normally understand each other.

Macrolanguages happen when language spreads to different regions and changes, but the cultural or political unity remains. Other macrolanguages include Arabic, Cree, Hmong, Quechua (as spoken in the Incan Empire), and Norweigian. I suspect that you could thrown in some other candidates like German and Italian - (we'd have more if the Roman Empire had made it to the 21st century.)

In any case, The ISO-639-3 language tag standard has a set of macrolanguage mappings which show how different related languages can map to each other so that either Mandarin Chinese (cmn) or Cantonese (yue) can also be called Chinese (zh or zho)

I really hope this term takes hold...because I really think it will simplify other discussions about language tags. After all, it was just this year that a language technology guru claimed that English had no "true dialects." I think he meant to say that English hasn't reached macrolanguage status yet.

Categories:

ISO-639-3 Language Code Changes

|

In a post about Cantonese Language tags, I mentioned ISO-639-3 language codes. This is a new series of codes developed by the linguistic organization SIL which attempts to cover a broader spectrum of languages than had been named in previous registries.

Although I recommend these codes for anyone working with linguistic information, it should be noted that they are being revised. The latest set of changes are announced on the ISO-639-3 home page. You should check these pages out when determining which codes to use.

Categories:

Picking the Right Cantonese Language Tag

|

Language codes are important, but in my humble opinion, kind of confusingly implemented. A classic example is Cantonese, the language of Hong Kong, which has three competing language codes

The codes are result of the fact that there isn’t a good consensus on whether Cantonese is a language or a dialect. Which one is best? It depends on what you’re doing...

  • zh-HK (ISO-639) - the oldest and safest code to use because software knows what it is
  • zh-yue (IANA) - to tag the script/language as Chinese, but add dialect/language information.
  • yue (ISO-639-3) - to tag content as separate languages (with local dialects). You may need to convert to zh-HK though.

Read below for the gritty details.

Cantonese Language or Dialect?

As most Chinese specialists know, the language to buy fruit in Hong Kong is quite different from the language to buy fruit in Shanghai or Beijing. When my aunt traveled to Beijing, she learned some basic shopping terms, but by the time they got to Shanghai, the tour guide told her to not bother.

Linguists tend to call these separate linguistic forms languages because the ability to understand speech from different regions is low to non-existent. In fact the names are Mandarin (Beijing), Cantonese (Hong Kong) or Wu. If you learn Chinese in the U.S., you are probably learning Mandarin which is the national standard (even in Taiwan). If you want to do business in Hong Kong though, you need to take a separate Cantonese class.

Speakers from China, on the other hand, call them dialects. They understand that they are very different, but think they are forms of the same master language because they are written in the same script (and they all do descend from a mother Proto-Chinese language spoken centuries ago). As far as the Chinese are concerned, we really have to worry about just one language only.

The interesting dilemma is that because Hong Kong was a British colony for so long, Cantonese gained some prominence as the business language of Hong Kong. And apparently there are local quirks to the Hong Kong writing system. So the tech community decided long ago that a separate code was needed. But...what it should be?

zh-HK

The first pass was zh-HK or Chinese as spoken in the colony of Hong Kong which was created under the original ISO-639 language code scheme. At the time of ISO-639, only national dialectal differences were allowed to be recognized. Hong Kong was a British colony so had its own country code.

This is the code used by the Microsoft Spell checker for instance; none of the other codes are recognized by Microsoft (even though they are better in some senses). This code will probably exist as long as Unicode does...

The problem is that there was no way to encode the other languages/dialects of China because the regions did not have their own country codes...and sometimes this was necessary.

zh-yue

At some point the language technology groups realized that dialects weren’t restricted to countries, so alternate dialect tags were created including this one. By the way yue is the (Mandarin) Chinese form for Cantonese.

All the Chinese forms got dialect tags (even Shanghai or zh-wuu), so it is an improvement. On the other hand it’s still not linguistically accurate (they’re really not dialects). Even worse, few major vendors have implemented these tags. So you can tag your content with a better tag, but the applications may get confused ...

yue

This tag says Cantonese is its own language. And so is Wu of Shanghai (wuu). Awesome! This code is from the latest language tag scheme (ISO-639-3) which was developed more by linguists to reflect linguistic reality.

It’s good for noting script differences (yue-Trad, yue-Latn) or regional Cantonese dialects.

But as with zh-yue, Microsoft and other vendors do not recognize it yet and for all I know, may never recognize it. There’s a good chance your browser may get a little confused if it sees yue instead of zh-HK.

Does that mean the linguists are wasting their time? Probably not. For linguistic database/archive applications, you probably would want to use the more accurate yue tag, especially in keyword metadata.

The trick would be that during PUBLICATION, you might need a utility that also marks your yue content as zh-HK or whatever.

Stupid? Probably, but it wouldn’t be the first time a Unicode specialist had to account for backwards compatibility.

Other Chinese Codes

Documented at http://tlt.its.psu.edu/suggestions/international/bylanguage/chinese.html#dialect...with much more neutral language.

Categories:

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Comments

The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments