April 2008 Archives

Language Codes: Dialect vs. Macrolanguage

|

A while ago, I was writing about the difficulty of defining some language tags like Cantonese because even though it's called a dialect, it's really a separate language.

The SIL group is using a new term I think should become more common - the macrolanguage. A macrolanguage is basically a set of related languages that share a common "identity" even though speakers can't normally understand each other.

Macrolanguages happen when language spreads to different regions and changes, but the cultural or political unity remains. Other macrolanguages include Arabic, Cree, Hmong, Quechua (as spoken in the Incan Empire), and Norweigian. I suspect that you could thrown in some other candidates like German and Italian - (we'd have more if the Roman Empire had made it to the 21st century.)

In any case, The ISO-639-3 language tag standard has a set of macrolanguage mappings which show how different related languages can map to each other so that either Mandarin Chinese (cmn) or Cantonese (yue) can also be called Chinese (zh or zho)

I really hope this term takes hold...because I really think it will simplify other discussions about language tags. After all, it was just this year that a language technology guru claimed that English had no "true dialects." I think he meant to say that English hasn't reached macrolanguage status yet.

Categories:

What's New in Unicode 5.1?

|

Unicode version 5.1 was recently released, and includes some new code blocks as well as new specifications. As with all new versions of Unicode there will be a time lag until the new items can be incorporated into fonts and utilities, but here is a partial list of new items

If you're interested in the new characters, the best place to view them is at http://www.unicode.org/charts/

New Plane 0 Scripts

  • Cham (Cambodia/Vietnam)
  • Kayah Li (Thailand/Myanmar)
  • Lepcha (India)
  • Ol Chiki/Santali (India)
  • Rejang (indonesia)
  • Saurashtra (India)
  • Sundanese (Indonesia)
  • Vai (Liberia)

Script Extensions

These blocks add characters to previously encoded scripts.

  • Cyrillic Extended-A
  • Cyrillic Extended-B
  • Arabic - characters for math, 4 Qu'ranic and multiple characters for different languages
  • Indic - Malayalam, Tamil character sequences, Devanagari chandra a,
    Sanskrit sounds in Gurmukhi, Oriya, Telegu
  • Latin - characters for minority languages and capital German sharp S (rare)
  • Math Symbols
  • Medievalist Punctuation - for research
  • Myanmar Additions

New Plane 1 Ancient Scripts and Miscellaneous Symbols

  • Carian (Anatolia/Turkey)
  • Lycian (Anatolia/Turkey)
  • Lydian (Anatolia/Turkey)
  • Phaistos Disk (Crete)
  • Domino Tile Symbols
  • Mahjong Tile Symbols

Categories:

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Comments

The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments