August 2007 Archives

Notes on Japanese Scripts

|

I'm not a Japanese expert by any means, but here are of my notes on what I've discovered about Japanese scripts.

Japanese is an East Asian script, but differs significantly from the Chinese script because it uses three phonetic scripts in addition to the Chinese kanji characters.

Multiple Scripts

The Japanese script is considered one of the most complex because it combines four writing systems in one. Fortunately, three of them are phonetic, but you cannot be considered an educated until you can also read Chinese Kanji. The scripts are:

  • Katakana - Based on Chinese, but each symbol is a syllable. Used for foreign words or technical vocabular.
  • Hiragana - Also based on Chinese, but rounder. Each symbol is also a syllable. Often used for grammatical endings.
  • Romāji - Roman (English) alphabet, often mixed in with other scripts in modern Japan
  • Kanji - the set of Chinese characters used in Japanese. However, not all Japanese characters are the same as the characters used for Chinese (hanzi) (Japan Reference)

Phonetic scripts developed in Japan partly as a way to write Japanese case endings (okurigana) not found in Chinese.

Still more

In addition to the forms found on the Web, there are a few more variants

  • Furigana - Kanji Characters with minature Katakana or Hiragana above or below to show the phonetic pronunciation. Technially
  • Hentaigana - an archaic syllabary found in soba noodle shops, diplomas, invitations and other times when a formal script might be used. Can also refer to a style of Japanese calligraphy.
  • Manyogana - Another syllabary with Chinese Kanji used only for their phonetic value (not their meaning). These were used in ancient poetry.

Information about these additional scripts can be found at these sites:

As of September 2006, neither Hentaigana or Manyogana blocks had been develeoped in Unicode, but there may be non-Unicode fonts that could be used.

Computing Set up

If you just want to set up on Japanese on your Windows or Mac, see the Penn State Japanese Set Up Page.

Categories:

North Korea Applies for Internet Domain Code

|

Although the country code KP (Democratic People's Republic of Korea /North Korea) has been available for some time, there had been no agency in North Korea to administer any .kp Web sites. This has been seen as an sign of the official policy to isolate North Korea from outside influences.

In Aug 2007, ICANN reported that it had received a request to "delegate this doman," but they said no decision had been reached as of Aug 14, 2007.

According to Prof. Kim Young-Soo, North Korea does have some access to the Internet, but is probably only available to only a few of the highest-level government officials, including Kim Jong-il.

References

Categories:

Using UTC vs. Local Server Time

|

The concept of time zones isn't exactly a Unicode issue, but it does relate to issues of globalization.

This blog entry from 4 Guys from Rolla explains the advantages of storing times/dates in UTC format vs. local time. The first one mentioned is that if your servers switch time zones, your data will still be the same.

Quick UTC Primer

UTC time zones are defined in terms of the Greenwich Mean Time (GMT), from the 0° longitude line estabilshed at Great Britain's Royal Observatory, Greenwich.

If you live in London, then you live in the UTC or GMT. If you live Paris, which is one time zone to the east, then you live in UTC +1 (or one hour later than London). If you live in Philadelphia (Eastern Time Zone), then you live in UTC -5 (i.e. five hours behind London).

The idea of using UTC is to flatten time zones and place everyone in the GMT (London) time zone, but then add information about how many hours to add or subtract in order to convert to local time. If you have operations in multiple time zones, looking at the UTC time can help you determine the sequence of events much better than local time alone.

About Daylight Savings

Interestingly, even though most countries (except Japan) implements daylight savings in the summer, UTC does not. Right now (Aug, 2007) London is UTC +1 (1 hour ahead), but in the fall it will return to UTC 0.

For the Eastern Time zone inhabitants, the summer time zone is UTC -4, and will return to UTC -5 in November.

That means right now, my EDT time of 2:35 PM converts to 6:35 PM UTC (or 18:35 UTC in military time).

References

NASA - http://science.nasa.gov/Realtime/Rocket_Sci/clocks/time-gmt.html
Federation of the Swiss Watch Industry - http://www.fhs.ch/en/worldclock.php

Categories:

Vulgar Fractions in Unicode

|

I've gotten some messages recently for the TLT International Page asking why I did not have codes for fractions (e.g. ½ or 1/2 vs. 1/2) listed, so I did some experimentation and self reflection.

In the end, I think the fraction codes are interesting, but not generally needed. If you do need fractions to be formatted for typography purposes, CSS is actually the best solution for formatting most simple fractions.

THE TERM VULGAR FRACTION

Actually the entity I am referring to are called vulgar fractions in Unicode/typography jargon. As far as I can tell, vulgar fractions are just fractions and are meant to contrast with decimal numbers (e.g. 1/2 = .5). I assume the term vulgar refers to usage among the general public (from the original Latin meaning of "people") vs. the scientific community who presumably stick to decimals.

If you are not concerned about typography a simple number + slash system is acceptable.

ENTITY CODES AND THEIR PROBLEMS

There are entity codes assigned to them in Unicode, but for Web purposes, I'm a little dubious about using them for the following reasons.
  1. They are inconsistently implemented. The codes for 1/2 (#189/U+00BD), 1/4 (#188/U+00BC) and 3/4 (#190/U+00BE) are in the Latin-1 block, while the codes for all the other vulgar fractions (thirds, fifths,sixths, eighths) are in the General Punctuation block (the 8500's/U+2150s). That means not all fonts support all fractions equally. One font may only have 1/2-3/4 but be missing the other fractions. Or the angle of the slash may be different.

  2. I noticed Dreamweaver in particular sort of has problems deciding how to display &frac; (1/2) vs. ⅓ (1/3). It's not just me by the way - this was also noticed by Lars Bruzelius on the CSS Discuss List.

  3. Key mathematical information could be lost. An entity code point combines two numbers (numerator and denominator) into one precomposed entry. This is why MATH ML makes fractions with both a numberator and a denominator.
  4. Screen readers might not recognize entity codes. Screen readers are always a little behind the curve in terms of new Web standards. Although a modern screen reader might understand the codes for 1/2 (½), it will likely not know what to do with &8531; for "1/3". On the other hand "number slash number" is more likely to make sense to a visually impaired user.
  5. Not all fractions encoded. Many common fractions have codes, but not all of them do. If you want 1/7 or 4/9, you're out of luck and have to use the "combining slash" instead.
  6. Legibitlity can be an issue. When using vulgar fraction codes, the numbers will be much smaller (another potential accessibility issue) and resizing them could be tricky. The CSS solution below allow for better control over your sizing.

A CSS SOLUTION

As I said before, the "number slash number" solution is usually fine for most documents, but you can use CSS to make prettier smaller scale fractions...but is a wee cumbersome.
Note: This solution was originally developed by Lars Bruzileus

First you have to shrink the numerator and the denominator to something like 75% - the slash stays at 100%. Then you have to raise the numerator up slightly (by .5 ex). You can also adjust the letter spacing depending on your font.

.den {font-size: 75%;}
.num {font-size: 75%; vertical-align:.5ex}

In the HTML the code looks like this:

<span class="num">1</span>/<span class="den">7</span>

And here's what it looks like:

1/7

So although I didn't like CSS for superscripts, I do think they are just the trick for vulgar fractions.

Categories:

Site Explaining Western European Character Sets

|

I can't believe I missed this, but Unicode guru Alan Wood has a great chart explaining the differences between Windows-1252 (ANSI) vs. Latin 1 (ISO-8859-1) vs. Mac Roman.

http://www.alanwood.net/demos/charsetdiffs.html

For people new to Unicode, this is the chart that explains why non-English characters don't always come out the same between Mac and Windows. Characters like British pound (£) were assigned different code numbers in Mac vs Windows.

These days Mac and Windows can usually translate between each other's encodings (technically both are Unicode)...but the glitches still occur from time to time.

Categories:

About The Blog

I am a Penn State technology specialist with a degree in linguistics and have maintained the Penn State Computing with Accents page since 2000.

See Elizabeth Pyatt's Homepage (ejp10@psu.edu) for a profile.

Comments

The standard commenting utility has been disabled due to hungry spam. If you have a comment, please feel free to drop me a line at (ejp10@psu.edu).

Powered by Movable Type Pro

Recent Comments