Thursday, December 14, 2006

Have You Seen Those 581 Characters?

If you are familiar with the 80-20 rule, you probably know that you don't need to learn a whole lot of Chinese characters before you can read and understand 80% of contents in Chinese. No kidding! Back in May, a Chinese government agency has unveiled, with much fanfare, the result of an extensive study on the "State of Chinese Language" in the media and daily lives.

Among the findings is that merely 581 most-frequently-used characters cover 80% of all contents in Chinese media. Going a little further, if you learn 934 characters, you will understand 90% of Chinese that is out there. With this fact in hand, a Chinese official proudly claimed that it is indeed not difficult to learn Chinese.

I came upon this piece of "news" by looking for a list of most-frequently-used Chinese characters so I can use them to supplement a more primitive test I did with my daughter earlier. I tried for many days and had come up empty. The Internet is really not that great, in Chinese anyway, yet.

There is a table of common Chinese characters available, originally published in 1988. It has 2500 characters grouped by the number of strokes, along with another 1000 of less common ones. (The study in May has found some 300 characters have changed among the 2500 and a correction set is available.)

However, the problem is that this 2500 character table does not come with any individual frequency data so it is impossible to deduct a "more frequently used" subset from them. As a whole, it is simply too large to be of any use for us.

Although the news that 581 characters cover 80% of Chinese content was widely reported on the (Chinese) Internet, there is not a single trace to be found as to what, exactly, those characters are! You can download the official report, along with some of the data, here. But a list of the 581, or 934, is not among them!

Maybe I am not the only one puzzled by the lack of this data in the public domain. A professor had mused before why a usage-frequency list was not included in a standard dictionary. But then I happened upon this fellow by the name of Huang Yong(黄勇), who had done this:

He wrote a little computer program that takes the 2500 common characters in the official list and feeds them into Google search one by one. He then interprets the hits, or search counts, as the character's usage count and sort the characters accordingly.

Personally, I am not sure these search counts could be equivalent to usage. His crude method also has its own flaws, or limitations, as he had documented in his page. But nevertheless, he generates a nice table of characters, sorted in descending order, by their search counts. You can also download his program and regenerate the table at any time.

All right, so it is not scientific. But it is better than nothing. It is probably good enough for our purpose to identify the first 300 or 500 or 900 most-frequently-used Chinese characters anyway.

In the meanwhile, if you happen to see those pesky 581 characters wandering around, drop me a note, okay?

Thanks!

No comments: