Friday, December 15, 2006

We Think We Have A (But Not The) List Now

So this is how things usually work. I posted the report last night, lamenting about not being able to find a list of Most-Frequently-Used characters. This morning my wife promptly sent me a link she had discovered. Lo and be hold, it's a list of 500 Most-Frequently-Used characters! Sorted by their frequencies!! Right here!!!

But hold on. It sounds a little fishy. The author of that page said the list was originally from a TsingHua University study. But he modified the list to "better reflect the oversea Chinese environment". Of course, there is neither a reference to the original list nor any elaboration of what and how he had changed the list himself.

So now I am curious. Taking the hint from that page, I googled the phrase "汉字频度表". This time I did get a lot more hits, including some that seem to refer to the same TsingHua study. But I still could not locate the original study itself. The most important issue for me here is when the study was conducted and how it was sampled. From what I could gather, the study was probably done in the early 80s when TsingHua was involved in creating a Chinese character table (the equivalent of ASCII) for computer processing.

Then I came upon an interesting piece of information. Someone had done a comparison of three different versions of Most-Frequently-Used Chinese characters: The first is an early study done in 1977. The second might be the one we are looking at, simply referred to as "modified version from a TsingHua University Reference cited by Mr. ChenShuYuan". (So I suppose the mysterious origin of this data evaded this author also.). Then yet a third version whose origin this comparison author simply "had forgot", but it was from "the Internet".

Anyway, the comparison does not quite yield a high confidence on the data, as the three versions differ significantly from each other. The ratio of common characters in all three versions is 40% for the top 10 characters, 66% for top 50, 62% for top 100, etc. It gets a little better as the it goes down the list, but still hovering around 80% all the way.

The author of this comparison did a nice job in pointing out the historical difference in sample context between the versions. The first version was done in 1977, which is at the tail end of the Culture Revolution. Its sample had to be skewed by the charged political context at the time. The other two versions, he found, must have been done in the Deng Xiaoping's reform era, as can be seen from the rise and fall of certain characteristic characters (oops, sorry). Indeed, the common ratio between the second and third versions are much closer: around 90%.

It's quite fascinating to look at all this. The stuff I can find here looks like amateur work, but I suppose some linguists must be doing some serious scholastic studies with all the data. It can be safely assumed that none of these belong to the most recent study announced by the government this May.

So, we have a list, or two lists, but not the list. Why is it this hard to find a good list of Most-Frequently-Used characters, with its source and reliability information?

No comments: