Re: Multiple encoding used with Unihan database

Jim Breen Tue, 17 Apr 2012 16:42:59 -0700

[email protected] wrote:
> Subject: Multiple encoding used with Unihan database
> I frequently consult the Unihan database to get detailed information
> about Japanese and Chinese characters, and I have noticed that at
> least some pages are encoded in more than one encoding, that is to
> say, although the main encoding is in "UTF-8" (as one would expect on
> the Unihan site), certain characters on those pages are encoded in
> "ISO-8859-1", which makes them unreadable until one forces a change
> of the encoding.
>
> I just looked at these pages:
> http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=58b3
> (character: 墳)
> http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=5893
> (character: 墓)
>
> The wrongly encoded characters appear here in the Hanyu Pinyin
> column: the accented letters are from the ISO-8859-1 charset and not
> from UTF-8 and will only become legible if one changes the encoding
> setting to ISO-8859-1 (which renders, of course, much the rest of the
> page unusable)
>
> kHanyuPinyin 10485.060:fén,fèn
> kHanyuPinyin 10470.090:mù
>
> I suspect that the providers of this information would like to see
> all of it to be encoded in UTF-8 and that the current encoding scheme
> is just an accident. :-)


This is very odd. The UniHan data files, which can be downloaded and which
presumably drive that WWW service, have that information correctly coded.

Quoting from Unihan_Readings.txt (Unicode 6.0):

U+58B3  kCantonese      fan4
U+58B3  kDefinition     grave, mound; bulge; bulging
U+58B3  kHangul 분
U+58B3  kHanyuPinlu     fen2(46)
U+58B3  kHanyuPinyin    10485.060:fén,fèn
U+58B3  kJapaneseKun    HAKA
U+58B3  kJapaneseOn     FUN
U+58B3  kKorean PWUN
U+58B3  kMandarin       FEN2
U+58B3  kTang   *bhiən
U+58B3  kVietnamese     phần
U+58B3  kXHC1983        0322.071:fén

My guess is the WWW service is using a pre-release version
which had some coding errors.

My advice is to download the data and search it directly.

Jim Breen

-- 
Jim Breen
Adjunct Snr Research Fellow, Clayton School of IT, Monash University
Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre
Graduate student: Language Technology Group, University of Melbourne

Re: Multiple encoding used with Unihan database

Reply via email to