[email protected] wrote: > Subject: Multiple encoding used with Unihan database > I frequently consult the Unihan database to get detailed information > about Japanese and Chinese characters, and I have noticed that at > least some pages are encoded in more than one encoding, that is to > say, although the main encoding is in "UTF-8" (as one would expect on > the Unihan site), certain characters on those pages are encoded in > "ISO-8859-1", which makes them unreadable until one forces a change > of the encoding. > > I just looked at these pages: > http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=58b3 > (character: 墳) > http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=5893 > (character: 墓) > > The wrongly encoded characters appear here in the Hanyu Pinyin > column: the accented letters are from the ISO-8859-1 charset and not > from UTF-8 and will only become legible if one changes the encoding > setting to ISO-8859-1 (which renders, of course, much the rest of the > page unusable) > > kHanyuPinyin 10485.060:fén,fèn > kHanyuPinyin 10470.090:mù > > I suspect that the providers of this information would like to see > all of it to be encoded in UTF-8 and that the current encoding scheme > is just an accident. :-)
This is very odd. The UniHan data files, which can be downloaded and which presumably drive that WWW service, have that information correctly coded. Quoting from Unihan_Readings.txt (Unicode 6.0): U+58B3 kCantonese fan4 U+58B3 kDefinition grave, mound; bulge; bulging U+58B3 kHangul 분 U+58B3 kHanyuPinlu fen2(46) U+58B3 kHanyuPinyin 10485.060:fén,fèn U+58B3 kJapaneseKun HAKA U+58B3 kJapaneseOn FUN U+58B3 kKorean PWUN U+58B3 kMandarin FEN2 U+58B3 kTang *bhiən U+58B3 kVietnamese phần U+58B3 kXHC1983 0322.071:fén My guess is the WWW service is using a pre-release version which had some coding errors. My advice is to download the data and search it directly. Jim Breen -- Jim Breen Adjunct Snr Research Fellow, Clayton School of IT, Monash University Webmaster: Hawthorn Rowing Club, Treasurer: Japanese Studies Centre Graduate student: Language Technology Group, University of Melbourne

