Deng Xiyue <manphiz-gu...@users.alioth.debian.org> writes: > Anthony Wong <ypw...@gmail.com> writes: > >> 2009/2/17 Arne Goetje <a...@linux.org.tw> >>> >>> Matt Kraai wrote: >>> > On Mon, Feb 16, 2009 at 11:58:20PM +0800, Arne Goetje wrote: >>> >> Matt Kraai wrote: >>> >>> We already appear to use a single source version for all three Chinese >>> >>> translations: Big5. Whether it's possible to change to UTF-8 is for >>> >>> someone more familiar with Chinese to say. It's not sufficient to >>> >>> just switch the encoding of this file, though: >>> >>> >>> >>> $ make >>> >>> cd . && wml -q -D CUR_YEAR=2009 -o >> undef...@ucnucnhkucntw:20090214.zh-cn.html....@g+w -o >> undef...@uhkucnhkuhktw:20090214.zh-hk.html....@g+w -o >> undef...@utwucntwuhktw:20090214.zh-tw.html....@g+w --prolog=../../bin/ >> fix_big5.pl 20090214.wml >>> >>> * Converting: [zh_CN.GB2312], /usr/bin/iconv: illegal input sequence >>> >>> at >> position 233 >>> >>> make: *** [20090214.zh-cn.html] Error 1 >>> >>> >>> >> Doesn't surprise me. A number of characters which are present in Big5 >>> >> are not present in GB2312 (and vice versa). Using iconv to convert those >>> >> characters will lead to such errors. >>> >> >>> >> zh-autoconvert might give better results. >>> >> >>> >> Else, if you can give me the link to the source, then I can take a look. >>> > >>> > Sure, it's available in the webwml CVS module at >>> > chinese/News/2009/20090214.wml. You can find instructions for >>> > accessing the repository at >>> > >>> > http://www.debian.org/devel/website/using_cvs >>> > >>> >>> OK, attached are the results for review. >>> >>> Build-Depends: zh-autoconvert >>> >>> To convert from Big5 into GB2312: >>> autob5 -o gb < 20090214.wml > 20090214_gb2312.wml >>> To convert from Big5 into UTF-8: >>> autob5 -o utf8 < 20090214.wml > 20090214_zht_utf8.wml >>> To convert from Big5 into simplified Chinese UTF-8: >>> autob5 -o gb < 20090214.wml | autogb -o utf8 > 20090214_zhs_utf8.wml >>> >>> I used the latter two commands to generate the attached files. >>> >>> The difference between iconv and zh-autoconvert is that iconv simply >>> tries to convert the codepoints one to one and zh-autoconvert uses a >>> dictionary to map traditional characters to their simplified >>> counterparts. Since the database is quite old, it may not work for >>> simplified <-> traditional mappings where simplified characters have >>> been added later (GBK) or where the document contains HKSCS characters, >>> which use the Big5 Private Use Area. Those characters cannot be converted. >>> I have long wanted to create a new library where a full Unicode >>> compatible mapping takes place. Unfortunately I don't have the time for >>> that. But if there are any volunteers out there, I'm willing to >>> coordinate such a project. >>> >>> Cheers >>> Arne >> >> Hi all, >> >> I have been thinking that using Big5 as the primary encoding for both >> TC (Traditional Chinese) and SC (Simplified Chinese) versions of >> Debian website are detrimental to user contributions. To summarize the >> current situation of the Chinese versions of Debian website, >> translations must be done in Big5 WML files, TC version is basically >> converted simply from WML to HTML, but to generate the SC versions, >> Big5 files must be converted to GB2312 first. It is done so due to the >> one-to-many SC-TC mappings problem. To deal with the differences of >> terms for the same meaning in TC and SC, like 文件 and 檔案, we use a >> simple mapping table written in Perl and for some terms that are >> rarely used, inline WML substitution syntax is used, like [CN:文 >> 件:][HKTW:檔案:]. >> >> This puts a hurdle for SC users to submit translations to Debian, >> because they write in SC but then have to use whatever method to >> convert it to Big5 for submission. And there is also the possibility >> that the converted Big5 file may not contain proper TC >> words/phrases. It also gives people the impression that SC >> contributors are treated like "second-class citizens" (am I too >> sensitive?). Not to mention that Big5 and GB2312 are both considered >> as outdated encodings now and should better be replaced by UTF-8, to >> make the same file accessible to both TC and SC users. >> >> I suggest 1. to convert all existing Chinese WML files for the Debian website >> from Big5 to UTF-8, and 2. to use MediaWiki's Chinese conversion table to do >> both TC-SC and SC-TC conversions. This way, we no longer need to care which >> script the translators use and the burden for them to use Big5 is lifted. >> >> For MediaWiki's Chinese conversion system, please see: >> >> 1. http://meta.wikimedia.org/wiki/ >> Automatic_conversion_between_simplified_and_traditional_Chinese >> 2. http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/ >> ZhConversion.php?revision=47314&view=markup >> 3. http://zh.wikipedia.org/wiki/MediaWiki:Conversiontable/zh-hant >> 4. http://zh.wikipedia.org/wiki/MediaWiki:Conversiontable/zh-hans >> >> >> Any comments? >> >> -- >> Anthony > > It is great to migrate to UTF-8 encoding to ease encoding conversion. > However, I'm a little bit concerned with solution for automatic dialect > handling in mediawiki, which is complicated and possibly error-prone. > It'll be good if the inline diversion solution currently in use can be > retained. Plus, several diversions that are synonyms can be unified, as > the example given above. Ideas? >
By example, instead of Anthony Wang's [CN:文件:][HKTW:檔案:] which does require differentiate, I mean this one: 我們將[CN:盡最大努力:][HKTW:力盡所能:] As least in China mainland, both versions are used, so that I guess it can be unified to "力盡所能" :) Regards, Deng Xiyue -- To UNSUBSCRIBE, email to debian-www-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org