Hi all, (with my Debian Chinese Team hat on)
(see bottom...) 在 2020-05-03星期日的 22:57 +0200,Holger Wansing写道: > Hi, > > Laura Arjona Reina <larj...@debian.org> wrote: > > There are some issues with some Chinese pages when they are built in a > > buster machine. > > We need to fix those issues (at least the "Malformed UTF-8 character > > [...] at ../../bin/tocn.pl [...]" ones) so DSA can upgrade the > > www-master machine to buster. See the summary of the log at the bottom > > to know which files produce this error. > > I have no idea of how to fix the issues, so any help from the Chinese > > team or web team mates is greatly appreciated.. > > Additional issues may arise (e.g. I still didn't test the release-notes > > or doc-manual), any help testing is welcome too, please create bug > > reports for each different issue or update the existing ones. Thanks! > > > > LONG VERSION > > > > I've done a test build of the /english and /chinese subdirs in a buster > > machine, and I have noticed some warnings/errors related to the Chinese > > pages (some, not all of them). > > > > It would be desirable to upgrade www-master machine to buster as soon as > > possible, so any help with this (from website or Chinese team members) > > is very appreciated. > > > > Below you can find an extract of the build log, including only the the > > files for which I got some error or warning message. > > > > After the build, I have compared the problematic HTML files of a build > > in stretch and a build in buster with a diff tool, to see if there were > > significant changes in the html output due to these issues. > > > > Here are my results: > > > > * For the messages of the type ", [zh_TW]Invalid UTF8: " when building, > > I couldn't note any difference between the output of a stretch build and > > the output of a buster build. > > > > I would say this is not a blocker for the buster upgrade of www-master. > > Don't know what I did different than Laura, but here some of the built html > files > with "Invalid UTF8: ... " messages are lacking much of the content, compared > to the one currently at www-master. > So maybe they are also serious. > > > * For the messages of the type "Malformed UTF-8 character [...] at > > ../../bin/tocn.pl [...]" I have seen important changes in the HTML diff, > > I think the output in the stretch build is totally broken (fortunately, > > there are not many files in that situation). > > > > I would say this is a blocker for the buster upgrade of www-master, but > > I would prefer somebody of the Chinese team to confirm (try to build > > those files in a buster machine, and review the output). > > Maybe someone from the chinese people can solve this, but if not, I want > to propose a possible (temporary) solution: > > If I delete the files below from the webwml/chinese tree, I can build > chinese without any errors. So, probably we can go with a workaround like > this: > delete this files, to remove these upgrade blockers out of the way, upgrade > wolkenstein to buster, and then try to re-add the files step-by-step, maybe > with some modifications at some point, to get the original situation back. Thanks for raising this issue. These build errors might have multiple causes, but I stripped the issue down to a (possible) regression of wml. Let's fix this issue first before talking about others. ======================================= $ wml --version This is WML Version 2.12.2 Copyright (c) 1996-2001 Ralf S. Engelschall. Copyright (c) 1999-2001 Denis Barbier. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. $ cat /etc/issue Debian GNU/Linux bullseye/sid \n \l $ cat a.wml <p> 包 </p> $ hexdump -C a.wml 00000000 3c 70 3e 0a e5 8c 85 0a 3c 2f 70 3e 0a |<p>.....</p>.| 0000000d $ wml a.wml > test.txt $ cat test.txt <p> � </p> $ hexdump -C test.txt 00000000 3c 70 3e 0a e5 8c 0a 3c 2f 70 3e 0a |<p>....</p>.| 0000000c $ ================================================== The single character in the a.wml above is U+5305 [1], namely "CJK Unified Ideograph-5305", a commonly-used Chinese character. Its UTF-8 encoding is "0xE5 0x8C 0x85". However after wml transformation, only "0xE5 0x8C" was kept and the "0x85" was dropped. That's surely a regression. I am using Debian Unstable but similar things also happen in Buster. I cc-ed the wml maintainer in Debian. Axel, is there any possibility to solve this regression in both Sid/Testing and Stable? -- Regards, Boyuan Yang [1] https://www.compart.com/en/unicode/U+5305
signature.asc
Description: This is a digitally signed message part