Control: clone -1 -2 Control: reasign -2 wml 2.12.2~ds1-2 Control: retitle -2 wml: Regression in "htmlstrip -O2" (default) with Chinese language
Hi, Boyuan Yang wrote: > Thanks for raising this issue. Thanks from me, too. I wasn't aware of such a regression, sorry. > These build errors might have multiple causes, > but I stripped the issue down to a (possible) regression of wml. Let's fix > this issue first before talking about others. > > ======================================= > $ wml --version > This is WML Version 2.12.2 > Copyright (c) 1996-2001 Ralf S. Engelschall. > Copyright (c) 1999-2001 Denis Barbier. > > This program is distributed in the hope that it will be useful, > but WITHOUT ANY WARRANTY; without even the implied warranty of > MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > GNU General Public License for more details. > $ cat /etc/issue > Debian GNU/Linux bullseye/sid \n \l > > $ cat a.wml > <p> > 包 > </p> > $ hexdump -C a.wml > 00000000 3c 70 3e 0a e5 8c 85 0a 3c 2f 70 3e 0a |<p>.....</p>.| > 0000000d > $ wml a.wml > test.txt > $ cat test.txt > <p> > � > </p> > $ hexdump -C test.txt > 00000000 3c 70 3e 0a e5 8c 0a 3c 2f 70 3e 0a |<p>....</p>.| > 0000000c > $ […] > I am using Debian Unstable but similar things also happen in Buster. Can confirm that this is a regression between Stretch and Buster. :-( > The single character in the a.wml above is U+5305 [1], namely "CJK Unified > Ideograph-5305", a commonly-used Chinese character. Its UTF-8 encoding is > "0xE5 0x8C 0x85". However after wml transformation, only "0xE5 0x8C" was kept > and the "0x85" was dropped. That's surely a regression. Ack. Figured out that it's pass 8 of 9 passes in WML: → cat a.wml | wml -p1-8 <p> � </p> → cat a.wml | wml -p1-7 <p> 包 </p> → cat a.wml | wml -p1-7,9 <p> 包 </p> → echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip � → Pass 8 is htmlstrip, something similar uglifyjs, but for HTML. Since that pass should be only for delivery performance and disk space reasons, it likely can be left out easily. So I see multiple ways to more or less quickly fix this issue in the Debian web: * Always call wml with "-p1-7,9". * Call wml with "-p1-7,9" if any of the affected languages is build. * Add <nostrip>…</nostrip> containers in the header and footer templates for the affected langauges. To be more precise, it's the optimisation level 2 of htmlstrip: → echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 0 包 → echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 1 包 → echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 2 � → The man page says: Level 2: Good stripping: Same as level 1 plus compression of multiple whitespaces (more then one in sequence) to single whitespaces [txt,tag] and stripping of trailing whitespaces at the of of a line [txt,tag,pre]. This level is the default because while providing good optimization the HTML markup is not destroyed and remains human readable. So instead of skipping htmlstrip completely, everywhere, where I suggested passing "-p1-7,9", also "-O1" could be passed to wml as this is passed to htmlstrip: → cat a.wml | wml -O1 <p> 包 </p> > I cc-ed the wml maintainer in Debian. Axel, is there any possibility to solve > this regression in both Sid/Testing and Stable? I think the above is a good first workaround on buster. With this mail, I clone the bug report and will try to figure out what change in htmlstrip caused the regression and/or how it can be fixed. I though currently have issues building more recent upstream versions of WML which is the reason why wml in Unstable hasn't seen an update yet. A more recent version is in git, but IIRC there was another release or two recently, at which I haven't looked yet. Regards, Axel -- ,''`. | Axel Beckert <a...@debian.org>, https://people.debian.org/~abe/ : :' : | Debian Developer, ftp.ch.debian.org Admin `. `' | 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5 `- | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE
signature.asc
Description: PGP signature