Package: release.debian.org Severity: normal Tags: buster User: release.debian....@packages.debian.org Usertags: pu X-Debbugs-Cc: Axel Beckert <a...@debian.org>, debian-www@lists.debian.org
Hi, (a...@debian.org in x-d-cc, who agreed with my helping on this topic, and debian-www@lists.debian.org for information) [ Reason ] The wml package in buster contains a regression from stretch that leads to various Unicode-related fun. It can trigger Unicode validity issues in Chinese, which was seen and worked around for the build of the Debian website; but it can also misrender various languages, if a non-ASCII character happens to be the last one on a line in the WML source. That includes the rather frequent word “à” in French (affecting hundreds of pages on the Debian website), or “υ” as the last letter of the last word (seen in Greek). This was also reported for Russian. Patching the Debian website to avoid running into these situations could be feasible but would also be impractical, as new/updated translations would have to be monitored. And that wouldn't fix the rendering of unsuspecting wml users outside the Debian website use case. Patching wml instead was discussed in this MR against webmaster-team's webwml, which includes some example of bad rendering, and many more data points down the line (which are summed up below): https://salsa.debian.org/webmaster-team/webwml/-/merge_requests/596 [ Impact ] Broken rendering when non-ASCII characters appear at the end of a line in WML sources, which might be non-obvious (this wouldn't break a build). [ Tests ] Obviously, I've used the Debian website as a “regression test” that encompasses many files in various languages. My findings are available there: https://salsa.debian.org/webmaster-team/webwml/-/merge_requests/596#note_240902 https://salsa.debian.org/webmaster-team/webwml/-/merge_requests/596#note_240938 Basically, `file` can be used to determine whether rendering in generated HTML files appears to be broken, mixing UTF-8 and ISO-something (or similar) characters. With this, I confirmed that all occurrences of “Non-ISO extended-ASCII” variations are being replaced with full UTF-8 files (also variations, depending on long lines etc.). I've also checked the expected changes are happening, with “broken character” being replaced by “à” many many times in French (we have 700+ affected pages for that language alone). Non-HTML files don't appear to change much either, as expected (those were inspected via diff, rather than counting on file's output). The corpus of generated HTML is 64466 files, which seems decent enough as a real-life regression test… Finally, I've checked that *only with the patched wml package*, reverting the workaround that was put in place for Chinese doesn't break the HTML generation again, and even gets us a better rendering than with the workaround. More details in: https://salsa.debian.org/webmaster-team/webwml/-/merge_requests/596#note_240938 [ Risks ] I cannot say it will not regress or slightly change the output for some specific users/files, but I would be quite surprised to see people show up and complain that we fixed broken rendering… [ Checklist ] [x] *all* changes are documented in the d/changelog [x] I reviewed all changes and I approve them [x] attach debdiff against the package in stable [x] the issue is verified as fixed in unstable [ Changes ] The package in buster is 2.12.2~ds1-2 (through an upload to unstable that migrated into testing), the issue was fixed in the following upload (2.12.2~ds1-3) which happened 1+ year later, with just a single patch. I'm proposing to backport this specific upload to buster, hence the rather obnoxious 2.12.2~ds1-3~deb10u1 version number. I've also considered 2.12.2~ds1-2+deb10u1 which didn't look much better (and I'm not sure going with 2.12.2~ds1-4 for cosmetic reasons would be reasonable). Thanks for considering! Cheers, -- Cyril Brulebois (k...@debian.org) <https://debamax.com/> D-I release manager -- Release team member -- Freelance Consultant
diff -Nru wml-2.12.2~ds1/debian/changelog wml-2.12.2~ds1/debian/changelog --- wml-2.12.2~ds1/debian/changelog 2019-02-17 18:39:38.000000000 +0100 +++ wml-2.12.2~ds1/debian/changelog 2021-05-25 05:47:04.000000000 +0200 @@ -1,3 +1,20 @@ +wml (2.12.2~ds1-3~deb10u1) buster; urgency=medium + + * Backport Unicode fix to buster, fixing rendering issues with e.g. + non-ASCII characters in various languages, as seen when building the + Debian website. Some examples include ‘υ’ in Greek and ‘à’ in French + when those characters are at the end of a line. + + -- Cyril Brulebois <k...@debian.org> Tue, 25 May 2021 05:47:04 +0200 + +wml (2.12.2~ds1-3) unstable; urgency=medium + + * Add patch to fix regression in Unicode handling (especially Chinese) + of "htmlstrip -O2" from Stretch to Buster by adding "no feature + 'unicode_strings'". (Closes: #959761) + + -- Axel Beckert <a...@debian.org> Tue, 05 May 2020 14:48:19 +0200 + wml (2.12.2~ds1-2) unstable; urgency=medium * Recommend libgd-perl: wml::des::imgbg now uses GD.pm instead of the diff -Nru wml-2.12.2~ds1/debian/patches/fix-unicode-handling-in-htmlstrip.patch wml-2.12.2~ds1/debian/patches/fix-unicode-handling-in-htmlstrip.patch --- wml-2.12.2~ds1/debian/patches/fix-unicode-handling-in-htmlstrip.patch 1970-01-01 01:00:00.000000000 +0100 +++ wml-2.12.2~ds1/debian/patches/fix-unicode-handling-in-htmlstrip.patch 2021-05-25 05:46:53.000000000 +0200 @@ -0,0 +1,19 @@ +Description: Disable feature "unicode_strings" in pass 8 (htmlstrip) + It only works properly if file handles are set to utf8 binmode and we + can't expect that all input and output is UTF-8. So disable it + completely and go back to classic Perl ASCII-only \s meaning. +Bug-Debian: https://bugs.debian.org/959761 +Author: Axel Beckert <a...@debian.org> +Forwarded: no + +--- a/wml_include/TheWML/Backends/HtmlStrip/Main.pm ++++ b/wml_include/TheWML/Backends/HtmlStrip/Main.pm +@@ -8,6 +8,8 @@ + use warnings; + use 5.014; + ++no feature qw(unicode_strings); ++ + use parent 'TheWML::CmdLine::Base'; + + use Getopt::Long (); diff -Nru wml-2.12.2~ds1/debian/patches/series wml-2.12.2~ds1/debian/patches/series --- wml-2.12.2~ds1/debian/patches/series 2019-02-17 16:44:08.000000000 +0100 +++ wml-2.12.2~ds1/debian/patches/series 2021-05-25 05:46:53.000000000 +0200 @@ -6,3 +6,4 @@ dont-use-usr-bin-env.patch fix-typos-found-by-lintian.patch fix-contrib-wml1to2-shebang-line.patch +fix-unicode-handling-in-htmlstrip.patch