Hi, found the culprit quicker than expected. I'm though no more sure if it's really a WML issue or if sits even deeper:
Axel Beckert wrote: > → echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 1 > 包 > → echo 包 | /usr/share/wml/exec/wml_p8_htmlstrip -O 2 > � Level 2 actually only consists of these two regular expressions being applied: * s|(\S+)[ \t]{2,}|$1 |sg * s|\s+\n|\n|sg It's the latter one (a really simple regexp) which causes the breakage. But not always. It depends on which Perl version compatibility level is used: → echo 包 | perl -pe 's|\s+\n|\n|sg;' 包 → echo 包 | perl -pE 's|\s+\n|\n|sg;' � "-E' instead of "-e" means "use the most recent Perl version feature set", for this bug it is equivalent to "use 5.014;" as that's what is used in htmlstrip. From some point of view, we're lucky, because the feature set of Perl 5.14 wasn't that big: "say state switch unicode_strings". It's obvious that neither say, state nor switch are causing this. So it seems as if "use feature unicode_strings" is the culprit. Proof: → echo 包 | perl -pe 's|\s+\n|\n|sg;' 包 → echo 包 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;' � Which kinda sounds like a Perl bug. Cc'ing the maintainers of Debian's perl package (not the whole Debian Perl Team), maybe they have some insight what actually goes wrong here and if that's indeed a Perl bug. I'm leaving #959761 open in wml as I now have an idea how to fix this there (adding "no feature unicode_strings" to htmlstrip in the hope that this doesn't do any collateral damage): → echo 包 | perl -pE 'no feature unicode_strings; s|\s+\n|\n|sg;' 包 Regards, Axel -- ,''`. | Axel Beckert <a...@debian.org>, https://people.debian.org/~abe/ : :' : | Debian Developer, ftp.ch.debian.org Admin `. `' | 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5 `- | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE
signature.asc
Description: PGP signature