(not a Perl maintainer here) -=| Axel Beckert, 05.05.2020 03:34:28 +0200 |=- > → echo 包 | perl -pe 's|\s+\n|\n|sg;' > 包 > → echo 包 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;' > � > > Which kinda sounds like a Perl bug. Cc'ing the maintainers of Debian's > perl package (not the whole Debian Perl Team), maybe they have some > insight what actually goes wrong here and if that's indeed a Perl > bug.
Seems like a user (wml) bug to me (improper handling of UTF-8 encoded data): → echo 包赠传阅加者 | perl -CS -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;' 包赠传阅加者 >From perlrun(1): -C [number/list] The -C flag controls some of the Perl Unicode features. As of 5.8.1, the -C can be followed either by a number or a list of option letters. The letters, their numeric values, and effects are as follows; listing the letters is equal to summing the numbers. I 1 STDIN is assumed to be in UTF-8 O 2 STDOUT will be in UTF-8 E 4 STDERR will be in UTF-8 S 7 I + O + E Perhaps the strings in wml need to be decoded from UTF-8 so that they aren't treated as a sequence of independent bytes? U+0085 is "Next line (NEL)", which seems to be treated as "\n". ( Strangely, replacing -CS with a call to STDIN->binmode("UTF-8") doesn't help: echo 包 | perl -E 'STDIN->binmode("UTF-8"); while(<>) { s|\s+\n|\n|sg; print }' � Explicitly using Encode helps: echo 包 | perl -E 'use Encode qw(decode_utf8); while(<>) { $_ = decode_utf8($_); s|\s+\n|\n|sg; print }' Wide character in print at -e line 1, <> line 1. 包 (whe wide character warning is expected, because STDOUT is not instructed how to encode unicode characters) ) -- dam