Hi Damyan, Damyan Ivanov wrote: > (not a Perl maintainer here)
Did help nevertheless. Just didn't want to spam the whole Perl Team with potential Perl bugs. ;-) > -=| Axel Beckert, 05.05.2020 03:34:28 +0200 |=- > > → echo 包 | perl -pe 's|\s+\n|\n|sg;' > > 包 > > → echo 包 | perl -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;' > > � > > > > Which kinda sounds like a Perl bug. Cc'ing the maintainers of Debian's > > perl package (not the whole Debian Perl Team), maybe they have some > > insight what actually goes wrong here and if that's indeed a Perl > > bug. > > Seems like a user (wml) bug to me (improper handling of UTF-8 encoded data): > > → echo 包赠传阅加者 | perl -CS -M"feature unicode_strings" -pe 's|\s+\n|\n|sg;' > 包赠传阅加者 > > >From perlrun(1): > > -C [number/list] > The -C flag controls some of the Perl Unicode features. > > As of 5.8.1, the -C can be followed either by a number or a list > of option letters. The letters, their numeric values, and effects > are as follows; listing the letters is equal to summing the > numbers. > > I 1 STDIN is assumed to be in UTF-8 > O 2 STDOUT will be in UTF-8 > E 4 STDERR will be in UTF-8 > S 7 I + O + E Thanks! I was not aware of the -C option... > Perhaps the strings in wml need to be decoded from UTF-8 so that they > aren't treated as a sequence of independent bytes? ... and would have expect "use feature unicode_strings;" already activates all of this. > U+0085 is "Next line (NEL)", which seems to be treated as "\n". I see. > Strangely, replacing -CS with a call to STDIN->binmode("UTF-8") > doesn't help: > > echo 包 | perl -E 'STDIN->binmode("UTF-8"); while(<>) { s|\s+\n|\n|sg; print > }' > � > > Explicitly using Encode helps: > > echo 包 | perl -E 'use Encode qw(decode_utf8); while(<>) { $_ = > decode_utf8($_); s|\s+\n|\n|sg; print }' > Wide character in print at -e line 1, <> line 1. > 包 Thanks, will try to use whatever works from these. Regards, Axel -- ,''`. | Axel Beckert <a...@debian.org>, https://people.debian.org/~abe/ : :' : | Debian Developer, ftp.ch.debian.org Admin `. `' | 4096R: 2517 B724 C5F6 CA99 5329 6E61 2FF9 CD59 6126 16B5 `- | 1024D: F067 EA27 26B9 C3FC 1486 202E C09E 1D89 9593 0EDE