On Sun, Jan 08 2023, Carlo Marcelo Arenas Belón wrote:
> When UTF is enabled for a PCRE match, the corresponding flags are
> added to the pcre2_compile() call, but PCRE2_UCP wasn't included.
>
> This prevents extending the meaning of the character classes to
> include those new valid characters
On 1/9/23 03:35, Ævar Arnfjörð Bjarmason wrote:
You almost never want "everything Unicode considers a digit", and if you
do using e.g. \p{Nd} instead of \d would be better in terms of
expressing your intent.
For GNU grep, PCRE2_UCP is needed because of examples like what Gro-Tsen
and Karl Pet
On Mon, Jan 09 2023, Paul Eggert wrote:
> On 1/9/23 03:35, Ævar Arnfjörð Bjarmason wrote:
>
>> You almost never want "everything Unicode considers a digit", and if you
>> do using e.g. \p{Nd} instead of \d would be better in terms of
>> expressing your intent.
>
> For GNU grep, PCRE2_UCP is need
Here's a shell session illustrating the problem on Fedora 37, which has
GNU grep 3.7. The same bug is still in bleeding-edge GNU grep.
$ export LC_ALL=en_US.utf8
$ printf '\300\n' | grep '\b'
grep: (standard input): binary file matches
$ printf '\300\n' | grep -P '\b'
$
Plain grep fin
On 1/9/23 11:51, Ævar Arnfjörð Bjarmason wrote:
/b:
155781
(*UCP)/b:
46035
/s:
0
(*UCP)/s:
0
/w:
142468
(*UCP)/w:
9706
So the output still differs, and some of those differences may or may
not be wan