bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P

2023-01-09 Thread Ævar Arnfjörð Bjarmason
On Sun, Jan 08 2023, Carlo Marcelo Arenas Belón wrote: > When UTF is enabled for a PCRE match, the corresponding flags are > added to the pcre2_compile() call, but PCRE2_UCP wasn't included. > > This prevents extending the meaning of the character classes to > include those new valid characters

bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P

2023-01-09 Thread Paul Eggert
On 1/9/23 03:35, Ævar Arnfjörð Bjarmason wrote: You almost never want "everything Unicode considers a digit", and if you do using e.g. \p{Nd} instead of \d would be better in terms of expressing your intent. For GNU grep, PCRE2_UCP is needed because of examples like what Gro-Tsen and Karl Pet

bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P

2023-01-09 Thread Ævar Arnfjörð Bjarmason
On Mon, Jan 09 2023, Paul Eggert wrote: > On 1/9/23 03:35, Ævar Arnfjörð Bjarmason wrote: > >> You almost never want "everything Unicode considers a digit", and if you >> do using e.g. \p{Nd} instead of \d would be better in terms of >> expressing your intent. > > For GNU grep, PCRE2_UCP is need

bug#60697: GNU grep mishandles \b near encoding errors

2023-01-09 Thread Paul Eggert
Here's a shell session illustrating the problem on Fedora 37, which has GNU grep 3.7. The same bug is still in bleeding-edge GNU grep. $ export LC_ALL=en_US.utf8 $ printf '\300\n' | grep '\b' grep: (standard input): binary file matches $ printf '\300\n' | grep -P '\b' $ Plain grep fin

bug#60690: [PATCH v2] grep: correctly identify utf-8 characters with \{b, w} in -P

2023-01-09 Thread Paul Eggert
On 1/9/23 11:51, Ævar Arnfjörð Bjarmason wrote: /b: 155781 (*UCP)/b: 46035 /s: 0 (*UCP)/s: 0 /w: 142468 (*UCP)/w: 9706 So the output still differs, and some of those differences may or may not be wan