bug#60618: unicode characters are not identified as such for \w and \b with -P

Jim Meyering Fri, 06 Jan 2023 23:38:40 -0800

On Fri, Jan 6, 2023 at 11:28 PM Jim Meyering <[email protected]> wrote:
> On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas <[email protected]> wrote:
> > Reported to PCRE[1] with mention of GNU grep being also affected.
> >
> > [1] https://github.com/PCRE2Project/pcre2/issues/185
>
> Yikes. This is a big deal.
> Thank you for the patch and added test.
> I made a tiny comment tweak and this test logic change that was
> required to make the new test pass with the fixed version.
>
> -grep -Po 'r\w' in > out && fail=1
> +grep -Po 'r\w' in > out || fail=1
>
> Also, make syntax-check required to change e.g.,
>
> -compare out exp || fail=1
> +compare exp out || fail=1
>
> Every bug fix needs a NEWS entry, so I added this:
>
>   With -P, some non-ASCII UTF8 characters were not recognized as
>   word-constituent due to our omission of the PCRE_UCP flag. E.g.,
>   given f(){ echo Perú|LC_ALL=en_US.UTF-8 grep -Po "$1"; } and
>   this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r".
>   After the fix, it prints the correct results: "rú:ú".
>
> Finally, I expanded the ChangeLog entry and gave credit where due.
>
> I'll push this tomorrow:


Must also mention Karl Pettersson in the ChangeLog:

pcre: use UCP in UTF mode

This fixes a serious bug affecting word-boundary and word-constituent regular
expressions when the desired match involves non-ASCII UTF8 characters.
* src/pcresearch.c: Set PCRE2_UCP together with PCRE2_UTF
* tests/pcre-utf8-w: New file.
* tests/Makefile.am (TESTS): Add it.
* NEWS (Bug fixes): Mention this.
Reported by Gro-Tsen https://twitter.com/gro_tsen/status/1610972356972875777
via Karl Pettersson in https://github.com/PCRE2Project/pcre2/issues/185
This bug was present from grep-2.5, when --perl-regexp (-P) support was added.

bug#60618: unicode characters are not identified as such for \w and \b with -P

Reply via email to