On Fri, Jan 6, 2023 at 11:28 PM Jim Meyering <j...@meyering.net> wrote: > On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas <care...@gmail.com> wrote: > > Reported to PCRE[1] with mention of GNU grep being also affected. > > > > [1] https://github.com/PCRE2Project/pcre2/issues/185 > > Yikes. This is a big deal. > Thank you for the patch and added test. > I made a tiny comment tweak and this test logic change that was > required to make the new test pass with the fixed version. > > -grep -Po 'r\w' in > out && fail=1 > +grep -Po 'r\w' in > out || fail=1 > > Also, make syntax-check required to change e.g., > > -compare out exp || fail=1 > +compare exp out || fail=1 > > Every bug fix needs a NEWS entry, so I added this: > > With -P, some non-ASCII UTF8 characters were not recognized as > word-constituent due to our omission of the PCRE_UCP flag. E.g., > given f(){ echo Perú|LC_ALL=en_US.UTF-8 grep -Po "$1"; } and > this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r". > After the fix, it prints the correct results: "rú:ú". > > Finally, I expanded the ChangeLog entry and gave credit where due. > > I'll push this tomorrow:
Must also mention Karl Pettersson in the ChangeLog: pcre: use UCP in UTF mode This fixes a serious bug affecting word-boundary and word-constituent regular expressions when the desired match involves non-ASCII UTF8 characters. * src/pcresearch.c: Set PCRE2_UCP together with PCRE2_UTF * tests/pcre-utf8-w: New file. * tests/Makefile.am (TESTS): Add it. * NEWS (Bug fixes): Mention this. Reported by Gro-Tsen https://twitter.com/gro_tsen/status/1610972356972875777 via Karl Pettersson in https://github.com/PCRE2Project/pcre2/issues/185 This bug was present from grep-2.5, when --perl-regexp (-P) support was added.