bug#47264: [PATCH] pcre: migrate to pcre2

Carlo Arenas Mon, 08 Nov 2021 01:48:16 -0800

On Sun, Nov 7, 2021 at 4:30 PM Paul Eggert <[email protected]> wrote:
>
> On 11/7/21 11:26, Carlo Marcelo Arenas Belón wrote:
> > Mostly a bug by bug translation of the original code to the PCRE2 API.
> > but includes a couple of fixes as well that might be worth doing in
> > independent patches, if a straight translation is preferable.
>
> Yes, that might be preferable so that if there are problems we can
> bisect better.


Working on it, I will open independent bugs then for them, but I have
to admit I am a little concerned that since they are all touching the
same code it might be more difficult to keep track of the end result
and the need to merge conflict resolution.

Let me know how to help otherwise.

> > Includes backward compatibility and could be made to build all the way
> > to 10.00, but assumes a recent version ~10.30; the configure rule sets
> > a strict minimum of 10.34 as that is required to pass all tests, even
> > if the issues are minimal and likely to be real bugs that the old PCRE
> > just hide, there is likely more work pending in this area.
>
> pcre2 10.34 is two years old. Seems old enough to me (tho perhaps others
> can chime in).

Ironically, I am the user of one of those, as my debian10 developer
box uses 10.32, and indeed CentOS 7 is on even an older 10.23.
Forcing 10.34 to compile would allow us to clean up the code
significantly, and was what I was aiming for originally, but it seems
that the added support needed for older versions isn't that difficult
either.
I didn't want anyone hitting on those old PCRE2 bugs though with this
first release, hence why the configure rule is there for now (even if
I am likely going to remove it for the next version)

> > +  unsigned char *re = xnmalloc (4, size + (fix_len_max + 4 - 1) / 4);
>
> We don't need to multiply by 4 any more, right? Because we no longer
> need to do the trick of replacing NUL with \000 in the pattern.

I couldn't find a rationale for it in the emails or commit history,
but will clean it up with the other suggestions for next release.

> > +      flags |= PCRE2_UTF;
> > +#ifdef PCRE2_NEVER_BACKSLASH_C
> > +      /* do not match individual code units but only UTF-8  */
> > +      flags |= PCRE2_NEVER_BACKSLASH_C;
> > +#endif
> > +#ifdef PCRE2_MATCH_INVALID_UTF
> > +      /* consider invalid UTF-8 as a barrier, instead of error  */
> > +      flags |= PCRE2_MATCH_INVALID_UTF;
> > +#endif
>
> Which versions of PCRE2 lack these flags? Should we bother to support
> these old versions?

PCRE2_NEVER_BACKSLASH_C is from 10.20 (~2015) and likely to be
everywhere, the #ifdef is just to help backporters to even earlier
versions (10.00 requires even more of those which I didn't include
here).

Interestingly enough, the CentOS 7 test (which was patched on top of
10.37, because current git doesn't build there anymore) passed all
tests, and only needed a few more of those.

\C is supported with -P in the PCRE version now though, is removing that ok?

> If memory serves grep currently takes care to not pass invalid UTF-8 in
> the buffer or pattern. Does PCRE2_MATCH_INVALID_UTF make this work obsolete?

not sure I understand what you mean, but PCRE2_MATCH_INVALID_UTF is
definitely something that helps with binary search because it will try
harder to do matches in the text found, instead of bailing out with an
error.

it still has to do a scan and verify UTF-8 is valid, so you still have
the performance hit that doing binary matching with a C locale avoids.

Carlo

bug#47264: [PATCH] pcre: migrate to pcre2

Reply via email to