On Sun, Nov 7, 2021 at 4:30 PM Paul Eggert <egg...@cs.ucla.edu> wrote: > > On 11/7/21 11:26, Carlo Marcelo Arenas Belón wrote: > > Mostly a bug by bug translation of the original code to the PCRE2 API. > > but includes a couple of fixes as well that might be worth doing in > > independent patches, if a straight translation is preferable. > > Yes, that might be preferable so that if there are problems we can > bisect better.
Working on it, I will open independent bugs then for them, but I have to admit I am a little concerned that since they are all touching the same code it might be more difficult to keep track of the end result and the need to merge conflict resolution. Let me know how to help otherwise. > > Includes backward compatibility and could be made to build all the way > > to 10.00, but assumes a recent version ~10.30; the configure rule sets > > a strict minimum of 10.34 as that is required to pass all tests, even > > if the issues are minimal and likely to be real bugs that the old PCRE > > just hide, there is likely more work pending in this area. > > pcre2 10.34 is two years old. Seems old enough to me (tho perhaps others > can chime in). Ironically, I am the user of one of those, as my debian10 developer box uses 10.32, and indeed CentOS 7 is on even an older 10.23. Forcing 10.34 to compile would allow us to clean up the code significantly, and was what I was aiming for originally, but it seems that the added support needed for older versions isn't that difficult either. I didn't want anyone hitting on those old PCRE2 bugs though with this first release, hence why the configure rule is there for now (even if I am likely going to remove it for the next version) > > + unsigned char *re = xnmalloc (4, size + (fix_len_max + 4 - 1) / 4); > > We don't need to multiply by 4 any more, right? Because we no longer > need to do the trick of replacing NUL with \000 in the pattern. I couldn't find a rationale for it in the emails or commit history, but will clean it up with the other suggestions for next release. > > + flags |= PCRE2_UTF; > > +#ifdef PCRE2_NEVER_BACKSLASH_C > > + /* do not match individual code units but only UTF-8 */ > > + flags |= PCRE2_NEVER_BACKSLASH_C; > > +#endif > > +#ifdef PCRE2_MATCH_INVALID_UTF > > + /* consider invalid UTF-8 as a barrier, instead of error */ > > + flags |= PCRE2_MATCH_INVALID_UTF; > > +#endif > > Which versions of PCRE2 lack these flags? Should we bother to support > these old versions? PCRE2_NEVER_BACKSLASH_C is from 10.20 (~2015) and likely to be everywhere, the #ifdef is just to help backporters to even earlier versions (10.00 requires even more of those which I didn't include here). Interestingly enough, the CentOS 7 test (which was patched on top of 10.37, because current git doesn't build there anymore) passed all tests, and only needed a few more of those. \C is supported with -P in the PCRE version now though, is removing that ok? > If memory serves grep currently takes care to not pass invalid UTF-8 in > the buffer or pattern. Does PCRE2_MATCH_INVALID_UTF make this work obsolete? not sure I understand what you mean, but PCRE2_MATCH_INVALID_UTF is definitely something that helps with binary search because it will try harder to do matches in the text found, instead of bailing out with an error. it still has to do a scan and verify UTF-8 is valid, so you still have the performance hit that doing binary matching with a C locale avoids. Carlo