bug#47264: [PATCH v2] pcre: migrate to pcre2

Carlo Arenas Sun, 14 Nov 2021 20:45:09 -0800

On Sun, Nov 14, 2021 at 7:18 PM Paul Eggert <[email protected]> wrote:
> On 11/14/21 14:25, Carlo Arenas wrote:
> > using idx_t instead of size_t should be fine (if only halves the max
> > size of the objects managed), but I am concerned that assuming
> > PCRE2_SIZE_MAX is always equivalent to SIZE_MAX (as done in patch 4)
> > might be risky (at least without a comment), and considering that is
> > part of the API anyway might be better if kept as PCRE2_SIZE_MAX IMHO.
>
> This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for
> forward compatibility to a potential future version of PCRE2 that may
> define PCRE2_SIZE to be some other type. For PCRE2 10.20 and earlier
> PCRE2_SIZE is hardwired to size_t, so there is only one plausible
> default for PCRE2_SIZE_MAX, namely SIZE_MAX.


which is why I mention that it will be better to at least document
that in a comment, as it was done everywhere else where assumptions
made in the pcre library were used.

Interestingly enough this discussion gave me an idea for a feature in
PCRE where that value will be set to something else than SIZE_MAX and
that might break grep in a future release if it lands.

> > As I mentioned before, PCRE matches the Perl definition as mentioned
> > before in an early draft that also had this change reversed.
>
> I see that PCRE2 documents that PCRE2_EXTRA_MATCH_WORD surrounds the
> pattern with "\b(?:" and ")\b". However, this is bogus: it doesn't
> correspond to the intuitive meaning of "match words", and it doesn't
> correspond to how grep -w behaves for any grep that I know of.

It all comes from what perl defines[1] as a word character (\w), and
that I presume came from the fact it was used earlier for text
processing and most of that text was computer code (which currently
can include unicode).

> Which "early draft" are you talking about? This appears to be merely a
> bug in libpcre2's documentation and implementation.

https://lists.gnu.org/archive/html/grep-devel/2021-10/msg00000.html

> > I would suggest instead that -P should also follow perl convention
> > instead when used together with -w, but maybe that is something that a
> > -P feature flag could enable or disable as needed?
>
> I can't imagine anybody intuitively saying in an English locale that
> "%%" is a word in the string "aa%%aa". PCRE2 is broken, that's all. If a
> user really wants PCRE2's buggy interpretation, they can simply surround
> their regexp with "\b(?:" and ")\b" and not use -w; so there's no need
> to have a different flag for pcre2grep's bizarre interpretation of -w.
>
> Here's another reason why pcre2grep -w is obviously busted:
>
> $ pcre2grep -w ',' <<'EOF'
>  > a,a
>  > a, a
>  > a,
>  > EOF
> a,a
>
> Why is "," a word in the first input line, but not in the second or
> third? pcre2grep is simply wrong here.

that is indeed likely a "bug", but is one that PCRE shares with perl
(and at least JavaScript, Java, Net, Python and Ruby) :

  $ echo 'a,a' | perl -nle '/\b(,)\b/ and print "$1"'
  ,

but it is also because the feature is not being used correctly as ','
is not a word and therefore logically none of them should match

> > Note that "word" definition also has a different meaning in a post
> > Unicode world
>
> Yes, but that's an independent issue.

for the '%' example was not, as it was the fact that it has a Unicode
property indicating is a character used for punctuation as the reason
why it was not matched as expected by grep.

Carlo

[1] https://perldoc.perl.org/perlrebackslash#%5Cb%7B%7D,-%5Cb,-%5CB%7B%7D,-%5CB

bug#47264: [PATCH v2] pcre: migrate to pcre2

Reply via email to