On Sun, Nov 14, 2021 at 7:18 PM Paul Eggert <egg...@cs.ucla.edu> wrote: > On 11/14/21 14:25, Carlo Arenas wrote: > > using idx_t instead of size_t should be fine (if only halves the max > > size of the objects managed), but I am concerned that assuming > > PCRE2_SIZE_MAX is always equivalent to SIZE_MAX (as done in patch 4) > > might be risky (at least without a comment), and considering that is > > part of the API anyway might be better if kept as PCRE2_SIZE_MAX IMHO. > > This shouldn't be a problem in practice. Surely PCRE2_SIZE_MAX is for > forward compatibility to a potential future version of PCRE2 that may > define PCRE2_SIZE to be some other type. For PCRE2 10.20 and earlier > PCRE2_SIZE is hardwired to size_t, so there is only one plausible > default for PCRE2_SIZE_MAX, namely SIZE_MAX.
which is why I mention that it will be better to at least document that in a comment, as it was done everywhere else where assumptions made in the pcre library were used. Interestingly enough this discussion gave me an idea for a feature in PCRE where that value will be set to something else than SIZE_MAX and that might break grep in a future release if it lands. > > As I mentioned before, PCRE matches the Perl definition as mentioned > > before in an early draft that also had this change reversed. > > I see that PCRE2 documents that PCRE2_EXTRA_MATCH_WORD surrounds the > pattern with "\b(?:" and ")\b". However, this is bogus: it doesn't > correspond to the intuitive meaning of "match words", and it doesn't > correspond to how grep -w behaves for any grep that I know of. It all comes from what perl defines[1] as a word character (\w), and that I presume came from the fact it was used earlier for text processing and most of that text was computer code (which currently can include unicode). > Which "early draft" are you talking about? This appears to be merely a > bug in libpcre2's documentation and implementation. https://lists.gnu.org/archive/html/grep-devel/2021-10/msg00000.html > > I would suggest instead that -P should also follow perl convention > > instead when used together with -w, but maybe that is something that a > > -P feature flag could enable or disable as needed? > > I can't imagine anybody intuitively saying in an English locale that > "%%" is a word in the string "aa%%aa". PCRE2 is broken, that's all. If a > user really wants PCRE2's buggy interpretation, they can simply surround > their regexp with "\b(?:" and ")\b" and not use -w; so there's no need > to have a different flag for pcre2grep's bizarre interpretation of -w. > > Here's another reason why pcre2grep -w is obviously busted: > > $ pcre2grep -w ',' <<'EOF' > > a,a > > a, a > > a, > > EOF > a,a > > Why is "," a word in the first input line, but not in the second or > third? pcre2grep is simply wrong here. that is indeed likely a "bug", but is one that PCRE shares with perl (and at least JavaScript, Java, Net, Python and Ruby) : $ echo 'a,a' | perl -nle '/\b(,)\b/ and print "$1"' , but it is also because the feature is not being used correctly as ',' is not a word and therefore logically none of them should match > > Note that "word" definition also has a different meaning in a post > > Unicode world > > Yes, but that's an independent issue. for the '%' example was not, as it was the fact that it has a Unicode property indicating is a character used for punctuation as the reason why it was not matched as expected by grep. Carlo [1] https://perldoc.perl.org/perlrebackslash#%5Cb%7B%7D,-%5Cb,-%5CB%7B%7D,-%5CB