bug#22655: grep -Pz '^' now fails!

Stephane Chazelas Sat, 19 Nov 2016 08:16:04 -0800

2016-11-19 03:22:23 -0800, Paul Eggert:
> Stephane Chazelas wrote:
> 
> >I don't know the details of why it's done that way, but I'm not
> >sure I can see how calling pcre_exec that way can be quicker
> >than calling it on each individual line/record.
> 
> It can be hundreds of times faster in common cases. See:
> 
> http://git.savannah.gnu.org/cgit/grep.git/commit/?id=f6603c4e1e04dbb87a7232c4b44acc6afdf65fef


On that same sample, when comparing with pcregrep, an
implementation which does the right thing IMO and is what grep
is documented to do, that is match each line against the regexp
passed as argument and without PCRE_MULTILINE

I don't find a x220 factor, more like a x2.5 factor:

$ time pcregrep  "z.*a" k
pcregrep "z.*a" k  1.00s user 0.05s system 99% cpu 1.047 total
$ time ./grep -P  "z.*a" k
./grep -P "z.*a" k  0.41s user 0.05s system 99% cpu 0.457 total

On the other hand if you change the pattern to "z[^+]*a",
pcregrep still takes about one second, but GNU grep a lot longer
(I gave up after a few minutes).

With a simpler example:

$ time seq 10000 | ./grep -P '[^a]*1000[^a]*2000'
seq 10000  0.00s user 0.00s system 0% cpu 0.002 total
./grep -P '[^a]*1000[^a]*2000'  1.99s user 0.00s system 99% cpu 1.991 total
$ time seq 10000 | pcregrep '[^a]*1000[^a]*2000'
seq 10000  0.00s user 0.00s system 0% cpu 0.001 total
pcregrep '[^a]*1000[^a]*2000'  0.00s user 0.00s system 0% cpu 0.002 total

That's at least a x1000 factor.

If I understand correctly, you're calling pcre_exec on several
lines and with PCRE_MULTILINE as an optimisation in the cases
where the RE is unlikely to match, so one call to pcre_exec can
go through many lines at a time, and if it doesn't match, that's
as many lines you can safely skip.

The example above is an one that shows it's an invalid
optimisation. seq 10000 obviously has no line that matches that
pattern, but still because GNU grep tries to match it on a
string that contains multiple lines, it will match in several
places, and you need to call pcre_exec again to double-check
each of those false positives which is counter-productive.

AFAICT, that optimisation only brings a little optimisation in
some cases (and reduces performance in others) but also reduces
functionality and breaks users' expectations. IMO, it's not
worth it.

[...]
> >grep -xzP '(?m)a'
> 
> I don't think grep can address this problem, as in general that
> would require interpreting the PCRE pattern at run-time and grep
> should not be delving into PCRE internals. Uses of (?m) lead to
> unspecified behavior in grep, and applications should not rely on
> any particular behavior in this area. 

I agree, grep should not try and outsmart libpcre, it should
just call it on every line (or NUL delimited record with -z), so
that users can expect the regexp it passes to grep to be matched
on those lines/records.

> This is firmly in the Perl
> tradition, as the Perl documentation for this part of the regular
> expression syntax says "The stability of these extensions varies
> widely. Some ... are experimental and may change without warning or
> be completely removed."

No, those s, m, i... flags are at the core of perl regexp
matching, they are certainly not experimental.

Most of the ones that are "experimental" in perl are not
available in PCRE.

> Also, the grep manual says that -P "is
> highly experimental". User beware, that's all.

It's highly experimental because grep is not using PCRE the
straightforward way. The simple "grep -P re" could be
implemented in a few lines of code (the equivalent of a
fgets+pcre_exec loop).

PCRE is the modern de-facto regular expression standard. Most
modern languages have regexps that are more or less compatible
with those (when they don't link to libpcre). If you look at the
trends on stackoverflow.com or unix.stackexchange.com, you'll
notice that nowadays, GNU grep is more often called with -P than
not.

I don't think you can just dismiss it as "not GNU grep's core
functionality".

-- 
Stephane

bug#22655: grep -Pz '^' now fails!

Reply via email to