2016-11-19 03:22:23 -0800, Paul Eggert: > Stephane Chazelas wrote: > > >I don't know the details of why it's done that way, but I'm not > >sure I can see how calling pcre_exec that way can be quicker > >than calling it on each individual line/record. > > It can be hundreds of times faster in common cases. See: > > http://git.savannah.gnu.org/cgit/grep.git/commit/?id=f6603c4e1e04dbb87a7232c4b44acc6afdf65fef
On that same sample, when comparing with pcregrep, an implementation which does the right thing IMO and is what grep is documented to do, that is match each line against the regexp passed as argument and without PCRE_MULTILINE I don't find a x220 factor, more like a x2.5 factor: $ time pcregrep "z.*a" k pcregrep "z.*a" k 1.00s user 0.05s system 99% cpu 1.047 total $ time ./grep -P "z.*a" k ./grep -P "z.*a" k 0.41s user 0.05s system 99% cpu 0.457 total On the other hand if you change the pattern to "z[^+]*a", pcregrep still takes about one second, but GNU grep a lot longer (I gave up after a few minutes). With a simpler example: $ time seq 10000 | ./grep -P '[^a]*1000[^a]*2000' seq 10000 0.00s user 0.00s system 0% cpu 0.002 total ./grep -P '[^a]*1000[^a]*2000' 1.99s user 0.00s system 99% cpu 1.991 total $ time seq 10000 | pcregrep '[^a]*1000[^a]*2000' seq 10000 0.00s user 0.00s system 0% cpu 0.001 total pcregrep '[^a]*1000[^a]*2000' 0.00s user 0.00s system 0% cpu 0.002 total That's at least a x1000 factor. If I understand correctly, you're calling pcre_exec on several lines and with PCRE_MULTILINE as an optimisation in the cases where the RE is unlikely to match, so one call to pcre_exec can go through many lines at a time, and if it doesn't match, that's as many lines you can safely skip. The example above is an one that shows it's an invalid optimisation. seq 10000 obviously has no line that matches that pattern, but still because GNU grep tries to match it on a string that contains multiple lines, it will match in several places, and you need to call pcre_exec again to double-check each of those false positives which is counter-productive. AFAICT, that optimisation only brings a little optimisation in some cases (and reduces performance in others) but also reduces functionality and breaks users' expectations. IMO, it's not worth it. [...] > >grep -xzP '(?m)a' > > I don't think grep can address this problem, as in general that > would require interpreting the PCRE pattern at run-time and grep > should not be delving into PCRE internals. Uses of (?m) lead to > unspecified behavior in grep, and applications should not rely on > any particular behavior in this area. I agree, grep should not try and outsmart libpcre, it should just call it on every line (or NUL delimited record with -z), so that users can expect the regexp it passes to grep to be matched on those lines/records. > This is firmly in the Perl > tradition, as the Perl documentation for this part of the regular > expression syntax says "The stability of these extensions varies > widely. Some ... are experimental and may change without warning or > be completely removed." No, those s, m, i... flags are at the core of perl regexp matching, they are certainly not experimental. Most of the ones that are "experimental" in perl are not available in PCRE. > Also, the grep manual says that -P "is > highly experimental". User beware, that's all. It's highly experimental because grep is not using PCRE the straightforward way. The simple "grep -P re" could be implemented in a few lines of code (the equivalent of a fgets+pcre_exec loop). PCRE is the modern de-facto regular expression standard. Most modern languages have regexps that are more or less compatible with those (when they don't link to libpcre). If you look at the trends on stackoverflow.com or unix.stackexchange.com, you'll notice that nowadays, GNU grep is more often called with -P than not. I don't think you can just dismiss it as "not GNU grep's core functionality". -- Stephane