Stephane Chazelas wrote:
I don't know the details of why it's done that way, but I'm not sure I can see how calling pcre_exec that way can be quicker than calling it on each individual line/record.
It can be hundreds of times faster in common cases. See: http://git.savannah.gnu.org/cgit/grep.git/commit/?id=f6603c4e1e04dbb87a7232c4b44acc6afdf65fef
Note that this is still wrong: $ printf 'a\nb\0' | ./src/grep -zxP a a b
Thanks, fixed by installing the attached.
Removing PCRE_MULTILINE (and get back to calling pcre_exec on every record separately) would help except in the cases where the user does: grep -xzP '(?m)a'
I don't think grep can address this problem, as in general that would require interpreting the PCRE pattern at run-time and grep should not be delving into PCRE internals. Uses of (?m) lead to unspecified behavior in grep, and applications should not rely on any particular behavior in this area. This is firmly in the Perl tradition, as the Perl documentation for this part of the regular expression syntax says "The stability of these extensions varies widely. Some ... are experimental and may change without warning or be completely removed." Also, the grep manual says that -P "is highly experimental". User beware, that's all.
From 882e652c8988ef9380d043ecfca96953e6c30009 Mon Sep 17 00:00:00 2001 From: Paul Eggert <egg...@cs.ucla.edu> Date: Sat, 19 Nov 2016 03:12:56 -0800 Subject: [PATCH] grep: fix -zxP bug * NEWS: Document this. * src/pcresearch.c (Pcompile): Search a line at a time if -x is used, since -x uses ^ and $. * tests/pcre: Test this. --- NEWS | 6 +++--- src/pcresearch.c | 34 ++++++++++++++++++++-------------- tests/pcre | 1 + 3 files changed, 24 insertions(+), 17 deletions(-) diff --git a/NEWS b/NEWS index 978ec55..4972c01 100644 --- a/NEWS +++ b/NEWS @@ -10,9 +10,9 @@ GNU grep NEWS -*- outline -*- >/dev/null" where PROGRAM dies when writing into a broken pipe. [bug introduced in grep-2.26] - grep -Pz no longer rejects patterns containing ^ and $, and is - more cautious about special patterns like (?-m) and (*FAIL). - [bug introduced in grep-2.23] + grep -Pz no longer rejects patterns containing ^ and $, is more + cautious about special patterns like (?-m) and (*FAIL), and works + when combined with -x. [bug introduced in grep-2.23] grep -m0 -L PAT FILE now outputs "FILE". [bug introduced in grep-2.5] diff --git a/src/pcresearch.c b/src/pcresearch.c index 439945a..01616c2 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -128,22 +128,28 @@ Pcompile (char const *pattern, size_t size) if (! eolbyte) { - bool escaped = false; - bool after_unescaped_left_bracket = false; - for (p = pattern; *p; p++) - if (escaped) - escaped = after_unescaped_left_bracket = false; - else - { - if (*p == '$' || (*p == '^' && !after_unescaped_left_bracket) - || (*p == '(' && (p[1] == '?' || p[1] == '*'))) + bool line_at_a_time = match_lines; + if (! line_at_a_time) + { + bool escaped = false; + bool after_unescaped_left_bracket = false; + for (p = pattern; *p; p++) + if (escaped) + escaped = after_unescaped_left_bracket = false; + else { - flags = (flags & ~ PCRE_MULTILINE) | PCRE_DOLLAR_ENDONLY; - break; + if (*p == '$' || (*p == '^' && !after_unescaped_left_bracket) + || (*p == '(' && (p[1] == '?' || p[1] == '*'))) + { + line_at_a_time = true; + break; + } + escaped = *p == '\\'; + after_unescaped_left_bracket = *p == '['; } - escaped = *p == '\\'; - after_unescaped_left_bracket = *p == '['; - } + } + if (line_at_a_time) + flags = (flags & ~ PCRE_MULTILINE) | PCRE_DOLLAR_ENDONLY; } *n = '\0'; diff --git a/tests/pcre b/tests/pcre index 653ef22..a290099 100755 --- a/tests/pcre +++ b/tests/pcre @@ -17,5 +17,6 @@ echo | grep -zP '\s$' || fail=1 echo '.ab' | returns_ 1 grep -Pwx ab || fail=1 echo x | grep -Pz '[^a]' || fail=1 printf 'x\n\0' | returns_ 1 grep -zP 'x$' || fail=1 +printf 'a\nb\0' | grep -zxP a && fail=1 Exit $fail -- 2.7.4