Hi, it seems for long files which starts with non binary data and if PCRE matcher is used, grep works in TEXTBIN_UNKNOWN mode until it finds binary data, then it switches to TEXTBIN_BINARY. But in -Pc mode in TEXTBIN_BINARY it exits on next match causing bogus -Pc results.
Reproducer: $ grep -P -c 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt 1 $ grep -P 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt | wc -l 2 The ./filtered.txt is long enough text file, that contains some NULLs after the first 32kB text, e.g. https://bugzilla.redhat.com/attachment.cgi?id=1080646 Original downstream bugzilla: https://bugzilla.redhat.com/attachment.cgi?id=1080646 Attached is my attempt to fix it, but it may be not the right way how to fix it. Especially the question is whether it should stop when it finds binary data or not. But at least the grep -Pc / grep -P | wc -l should behave the same thanks & regards Jaroslav
From d65bb028e65d22329619e4e8b49d05c2b2535420 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jaroslav=20=C5=A0karvada?= <jskar...@redhat.com> Date: Thu, 26 Nov 2015 19:01:33 +0100 Subject: [PATCH] grep: do not stop on binary data if counting in PCRE MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Red Hat bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1269014 Signed-off-by: Jaroslav Å karvada <jskar...@redhat.com> --- src/grep.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/src/grep.c b/src/grep.c index 2c5e09a..957e71d 100644 --- a/src/grep.c +++ b/src/grep.c @@ -1445,7 +1445,8 @@ grep (int fd, struct stat const *st) if (binary_files == WITHOUT_MATCH_BINARY_FILES) return 0; textbin = tb; - done_on_match = out_quiet = true; + if (!count_matches) + done_on_match = out_quiet = true; nul_zapper = eol; skip_nuls = skip_empty_lines; } -- 2.4.3