On Fri, 27 Nov 2015 06:29:31 -0500 (EST)
Jaroslav Skarvada <jskar...@redhat.com> wrote:

> Hi,
> 
> it seems for long files which starts with non binary data and if PCRE matcher
> is used, grep works in TEXTBIN_UNKNOWN mode until it finds binary data, then 
> it
> switches to TEXTBIN_BINARY. But in -Pc mode in TEXTBIN_BINARY it exits
> on next match causing bogus -Pc results.
> 
> Reproducer:
> $ grep -P -c 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt
> 1
> $ grep -P 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt | wc -l
> 2
> 
> The ./filtered.txt is long enough text file, that contains some NULLs after 
> the
> first 32kB text, e.g. https://bugzilla.redhat.com/attachment.cgi?id=1080646
> 
> Original downstream bugzilla:
> https://bugzilla.redhat.com/attachment.cgi?id=1080646
> 
> Attached is my attempt to fix it, but it may be not the right way
> how to fix it. Especially the question is whether it should stop when
> it finds binary data or not. But at least the grep -Pc / grep -P | wc -l
> should behave the same
> 
> thanks & regards
> 
> Jaroslav

I see that filter.txt is binary file, as NULs are included at line 647.
However, first 32768 bytes are correctly enocoded.

If first 32768 bytes of a file are correct encoding, grep -P marks with
not TEXTBIN_TEXT but TEXTBIN_UNKNOWN, and if grep found first match,
marks with TEXTBIN_TEXT.  However, grep -P -c does not do last behavior.


grep -P treats as TEXTBIN_UNKNOWN, and if grep found first match, treats
as text file.  However, grep -P -c does not do it.

So you can get number of matched lines with grep -a -P -c.

Thanks,
Norihiro
From 6e4aa5ddf0f81cfd86303b958d3c0f93c350a028 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka <nori...@kcn.ne.jp>
Date: Thu, 26 Nov 2015 10:17:47 +0900
Subject: [PATCH] grep -P / grep -Pc consistent results

If first 32768 bytes of a file are correct encoding, grep -P marks with
not TEXTBIN_TEXT but TEXTBIN_UNKNOWN, and if grep found first match,
marks with TEXTBIN_TEXT.  However, grep -P -c does not do last behavior.

Reported by Jaroslav Skarvada in http://debbugs.gnu.org/22028

* src/grep.c (grep): Fix this.
* tests/count-for-binary: Add new test.
* tests/Makefile.am: Add test for this.
* NEWS: Mention it.
---
 NEWS              |  4 ++++
 src/grep.c        | 14 +++++++-------
 tests/Makefile.am |  1 +
 tests/pcre-count  | 23 +++++++++++++++++++++++
 4 files changed, 35 insertions(+), 7 deletions(-)
 create mode 100755 tests/pcre-count

diff --git a/NEWS b/NEWS
index ac632d7..f498a5b 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,10 @@ GNU grep NEWS                                    -*- outline 
-*-
 
 * Noteworthy changes in release ?.? (????-??-??) [?]
 
+** Buf fixes
+
+  Now grep -P / grep -Pc are consistent results.
+  [bug introduced in grep-2.21]
 
 * Noteworthy changes in release 2.22 (2015-11-01) [stable]
 
diff --git a/src/grep.c b/src/grep.c
index 2c5e09a..cd1826c 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -1415,13 +1415,13 @@ grep (int fd, struct stat const *st)
         }
 
       /* Detect whether leading context is adjacent to previous output.  */
-      if (lastout)
-        {
-          if (textbin == TEXTBIN_UNKNOWN)
-            textbin = TEXTBIN_TEXT;
-          if (beg != lastout)
-            lastout = 0;
-        }
+      if (beg != lastout)
+        lastout = 0;
+
+      /* If the file's textbin has not been determined yet, assume
+         it's text if has found any matched line already.  */
+      if (textbin == TEXTBIN_UNKNOWN && nlines)
+        textbin = TEXTBIN_TEXT;
 
       /* Handle some details and read more data to scan.  */
       save = residue + lim - beg;
diff --git a/tests/Makefile.am b/tests/Makefile.am
index d379821..2865871 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -105,6 +105,7 @@ TESTS =                                             \
   pcre                                         \
   pcre-abort                                   \
   pcre-context                                 \
+  pcre-count                                   \
   pcre-infloop                                 \
   pcre-invalid-utf8-input                      \
   pcre-jitstack                                        \
diff --git a/tests/pcre-count b/tests/pcre-count
new file mode 100755
index 0000000..78e1c7c
--- /dev/null
+++ b/tests/pcre-count
@@ -0,0 +1,23 @@
+#! /bin/sh
+# grep -P / grep -Pc are inconsistent results
+# This bug affected grep versions 2.21 through 2.22.
+#
+# Copyright (C) 2015 Free Software Foundation, Inc.
+#
+# Copying and distribution of this file, with or without modification,
+# are permitted in any medium without royalty provided the copyright
+# notice and this notice are preserved.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+require_pcre_
+
+fail=0
+
+printf 'a\n%032768d\nb\x0\n%032768d\na\n' 0 0 > in
+
+LC_ALL=C grep -P 'a' in | wc -l > exp
+
+LC_ALL=C grep -Pc 'a' in > out ||  fail=1
+compare exp out || fail=1
+
+Exit $fail
-- 
2.4.6

Reply via email to