bug#20526: grep BUG: text file is detected as binary

Paul Eggert Fri, 08 Jan 2016 07:29:51 -0800

Paul Eggert wrote:

I missed the possibility of a unibyte encoding where not all bytes are valid
unibyte characters.

I found a significant performance problem related to that bug and bug fix, andinstalled the attached further patch 0001. Come to think of it, this issueshould be in NEWS too, so I added the attached patch 0002.

>From d1160ec6d239b2e0f20c2fb3395e3b70963bf916 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Jan 2016 21:28:23 -0800
Subject: [PATCH 1/2] grep: improve unibyte -P performance

This is a followon to the recent changes prompted by Bug#20526.
In <http://bugs.gnu.org/bug=20526#86> Norihiro Tanaka pointed out
that grep mistakenly assumed that unibyte locales cannot have
encoding errors.  Here, the mistake hurt performance significantly.
On Fedora 23 x86-64 in the C locale, this patch improved grep's
performance by a factor of 7 when run as "grep -P 'z.*a'" on the
output of "yes $(printf '\200\n') | head -n 1000000000".
* src/pcresearch.c (multibyte_locale) [HAVE_LIBPCRE]: New static var.
(Pcompile): Set it.
(Pexecute): Use it to avoid the need to call
buf_has_encoding_errors in unibyte locales.
---
 src/pcresearch.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/src/pcresearch.c b/src/pcresearch.c
index c0b8678..1fae94d 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -84,6 +84,8 @@ jit_exec (char const *subject, int search_bytes, int search_offset,
 /* Table, indexed by ! (flag & PCRE_NOTBOL), of whether the empty
    string matches when that flag is used.  */
 static int empty_match[2];
+
+static bool multibyte_locale;
 #endif
 
 void
@@ -104,10 +106,14 @@ Pcompile (char const *pattern, size_t size)
   char const *p;
   char const *pnul;
 
-  if (using_utf8 ())
-    flags |= PCRE_UTF8;
-  else if (MB_CUR_MAX != 1)
-    error (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8 locales"));
+  if (1 < MB_CUR_MAX)
+    {
+      if (! using_utf8 ())
+        error (EXIT_TROUBLE, 0,
+               _("-P supports only unibyte and UTF-8 locales"));
+      multibyte_locale = true;
+      flags |= PCRE_UTF8;
+    }
 
   /* FIXME: Remove these restrictions.  */
   if (memchr (pattern, '\n', size))
@@ -194,12 +200,16 @@ Pexecute (char *buf, size_t size, size_t *match_size,
      error.  */
   char const *subject = buf;
 
-  /* If the input is free of encoding errors a multiline search is
+  /* If the input is unibyte or is free of encoding errors a multiline search is
      typically more efficient.  Otherwise, a single-line search is
      typically faster, so that pcre_exec doesn't waste time validating
      the entire input buffer.  */
-  bool multiline = ! buf_has_encoding_errors (buf, size - 1);
-  buf[size - 1] = eolbyte;
+  bool multiline = true;
+  if (multibyte_locale)
+    {
+      multiline = ! buf_has_encoding_errors (buf, size - 1);
+      buf[size - 1] = eolbyte;
+    }
 
   for (; p < buf + size; p = line_start = line_end + 1)
     {
-- 
2.5.0

>From ca68df394ba1d9359c0e4d825394ab875c7fe1c2 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Thu, 7 Jan 2016 21:34:00 -0800
Subject: [PATCH 2/2] doc: mention unibyte encoding fix

* NEWS: Document recent fix for encoding errors in unibyte locales.
---
 NEWS | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/NEWS b/NEWS
index f572a0c..a0f6bbb 100644
--- a/NEWS
+++ b/NEWS
@@ -18,6 +18,11 @@ GNU grep NEWS                                    -*- outline -*-
   grep -c no longer stops counting when finding binary data.
   [bug introduced in grep-2.21]
 
+  grep no longer outputs encoding errors in unibyte locales.
+  For example, if the byte '\x81' is not a valid character in a
+  unibyte locale, grep treats the byte as binary data.
+  [bug introduced in grep-2.21]
+
   grep -oP is no longer susceptible to an infinite loop when processing
   invalid UTF8 just before a match.
   [bug introduced in grep-2.22]
-- 
2.5.0

bug#20526: grep BUG: text file is detected as binary

Reply via email to