After grep -P found first match, TEXTBIN_UNKNOWN optimizations is not
used.  Therefore, if grep -P found early match, grep -P is very slow in
UTF-8.

  $ time -p grep -P ^1$ <(seq 999999)
  1
  real 14.55
  user 13.77
  sys 1.12

Or grep -Pa is not used TEXTBIN_UNKNOWN optimizations.  Therefere, it is
also very slow in UTF-8.

grep -P ^1$ <(seq 999999)

  $ time -p grep -Pa a <(seq 999999)
  real 14.53
  user 13.65
  sys 1.35

This change makes deference to leave TEXTBIN_UNKNOWN optimizations until
grep -P finds a binary character.

It will bring more than 10x speed up.

  $ time -p src/grep -P ^1$ <(seq 999999)
  1
  real 0.97
  user 0.79
  sys 0.24

  $ time -p src/grep -Pa a <(seq 999999)
  real 0.98
  user 0.23
  sys 0.99

BTW, this change conflicts with proposal in bug#22028.
From 2cf98594e1b7ce7490d0b6d7551f52d65ccd44a4 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka <nori...@kcn.ne.jp>
Date: Thu, 26 Nov 2015 15:34:13 +0900
Subject: [PATCH] grep: improve performance for grep -P in UTF-8

grep -P uses line by line search after found first match or specified -a
option, but it is very slow.  This change also tries to use multi-line
search after them until found not text character.

* src/grep.c (grep): Do it.
* NEWS: Mention it.
---
 NEWS       |  6 ++++++
 src/grep.c | 28 ++++++++++++++--------------
 2 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/NEWS b/NEWS
index ac632d7..a9a7042 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,12 @@ GNU grep NEWS                                    -*- outline 
-*-
 
 * Noteworthy changes in release ?.? (????-??-??) [?]
 
+** Improvements
+
+  Performance has improved for grep -P in UTF-8.  Before, commands
+  like the following would speed up more than 10x:
+    grep -P ^1$ <(seq 999999)
+    grep -aP a <(seq 999999)
 
 * Noteworthy changes in release 2.22 (2015-11-01) [stable]
 
diff --git a/src/grep.c b/src/grep.c
index 2c5e09a..a1ee183 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -1345,7 +1345,7 @@ grep (int fd, struct stat const *st)
       return 0;
     }
 
-  if (binary_files == TEXT_BINARY_FILES)
+  if (binary_files == TEXT_BINARY_FILES && execute != Pexecute)
     textbin = TEXTBIN_TEXT;
   else
     {
@@ -1415,13 +1415,8 @@ grep (int fd, struct stat const *st)
         }
 
       /* Detect whether leading context is adjacent to previous output.  */
-      if (lastout)
-        {
-          if (textbin == TEXTBIN_UNKNOWN)
-            textbin = TEXTBIN_TEXT;
-          if (beg != lastout)
-            lastout = 0;
-        }
+      if (beg != lastout)
+        lastout = NULL;
 
       /* Handle some details and read more data to scan.  */
       save = residue + lim - beg;
@@ -1442,12 +1437,17 @@ grep (int fd, struct stat const *st)
           enum textbin tb = buffer_textbin (bufbeg, buflim - bufbeg);
           if (textbin_is_binary (tb))
             {
-              if (binary_files == WITHOUT_MATCH_BINARY_FILES)
-                return 0;
-              textbin = tb;
-              done_on_match = out_quiet = true;
-              nul_zapper = eol;
-              skip_nuls = skip_empty_lines;
+              if (nlines || binary_files == TEXT_BINARY_FILES)
+                textbin = TEXTBIN_TEXT;
+              else
+                {
+                  if (binary_files == WITHOUT_MATCH_BINARY_FILES)
+                    return 0;
+                  textbin = tb;
+                  done_on_match = out_quiet = true;
+                  nul_zapper = eol;
+                  skip_nuls = skip_empty_lines;
+                }
             }
         }
     }
-- 
2.4.6

Reply via email to