I found that grep -Fw is extremely slow in spite of whether in
multibyte locales or not.

$ yes 'abcdefg hijklmn opqrstu vwxyz' | head -100000 >k
$ time -p env LC_ALL=C grep -Fw vwxy k
real 14.03
user 12.51
sys 0.74
$ time -p env LC_ALL=ja_JP.eucJP grep -Fw vwxy k
real 14.29
user 12.67
sys 0.50

$ time -p env LC_ALL=C grep -w vwxy k
real 0.11
user 0.01
sys 0.09
$ time -p env LC_ALL=ja_JP.eucJP grep -w vwxy k
real 0.89
user 0.71
sys 0.15

First patch fixes the problem.  Second patch changes as using grep
matcher for grep -Fw in single byte locales.

In single byte locales, DFA (not regex) is also used for words matching,
and it is very fast as above result.
From 7f693ddf06280a0e638f97a1810d454c20c62716 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka <nori...@kcn.ne.jp>
Date: Tue, 13 Oct 2015 09:42:57 +0900
Subject: [PATCH 1/2] grep: improvement of performance of grep -Fw

grep -Fw examines whether previous character is not word character or
not after matching from a head of buffer.  It is extremely slow.  Now,
if grep found potential match, seeks previous newline, and examines
from there.
---
 src/kwsearch.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/src/kwsearch.c b/src/kwsearch.c
index 5a91eb6..045ef46 100644
--- a/src/kwsearch.c
+++ b/src/kwsearch.c
@@ -124,7 +124,11 @@ Fexecute (char const *buf, size_t size, size_t *match_size,
       if (match_words)
         for (try = beg; ; )
           {
-            if (wordchar (mb_prev_wc (buf, try, buf + size)))
+            char const *bol;
+            bol = beg;
+            while (buf < bol && bol[-1] != eol)
+              --bol;
+            if (wordchar (mb_prev_wc (bol, try, buf + size)))
               break;
             if (wordchar (mb_next_wc (try + len, buf + size)))
               {
-- 
2.4.6

From f43fe791dd4aa8f2ca079ed461349f24de32276a Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka <nori...@kcn.ne.jp>
Date: Tue, 13 Oct 2015 09:19:10 +0900
Subject: [PATCH 2/2] grep: use grep matcher for grep -Fw in single byte
 locales

In in single byte locales, KWset and DFA are used for words matching by
grep.  It is faster than kwset matcher, as kwset matcher calls kwsexec
many times until matches words.  So we use grep matcher for grep -Fw
in single byte locales.

* src/grep.c (main): Change pattern for fgrep into grep for grep -Fw in
single byte locales.
---
 src/grep.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/grep.c b/src/grep.c
index d8ea70f..0ca0d9a 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -2563,9 +2563,12 @@ main (int argc, char **argv)
 
   /* If fgrep in a multibyte locale, then use grep if either
      (1) case is ignored (where grep is typically faster), or
-     (2) the pattern has an encoding error (where fgrep might not work).  */
-  if (compile == Fcompile && MB_CUR_MAX > 1
-      && (match_icase || contains_encoding_error (keys, keycc)))
+     (2) the pattern matches words (where grep is typically faster), or
+     (3) the pattern has an encoding error (where fgrep might not work).  */
+  if (compile == Fcompile
+      && (MB_CUR_MAX > 1 && (match_icase
+                             || contains_encoding_error (keys, keycc)))
+          || (MB_CUR_MAX == 1 && match_words))
     {
       size_t new_keycc;
       char *new_keys;
-- 
2.4.6

Reply via email to