I found that grep -Fw is extremely slow in spite of whether in multibyte locales or not.
$ yes 'abcdefg hijklmn opqrstu vwxyz' | head -100000 >k $ time -p env LC_ALL=C grep -Fw vwxy k real 14.03 user 12.51 sys 0.74 $ time -p env LC_ALL=ja_JP.eucJP grep -Fw vwxy k real 14.29 user 12.67 sys 0.50 $ time -p env LC_ALL=C grep -w vwxy k real 0.11 user 0.01 sys 0.09 $ time -p env LC_ALL=ja_JP.eucJP grep -w vwxy k real 0.89 user 0.71 sys 0.15 First patch fixes the problem. Second patch changes as using grep matcher for grep -Fw in single byte locales. In single byte locales, DFA (not regex) is also used for words matching, and it is very fast as above result.
From 7f693ddf06280a0e638f97a1810d454c20c62716 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka <nori...@kcn.ne.jp> Date: Tue, 13 Oct 2015 09:42:57 +0900 Subject: [PATCH 1/2] grep: improvement of performance of grep -Fw grep -Fw examines whether previous character is not word character or not after matching from a head of buffer. It is extremely slow. Now, if grep found potential match, seeks previous newline, and examines from there. --- src/kwsearch.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/src/kwsearch.c b/src/kwsearch.c index 5a91eb6..045ef46 100644 --- a/src/kwsearch.c +++ b/src/kwsearch.c @@ -124,7 +124,11 @@ Fexecute (char const *buf, size_t size, size_t *match_size, if (match_words) for (try = beg; ; ) { - if (wordchar (mb_prev_wc (buf, try, buf + size))) + char const *bol; + bol = beg; + while (buf < bol && bol[-1] != eol) + --bol; + if (wordchar (mb_prev_wc (bol, try, buf + size))) break; if (wordchar (mb_next_wc (try + len, buf + size))) { -- 2.4.6
From f43fe791dd4aa8f2ca079ed461349f24de32276a Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka <nori...@kcn.ne.jp> Date: Tue, 13 Oct 2015 09:19:10 +0900 Subject: [PATCH 2/2] grep: use grep matcher for grep -Fw in single byte locales In in single byte locales, KWset and DFA are used for words matching by grep. It is faster than kwset matcher, as kwset matcher calls kwsexec many times until matches words. So we use grep matcher for grep -Fw in single byte locales. * src/grep.c (main): Change pattern for fgrep into grep for grep -Fw in single byte locales. --- src/grep.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/src/grep.c b/src/grep.c index d8ea70f..0ca0d9a 100644 --- a/src/grep.c +++ b/src/grep.c @@ -2563,9 +2563,12 @@ main (int argc, char **argv) /* If fgrep in a multibyte locale, then use grep if either (1) case is ignored (where grep is typically faster), or - (2) the pattern has an encoding error (where fgrep might not work). */ - if (compile == Fcompile && MB_CUR_MAX > 1 - && (match_icase || contains_encoding_error (keys, keycc))) + (2) the pattern matches words (where grep is typically faster), or + (3) the pattern has an encoding error (where fgrep might not work). */ + if (compile == Fcompile + && (MB_CUR_MAX > 1 && (match_icase + || contains_encoding_error (keys, keycc))) + || (MB_CUR_MAX == 1 && match_words)) { size_t new_keycc; char *new_keys; -- 2.4.6