On Sun, 13 Jan 2019 08:45:47 +0900 Norihiro Tanaka <nori...@kcn.ne.jp> wrote:
> Hi, > > grep uses KWset matcher for multiple word matching. It is very slow when > most of the parts matched to a pattern are not words. So, if a part firstly > matched to pattern is not a word, use the grep matcher to match for its line. > > By the way, if START_PTR is set, grep matcher uses regex matcher which is > very slow to match words. Therefore, we use grep matcher when only START_PTR > is not set. > > Example, although it is a very extreme case... > > $ cat >pat <<EOF > 0 > 00 0 > 00 00 0 > 00 00 00 0 > 00 00 00 00 0 > 00 00 00 00 00 0 > 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 00 00 0 > EOF > $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | head -1000000 >inp > > $ env LC_ALL=C time -p src/grep -wf pat inp > real 5.75 > user 5.67 > sys 0.02 > > Retry after applied the patch. > > $ env LC_ALL=C time -p src/grep -wf pat inp > real 0.32 > user 0.31 > sys 0.00 > > Thanks, > Norihiro I fix previous patch. This change should not be applied for multibyte locales, as grep matcher uses regex with pattern with invert charclass in word matching in multibyte locales and it is very slow.
From 2148a2e62c775c899836de6aca1f1bddf44caa12 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka <nori...@kcn.ne.jp> Date: Sun, 13 Jan 2019 07:53:32 +0900 Subject: [PATCH] grep: fix slow multiple word matching grep uses KWset matcher for multiple word matching. It is very slow when most of the parts matched to a pattern are not words. So, if a part firstly matched to pattern is not a word, use the grep matcher to match for its line. By the way, if START_PTR is set, grep matcher uses regex matcher which is very slow to match words. Therefore, we use grep matcher when only START_PTR is not set. * src/kwsearch.c (Fexecute): If a part matched firstly to pattern is not word, we use the grep matcher to match for its line. --- src/kwsearch.c | 17 +++++++++++++++++ 1 files changed, 17 insertions(+), 0 deletions(-) diff --git a/src/kwsearch.c b/src/kwsearch.c index 42567e9..3c6a822 100644 --- a/src/kwsearch.c +++ b/src/kwsearch.c @@ -239,6 +239,22 @@ Fexecute (void *vcp, char const *buf, size_t size, size_t *match_size, else goto success; } + if (!start_ptr && !localeinfo.multibyte) + { + if (! kwsearch->re) + { + fgrep_to_grep_pattern (&kwsearch->pattern, &kwsearch->size); + kwsearch->re = GEAcompile (kwsearch->pattern, kwsearch->size, + RE_SYNTAX_GREP); + } + end = memchr (beg + len, eol, (buf + size) - (beg + len)); + end = end ? end + 1 : buf + size; + if (EGexecute (kwsearch->re, beg, end - beg, match_size, NULL) + != (size_t) -1) + goto success_match_words; + beg = end - 1; + break; + } if (!len) break; offset = kwsexec (kwset, beg, --len, &kwsmatch, true); @@ -259,6 +275,7 @@ Fexecute (void *vcp, char const *buf, size_t size, size_t *match_size, success: end = memchr (beg + len, eol, (buf + size) - (beg + len)); end = end ? end + 1 : buf + size; + success_match_words: beg = memrchr (buf, eol, beg - buf); beg = beg ? beg + 1 : buf; len = end - beg; -- 1.7.1