On Thu, 26 Nov 2020 21:41:20 -1000 Jim Meyering <j...@meyering.net> wrote:
> On Thu, Nov 26, 2020 at 9:03 AM Jim Meyering <j...@meyering.net> wrote: > > > > On Wed, Nov 25, 2020 at 3:12 PM Jim Meyering <j...@meyering.net> wrote: > > > Thank you for the fine bug report. > > > The grep-3.6 bug you've exposed is due to the fact that your input > > > triggers excessive hash collisions when using the code modeled after > > > gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase > > > take O(N^2) time for N patterns. In the attached, I've switched grep > > > to use the djb2 hash function, and that resolves the problem. I'll > > > also add a NEWS entry and a test before pushing this. > > > > Timings suggest that grep-3.6's preprocessing came closer to O(N^3). > > Here's an example that would take 2-3 days with grep-3.6 and only > > seconds with this fix: > > > > : | grep -Ff <(seq 6400000 | tr 0-9 A-J) > > > > Here's a complete patch. > > I'll push it later today. > > Pushed along with two gnulib-related changes. The fix has improved some performance. However, it's still quite slow compared to version 3.3, and that can be remedied. It converts to grep only if the potential match does not match the word frequently.
From 1bfcdca658bd91dd6b8e6e3a96c9e77678bb4d2e Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka <nori...@kcn.ne.jp> Date: Thu, 3 Dec 2020 17:22:50 +0900 Subject: [PATCH] grep: improvement of grep convertion for fgrep * src/kwsearch.c (struct kwsearch): Add new members. (Fcompile): Initialize them. (Fexecute): Convert to grep only if the potential match does not match the word frequently. --- src/kwsearch.c | 17 +++++++++++++++-- 1 files changed, 15 insertions(+), 2 deletions(-) diff --git a/src/kwsearch.c b/src/kwsearch.c index 685502d..a7542fd 100644 --- a/src/kwsearch.c +++ b/src/kwsearch.c @@ -41,6 +41,11 @@ struct kwsearch /* The user's pattern compiled as a regular expression, or null if it has not been compiled. */ void *re; + + /* The number of matched and not matched with word separators. These + are used to decide whether convert patterns to grep or not. */ + size_t word_success; + size_t word_failure; }; /* Compile the -F style PATTERN, containing SIZE bytes that are @@ -97,6 +102,8 @@ Fcompile (char *pattern, size_t size, reg_syntax_t ignored, bool exact) kwsearch->pattern = pattern; kwsearch->size = size; kwsearch->re = NULL; + kwsearch->word_success = 0; + kwsearch->word_failure = 0; return kwsearch; } @@ -181,7 +188,9 @@ Fexecute (void *vcp, char const *buf, size_t size, size_t *match_size, else goto success; } - if (!start_ptr && !localeinfo.multibyte) + if (!start_ptr && !localeinfo.multibyte + && (kwsearch->word_failure & ~0xffff) + && kwsearch->word_success < kwsearch->word_failure) { if (! kwsearch->re) { @@ -209,7 +218,11 @@ Fexecute (void *vcp, char const *buf, size_t size, size_t *match_size, struct kwsmatch shorter_match; if (kwsexec (kwset, beg, --len, &shorter_match, true) != 0) - break; + { + kwsearch->word_success++; + break; + } + kwsearch->word_failure++; len = shorter_match.size; } -- 1.7.1