On Thu, Dec 3, 2020 at 12:26 AM Norihiro Tanaka <nori...@kcn.ne.jp> wrote: > On Thu, 26 Nov 2020 21:41:20 -1000 > Jim Meyering <j...@meyering.net> wrote: > > > On Thu, Nov 26, 2020 at 9:03 AM Jim Meyering <j...@meyering.net> wrote: > > > > > > On Wed, Nov 25, 2020 at 3:12 PM Jim Meyering <j...@meyering.net> wrote: > > > > Thank you for the fine bug report. > > > > The grep-3.6 bug you've exposed is due to the fact that your input > > > > triggers excessive hash collisions when using the code modeled after > > > > gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase > > > > take O(N^2) time for N patterns. In the attached, I've switched grep > > > > to use the djb2 hash function, and that resolves the problem. I'll > > > > also add a NEWS entry and a test before pushing this. > > > > > > Timings suggest that grep-3.6's preprocessing came closer to O(N^3). > > > Here's an example that would take 2-3 days with grep-3.6 and only > > > seconds with this fix: > > > > > > : | grep -Ff <(seq 6400000 | tr 0-9 A-J) > > > > > > Here's a complete patch. > > > I'll push it later today. > > > > Pushed along with two gnulib-related changes. > > The fix has improved some performance. However, it's still quite slow > compared to version 3.3, and that can be remedied. > > It converts to grep only if the potential match does not match the word > frequently.
Thank you for that patch. Can you say a little more about the domain of the problem? I.e., is it specific to invocations with "-w"? Can you provide an example that exhibits the performance improvement, with timings?