On Thu, Nov 19, 2020 at 7:32 PM Frank Heckenbach <f.heckenb...@fh-soft.de> wrote: > I have a use case where I run grep with a large number of search > patterns on a large text file. It works well with grep-3.3, but with > grep-3.4 it quickly burned through GBs of memory and almost locked > up my system due to swapping. > > To avoid attaching those large files, I could mostly reproduce the > effects like this: > > ulimit -d 5000000 # avoid system lockup due to excessive swapping > export LC_ALL=C # make sure no Unicode case conversions are needed > > % time ./grep-3.3 -Fwif <(seq 300000 | tr 0-9 A-J) <<<foo > > real 0m0.054s > user 0m0.048s > sys 0m0.012s > > % time ./grep-3.4 -Fwif <(seq 30000 | tr 0-9 A-J) <<<foo > ./grep-3.4: Memory exhausted > Aborted > > real 0m1.291s > user 0m0.696s > sys 0m0.599s > > % time ./grep-3.6 -Fwif <(seq 300000 | tr 0-9 A-J) <<<foo > > real 0m13.162s > user 0m12.955s > sys 0m0.211s > > grep-3.3 behaves well, even with much larger number of patterns. > Time seems to grow linearly, and memory usage is constant. > > grep-3.4 behaves the worst of these 3 versions. Even with just 30000 > patterns it exceeds the ulimit of 5 GB. > > grep-3.6 behaves a bit better than 3.4, but still bad. Time seems to > be quadratic in the number of patterns, and though memory usage in > this case seems to be almost constant, in my actual use case it also > runs out of memory where grep-3.3 works well with just a few 100 MB > used. > > Without "-i", grep-3.4 seems to run as fast as grep-3.3, but > grep-3.6 is almost as slow as with "-i". > > So there might actually be two different issues here, one that > affects 3.4 with "-i" and one that affects 3.6 with or without "-i".
Thank you for the fine bug report. The grep-3.6 bug you've exposed is due to the fact that your input triggers excessive hash collisions when using the code modeled after gnulib/lib/hash-pjw.c. That made the new pattern-preprocessing phase take O(N^2) time for N patterns. In the attached, I've switched grep to use the djb2 hash function, and that resolves the problem. I'll also add a NEWS entry and a test before pushing this.
0001-grep-avoid-performance-regression-with-many-patterns.patch
Description: Binary data