On Sun, Sep 20, 2020 at 12:17 AM Norihiro Tanaka <nori...@kcn.ne.jp> wrote: > Hi, > Performace for as following case is fixed in bug#43040. > > $ yes 0 | head -100000 | sed '$s/././' >pat > $ grep -vf pat /dev/null > > However, still slow and a lot of memory wasted for the following cases. > > $ grep -vf /usr/share/dict/linux.words /usr/share/dict/linux.words > > This bug is introduced in commit abb7f4f2325f26f930ff59b702fe42568a8e81e7. > Though it's an optimization for patterns with backreferences, it seems > to cause performance degradation in many cases due to regex > implementation issues. > > grep needs regex engine when patterns is not supported by DFA engine, > and when either given only matching (-o) or color option (--color) is > given. > > In other words, if none of them are met, grep only uses regex to check > the syntax. grep avoids compilation of regex not to check syntax by this > patch.
Yikes. Thank you! That exposes (and fixes in this common case) a problem that makes grep require memory that is quadratic in the number of regular expressions. To illustrate, I ran some timings. With only 80,000 lines of /usr/share/dict/linux.words, the following would use 100GB of RSS and take 3 minutes. With the fix, it used less than 400MB and took less than one second. head -$N /usr/share/dict/linux.words > w; grep -vf w w N Mem(k): Old New 20000 6341188 (2.4s) 103168 40000 25241288 (9.29s) 199188 (0.31s) 80000 100547432 (180s) 392872 (0.66s) I've just pushed the gnulib-adjusting patch and will push the other soon. I'll also add a test and a NEWS item in a separate patch.