On Mon, 21 Sep 2020 17:33:25 -0700 Jim Meyering <j...@meyering.net> wrote:
> On Sun, Sep 20, 2020 at 6:34 PM Jim Meyering <j...@meyering.net> wrote: > > > > On Sun, Sep 20, 2020 at 12:17 AM Norihiro Tanaka <nori...@kcn.ne.jp> wrote: > > > Hi, > > > Performace for as following case is fixed in bug#43040. > > > > > > $ yes 0 | head -100000 | sed '$s/././' >pat > > > $ grep -vf pat /dev/null > > > > > > However, still slow and a lot of memory wasted for the following cases. > > > > > > $ grep -vf /usr/share/dict/linux.words /usr/share/dict/linux.words > > > > > > This bug is introduced in commit abb7f4f2325f26f930ff59b702fe42568a8e81e7. > > > Though it's an optimization for patterns with backreferences, it seems > > > to cause performance degradation in many cases due to regex > > > implementation issues. > > > > > > grep needs regex engine when patterns is not supported by DFA engine, > > > and when either given only matching (-o) or color option (--color) is > > > given. > > > > > > In other words, if none of them are met, grep only uses regex to check > > > the syntax. grep avoids compilation of regex not to check syntax by this > > > patch. > > > > Yikes. Thank you! > > That exposes (and fixes in this common case) a problem that makes grep > > require memory that is quadratic in the number of regular expressions. > > > > To illustrate, I ran some timings. > > With only 80,000 lines of /usr/share/dict/linux.words, the following > > would use 100GB of RSS and take 3 minutes. With the fix, it used less > > than 400MB and took less than one second. > > > > head -$N /usr/share/dict/linux.words > w; grep -vf w w > > > > N Mem(k): Old New > > 20000 6341188 (2.4s) 103168 > > 40000 25241288 (9.29s) 199188 (0.31s) > > 80000 100547432 (180s) 392872 (0.66s) > > > > I've just pushed the gnulib-adjusting patch and will push the other soon. > > I'll also add a test and a NEWS item in a separate patch. > > Here are the two patches (tested on top of a third that updates to > latest gnulib). I'll await an 'ok' from Norihiro Tanaka before > pushing, since commit-log metadata is essentially immutable once > pushed. Great, thank you. I confirmed it.