On Sun, Sep 20, 2020 at 6:34 PM Jim Meyering <j...@meyering.net> wrote: > > On Sun, Sep 20, 2020 at 12:17 AM Norihiro Tanaka <nori...@kcn.ne.jp> wrote: > > Hi, > > Performace for as following case is fixed in bug#43040. > > > > $ yes 0 | head -100000 | sed '$s/././' >pat > > $ grep -vf pat /dev/null > > > > However, still slow and a lot of memory wasted for the following cases. > > > > $ grep -vf /usr/share/dict/linux.words /usr/share/dict/linux.words > > > > This bug is introduced in commit abb7f4f2325f26f930ff59b702fe42568a8e81e7. > > Though it's an optimization for patterns with backreferences, it seems > > to cause performance degradation in many cases due to regex > > implementation issues. > > > > grep needs regex engine when patterns is not supported by DFA engine, > > and when either given only matching (-o) or color option (--color) is > > given. > > > > In other words, if none of them are met, grep only uses regex to check > > the syntax. grep avoids compilation of regex not to check syntax by this > > patch. > > Yikes. Thank you! > That exposes (and fixes in this common case) a problem that makes grep > require memory that is quadratic in the number of regular expressions. > > To illustrate, I ran some timings. > With only 80,000 lines of /usr/share/dict/linux.words, the following > would use 100GB of RSS and take 3 minutes. With the fix, it used less > than 400MB and took less than one second. > > head -$N /usr/share/dict/linux.words > w; grep -vf w w > > N Mem(k): Old New > 20000 6341188 (2.4s) 103168 > 40000 25241288 (9.29s) 199188 (0.31s) > 80000 100547432 (180s) 392872 (0.66s) > > I've just pushed the gnulib-adjusting patch and will push the other soon. > I'll also add a test and a NEWS item in a separate patch.
Here are the two patches (tested on top of a third that updates to latest gnulib). I'll await an 'ok' from Norihiro Tanaka before pushing, since commit-log metadata is essentially immutable once pushed.
0002-tests-test-for-many-regexp-N-2-RSS-regression.patch
Description: Binary data
0001-grep-avoid-unnecessary-regex-compilation.patch
Description: Binary data