bug#22357: grep -f not only huge memory usage, but also huge time cost

Trevor Cordes Mon, 12 Dec 2016 16:05:23 -0800

On 2016-12-12 Paul Eggert wrote:
> grep should just work. It is
> not working now and we need to fix that

By that rationale, the commit that causes the problems for me should be
backed out as a first step, and then a proper solution should be
found.  If you look at the commit in isolation, it:
- at best speeds up some use cases ("case A") by 20%
- at worst slows down some use cases ("case B") by 10000%+

This all should be approached from the standpoint that the pre-commit
state is correct, and the post-commit state is buggy.  And then try to
find the proper solution to regain the speed up in use case A, without
killing case B.  All I'm talking about here is perspective... the end
result may be the same.

> Instead, we should fix things so that grep -F and plain grep are both 
> fast when given equivalent patterns.

Agreed, but it would appear that no amount of optimization to the dfa
algorithm (minutes/hours) is going to get us anywhere near the speed of
the old -F algorithm (2s) in my use case.  Then there's this...

On 2016-12-12 Norihiro Tanaka wrote:
> memory in calculation of `D->follow' in dfaanalyze().  In my machine,
> dfa uses about 500MB, and regex uses about 1GB.
> 
> $ env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null

Even if a patch solves the time issue, perhaps the memory issue must be
considered as well, as Norihiro suggests.  On my box, the above gives
me an RSS KB:
 pre-commit:  141248
post-commit: 1410152

An order of magnitude more memory, ironic since pre-commit also runs
faster :-)  I would assume that if someone's -f file size is even
bigger than mine (or their box had small amounts of RAM), the memory
requirements could more than their box could handle (as per RHBZ
1338497).  My point being: wouldn't any dfa solution require much
larger amounts of RAM?  Any heuristic switching should take into
account time and ram, making the choice even harder.  Perhaps
necessitating a need for a --optimize-for-ram and --optimize-for-time
switch choice.

On 2016-12-11 Bruno Haible wrote:
> It is wrong to put the burden of the algorithm choice on the user.

OK, you convinced me.  In my mind I was thinking of -F as more of an
algorithm choice than a has/hasnot regex meta-chars choice.  That is
clearly wrong.

> There are at least two approaches.
> 
> 1) If you have vastly different run times (2 sec vs. 10 min or so),
[...]
> 2) You can run a set of benchmarks, indexed by 2 parameters (m,n)

I love both of your ideas, though #1 is for sure "kludgy" it would
solve the problem for my use case.  Perhaps #1 could be invoked only
when a heuristic applies, such as the -f file being >X bytes or words.

On 2016-12-12 Norihiro Tanaka wrote:
> We can not know N before searching.  If input is a regular file, we
> can examine it with stat(), but it may be STDIN.

It seems to me N is vastly less important here than M, in terms of
runtime.  I could be wrong, but perhaps just looking at M is good
enough.  If not, then you are correct, a heuristic solution would be
difficult to determine.

I'm going to run some test cases and benchmarks to see exactly how M vs
N impacts runtime to see if some heuristics are viable solutions.

Thanks everyone for all your great ideas and thoughtful discussion!

bug#22357: grep -f not only huge memory usage, but also huge time cost

Reply via email to