bug#22357: grep -f not only huge memory usage, but also huge time cost

Norihiro Tanaka Mon, 12 Dec 2016 06:53:58 -0800

On Sun, 11 Dec 2016 18:55:32 +0100
Bruno Haible <br...@clisp.org> wrote:


> Norihiro Tanaka wrote:
> > dfa matcher is not always slower than kws matcher. ...
> > It's a trade-off.  Can you have any idea to select the better
> > matcher for both two cases?
> 
> There are at least two approaches.
> 
> 1) If you have vastly different run times (2 sec vs. 10 min or so), the
> program can set up two threads and run each algorithm in one thread. Buffer
> the output. When a thread terminates, use its output and kill the other 
> thread.
> 
> Now that is still suboptimal, because on average this approach will lose a
> factor of 2, just because it does not know which is the best algorithm.
> 
> 2) You can run a set of benchmarks, indexed by 2 parameters (m,n)
>     m = number of keywords,
>     n = total number of bytes of the files to be searched,
> run each of the two algorithms, and note the best algorithm. This way
> you fill a matrix of results (in the (m,n) plane). Now find a formula
> that roughly tells you for given (m,n) whether you are in one area of
> the plane (where kwset is better) or in the other area of the plane
> (where dfa is better). Finally, code this formula into the 'grep' program.

We can not know N before searching.  If input is a regular file, we can
examine it with stat(), but it may be STDIN.

BTW, following test case uses a lot of memory with grep matcher
specially, as grep compiles both dfa and regex.  dfa uses a lot of memory
in calculation of `D->follow' in dfaanalyze().  In my machine, dfa uses
about 500MB, and regex uses about 1GB.

$ env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null

bug#22357: grep -f not only huge memory usage, but also huge time cost

Reply via email to