On Sun, 11 Dec 2016 18:55:32 +0100 Bruno Haible <br...@clisp.org> wrote:
> Norihiro Tanaka wrote: > > dfa matcher is not always slower than kws matcher. ... > > It's a trade-off. Can you have any idea to select the better > > matcher for both two cases? > > There are at least two approaches. > > 1) If you have vastly different run times (2 sec vs. 10 min or so), the > program can set up two threads and run each algorithm in one thread. Buffer > the output. When a thread terminates, use its output and kill the other > thread. > > Now that is still suboptimal, because on average this approach will lose a > factor of 2, just because it does not know which is the best algorithm. > > 2) You can run a set of benchmarks, indexed by 2 parameters (m,n) > m = number of keywords, > n = total number of bytes of the files to be searched, > run each of the two algorithms, and note the best algorithm. This way > you fill a matrix of results (in the (m,n) plane). Now find a formula > that roughly tells you for given (m,n) whether you are in one area of > the plane (where kwset is better) or in the other area of the plane > (where dfa is better). Finally, code this formula into the 'grep' program. We can not know N before searching. If input is a regular file, we can examine it with stat(), but it may be STDIN. BTW, following test case uses a lot of memory with grep matcher specially, as grep compiles both dfa and regex. dfa uses a lot of memory in calculation of `D->follow' in dfaanalyze(). In my machine, dfa uses about 500MB, and regex uses about 1GB. $ env LC_ALL=C grep -F -w -f /usr/share/dict/words /dev/null