bug#21763: poor performance since grep 2.19 when comparing files with grep

Bennett, Steve Mon, 26 Oct 2015 09:28:15 -0700

Apologies in advance if this is more of a "discuss" question, but it looks like 
a particular use-case shows a marked change in performance between recent 
versions of grep.


A colleague mentioned a performance issue with grep to me, and its puzzling me 
a bit.
It turns out that he was using "grep -Fvif" to find lines in one file that are 
not present in another.

Up until grep 2.18 this seems to work with linear performance and it takes less 
than 50ms to compare files up to about 20,000 lines.
With grep 2.19 and later, ever relatively small files are quite slow, runtime 
(and memory use) increases exponentially (e.g. 300ms to compare 200 lines, 1.5s 
to compare 400 lines, 5s to compare 600 lines).

I've shown my colleague how to use sort and diff (and "comm", which I think is 
vastly underrated), but it made me wonder if this is a reasonable thing to 
expect grep to be able to do, and whether such a performance drop should be 
seen as a bug.

The way he was using it, he had two (unsorted) data sets (about 6000 rows in 
each), with most lines being common, and he was just using:
    grep -Fvif FILE1 FILE2
In his case, the older version of grep took way less than a second to run, but 
after he had upgraded his machine it took 20 minutes before running out of swap 
and seg faulting. 

In terms of comparing performance, I've found that the following works to 
compare performance (vary N to try different sized data files):
    N=600; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N > $F; 
time grep -Fvif $F $F; rm $F

Steve.

bug#21763: poor performance since grep 2.19 when comparing files with grep

Reply via email to