bug#21763: poor performance since grep 2.19 when comparing files with grep

Norihiro Tanaka Mon, 26 Oct 2015 17:08:22 -0700

On Mon, 26 Oct 2015 11:39:54 -0700
Jim Meyering <j...@meyering.net> wrote:


> On Mon, Oct 26, 2015 at 5:54 AM, Bennett, Steve
> <s.benn...@lancaster.ac.uk> wrote:
> > Apologies in advance if this is more of a "discuss" question, but it looks 
> > like a particular use-case shows a marked change in performance between 
> > recent versions of grep.
> >
> > A colleague mentioned a performance issue with grep to me, and its puzzling 
> > me a bit.
> > It turns out that he was using "grep -Fvif" to find lines in one file that 
> > are not present in another.
> >
> > Up until grep 2.18 this seems to work with linear performance and it takes 
> > less than 50ms to compare files up to about 20,000 lines.
> > With grep 2.19 and later, ever relatively small files are quite slow, 
> > runtime (and memory use) increases exponentially (e.g. 300ms to compare 200 
> > lines, 1.5s to compare 400 lines, 5s to compare 600 lines).
> >
> > I've shown my colleague how to use sort and diff (and "comm", which I think 
> > is vastly underrated), but it made me wonder if this is a reasonable thing 
> > to expect grep to be able to do, and whether such a performance drop should 
> > be seen as a bug.
> >
> > The way he was using it, he had two (unsorted) data sets (about 6000 rows 
> > in each), with most lines being common, and he was just using:
> >     grep -Fvif FILE1 FILE2
> > In his case, the older version of grep took way less than a second to run, 
> > but after he had upgraded his machine it took 20 minutes before running out 
> > of swap and seg faulting.
> >
> > In terms of comparing performance, I've found that the following works to 
> > compare performance (vary N to try different sized data files):
> >     N=600; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N > $F; 
> > time grep -Fvif $F $F; rm $F
> 
> Thank you for reporting that.
> Interesting: that progression (time vs. increasing N) is clearly
> quadratic or worse when using a multibyte locale, but is linear with
> LC_ALL=C. I suspect when you run "locale", it reports something like
> en_US.utf8.
> 
> I.e., if you have no need for multi-byte matching, set LC_ALL=C, and
> that idiom will be very quick, even for a million lines:
> 
>   $ N=1000000; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1
> $N > $F; LC_ALL=C env time grep -Fvif $F $F; rm $F
>   0.78user 0.08system 0:00.86elapsed 100%CPU (0avgtext+0avgdata
> 131536maxresident)k
>   0inputs+0outputs (0major+32587minor)pagefaults 0swaps
> 
> Currently, I am not planning even to investigate this for the imminent 
> release.

In grep 2.19 or later, grep -Fi is re-written to grep -i and corresponding
regular expression in only multibyte-locales.

Now, assume the case of N=2.

--
1 bottles of beer on the wall
2 bottles of beer on the wall
--

"grep -Fif $F" to "grep -e 'bottles of beer on the wall\|2 bottles of -
beer on the wall' $F"

By the way, we may expect to "grep -e '\(1\|2\) bottles of beer on the -
wall'", but grep does not do it.  It will cause slow down.

bug#21763: poor performance since grep 2.19 when comparing files with grep

Reply via email to