bug#31074: Grep -i is slow

Geoff Kuenning Thu, 05 Apr 2018 22:33:49 -0700

The -i switch is slow when searching large files.  I haven't dug into
the code in detail, although it seems that dfa.c is trying to build an
intelligent case-agnostic DFA when -i is specified.  But that doesn't
seem to be working.  Perhaps that's because I'm running the UTF-8
character set?  Although I don't see why that would affect the DFA.


Here's an example of timing several greps of 151M file named "rawindex",
which has already been read so that it is in the file system buffer cache.
In each case the grep finds a single match, since the matched line is
actually all lowercase; for privacy, I have omitted the match lines
themselves.

A straightforward match takes only 199 ms even with two .* patterns.
Adding -i blows that up to 6917 ms.  Finally when I write an explicit
case-agnostic pattern to force how the DFA is built, it does run slower
(532 ms) but it's nowhere near the -i time.

mallet:514> time grep outgoing.*harris.*dcraw rawindex 

real    0m0.199s
user    0m0.170s
sys     0m0.029s

mallet:515> time grep -i outgoing.*harris.*dcraw rawindex 

real    0m6.917s
user    0m6.879s
sys     0m0.036s

mallet:516> time grep 
[Oo][Uu][Tt][Gg][Oo][Ii][Nn][Gg].*[Hh][Aa][Rr][Rr][Ii][Ss].*[Dd][Cc][Rr][Aa][Ww]'
 rawindex

real    0m0.532s
user    0m0.491s
sys     0m0.040s
-- 
    Geoff Kuenning   ge...@cs.hmc.edu   http://www.cs.hmc.edu/~geoff/

The DMCA criminalizes curiosity.  It would put Susie in jail for
taking her stereo apart to see how it works.

bug#31074: Grep -i is slow

Reply via email to