On Tue, Nov 26, 2019 at 2:38 PM Norihiro Tanaka <nori...@kcn.ne.jp> wrote: > On Sun, 13 Jan 2019 08:45:47 +0900 > Norihiro Tanaka <nori...@kcn.ne.jp> wrote: > > grep uses KWset matcher for multiple word matching. It is very slow when > > most of the parts matched to a pattern are not words. So, if a part firstly > > matched to pattern is not a word, use the grep matcher to match for its > > line. > > > > By the way, if START_PTR is set, grep matcher uses regex matcher which is > > very slow to match words. Therefore, we use grep matcher when only > > START_PTR > > is not set. > > > > Example, although it is a very extreme case... > > > > $ cat >pat <<EOF > > 0 > > 00 0 > > 00 00 0 > > 00 00 00 0 > > 00 00 00 00 0 > > 00 00 00 00 00 0 > > 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 00 00 00 00 00 0 > > 00 00 00 00 00 00 00 00 00 00 00 00 00 0 > > EOF > > $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | head -1000000 >inp > > > > $ env LC_ALL=C time -p src/grep -wf pat inp > > real 5.75 > > user 5.67 > > sys 0.02 > > > > Retry after applied the patch. > > > > $ env LC_ALL=C time -p src/grep -wf pat inp > > real 0.32 > > user 0.31 > > sys 0.00 > > > > Thanks, > > Norihiro > > I fix previous patch. > > This change should not be applied for multibyte locales, as grep matcher > uses regex with pattern with invert charclass in word matching in > multibyte locales and it is very slow.
Good timing, in both senses :-) Thank you. This confirms a ~15x speedup for that test case: $ t="$(yes 00 |head -14|paste -sd' ') 0"; while :; do echo $t; test "$t" = 0 && break; t=${t#00 }; done > pat;cat pat 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 00 0 00 00 00 00 00 00 00 0 00 00 00 00 00 00 0 00 00 00 00 00 0 00 00 00 00 0 00 00 00 0 00 00 0 00 0 0 $ env LC_ALL=C time -p grep -wf pat inp ... before: real 4.75 after: real 0.30 Similarly, retrying my counterexample, I now see a significant improvement (needed a larger input file to demonstrate): $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | head -5000000 > inp Best of ten, running this, before vs after: $ env LC_ALL=C time -p grep -w 000 inp 0.62s vs 0.43s I have revised the wording in the commit message. Please ACK that. The only other change was in formatting, to split a line that went one byte past the 80-column maximum. l'd like to push the attached some time tomorrow.
kws.diff
Description: Binary data