On Wed, 18 Dec 2019 18:55:01 -0800 Jim Meyering <j...@meyering.net> wrote:
> On Tue, Nov 26, 2019 at 2:38 PM Norihiro Tanaka <nori...@kcn.ne.jp> wrote: > > On Sun, 13 Jan 2019 08:45:47 +0900 > > Norihiro Tanaka <nori...@kcn.ne.jp> wrote: > > > grep uses KWset matcher for multiple word matching. It is very slow when > > > most of the parts matched to a pattern are not words. So, if a part > > > firstly > > > matched to pattern is not a word, use the grep matcher to match for its > > > line. > > > > > > By the way, if START_PTR is set, grep matcher uses regex matcher which is > > > very slow to match words. Therefore, we use grep matcher when only > > > START_PTR > > > is not set. > > > > > > Example, although it is a very extreme case... > > > > > > $ cat >pat <<EOF > > > 0 > > > 00 0 > > > 00 00 0 > > > 00 00 00 0 > > > 00 00 00 00 0 > > > 00 00 00 00 00 0 > > > 00 00 00 00 00 00 0 > > > 00 00 00 00 00 00 00 0 > > > 00 00 00 00 00 00 00 00 0 > > > 00 00 00 00 00 00 00 00 00 0 > > > 00 00 00 00 00 00 00 00 00 00 0 > > > 00 00 00 00 00 00 00 00 00 00 00 0 > > > 00 00 00 00 00 00 00 00 00 00 00 00 0 > > > 00 00 00 00 00 00 00 00 00 00 00 00 00 0 > > > EOF > > > $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | head -1000000 >inp > > > > > > $ env LC_ALL=C time -p src/grep -wf pat inp > > > real 5.75 > > > user 5.67 > > > sys 0.02 > > > > > > Retry after applied the patch. > > > > > > $ env LC_ALL=C time -p src/grep -wf pat inp > > > real 0.32 > > > user 0.31 > > > sys 0.00 > > > > > > Thanks, > > > Norihiro > > > > I fix previous patch. > > > > This change should not be applied for multibyte locales, as grep matcher > > uses regex with pattern with invert charclass in word matching in > > multibyte locales and it is very slow. > > Good timing, in both senses :-) Thank you. > This confirms a ~15x speedup for that test case: > > $ t="$(yes 00 |head -14|paste -sd' ') 0"; while :; do echo $t; test > "$t" = 0 && break; t=${t#00 }; done > pat;cat pat > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 00 0 > 00 00 00 00 00 00 0 > 00 00 00 00 00 0 > 00 00 00 00 0 > 00 00 00 0 > 00 00 0 > 00 0 > 0 > $ env LC_ALL=C time -p grep -wf pat inp > ... > before: real 4.75 > after: real 0.30 > > Similarly, retrying my counterexample, I now see a significant > improvement (needed a larger input file to demonstrate): > > $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | head -5000000 > inp > > Best of ten, running this, before vs after: > $ env LC_ALL=C time -p grep -w 000 inp > 0.62s vs 0.43s > > I have revised the wording in the commit message. Please ACK that. The > only other change was in formatting, to split a line that went one > byte past the 80-column maximum. > > l'd like to push the attached some time tomorrow. Thanks, I agree to your adjuestment for commit message.