On Sun, Jul 19, 2015 at 12:42 AM, Norihiro Tanaka <nori...@kcn.ne.jp> wrote: > On Sat, 18 Jul 2015 22:15:33 -0700 > Jim Meyering <j...@meyering.net> wrote: > >> Hello, >> Thank you for the patches in this report: >> >> http://bugs.gnu.org/19306 >> >> Please excuse my delay in getting back to you on this. >> Would you revise each of those to include a test case >> that demonstrates the problem/fix? > > Thanks for your reviewing of this report. > > This is not bug fix. It avoids that BACKREF is found in the process of > DFAEXEC and passed to regex in multibyte locale. In other words, if a > pattern includes BACKREF, grep does not try to use DFA from the > beginning. > > I confirmed about 10% speed-up for a test case in attachment. > > Before patching: real 7.29 user 7.26 sys 0.02 > After patching : real 6.57 user 6.55 sys 0.01 > > KWset and DFA superset succeeds for all rows in the test case, and DFA > for multibyte succeeds, too. However, all rows are rejected in regex. > > After patching, grep does not try DFA for multibyte, as pattern includes > BACKREF. > > In addtion, I believe that DFA is simplified by removal of handling for > BACKREF from dfaanalyze(), dfassbuild() and dfaexec().
Thank you for the additional information and the test script. I like most of this patch, but not the fact that it causes the word-delim-multibyte test to fail. I do see that also applying your following patch makes that test pass once again. However, it does so at the cost of forcing a new class of regexps (any that contain a use of \b, \< or \>) from DFA into the slower regex matcher. That feels like too large a performance penalty, in general. Can you quantify it?