This patch improves performance for input string which doesn't match even the first part of a pattern. Although there is no less effective for grep as it uses a superset of DFA, gawk speeds up about 40%.
$ time -p env LC_ALL=ja_JP.eucJP ./gawk '/k/ { print }' ../k (before) real 2.85 user 2.79 sys 0.05 (after) real 1.70 user 1.64 sys 0.06 I think that this improvement should have been performed in bug#17576.
From 2cf24a4e084c873f7ae3f184251b8dca1a55e851 Mon Sep 17 00:00:00 2001 From: Norihiro Tanaka <nori...@kcn.ne.jp> Date: Mon, 20 Oct 2014 23:20:15 +0900 Subject: [PATCH] dfa: improvement for checking of multibyte character boundary When found newline, we can skip check of a multibyte character boundary before the character, as we assume newline as a single byte character. by that. The improvement speeds up about 40% for input string which doesn't match even the first part of a pattern. * src/dfa.c (skip_remains_mb): If an input character is newline, skip checking for multibyte character boundary until there. --- src/dfa.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/src/dfa.c b/src/dfa.c index 58a4b83..b9f065f 100644 --- a/src/dfa.c +++ b/src/dfa.c @@ -3252,6 +3252,8 @@ skip_remains_mb (struct dfa *d, unsigned char const *p, unsigned char const *mbp, char const *end) { wint_t wc; + if (*p == eolbyte) + return p; while (mbp < p) mbp += mbs_to_wchar (&wc, (char const *) mbp, end - (char const *) mbp, d); -- 2.1.1