bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

Norihiro Tanaka Mon, 20 Oct 2014 08:05:57 -0700

This patch improves performance for input string which doesn't match
even the first part of a pattern.  Although there is no less effective
for grep as it uses a superset of DFA, gawk speeds up about 40%.


$ time -p env LC_ALL=ja_JP.eucJP ./gawk '/k/ { print }' ../k

(before)
  real 2.85  user 2.79  sys 0.05

(after)
  real 1.70  user 1.64  sys 0.06

I think that this improvement should have been performed in bug#17576.

From 2cf24a4e084c873f7ae3f184251b8dca1a55e851 Mon Sep 17 00:00:00 2001
From: Norihiro Tanaka <nori...@kcn.ne.jp>
Date: Mon, 20 Oct 2014 23:20:15 +0900
Subject: [PATCH] dfa: improvement for checking of multibyte character boundary

When found newline, we can skip check of a multibyte character boundary
before the character, as we assume newline as a single byte character.
by that.

The improvement speeds up about 40% for input string which doesn't match
even the first part of a pattern.

* src/dfa.c (skip_remains_mb): If an input character is newline, skip
checking for multibyte character boundary until there.
---
 src/dfa.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/dfa.c b/src/dfa.c
index 58a4b83..b9f065f 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -3252,6 +3252,8 @@ skip_remains_mb (struct dfa *d, unsigned char const *p,
                  unsigned char const *mbp, char const *end)
 {
   wint_t wc;
+  if (*p == eolbyte)
+    return p;
   while (mbp < p)
     mbp += mbs_to_wchar (&wc, (char const *) mbp,
                          end - (char const *) mbp, d);
-- 
2.1.1

bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary

Reply via email to