bug#19173: [PATCH] dfa: speed-up for long pattern

Jim Meyering Sat, 18 Jul 2015 11:45:38 -0700

On Sat, Jul 4, 2015 at 8:40 PM, Jim Meyering <j...@meyering.net> wrote:
> On Mon, Nov 24, 2014 at 5:15 PM, Norihiro Tanaka <nori...@kcn.ne.jp> wrote:
...
> Thank you for that patch.
> I have rebased it and made some small improvements:
> I combined an if+do loop into a single for-loop and moved
> some declarations "down". I constructed a reproducer that
> does not require two large inputs to demonstrate the
> performance improvement.
>
> I've also begun to reword the commit log, but am out of time
> for this evening, so will post this here, for now.
>
> I am still trying to convince myself that this is a strict
> improvement, i.e., that the O(N^2) strstr calls avoided
> by this change served no purpose.


I have pushed that.
I have also pushed a test (attached) for this performance fix in
a separate commit. As with most performance-measuring tests,
it may be fragile, especially when many tests are run in parallel,
but it passes for me on a few different systems.

From d940cf045c4e777d37b19ee4639623f2c4a7e062 Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyer...@fb.com>
Date: Sun, 5 Jul 2015 11:43:47 -0700
Subject: [PATCH] tests: add a test for the performance fix

* tests/long-pattern-perf: New file.
* tests/Makefile.am (TESTS): Add it.
---
 tests/Makefile.am       |  1 +
 tests/long-pattern-perf | 50 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 51 insertions(+)
 create mode 100755 tests/long-pattern-perf

diff --git a/tests/Makefile.am b/tests/Makefile.am
index 629d322..c95d5a9 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -80,6 +80,7 @@ TESTS =						\
   khadafy					\
   kwset-abuse					\
   long-line-vs-2GiB-read			\
+  long-pattern-perf				\
   match-lines					\
   max-count-overread				\
   max-count-vs-context				\
diff --git a/tests/long-pattern-perf b/tests/long-pattern-perf
new file mode 100755
index 0000000..cba6553
--- /dev/null
+++ b/tests/long-pattern-perf
@@ -0,0 +1,50 @@
+#!/bin/sh
+# grep-2.21 would incur a 100x penalty for 10x increase in regexp length
+
+# Copyright 2015 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+require_timeout_
+require_hi_res_time_
+
+fail=0
+
+echo x > in || framework_failure_
+# We could use seq -s '' (avoiding the tr filter), but I
+# suspect some version of seq does not honor that option.
+# Note that we want 10x the byte count (not line count) in the larger file.
+seq 5000  | tr -d '\012' > re     || framework_failure_
+seq 40000 | tr -d '\012' > re-10x || framework_failure_
+
+start=$(hi_res_time_)
+grep -f re in; st=$?
+stop=$(hi_res_time_)
+test $st = 1 || fail=1
+
+# Increasing the length of the regular expression by a factor
+# of 10 should cause no more than a 10x increase in duration.
+# However, we'll draw the line at 30x to avoid false-positives.
+# Use an integer; some 'timeout' implementations have trouble with
+# floating-point.
+n_sec=$(
+  $AWK 'BEGIN { print 1 + int (30 * ('$stop' - '$start'))}' < /dev/null
+)
+
+# Expect no match, i.e., exit status of 1.  Anything else is an error.
+timeout $n_sec grep -f re-10x in; st=$?
+test $st = 1 || fail=1
+
+Exit $fail
-- 
2.3.7

bug#19173: [PATCH] dfa: speed-up for long pattern

Reply via email to