On Sun, Feb 14, 2016 at 12:02 PM, Ulya Fokanova <skvad...@gmail.com> wrote:
> I've explored the following case:
>
>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z '^[1-4]*$' | wc -c
>    6
>
> It's a bug (there should be no match).
>
> This is what grep does:
>
>  * triesto build DFA (as indfa.c)
>  * fails to expand character range [1-4] because of multibyte
>    localeen_US.utf-8 and gives up building DFA(marks [1-4] as BACKREF
>    that suppressesall dfa.c-related code), note the difference with
>    [1234] casein whichthere's no need to expand multibyte range
>  * falls back to Regex (gnulib extension of regex.h)
>  * Regex doesn't support '-z'semantics(the closest configuration to
>    '-z' is RE_NEWLINE_ALT, which is already included in RE_SYNTAX_GREP
>    set), so '\n'is treated as newline and match erroneously succeeds
>
> I think this should be worked around in grep: before calling 're_search' it
> should split the input string by 'eolbyte'.
>
> The bug also present with PCRE engine:
>
>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1234]*$' | wc -c
>    6
>    $ printf '12\n34\0' | LC_ALL=en_US.utf-8 grep -z -P '^[1-4]*$' | wc -c
>    6

Thank you for the analysis and the report.
I have fixed the regex-oriented problem with the attached
patch, but not yet the case using -P -z (PCRE + --null-data):
From 3ce8b39e3137d3cdcf8cec84dc89788037e76742 Mon Sep 17 00:00:00 2001
From: Jim Meyering <meyer...@fb.com>
Date: Sat, 20 Feb 2016 12:50:27 -0800
Subject: [PATCH] grep -z: avoid erroneous match with regexp anchor and \n in
 text

* src/dfasearch.c (EGexecute): Clear the newline_anchor bit when
eolbyte is not '\n'.
* tests/z-anchor-newline: New file.
* tests/Makefile.am (TESTS): Add it.
* NEWS (Bug fixes): Describe it.
Originally reported by Ulrich Mueller in
https://bugs.gentoo.org/show_bug.cgi?id=574662
Reported to us by Sergei Trofimovich as http://debbugs.gnu.org/22655
---
 NEWS                   | 13 +++++++++++++
 src/dfasearch.c        |  1 +
 tests/Makefile.am      |  3 ++-
 tests/z-anchor-newline | 43 +++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 59 insertions(+), 1 deletion(-)
 create mode 100755 tests/z-anchor-newline

diff --git a/NEWS b/NEWS
index feca5c5..ae238be 100644
--- a/NEWS
+++ b/NEWS
@@ -2,6 +2,19 @@ GNU grep NEWS                                    -*- outline -*-

 * Noteworthy changes in release ?.? (????-??-??) [?]

+** Bug fixes
+
+  grep -z would match strings it should not.  To trigger the bug, you'd
+  have to use a regular expression including an anchor (^ or $) and a
+  feature like a range or a backreference, causing grep to forego its DFA
+  matcher and resort to using re_search.  With a multibyte locale, that
+  matcher could mistakenly match a string containing a newline.
+  For example, this command:
+    printf 'a\nb\0' | LC_ALL=en_US.utf-8 grep -z '^[a-b]*b'
+  would mistakenly match and print all four input bytes.  After the fix,
+  there is no match, as expected.
+  [bug introduced in grep-2.7]
+

 * Noteworthy changes in release 2.23 (2016-02-04) [stable]

diff --git a/src/dfasearch.c b/src/dfasearch.c
index e04a2df..d348d44 100644
--- a/src/dfasearch.c
+++ b/src/dfasearch.c
@@ -342,6 +342,7 @@ EGexecute (char *buf, size_t size, size_t *match_size,
       for (i = 0; i < pcount; i++)
         {
           patterns[i].regexbuf.not_eol = 0;
+          patterns[i].regexbuf.newline_anchor = eolbyte == '\n';
           start = re_search (&(patterns[i].regexbuf),
                              beg, end - beg - 1,
                              ptr - beg, end - ptr - 1,
diff --git a/tests/Makefile.am b/tests/Makefile.am
index a38303c..5a2c0f0 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -141,7 +141,8 @@ TESTS =						\
   word-delim-multibyte				\
   word-multi-file				\
   word-multibyte				\
-  yesno
+  yesno						\
+  z-anchor-newline

 EXTRA_DIST =					\
   $(TESTS)					\
diff --git a/tests/z-anchor-newline b/tests/z-anchor-newline
new file mode 100755
index 0000000..b4dfebc
--- /dev/null
+++ b/tests/z-anchor-newline
@@ -0,0 +1,43 @@
+#!/bin/sh
+# grep -z with an anchor in the regex could mistakenly match text
+# including a newline.
+
+# Copyright 2016 Free Software Foundation, Inc.
+
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+. "${srcdir=.}/init.sh"; path_prepend_ ../src
+
+require_en_utf8_locale_
+require_compiled_in_MB_support
+LC_ALL=en_US.UTF-8
+
+printf 'a\nb\0' > in || framework_failure_
+
+fail=0
+
+env > /t/x
+# These three would all mistakenly match, because the [a-b] range
+# forced the non-DFA (regexp-using) code path.
+returns_ 1 grep -z '^[a-b]*$' in || fail=1
+returns_ 1 grep -z 'a[a-b]*$' in || fail=1
+returns_ 1 grep -z '^[a-b]*b' in || fail=1
+
+# Test these for good measure; they exercise the DFA code path
+# and always worked
+returns_ 1 grep -z '^[ab]*$' in || fail=1
+returns_ 1 grep -z 'a[ab]*$' in || fail=1
+returns_ 1 grep -z '^[ab]*b' in || fail=1
+
+Exit $fail
-- 
2.6.4

Reply via email to