On 5/22/22 15:22, Paul Eggert wrote:

We've already uncovered one area, where \a doesn't work as expected and where a warning diagnostic would be helpful.

I installed the attached patch to cause 'grep' to warn about these. Comments welcome. Most of the changes were in Gnulib's dfa module, which see.


Here's another one, where an oddly-placed '*' doesn't work as one would expect:

$ printf '*\na\n*a\n' | grep '\(*\)'
*
*a
$ printf '*\na\n*a\n' | grep -E '(*)'
grep: Unmatched ( or \(
$ printf '*\na\n*a\n' | grep '\(*a\)'
*a
$ printf '*\na\n*a\n' | grep -E '(*a)'
a
*a

I plan to look at this next. We shouldn't warn about BREs like '\(*\)' and '\(*a\)' as these conform to POSIX and work fine. But it makes sense to warn about EREs like '(*)', '(*a)', '(+)', '(+a)', '({1})', '({1}a)' as POSIX does not specify their behavior, their semantics are unpredictable with GNU grep, and it's plausible that people are making mistakes in this area.
From 42db5cc8f58620b4c9c58a91c7683279c50503f9 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Mon, 23 May 2022 12:22:15 -0700
Subject: [PATCH 1/2] build: update gnulib submodule to latest

---
 gnulib | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gnulib b/gnulib
index b19a107..55d1a73 160000
--- a/gnulib
+++ b/gnulib
@@ -1 +1 @@
-Subproject commit b19a10775e54f8ed17e3a8c08a72d261d8c26244
+Subproject commit 55d1a73c1a79e94c443f51798c4c76449a0c7d62
-- 
2.34.1

From e7f8e8eb1fd41b308ee10741bbd8068acc1847c2 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Mon, 23 May 2022 12:38:42 -0700
Subject: [PATCH 2/2] grep: warn about stray backslashes
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This papers over a problem reported by Benno Schulenberg and
Tomasz Dziendzielski <https://bugs.gnu.org/39678> involving
regular expressions like \a that have unspecified behavior.
* src/dfasearch.c (dfawarn): Just output a warning.
Don’t exit, as DFA_CONFUSING_BRACKETS_ERROR now
does that for us, and we need the ability to warn
without exiting to diagnose \a etc.
(GEAcompile): Use new dfa options DFA_CONFUSING_BRACKETS_ERROR and
DFA_STRAY_BACKSLASH_WARN.
---
 NEWS            | 6 ++++++
 src/dfasearch.c | 7 ++++---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/NEWS b/NEWS
index 38ac035..112d85b 100644
--- a/NEWS
+++ b/NEWS
@@ -11,6 +11,12 @@ GNU grep NEWS                                    -*- outline -*-
   release 2.5.3 (2007), now warn that they are obsolescent and should
   be replaced by grep -E and grep -F.
 
+  Regular expressions with stray backslashes now cause warnings, as
+  their unspecified behavior can lead to unexpected results.
+  For example, '\a' and 'a' are not always equivalent
+  <https://bugs.gnu.org/39768>.  The warnings are intended as a
+  transition aid; they are likely to be errors in future releases.
+
   Regular expressions like [:space:] are now errors even if
   POSIXLY_CORRECT is set, since POSIX now allows the GNU behavior.
 
diff --git a/src/dfasearch.c b/src/dfasearch.c
index 8f48296..7547a8a 100644
--- a/src/dfasearch.c
+++ b/src/dfasearch.c
@@ -53,10 +53,10 @@ dfaerror (char const *mesg)
   die (EXIT_TROUBLE, 0, "%s", mesg);
 }
 
-_Noreturn void
+void
 dfawarn (char const *mesg)
 {
-  dfaerror (mesg);
+  error (0, 0, _("warning: %s"), mesg);
 }
 
 /* If the DFA turns out to have some set of fixed strings one of
@@ -196,7 +196,8 @@ GEAcompile (char *pattern, idx_t size, reg_syntax_t syntax_bits,
 
   if (match_icase)
     syntax_bits |= RE_ICASE;
-  int dfaopts = eolbyte ? 0 : DFA_EOL_NUL;
+  int dfaopts = (DFA_CONFUSING_BRACKETS_ERROR | DFA_STRAY_BACKSLASH_WARN
+                 | (eolbyte ? 0 : DFA_EOL_NUL));
   dfasyntax (dc->dfa, &localeinfo, syntax_bits, dfaopts);
   bool bs_safe = !localeinfo.multibyte | localeinfo.using_utf8;
 
-- 
2.34.1

Reply via email to