bug#39678: POSIXLY_CORRECT removal, and oddball regex doc

Paul Eggert Sun, 22 May 2022 15:25:46 -0700

On 5/21/22 11:40, Jim Meyering wrote:

In my experience, there are many lurking uses of things like '\a', and
would like to ease into this gently, so I much prefer your latter
approach: warn now, and change grep's exit status later


Sounds good.

When I started looking into that, I discovered that the grep manualdoesn't cover these lurkers well. And although I installed a patchyesterday about this, after looking at the POSIX spec again today Idiscovered that I'd missed quite a few lurkers. So I just now installedthe attached documentation fix, which attempts to cover all theremaining problem regexps, and to give us room to add warnings for someof them soon.

We shouldn't warn about all these problems, not without a --pedanticflag or something like that (something I'm probably too busy to add).But I expect it'd be good to warn about areas where grep's semanticsdon't match any reasonable expectation.

We've already uncovered one area, where \a doesn't work as expected andwhere a warning diagnostic would be helpful. Here's another one, wherean oddly-placed '*' doesn't work as one would expect:


$ printf '*\na\n*a\n' | grep '\(*\)'
*
*a
$ printf '*\na\n*a\n' | grep -E '(*)'
grep: Unmatched ( or \(
$ printf '*\na\n*a\n' | grep '\(*a\)'
*a
$ printf '*\na\n*a\n' | grep -E '(*a)'
a
*a

Although not a POSIX violation, here 'grep -E' is "wrong" for anyreasonable definition of "wrong" that I can think of. The attached patchchanges the doc to say that this regular expression has unspecifiedbehavior (something that POSIX allows).


(Who would have thought regular expressions were so complicated? :-)

From a860bd39e384ed6111bc63fe6aabeb7f7120e6d5 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Sun, 22 May 2022 14:59:53 -0700
Subject: [PATCH] doc: document regex corner cases better

* doc/grep.texi (Environment Variables)
(Fundamental Structure, Character Classes and Bracket Expressions)
(Special Backslash Expressions, Back-references and Subexpressions)
(Basic vs Extended): Say more precisely what happens with
problematic regular expressions.
(Problematic Expressions): New section.
---
 NEWS          |   5 ++
 doc/grep.texi | 224 +++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 173 insertions(+), 56 deletions(-)

diff --git a/NEWS b/NEWS
index bf2ee50..38ac035 100644
--- a/NEWS
+++ b/NEWS
@@ -26,6 +26,11 @@ GNU grep NEWS                                    -*- outline -*-
   The -s option no longer suppresses "binary file matches" messages.
   [Bug#51860 introduced in grep 3.5]
 
+** Documentation improvements
+
+  The manual now covers unspecified behavior in patterns like \x, (+),
+  and range expressions outside the POSIX locale.
+
 
 * Noteworthy changes in release 3.7 (2021-08-14) [stable]
 
diff --git a/doc/grep.texi b/doc/grep.texi
index a717e32..69b52dc 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -265,8 +265,7 @@ begin and end with word constituents, it differs from surrounding a
 regular expression with @samp{\<} and @samp{\>}.  For example, although
 @samp{grep -w @@} matches a line containing only @samp{@@}, @samp{grep
 '\<@@\>'} cannot match any line because @samp{@@} is not a
-word constituent.  @xref{The Backslash Character and Special
-Expressions}.
+word constituent.  @xref{Special Backslash Expressions}.
 
 @item -x
 @itemx --line-regexp
@@ -830,8 +829,8 @@ is specified by examining the three environment variables
 in that order.
 The first of these variables that is set specifies the locale.
 For example, if @env{LC_ALL} is not set,
-but @env{LC_COLLATE} is set to @samp{pt_BR},
-then the Brazilian Portuguese locale is used
+but @env{LC_COLLATE} is set to @samp{pt_BR.UTF-8},
+then a Brazilian Portuguese locale is used
 for the @env{LC_COLLATE} category.
 As a special case for @env{LC_MESSAGES} only, the environment variable
 @env{LANGUAGE} can contain a colon-separated list of languages that
@@ -1176,10 +1175,11 @@ pages, but work only if PCRE is available in the system.
 @menu
 * Fundamental Structure::
 * Character Classes and Bracket Expressions::
-* The Backslash Character and Special Expressions::
+* Special Backslash Expressions::
 * Anchoring::
 * Back-references and Subexpressions::
 * Basic vs Extended::
+* Problematic Expressions::
 * Character Encoding::
 * Matching Non-ASCII::
 @end menu
@@ -1259,9 +1259,10 @@ the resulting regular expression
 matches any string formed by concatenating two substrings
 that respectively match the concatenated expressions.
 
-Two regular expressions may be joined by the infix operator @samp{|};
-the resulting regular expression
-matches any string matching either alternate expression.
+@cindex alternatives in regular expressions
+Two regular expressions may be joined by the infix operator @samp{|}.
+The resulting regular expression matches any string matching either of
+the two expressions, which are called @dfn{alternatives}.
 
 Repetition takes precedence over concatenation,
 which in turn takes precedence over alternation.
@@ -1269,14 +1270,8 @@ A whole expression may be enclosed in parentheses
 to override these precedence rules and form a subexpression.
 An unmatched @samp{)} matches just itself.
 
-Some strings are not valid regular expressions and cause
-@command{grep} to issue a diagnostic and fail.  For example, @samp{xy\1}
-is invalid because there is no parenthesized subexpression for the
-back-reference @samp{\1} to refer to.  Also, some regular expressions
-have unspecified behavior and should be avoided in portable scripts
-even if @command{grep} does not currently diagnose them.  For example,
-@samp{xy\0} has unspecified behavior because @samp{0} is not a special
-character and there is no documentation for the behavior of @samp{\0}.
+Not every character string is a valid regular expression.
+@xref{Problematic Expressions}.
 
 @node Character Classes and Bracket Expressions
 @section Character Classes and Bracket Expressions
@@ -1442,7 +1437,7 @@ represents the close character class symbol.
 
 @item -
 represents the range if it's not first or last in a list or the ending point
-of a range.
+of a range.  To make the @samp{-} a list item, it is best to put it last.
 
 @item ^
 represents the characters not in the list.
@@ -1451,8 +1446,8 @@ character a list item, place it anywhere but first.
 
 @end table
 
-@node The Backslash Character and Special Expressions
-@section The Backslash Character and Special Expressions
+@node Special Backslash Expressions
+@section Special Backslash Expressions
 @cindex backslash
 
 The @samp{\} character followed by a special character is a regular
@@ -1524,8 +1519,6 @@ for example, @samp{(a)*\1} fails to match @samp{a}.
 If the parenthesized subexpression matches more than one substring,
 the back-reference refers to the last matched substring;
 for example, @samp{^(ab*)*\1$} matches @samp{ababbabb} but not @samp{ababbab}.
-The back-reference @samp{\@var{n}} is invalid
-if preceded by fewer than @var{n} subexpressions.
 When multiple regular expressions are given with
 @option{-e} or from a file (@samp{-f @var{file}}),
 back-references are local to each expression.
@@ -1536,65 +1529,181 @@ back-references are local to each expression.
 @section Basic vs Extended Regular Expressions
 @cindex basic regular expressions
 
-In basic regular expressions the characters @samp{?}, @samp{+},
+Basic regular expressions differ from extended regular expressions
+in the following ways:
+
+@itemize
+@item
+The characters @samp{?}, @samp{+},
 @samp{@{}, @samp{|}, @samp{(}, and @samp{)} lose their special meaning;
 instead use the backslashed versions @samp{\?}, @samp{\+}, @samp{\@{},
 @samp{\|}, @samp{\(}, and @samp{\)}.  Also, a backslash is needed
-before an interval expression's closing @samp{@}}, and an unmatched
-@code{\)} is invalid.
+before an interval expression's closing @samp{@}}.
 
-Portable scripts should avoid the following constructs, as
-POSIX says they produce unspecified results:
+@item
+An unmatched @samp{\)} is invalid.
 
-@itemize @bullet
 @item
-An extended regular expression that uses back-references.
+If an unescaped @samp{^} appears neither first, nor directly after
+@samp{\(} or @samp{\|}, it is treated like an ordinary character and
+is not an anchor.
+
 @item
-A basic regular expression that uses @samp{\?}, @samp{\+}, or @samp{\|}.
+If an unescaped @samp{$} appears neither last, nor directly before
+@samp{\|} or @samp{\)}, it is treated like an ordinary character and
+is not an anchor.
+
 @item
-An empty parenthesized regular expression like @samp{()}.
+If an unescaped @samp{*} appears first, or appears directly after
+@samp{\(} or @samp{\|} or anchoring @samp{^}, it is treated like an
+ordinary character and is not a repetition operator.
+@end itemize
+
+@node Problematic Expressions
+@section Problematic Regular Expressions
+
+@cindex invalid regular expressions
+@cindex unspecified behavior in regular expressions
+Some strings are @dfn{invalid regular expressions} and cause
+@command{grep} to issue a diagnostic and fail.  For example, @samp{xy\1}
+is invalid because there is no parenthesized subexpression for the
+back-reference @samp{\1} to refer to.
+
+Also, some regular expressions have @dfn{unspecified behavior} and
+should be avoided even if @command{grep} does not currently diagnose
+them.  For example, @samp{xy\0} has unspecified behavior because
+@samp{0} is not a special character and @samp{\0} is not a special
+backslash expression (@pxref{Special Backslash Expressions}).
+Unspecified behavior can be particularly problematic because the set
+of matched strings might be only partially specified, or not be
+specified at all, or the expression might even be invalid.
+
+The following regular expression constructs are invalid on all
+platforms conforming to POSIX, so portable scripts can assume that
+@command{grep} rejects these constructs:
+
+@itemize @bullet
 @item
-An empty alternative (as in, e.g, @samp{a|}).
+A basic regular expression containing a back-reference @samp{\@var{n}}
+preceded by fewer than @var{n} closing parentheses.  For example,
+@samp{\(a\)\2} is invalid.
+
 @item
-A repetition operator that immediately follows an empty expression,
-unescaped @samp{$}, or another repetition operator.
+A bracket expression containing @samp{[:} that does not start a
+character class; and similarly for @samp{[=} and @samp{[.}.  For
+example, @samp{[a[:b]} and @samp{[a[:ouch:]b]} are invalid.
+@end itemize
+
+GNU @command{grep} treats the following constructs as invalid.
+However, other @command{grep} implementations might allow them, so
+portable scripts should not rely on their being invalid:
+
+@itemize @bullet
+@item
+Unescaped @samp{\} at the end of a regular expression.
+
 @item
-An interval expression with a repetition count greater than 255.
+Unescaped @samp{[} that does not start a bracket expression.
+
+@item
+A @samp{\@{} in a basic regular expression that does not start an
+interval expression.
+
 @item
 A basic regular expression with unbalanced @samp{\(} or @samp{\)},
 or an extended regular expression with unbalanced @samp{(}.
+
+@item
+In the POSIX locale, a range expression like @samp{z-a} that
+represents zero elements.  A non-GNU @command{grep} might treat it as
+a valid range that never matches.
+
+@item
+An interval expression with a repetition count greater than 32767.
+(The portable POSIX limit is 255, and even interval expressions with
+smaller counts can be impractically slow on all known implementations.)
+
 @item
 A bracket expression that contains at least three elements, the first
 and last of which are both @samp{:}, or both @samp{.}, or both
-@samp{=}.  For example, it is unspecified whether the bracket expression
-@samp{[:alpha:]} is equivalent to @samp{[[:alpha:]]}, equivalent to
-@samp{[:ahlp]}, or invalid.
+@samp{=}.  For example, a non-GNU @command{grep} might treat
+@samp{[:alpha:]} like @samp{[[:alpha:]]}, or like @samp{[:ahlp]}.
+@end itemize
+
+The following constructs have well-defined behavior in GNU
+@command{grep}.  However, they have unspecified behavior elsewhere, so
+portable scripts should avoid them:
+
+@itemize @bullet
 @item
-A range expression like @samp{z-a} that represents zero elements;
-it might never match, or it might be invalid.
+Special backslash expressions like @samp{\<} and @samp{\b}.
+@xref{Special Backslash Expressions}.
+
 @item
-A range expression outside the POSIX locale.
+A basic regular expression that uses @samp{\?}, @samp{\+}, or @samp{\|}.
+
 @item
-A backslash escaping an ordinary character (e.g., @samp{\S}),
-unless it is a back-reference.
+An extended regular expression that uses back-references.
+
 @item
-An unescaped backslash at the end of a regular expression.
+An empty regular expression, subexpression, or alternative.  For
+example, @samp{(a|bc|)} is not portable; a portable equivalent is
+@samp{(a|bc)?}.
+
 @item
-An unescaped @samp{[} that is not part of a bracket expression.
+In a basic regular expression, an anchoring @samp{^} that appears
+directly after @samp{\(}, or an anchoring @samp{$} that appears
+directly before @samp{\)}.
+
 @item
-A @samp{\@{} in a basic regular expression (or an unescaped @samp{@{}
-in an extended regular expression) that does not start an interval
-expression.
+In a basic regular expression, a repetition operator that
+directly follows another repetition operator.
+
+@item
+In an extended regular expression, unescaped @samp{@{}
+that does not begin a valid interval expression.
+GNU @command{grep} treats the @samp{@{} as an ordinary character.
+
+@item
+A null character or an encoding error in either pattern or input data.
+@xref{Character Encoding}.
+
+@item
+An input file that ends in a non-newline character,
+where GNU @command{grep} silently supplies a newline.
 @end itemize
 
-@cindex interval expressions
-GNU @samp{grep@ -E} treats @samp{@{} as special
-only if it begins a valid interval expression.
-For example, the command
-@samp{grep@ -E@ '@{1'} searches for the two-character string @samp{@{1}
-instead of reporting a syntax error in the regular expression.
-POSIX allows this behavior as an extension, but portable scripts
-should avoid it.
+The following constructs have unspecified behavior, in both GNU
+and other @command{grep} implementations.  Scripts should avoid
+them whenever possible.
+
+@itemize
+@item
+A backslash escaping an ordinary character, unless it is a
+back-reference like @samp{\1} or a special backslash expression like
+@samp{\<} or @samp{\b}.  @xref{Special Backslash Expressions}.  For
+example, @samp{\x} has unspecified behavior now, and a future version
+of @command{grep} might specify @samp{\x} to have a new behavior.
+
+@item
+A repetition operator that appears directly after an anchor, or at the
+start of a complete regular expression, parenthesized subexpression,
+or alternative.  For example, @samp{+|^*(+a|?-b)} has unspecified
+behavior, whereas @samp{\+|^\*(\+a|\?-b)} is portable.
+
+@item
+A range expression outside the POSIX locale.  For example, in some
+locales @samp{[a-z]} might match some characters that are not
+lowercase letters, or might not match some lowercase letters, or might
+be invalid.  With GNU @command{grep} it is not documented whether
+these range expressions use native code points, or use the collating
+sequence specified by the @env{LC_COLLATE} category, or have some
+other interpretation.  Outside the POSIX locale, it is portable to use
+@samp{[[:lower:]]} to match a lower-case letter, or
+@samp{[abcdefghijklmnopqrstuvwxyz]} to match an ASCII lower-case
+letter.
+
+@end itemize
 
 @node Character Encoding
 @section Character Encoding
@@ -1900,7 +2009,10 @@ other patterns cause @command{grep} to match every line.
 
 To match empty lines, use the pattern @samp{^$}.  To match blank
 lines, use the pattern @samp{^[[:blank:]]*$}.  To match no lines at
-all, use the command @samp{grep -f /dev/null}.
+all, use an extended regular expression like @samp{a^} or @samp{$a}.
+To match every line, a portable script should use a pattern like
+@samp{^} instead of the empty pattern, as POSIX does not specify the
+behavior of the empty pattern.
 
 @item
 How can I search in both standard input and in files?
-- 
2.34.1

bug#39678: POSIXLY_CORRECT removal, and oddball regex doc

Reply via email to