On 11/9/20 9:34 AM, JIang Yuancheng wrote:
grep -E “.*{10,}{10,}{10,}{10,}{10,}” can exhaust stack space then stack overflow comes out. (Tested on latest version 3.6)
This is a longstanding issue with the regex matcher. I installed the attached patch to document the issue better. Fortunately, the problem is mostly limited to contrived examples.
>From f0d97db2a2104c5fd558178713054f3f267623b2 Mon Sep 17 00:00:00 2001 From: Paul Eggert <egg...@cs.ucla.edu> Date: Fri, 27 Aug 2021 18:20:58 -0700 Subject: [PATCH] doc: document interval expression limitations * doc/grep.texi (Basic vs Extended, Performance): Document limitations of interval expressions (Bug#44538). --- doc/grep.texi | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/doc/grep.texi b/doc/grep.texi index b92ecb7..e5b9fd8 100644 --- a/doc/grep.texi +++ b/doc/grep.texi @@ -1526,7 +1526,7 @@ before an interval expression's closing @samp{@}}, and an unmatched @code{\)} is invalid. Portable scripts should avoid the following constructs, as -POSIX says they produce undefined results: +POSIX says they produce unspecified results: @itemize @bullet @item @@ -1541,6 +1541,8 @@ Empty alternatives (as in, e.g, @samp{a|}). Repetition operators that immediately follow empty expressions, unescaped @samp{$}, or other repetition operators. @item +Interval expressions containing repetition counts greater than 255. +@item A backslash escaping an ordinary character (e.g., @samp{\S}), unless it is a back-reference. @item @@ -1965,6 +1967,17 @@ bracket expressions like @samp{[a-z]} and @samp{[[=a=]b]}, can be surprisingly inefficient due to difficulties in fast portable access to concepts like multi-character collating elements. +@cindex interval expressions +Interval expressions may be implemented internally via repetition. +For example, @samp{^(a|bc)@{2,4@}$} might be implemented as +@samp{^(a|bc)(a|bc)((a|bc)(a|bc)?)?$}. A large repetition count may +exhaust memory or greatly slow matching. Even small counts can cause +problems if cascaded; for example, @samp{grep -E +".*@{10,@}@{10,@}@{10,@}@{10,@}@{10,@}"} is likely to overflow a +stack. Fortunately, regular expressions like these are typically +artificial, and cascaded repetitions do not conform to POSIX so cannot +be used in portable programs anyway. + @cindex back-references A back-reference such as @samp{\1} can hurt performance significantly in some cases, since back-references cannot in general be implemented -- 2.30.2