Hi. Dimitry Andric <dimi...@andric.com> wrote:
> Yes, it still reproduces when I configure the latest grep using > --without-included-regex: I assume you meant --with-included-regex? If you really used --without-included-regex, that doesn't prove anything... :-) It's interesting, as gawk uses the same regex, but with different flags. It might be worth trying to understand which of the syntax bits is causing this. Thanks, Arnold > > 1400 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;) > 1: idx = 0 > (gdb) > 1402 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, > nmatch); > 1: idx = 0 > (gdb) > 1404 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node) > 1: idx = 0 > (gdb) > 1405 || (fs && re_node_set_contains (&eps_via_nodes, cur_node))) > 1: idx = 0 > (gdb) > 1428 cur_node = proceed_next_node (mctx, nmatch, pmatch, > prev_idx_match, > 1: idx = 0 > (gdb) > 1432 if (__glibc_unlikely (cur_node < 0)) > 1: idx = 0 > (gdb) > 1400 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;) > 1: idx = 0 > (gdb) > 1402 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, > nmatch); > 1: idx = 0 > (gdb) > 1404 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node) > 1: idx = 0 > (gdb) > 1405 || (fs && re_node_set_contains (&eps_via_nodes, cur_node))) > 1: idx = 0 > > The endless loop looks the same. grep's regexec.c is mostly the same as > glibc's, except for the latter having a bunch of #if RE_ENABLE_I18N > directives added. > > -Dimitry > > > On 28 Mar 2023, at 08:46, arn...@skeeve.com wrote: > > > > This does not reproduce with gawk, even when I force use of the regex > > matcher. > > > > What happens if grep is built with the included regex files instead of > > relying on the ones in the local glibc? > > > > Arnold > > > > Dimitry Andric <dimi...@andric.com> wrote: > > > >> On 27 Mar 2023, at 11:14, Koen Claessen <k...@chalmers.se> wrote: > >>> > >>> Running the command: > >>> > >>> echo a | grep -E -w '((()|a)|())*' > >>> > >>> does not terminate, and uses a LOT of processor time, for all versions of > >>> grep I have tried. > >>> > >>> This is the smallest case that could be found; simplifying anything in the > >>> input and/or expression leads to correct behavior. > >> > >> Reproducible with GNU grep 3.7 on Ubuntu 22: > >> > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > >> COMMAND > >> 93938 dim 20 0 9.0m 2.1m 2.0m R 100.0 0.0 0:08.32 grep > >> -E -w ((()|a)|())* > >> > >> It seems that at least on Ubuntu, grep in this mode uses glibc's > >> regexec(3), and it is that implementation which ends up in an endless loop. > >> > >> It loops between lines 1415, 1417 and 1443, but idx and cur_node never > >> change from 0: > >> > >> 1378 static reg_errcode_t > >> 1379 __attribute_warn_unused_result__ > >> 1380 set_regs (const regex_t *preg, const re_match_context_t *mctx, > >> size_t nmatch, > >> 1381 regmatch_t *pmatch, bool fl_backtrack) > >> 1382 { > >> ... > >> 1415 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;) > >> 1416 { > >> 1417 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, > >> nmatch); > >> 1418 > >> 1419 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node) > >> 1420 || (fs && re_node_set_contains (&eps_via_nodes, > >> cur_node))) > >> ... > >> 1442 /* Proceed to next node. */ > >> 1443 cur_node = proceed_next_node (mctx, nmatch, pmatch, > >> prev_idx_match, > >> 1444 &idx, cur_node, > >> 1445 &eps_via_nodes, fs); > >> 1446 > >> 1447 if (__glibc_unlikely (cur_node < 0)) > >> ... > >> 1465 } > >> 1466 } > >> > >> -Dimitry > >> > >> P.S.: Interestingly this does not reproduce with BSD grep, which returns > >> immediately with "a". > >> >