Hello All. I believe that the code in dfa.c that deals with character ranges is incorrect with respect to Rational Range Interpretation. This shows up in the following test case:
$ echo \\ | src/grep -Xawk '[\[-\]]' $ Whereas with gawk: $ echo \\ | gawk '/[\[-\]]/' \ >From ascii(7): ... 133 91 5B [ ... 134 92 5C \ '\\' ... 135 93 5D ] So gawk is correct here. (This is on a GLIBC system; in private email Jim reported different behavior on Mac OS X.) In the grep master, the code in question is in dfa.c:parse_bracket_exp, lines 1110 - 1135: { /* Defer to the system regex library about the meaning of range expressions. */ regex_t re; char pattern[6] = { '[', 0, '-', 0, ']', 0 }; char subject[2] = { 0, 0 }; c1 = c; if (case_fold) { c1 = tolower (c1); c2 = tolower (c2); } pattern[1] = c1; pattern[3] = c2; regcomp (&re, pattern, REG_NOSUB); for (c = 0; c < NOTCHAR; ++c) { if ((case_fold && isupper (c))) continue; subject[0] = c; if (regexec (&re, subject, 0, NULL, 0) != REG_NOMATCH) setbit_case_fold_c (c, ccl); } regfree (&re); } This code lets the regex routines decide what characters match a particular range expression. If the regex routines are not obeying RRI, then dfa.c will not either. Yet, grep now supports RRI. (To me this argues that grep's configure should be checking the system regex routines for correct RRI support, and automatically using the included routines if the system routines are not good. Gawk goes further and simply always uses the included regex routines, *guaranteeing* consistent behavior across systems. But that's a parenthetical issue.) In addition, the call to regcomp could fail, but this isn't being checked. When I add an error report to the call, I get the following on one of the gawk test cases: "[.c.]" ~ /[a-[.e.]]/ --> 1 dfa.c:1176: regcomp(/[a-[]/) failed: Invalid range end Since this relates to [. and .] which dfa and regex don't really support, there's a gap somewhere, but the point is that if regcomp fails, nobody notices. What does regexec do if regcomp fails? Beats me... Next, let's take a harder look at this: for (c = 0; c < NOTCHAR; ++c) { if ((case_fold && isupper (c))) continue; subject[0] = c; if (regexec (&re, subject, 0, NULL, 0) != REG_NOMATCH) setbit_case_fold_c (c, ccl); } Since c is 0 on the first iteration, regexec is called with subject equal to [ '\0' '\0' ]. The first thing regexec does is length = strlen(string); which in this case will be zero. We really want a length of 1 where the first byte is zero (no arbitrary limits, eh?). Bug in the regexec interface, methinks, but in any case, testing 0 is fruitless. However, this code begs a deeper question. If we're doing RRI, then by definition only the values between the low member of the range and the high member of the range can match the range expression. So why loop over everything from 0 to 255? Thus, gawk replaces the above code with the following: c1 = c; if (case_fold) { c1 = tolower (c1); c2 = tolower (c2); } for (c = c1; c <= c2; c++) setbit_case_fold_c (c, ccl); This sets the bits for exactly those characters in the range. No more, no less. And it doesn't rely on the system regex routines, which makes compiling the dfa go faster. Grep only compiles its dfa once, but gawk can compile arbitrarily many dfa's, since it can match expressions that are computed dynamically. I'm not sure if this analysis covers all the problems with the current code. But I do think that gawk's code is the correct thing to be doing for RRI. Additionally, I recommend that grep's configure check for good RRI support in the system regex routines and switch to the included ones if the system ones don't support it. Finally, the following diff lets grep check the other awk syntax variants. Feel free to apply it. For the above test case, all three give the same results. I hope all this is of interest. Thanks! Arnold ----------------------------------------------------- diff --git a/src/grep.c b/src/grep.c index 1b2198f..12644a2 100644 --- a/src/grep.c +++ b/src/grep.c @@ -19,10 +19,24 @@ Acompile (char const *pattern, size_t size) GEAcompile (pattern, size, RE_SYNTAX_AWK); } +static void +GAcompile (char const *pattern, size_t size) +{ + GEAcompile (pattern, size, RE_SYNTAX_GNU_AWK); +} + +static void +PAcompile (char const *pattern, size_t size) +{ + GEAcompile (pattern, size, RE_SYNTAX_POSIX_AWK); +} + struct matcher const matchers[] = { { "grep", Gcompile, EGexecute }, { "egrep", Ecompile, EGexecute }, { "awk", Acompile, EGexecute }, + { "gawk", GAcompile, EGexecute }, + { "posixawk", PAcompile, EGexecute }, { "fgrep", Fcompile, Fexecute }, { "perl", Pcompile, Pexecute }, { NULL, NULL, NULL },