bug#16481: dfa.c and Rational Range Interpretation

Aharon Robbins Fri, 17 Jan 2014 05:42:34 -0800

Hello All.

I believe that the code in dfa.c that deals with character ranges
is incorrect with respect to Rational Range Interpretation.
This shows up in the following test case:


        $ echo \\ | src/grep -Xawk '[\[-\]]'
        $ 

Whereas with gawk:

        $ echo \\ | gawk '/[\[-\]]/'
        \

>From ascii(7):

      ...       133   91    5B    [
      ...       134   92    5C    \  '\\'
      ...       135   93    5D    ]

So gawk is correct here.  (This is on a GLIBC system; in private email
Jim reported different behavior on Mac OS X.)

In the grep master, the code in question is in dfa.c:parse_bracket_exp,
lines 1110 - 1135:

            {
              /* Defer to the system regex library about the meaning
                 of range expressions.  */
              regex_t re;
              char pattern[6] = { '[', 0, '-', 0, ']', 0 };
              char subject[2] = { 0, 0 };
              c1 = c;
              if (case_fold)
                {
                  c1 = tolower (c1);
                  c2 = tolower (c2);
                }

              pattern[1] = c1;
              pattern[3] = c2;
              regcomp (&re, pattern, REG_NOSUB);
              for (c = 0; c < NOTCHAR; ++c)
                {
                  if ((case_fold && isupper (c)))
                    continue;
                  subject[0] = c;
                  if (regexec (&re, subject, 0, NULL, 0) != REG_NOMATCH)
                    setbit_case_fold_c (c, ccl);
                }
              regfree (&re);
            }

This code lets the regex routines decide what characters match
a particular range expression. If the regex routines are not
obeying RRI, then dfa.c will not either.  Yet, grep now supports RRI.

(To me this argues that grep's configure should be checking the
system regex routines for correct RRI support, and automatically
using the included routines if the system routines are not good. Gawk
goes further and simply always uses the included regex routines,
*guaranteeing* consistent behavior across systems. But that's a
parenthetical issue.)

In addition, the call to regcomp could fail, but this isn't being
checked. When I add an error report to the call, I get the following
on one of the gawk test cases:

  "[.c.]" ~ /[a-[.e.]]/ --> 1
  dfa.c:1176: regcomp(/[a-[]/) failed: Invalid range end

Since this relates to [. and .] which dfa and regex don't really
support, there's a gap somewhere, but the point is that if regcomp
fails, nobody notices. What does regexec do if regcomp fails?
Beats me...

Next, let's take a harder look at this:

              for (c = 0; c < NOTCHAR; ++c)
                {
                  if ((case_fold && isupper (c)))
                    continue;
                  subject[0] = c;
                  if (regexec (&re, subject, 0, NULL, 0) != REG_NOMATCH)
                    setbit_case_fold_c (c, ccl);
                }

Since c is 0 on the first iteration, regexec is called with subject
equal to [ '\0' '\0' ].  The first thing regexec does is

        length = strlen(string);

which in this case will be zero. We really want a length of 1 where
the first byte is zero (no arbitrary limits, eh?).  Bug in the regexec
interface, methinks, but in any case, testing 0 is fruitless.

However, this code begs a deeper question. If we're doing RRI, then by
definition only the values between the low member of the range and
the high member of the range can match the range expression.  So why
loop over everything from 0 to 255?

Thus, gawk replaces the above code with the following:

              c1 = c;
              if (case_fold)
                {
                  c1 = tolower (c1);
                  c2 = tolower (c2);
                }
              for (c = c1; c <= c2; c++)
                setbit_case_fold_c (c, ccl);

This sets the bits for exactly those characters in the range. No more,
no less. And it doesn't rely on the system regex routines, which makes
compiling the dfa go faster.

Grep only compiles its dfa once, but gawk can compile arbitrarily many
dfa's, since it can match expressions that are computed dynamically.

I'm not sure if this analysis covers all the problems with the current
code.  But I do think that gawk's code is the correct thing to be
doing for RRI.

Additionally, I recommend that grep's configure check for good RRI
support in the system regex routines and switch to the included ones
if the system ones don't support it.

Finally, the following diff lets grep check the other awk syntax
variants.  Feel free to apply it. For the above test case, all three
give the same results.

I hope all this is of interest.

Thanks!

Arnold
-----------------------------------------------------
diff --git a/src/grep.c b/src/grep.c
index 1b2198f..12644a2 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -19,10 +19,24 @@ Acompile (char const *pattern, size_t size)
   GEAcompile (pattern, size, RE_SYNTAX_AWK);
 }
 
+static void
+GAcompile (char const *pattern, size_t size)
+{
+  GEAcompile (pattern, size, RE_SYNTAX_GNU_AWK);
+}
+
+static void
+PAcompile (char const *pattern, size_t size)
+{
+  GEAcompile (pattern, size, RE_SYNTAX_POSIX_AWK);
+}
+
 struct matcher const matchers[] = {
   { "grep",    Gcompile, EGexecute },
   { "egrep",   Ecompile, EGexecute },
   { "awk",     Acompile, EGexecute },
+  { "gawk",    GAcompile, EGexecute },
+  { "posixawk", PAcompile, EGexecute },
   { "fgrep",   Fcompile, Fexecute },
   { "perl",    Pcompile, Pexecute },
   { NULL, NULL, NULL },

bug#16481: dfa.c and Rational Range Interpretation

Reply via email to