bug#16481: dfa.c and Rational Range Interpretation

Aharon Robbins Sat, 25 Jan 2014 10:29:00 -0800

Hi Paul & Jim,

> > What happens if you compile them in and run the grep test suite?
>
> The test suite passes, but grep is bigger and (I presume) slower.  The 
> GREP-related changes are for performance, and shouldn't affect behavior.
>
> How about if we apply the attached patch to dfa.c, in both gawk and 
> grep?  I tried it just now, and gawk passed all its tests too.  Or, if 
> there's some reason this patch would introduce a bug into gawk, I'd like 
> to fix the grep test cases to detect the bug.


The code in question occurs in two functions, parse_bracket_exp() and atom().

The first instance is in parse_bracket_exp(), building a range expression,
where we may have multibyte characters.

      ....
      if (c1 == '-' && c2 != ']')
        {
          if (c2 == '\\' && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS))
            FETCH_WC (c2, wc2, _("unbalanced ["));

          if (MB_CUR_MAX > 1)
            {
              /* When case folding map a range, say [m-z] (or even [M-z])
                 to the pair of ranges, [m-z] [M-Z].  */
              REALLOC_IF_NECESSARY (work_mbc->range_sts,
                                    range_sts_al, work_mbc->nranges + 1);
              REALLOC_IF_NECESSARY (work_mbc->range_ends,
                                    range_ends_al, work_mbc->nranges + 1);
              work_mbc->range_sts[work_mbc->nranges] =
                case_fold ? towlower (wc) : (wchar_t) wc;
              work_mbc->range_ends[work_mbc->nranges++] =
                case_fold ? towlower (wc2) : (wchar_t) wc2;

#ifndef GREP
              if (case_fold && (iswalpha (wc) || iswalpha (wc2)))
                {
                  REALLOC_IF_NECESSARY (work_mbc->range_sts,
                                        range_sts_al, work_mbc->nranges + 1);
                  work_mbc->range_sts[work_mbc->nranges] = towupper (wc);
                  REALLOC_IF_NECESSARY (work_mbc->range_ends,
                                        range_ends_al, work_mbc->nranges + 1);
                  work_mbc->range_ends[work_mbc->nranges++] = towupper (wc2);
                }
#endif
            }

To me this looks like when doing case folding (grep -i, IGNORECASE in gawk),
we turn the m.b. equivalent of [a-c] into [a-cA-C].  This would seem to be
necessary for correctness, and the question is why does grep not need it?

The next such bit is later on in the same function:

      if (case_fold && iswalpha (wc))
        {
          wc = towlower (wc);
          if (!setbit_wc (wc, ccl))
            {
              REALLOC_IF_NECESSARY (work_mbc->chars, chars_al,
                                    work_mbc->nchars + 1);
              work_mbc->chars[work_mbc->nchars++] = wc;
            }
#ifdef GREP
          continue;
#else
          wc = towupper (wc);
#endif
        }
      if (!setbit_wc (wc, ccl))
        {
          REALLOC_IF_NECESSARY (work_mbc->chars, chars_al,
                                work_mbc->nchars + 1);
          work_mbc->chars[work_mbc->nchars++] = wc;
        }
    }
  while ((wc = wc1, (c = c1) != ']'));

This too looks related to case folding and ranges; if I read it
correctly, when case folding it added the lower case version and
now it has to add the uppercase version of the charcter.

Then, in atom():  (Why the bizarre leading `if (0)'?)

static void
atom (void)
{
  if (0)
    {
      /* empty */
    }
  else if (MBS_SUPPORT && tok == WCHAR)
    {
      addtok_wc (case_fold ? towlower (wctok) : wctok);
#ifndef GREP
      if (case_fold && iswalpha (wctok))
        {
          addtok_wc (towupper (wctok));
          addtok (OR);
        }
#endif

      tok = lex ();
    }

Here too, we're doing case folding, have added the lower case character
and need to add the upper case one.

I think to test out this code you'd need a character set where the lower
and upper case counterparts are multibyte characters and grep -i is
in effect.  But I suspect that grep has so much other code to special
case grep -i that this code in dfa.c is never reached.

In short, I don't think it's right to remove this code, but I don't
know how to test it to prove that, either.

HTH,

Arnold

bug#16481: dfa.c and Rational Range Interpretation

Reply via email to