Hi Paul & Jim, > > What happens if you compile them in and run the grep test suite? > > The test suite passes, but grep is bigger and (I presume) slower. The > GREP-related changes are for performance, and shouldn't affect behavior. > > How about if we apply the attached patch to dfa.c, in both gawk and > grep? I tried it just now, and gawk passed all its tests too. Or, if > there's some reason this patch would introduce a bug into gawk, I'd like > to fix the grep test cases to detect the bug.
The code in question occurs in two functions, parse_bracket_exp() and atom(). The first instance is in parse_bracket_exp(), building a range expression, where we may have multibyte characters. .... if (c1 == '-' && c2 != ']') { if (c2 == '\\' && (syntax_bits & RE_BACKSLASH_ESCAPE_IN_LISTS)) FETCH_WC (c2, wc2, _("unbalanced [")); if (MB_CUR_MAX > 1) { /* When case folding map a range, say [m-z] (or even [M-z]) to the pair of ranges, [m-z] [M-Z]. */ REALLOC_IF_NECESSARY (work_mbc->range_sts, range_sts_al, work_mbc->nranges + 1); REALLOC_IF_NECESSARY (work_mbc->range_ends, range_ends_al, work_mbc->nranges + 1); work_mbc->range_sts[work_mbc->nranges] = case_fold ? towlower (wc) : (wchar_t) wc; work_mbc->range_ends[work_mbc->nranges++] = case_fold ? towlower (wc2) : (wchar_t) wc2; #ifndef GREP if (case_fold && (iswalpha (wc) || iswalpha (wc2))) { REALLOC_IF_NECESSARY (work_mbc->range_sts, range_sts_al, work_mbc->nranges + 1); work_mbc->range_sts[work_mbc->nranges] = towupper (wc); REALLOC_IF_NECESSARY (work_mbc->range_ends, range_ends_al, work_mbc->nranges + 1); work_mbc->range_ends[work_mbc->nranges++] = towupper (wc2); } #endif } To me this looks like when doing case folding (grep -i, IGNORECASE in gawk), we turn the m.b. equivalent of [a-c] into [a-cA-C]. This would seem to be necessary for correctness, and the question is why does grep not need it? The next such bit is later on in the same function: if (case_fold && iswalpha (wc)) { wc = towlower (wc); if (!setbit_wc (wc, ccl)) { REALLOC_IF_NECESSARY (work_mbc->chars, chars_al, work_mbc->nchars + 1); work_mbc->chars[work_mbc->nchars++] = wc; } #ifdef GREP continue; #else wc = towupper (wc); #endif } if (!setbit_wc (wc, ccl)) { REALLOC_IF_NECESSARY (work_mbc->chars, chars_al, work_mbc->nchars + 1); work_mbc->chars[work_mbc->nchars++] = wc; } } while ((wc = wc1, (c = c1) != ']')); This too looks related to case folding and ranges; if I read it correctly, when case folding it added the lower case version and now it has to add the uppercase version of the charcter. Then, in atom(): (Why the bizarre leading `if (0)'?) static void atom (void) { if (0) { /* empty */ } else if (MBS_SUPPORT && tok == WCHAR) { addtok_wc (case_fold ? towlower (wctok) : wctok); #ifndef GREP if (case_fold && iswalpha (wctok)) { addtok_wc (towupper (wctok)); addtok (OR); } #endif tok = lex (); } Here too, we're doing case folding, have added the lower case character and need to add the upper case one. I think to test out this code you'd need a character set where the lower and upper case counterparts are multibyte characters and grep -i is in effect. But I suspect that grep has so much other code to special case grep -i that this code in dfa.c is never reached. In short, I don't think it's right to remove this code, but I don't know how to test it to prove that, either. HTH, Arnold