Hi Paul. > Subject: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions > To: 16...@debbugs.gnu.org > Date: Thu, 27 Feb 2014 09:34:33 -0800 > From: Paul Eggert <egg...@cs.ucla.edu> > > I'm afraid there are several problems in the dfa code. I still don't > have a handle on all of them, but here's my first patch to deal with the > first major one I found. Patterns like [a-[.z.]], which caused 'grep' > to dump core until recently, still aren't being handled correctly, and > there are several closely related bugs here. I've taken the liberty of > pushing the attached patch.
Thanks. This looks promising. A few comments / questions. > +/* Return true if the current locale is known to be a unibyte locale > + without multicharacter collating sequences and where range > + comparisons simply use the native encoding. These locales can be > + processed more efficiently. */ > + > +static bool > +using_simple_locale (void) > +{ > + /* True if the native character set is known to be compatible with > + the C locale. The following test isn't perfect, but it's good > + enough in practice, as only ASCII and EBCDIC are in common use > + and this test correctly accepts ASCII and rejects EBCDIC. */ > + enum { native_c_charset = > + ('\b' == 8 && '\t' == 9 && '\n' == 10 && '\v' == 11 && '\f' == 12 > + && '\r' == 13 && ' ' == 32 && '!' == 33 && '"' == 34 && '#' == 35 > + && '%' == 37 && '&' == 38 && '\'' == 39 && '(' == 40 && ')' == 41 > + && '*' == 42 && '+' == 43 && ',' == 44 && '-' == 45 && '.' == 46 > + && '/' == 47 && '0' == 48 && '9' == 57 && ':' == 58 && ';' == 59 > + && '<' == 60 && '=' == 61 && '>' == 62 && '?' == 63 && 'A' == 65 > + && 'Z' == 90 && '[' == 91 && '\\' == 92 && ']' == 93 && '^' == 94 > + && '_' == 95 && 'a' == 97 && 'z' == 122 && '{' == 123 && '|' == 124 > + && '}' == 125 && '~' == 126) > + }; What a mouthful! Is all that really necessary? > + if ((c1 == ':' && syntax_bits & RE_CHAR_CLASSES) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ I'd suggest parentheses around the bit with the bitwise operator, both for readability and to match the rest of the code. > @@ -1000,7 +1043,10 @@ parse_bracket_exp (void) > /* Fetch bracket. */ > FETCH_WC (c, wc, _("unbalanced [")); > if (c1 == ':') > - /* build character class. */ > + /* Build character class. POSIX allows character > + classes to match multicharacter collating elements, > + but the regex code does not support that, so do not > + worry about that possibility. */ I thought GLIBC did support them? I will try this out in gawk, sometime in the next few days and let you know how it goes. Thanks for the work! Arnold