bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions

Aharon Robbins Thu, 27 Feb 2014 12:33:02 -0800

Hi Paul.

> Subject: bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions
> To: 16...@debbugs.gnu.org
> Date: Thu, 27 Feb 2014 09:34:33 -0800
> From: Paul Eggert <egg...@cs.ucla.edu>
>
> I'm afraid there are several problems in the dfa code.  I still don't 
> have a handle on all of them, but here's my first patch to deal with the 
> first major one I found.  Patterns like [a-[.z.]], which caused 'grep' 
> to dump core until recently, still aren't being handled correctly, and 
> there are several closely related bugs here.  I've taken the liberty of 
> pushing the attached patch.


Thanks. This looks promising. A few comments / questions.

> +/* Return true if the current locale is known to be a unibyte locale
> +   without multicharacter collating sequences and where range
> +   comparisons simply use the native encoding.  These locales can be
> +   processed more efficiently.  */
> +
> +static bool
> +using_simple_locale (void)
> +{
> +  /* True if the native character set is known to be compatible with
> +     the C locale.  The following test isn't perfect, but it's good
> +     enough in practice, as only ASCII and EBCDIC are in common use
> +     and this test correctly accepts ASCII and rejects EBCDIC.  */
> +  enum { native_c_charset =
> +    ('\b' == 8 && '\t' == 9 && '\n' == 10 && '\v' == 11 && '\f' == 12
> +     && '\r' == 13 && ' ' == 32 && '!' == 33 && '"' == 34 && '#' == 35
> +     && '%' == 37 && '&' == 38 && '\'' == 39 && '(' == 40 && ')' == 41
> +     && '*' == 42 && '+' == 43 && ',' == 44 && '-' == 45 && '.' == 46
> +     && '/' == 47 && '0' == 48 && '9' == 57 && ':' == 58 && ';' == 59
> +     && '<' == 60 && '=' == 61 && '>' == 62 && '?' == 63 && 'A' == 65
> +     && 'Z' == 90 && '[' == 91 && '\\' == 92 && ']' == 93 && '^' == 94
> +     && '_' == 95 && 'a' == 97 && 'z' == 122 && '{' == 123 && '|' == 124
> +     && '}' == 125 && '~' == 126)
> +  };

What a mouthful!  Is all that really necessary?

> +          if ((c1 == ':' && syntax_bits & RE_CHAR_CLASSES)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I'd suggest parentheses around the bit with the bitwise operator,
both for readability and to match the rest of the code.

> @@ -1000,7 +1043,10 @@ parse_bracket_exp (void)
>                /* Fetch bracket.  */
>                FETCH_WC (c, wc, _("unbalanced ["));
>                if (c1 == ':')
> -                /* build character class.  */
> +                /* Build character class.  POSIX allows character
> +                   classes to match multicharacter collating elements,
> +                   but the regex code does not support that, so do not
> +                   worry about that possibility.  */

I thought GLIBC did support them?

I will try this out in gawk, sometime in the next few days and
let you know how it goes.

Thanks for the work!

Arnold

bug#16895: [PATCH] grep: fix multiple bugs with bracket expressions

Reply via email to