bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Norihiro Tanaka Fri, 19 Dec 2014 06:01:41 -0800

On Thu, 18 Dec 2014 14:45:58 +0100
Vincent Lefevre <vinc...@vinc17.net> wrote:
> > 
> >   0xE0 0xC2 0xFF
> >   0xED 0xA0 0xFF
> >   0xF0 0xBF 0xFF 0xFF
> 
> If I'm not mistaken, these first three are also treated as invalid by
> my patch (and should be treated as invalid by any tool).


I got them from pcre_valid_utf8(), but I made some mistakes.  They are
as following.

  0xE0 0xAF 0xBF
  0xED 0xA0 0xBF
  0xF0 0x8F 0xBF 0xBF

By the way, they are correspond with following codes in pcre_valid_utf8().

    if (c == 0xe0 && (d & 0x20) == 0)
      {
      *erroroffset = (int)(p - string) - 2;
      return PCRE_UTF8_ERR16;
      }
    if (c == 0xed && d >= 0xa0)
      {
      *erroroffset = (int)(p - string) - 2;
      return PCRE_UTF8_ERR14;
      }

    ........

    if (c == 0xf0 && (d & 0x30) == 0)
      {
      *erroroffset = (int)(p - string) - 3;
      return PCRE_UTF8_ERR17;
      }
    if (c > 0xf4 || (c == 0xf4 && d > 0x8f))
      {
      *erroroffset = (int)(p - string) - 3;
      return PCRE_UTF8_ERR13;
      }

> BTW,
> 
>   printf "\xF4\xBF\xBF\xBF\n" | grep .
> 
> finds a match, and this appears to be a bug (grep should follow
> the current standard).

I also see it is a bug as you say.  mbrlen() in glibc returns (size_t) -1
for the sequence.

bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales

Reply via email to