On Thu, 18 Dec 2014 14:45:58 +0100 Vincent Lefevre <vinc...@vinc17.net> wrote: > > > > 0xE0 0xC2 0xFF > > 0xED 0xA0 0xFF > > 0xF0 0xBF 0xFF 0xFF > > If I'm not mistaken, these first three are also treated as invalid by > my patch (and should be treated as invalid by any tool).
I got them from pcre_valid_utf8(), but I made some mistakes. They are as following. 0xE0 0xAF 0xBF 0xED 0xA0 0xBF 0xF0 0x8F 0xBF 0xBF By the way, they are correspond with following codes in pcre_valid_utf8(). if (c == 0xe0 && (d & 0x20) == 0) { *erroroffset = (int)(p - string) - 2; return PCRE_UTF8_ERR16; } if (c == 0xed && d >= 0xa0) { *erroroffset = (int)(p - string) - 2; return PCRE_UTF8_ERR14; } ........ if (c == 0xf0 && (d & 0x30) == 0) { *erroroffset = (int)(p - string) - 3; return PCRE_UTF8_ERR17; } if (c > 0xf4 || (c == 0xf4 && d > 0x8f)) { *erroroffset = (int)(p - string) - 3; return PCRE_UTF8_ERR13; } > BTW, > > printf "\xF4\xBF\xBF\xBF\n" | grep . > > finds a match, and this appears to be a bug (grep should follow > the current standard). I also see it is a bug as you say. mbrlen() in glibc returns (size_t) -1 for the sequence.