On 2014-12-19 23:00:38 +0900, Norihiro Tanaka wrote: > I got them from pcre_valid_utf8(), but I made some mistakes. They are > as following. > > 0xE0 0xAF 0xBF
This one is valid UTF-8 and corresponds to the code point U+0BFF, and the following matches: $ printf "\xE0\xAF\xBF\n" | grep -P . > 0xED 0xA0 0xBF OK, this is in the surrogate area, and it doesn't match with PCRE. > 0xF0 0x8F 0xBF 0xBF This would be U+7FF4FFFF, larger than U+10FFFF. > > BTW, > > > > printf "\xF4\xBF\xBF\xBF\n" | grep . > > > > finds a match, and this appears to be a bug (grep should follow > > the current standard). > > I also see it is a bug as you say. mbrlen() in glibc returns (size_t) -1 > for the sequence. Ditto with: printf "\xED\xA0\xBF\n" | grep . (surrogate area). -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)