On 2014-12-20 10:31:46 +0900, Norihiro Tanaka wrote: > On Fri, 19 Dec 2014 23:00:38 +0900 > Norihiro Tanaka <nori...@kcn.ne.jp> wrote: > $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -G . > Binary file (standard input) matches > $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -P . > $ > > regex also behaves same as grep -G, e.g. sed only using regex returns the > line. Therefore, I think that what a character in the surrogate area > matches a period with grep -G is not a bug, although the behavior might > not obey a standard. > > $ printf "\xED\xA0\xBF\n" | LANG=en_US.utf8 sed -ne '/./p' > > By the way, mbrlen() returns (size_t) -1 for the character.
IMHO, both grep and sed should be fixed to obey RFC 3629, which specifies UTF-8. And other tools too (iconv...). > OTOH, if a character in the surrogate area does not match a period in > PCRE, I think that the character should not also match a period grep -P. I agree. -- Vincent Lefèvre <vinc...@vinc17.net> - Web: <https://www.vinc17.net/> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)