On Fri, 19 Dec 2014 23:00:38 +0900 Norihiro Tanaka <nori...@kcn.ne.jp> wrote: > I also see it is a bug as you say. mbrlen() in glibc returns (size_t) -1 > for the sequence.
$ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -G . Binary file (standard input) matches $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -P . $ regex also behaves same as grep -G, e.g. sed only using regex returns the line. Therefore, I think that what a character in the surrogate area matches a period with grep -G is not a bug, although the behavior might not obey a standard. $ printf "\xED\xA0\xBF\n" | LANG=en_US.utf8 sed -ne '/./p' By the way, mbrlen() returns (size_t) -1 for the character. OTOH, if a character in the surrogate area does not match a period in PCRE, I think that the character should not also match a period grep -P.