bug#23234: unexpected results with charset handling in GNU grep 2.23

Eric Blake Wed, 06 Apr 2016 14:05:44 -0700

On 04/06/2016 01:25 PM, Björn JACKE wrote:
> Let's take this example using grep 2.23:
> 
> # echo -e "test\ntäst\ntest" | iconv -f utf8 -t latin1 | LC_ALL=C grep "st" ; 
> echo $?


[As a side point, 'echo -e' is non-portable; better is to use printf.]

Hmm.  POSIX says that a file is binary if it does not end in newline, if
it contains embedded NUL, or if it contains an encoding error.  But it
also says that LC_ALL=C is _required_ to treat all 256 byte values as
valid characters (ASCII is only required to treat 7-bit characters as
valid, and may reject 8-bit bytes, but LC_ALL=C is _not_ ASCII).  This
indeed looks like a bug in current grep.git, as I can reproduce it:

$ git rev-parse HEAD
2ba6ab34da05d3aebc5e7e3dfaedb1cf3ddc5a73
$ printf "test\ntäst\ntest\n" | iconv -f utf8 -t latin1 |
   LC_ALL=C src/grep "st"
test
Binary file (standard input) matches

Looks like we don't have something quite right in claiming that 0xe4 is
not a valid character when in the single-byte C locale.

> I really hope this change will be reverted as soon as possible. I would rather
> prefer GNU grep to become posix compliant and not do any binary detection by
> default actually.

The change of treating encoding errors as binary files will NOT be
reverted, but here, you HAVE pointed out a bug where we are treating
something as binary that is NOT an encoding error (because by
definition, LC_ALL=C has no encoding errors - all 256 byte values are
characters).  So this is indeed a bug to be fixed.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

bug#23234: unexpected results with charset handling in GNU grep 2.23

Reply via email to