On 04/06/2016 01:25 PM, Björn JACKE wrote: > Let's take this example using grep 2.23: > > # echo -e "test\ntäst\ntest" | iconv -f utf8 -t latin1 | LC_ALL=C grep "st" ; > echo $?
[As a side point, 'echo -e' is non-portable; better is to use printf.] Hmm. POSIX says that a file is binary if it does not end in newline, if it contains embedded NUL, or if it contains an encoding error. But it also says that LC_ALL=C is _required_ to treat all 256 byte values as valid characters (ASCII is only required to treat 7-bit characters as valid, and may reject 8-bit bytes, but LC_ALL=C is _not_ ASCII). This indeed looks like a bug in current grep.git, as I can reproduce it: $ git rev-parse HEAD 2ba6ab34da05d3aebc5e7e3dfaedb1cf3ddc5a73 $ printf "test\ntäst\ntest\n" | iconv -f utf8 -t latin1 | LC_ALL=C src/grep "st" test Binary file (standard input) matches Looks like we don't have something quite right in claiming that 0xe4 is not a valid character when in the single-byte C locale. > I really hope this change will be reverted as soon as possible. I would rather > prefer GNU grep to become posix compliant and not do any binary detection by > default actually. The change of treating encoding errors as binary files will NOT be reverted, but here, you HAVE pointed out a bug where we are treating something as binary that is NOT an encoding error (because by definition, LC_ALL=C has no encoding errors - all 256 byte values are characters). So this is indeed a bug to be fixed. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature