bug#23234: unexpected results with charset handling in GNU grep 2.23

Björn JACKE Wed, 06 Apr 2016 13:45:31 -0700

Hi,

this change in GNU grep 2.23 has severe consequences:


> Binary files are now less likely to generate diagnostics and more
> likely to yield text matches.  grep now reports "Binary file FOO
> matches" and suppresses further output instead of outputting a line
> containing an encoding error; hence grep can now report matching text
> before a later binary match.  Formerly, grep reported FOO to be
> binary when it found an encoding error in FOO before generating
> output for FOO, which meant it never reported both matching text and
> matching binary data; this was less useful for searching text
> containing encoding errors in non-matching lines.

I got a report that the build of the German spellcheck dictionary got broken.
It tuned out that this happened after the update to GNU grep to 2.23:

https://bugzilla.redhat.com/show_bug.cgi?id=1316359

Actually the mentioned change leaves no reliable way to grep lines out of a
any text file, which contains non-ASCII characters.

Until now it was quite save to use grep in the C locale, also for non-ASCII
text. Now after that change, the locale charmap has to match all of the
encoding of the input file.  Unfortunately the only locale that definetely
always exists for sure is the C locale. We cannot assume that any other locale
definitions exist on an unknown system. For a script, that wants to use grep,
this is a big problem now.

Let's take this example using grep 2.23:

# echo -e "test\ntäst\ntest" | iconv -f utf8 -t latin1 | LC_ALL=C grep "st" ; 
echo $?
test
Binary file (standard input) matches
0

There are several problems here. Someone might want to assume that the locale
definitions for en_US.ISO-8859-1 exist. Unfortunetely such an assumtion cannot
be made. Whatever locale is used - if the definition might not be there and we
will fall back to the C locale in any case then.

The result is, we get the first matching line in the example. The second
matching line with a non-ASCII character returns the text "Binary file
(standard input) matches" on stdout (which might even be a valid matching line
of the input file!) and the following matches are skipped. (Finally the return
code is 0 - as the grepping stopped quickly, a return code >1 might be 
desireble,
but I don't want to dive into that point right now.)


Let me draw a biger picture: Have a look at what a POSIX compliant grep is
expected to do:
http://pubs.opengroup.org/onlinepubs/009604499/utilities/grep.html

Read the description section, especially:

--snip--
By default, an input line shall be selected if any pattern, treated as an
entire basic regular expression (BRE) as described in the Base Definitions
volume of IEEE Std 1003.1-2001, Section 9.3, Basic Regular Expressions, matches
any part of the line excluding the terminating <newline>;
--snap--

That means a posix compliant grep should not try to be too smart and tell the
user that a binary file matches the search pattern (people can use "strings" if
they want). It should just output the line. From that perspective GNU grep was
not posix compliant before either, but it was not a big problem for most people
obviously. With the recent change though and the issues described above I think
a lot of scripts using (GNU) grep will get broken.

I really hope this change will be reverted as soon as possible. I would rather
prefer GNU grep to become posix compliant and not do any binary detection by
default actually.

Cheers
Björn

bug#23234: unexpected results with charset handling in GNU grep 2.23

Reply via email to