Hi, this change in GNU grep 2.23 has severe consequences:
> Binary files are now less likely to generate diagnostics and more > likely to yield text matches. grep now reports "Binary file FOO > matches" and suppresses further output instead of outputting a line > containing an encoding error; hence grep can now report matching text > before a later binary match. Formerly, grep reported FOO to be > binary when it found an encoding error in FOO before generating > output for FOO, which meant it never reported both matching text and > matching binary data; this was less useful for searching text > containing encoding errors in non-matching lines. I got a report that the build of the German spellcheck dictionary got broken. It tuned out that this happened after the update to GNU grep to 2.23: https://bugzilla.redhat.com/show_bug.cgi?id=1316359 Actually the mentioned change leaves no reliable way to grep lines out of a any text file, which contains non-ASCII characters. Until now it was quite save to use grep in the C locale, also for non-ASCII text. Now after that change, the locale charmap has to match all of the encoding of the input file. Unfortunately the only locale that definetely always exists for sure is the C locale. We cannot assume that any other locale definitions exist on an unknown system. For a script, that wants to use grep, this is a big problem now. Let's take this example using grep 2.23: # echo -e "test\ntäst\ntest" | iconv -f utf8 -t latin1 | LC_ALL=C grep "st" ; echo $? test Binary file (standard input) matches 0 There are several problems here. Someone might want to assume that the locale definitions for en_US.ISO-8859-1 exist. Unfortunetely such an assumtion cannot be made. Whatever locale is used - if the definition might not be there and we will fall back to the C locale in any case then. The result is, we get the first matching line in the example. The second matching line with a non-ASCII character returns the text "Binary file (standard input) matches" on stdout (which might even be a valid matching line of the input file!) and the following matches are skipped. (Finally the return code is 0 - as the grepping stopped quickly, a return code >1 might be desireble, but I don't want to dive into that point right now.) Let me draw a biger picture: Have a look at what a POSIX compliant grep is expected to do: http://pubs.opengroup.org/onlinepubs/009604499/utilities/grep.html Read the description section, especially: --snip-- By default, an input line shall be selected if any pattern, treated as an entire basic regular expression (BRE) as described in the Base Definitions volume of IEEE Std 1003.1-2001, Section 9.3, Basic Regular Expressions, matches any part of the line excluding the terminating <newline>; --snap-- That means a posix compliant grep should not try to be too smart and tell the user that a binary file matches the search pattern (people can use "strings" if they want). It should just output the line. From that perspective GNU grep was not posix compliant before either, but it was not a big problem for most people obviously. With the recent change though and the issues described above I think a lot of scripts using (GNU) grep will get broken. I really hope this change will be reverted as soon as possible. I would rather prefer GNU grep to become posix compliant and not do any binary detection by default actually. Cheers Björn