Paul Eggert wrote: > Perhaps we can improve the behavior of grep by changing its heuristic > slightly. > Currently grep reports "Binary file FOO matches" if it finds binary > data in FOO before it finds the first match. Instead, perhaps we > could change grep to report "Binary file FOO matches" when it sees > that it's about to generate binary *output* copied from FOO, > regardless of whether this output represents the first match. That > is, when grep sees that it's about to output binary > data, grep instead outputs "Binary file FOO matches" and then stops > output for FOO (even if it already output some lines for ordinary > matches in FOO).
Another option would be to escape the problematic binary data (but how to escape the escape char?) or maybe even replace it with U+FFFD if our output is utf-8 (this has its own sort of problems when trying to determine what was really matched, though). > This approach would fix the problem of grep trashing the output > stream, and it should be less drastic than grep's current approach, > in that it would make grep more likely to do what Kamil Dudka is > asking for (assuming grep is given mostly valid input interspersed > with small amounts of binary data). +1 When grep is the las component of a pipeline, it isn't too bad. The danger comes from grep being part of a pipeline instead. Sebastian Makefile is one of such cases. Another silly example: we might have a list of people and be interested in knowning how many of them begin with J (but excluding pseudonyms): printf 'John Smith\nJohannes Meixner\nPaul Eggert\nJohn Doe\n' > defendants-2015-05-15 grep ^J defendants-2015-05-* | sort -u | grep -vc "John Doe" works perfectly, until the day someone provides an incorrectly entry. printf 'Pedro P\xe9rez\n' >> defendants-2015-05-15 and havoc ensues. It's something that should never happen, but someone else prepared the file for you, or it comes from a third party (and sometimes it only makes sense for them to be ANSI, yet one day there are unencoded high bytes)