bug#20526: BUG: text file is detected as binary

Ángel González Wed, 20 May 2015 20:59:06 -0700

Paul Eggert wrote:
> Perhaps we can improve the behavior of grep by changing its heuristic 
> slightly. 
>   Currently grep reports "Binary file FOO matches" if it finds binary 
> data in FOO before it finds the first match.  Instead, perhaps we 
> could change grep to report "Binary file FOO matches" when it sees 
> that it's about to generate binary *output* copied from FOO, 
> regardless of whether this output represents the first match.  That 
> is, when grep sees that it's about to output binary 
> data, grep instead outputs "Binary file FOO matches" and then stops 
> output for FOO (even if it already output some lines for ordinary 
> matches in FOO).



Another option would be to escape the problematic binary data (but how
to escape the escape char?) or maybe even replace it with U+FFFD if our
output is utf-8 (this has its own sort of problems when trying to
determine what was really matched, though).


> This approach would fix the problem of grep trashing the output 
> stream, and it should be less drastic than grep's current approach, 
> in that it would make grep more likely to do what Kamil Dudka is 
> asking for (assuming grep is given mostly valid input interspersed 
> with small amounts of binary data).


+1

When grep is the las component of a pipeline, it isn't too bad. The
danger comes from grep being part of a pipeline instead. 
Sebastian Makefile is one of such cases. Another silly example: we
might have a list of people and be interested in knowning how many of
them begin with J (but excluding pseudonyms):

 printf 'John Smith\nJohannes Meixner\nPaul Eggert\nJohn Doe\n' > 
defendants-2015-05-15
 grep ^J defendants-2015-05-* | sort -u | grep -vc "John Doe"

works perfectly, until the day someone provides an incorrectly entry. 
 printf 'Pedro P\xe9rez\n' >> defendants-2015-05-15
and havoc ensues.

It's something that should never happen, but someone else prepared the
file for you, or it comes from a third party (and sometimes it only
makes sense for them to be ANSI, yet one day there are unencoded high
bytes)

bug#20526: BUG: text file is detected as binary

Reply via email to