The new heuristics to detect 'Binary files' should be reverted to the
old one (before 2.20) as the new one has too big a potential to silently
fail important tasks.
One of the most important use cases of grep is processing file lists,
eg. in the pipe: find | grep | tar. This is often done by backup
software, eg. the in debian package 'backup2l'.
The new behaviour of grep -- to output 'Binary file matches' after
output started -- has silently broken the 'backup2l' script and has the
potential of silently breaking many other backup scripts as well.
Test case:
$ find /etc/ssl/certs/ | LANG= grep pem
Outcome:
grep will stop with 'Binary file (standard input) matches' after
outputting a small percentage of the existing .pem files.
Expected behaviour:
grep should list all .pem files.
This behaviour is particularly insidious because users may not notice
that their backup archives are a bit smaller than before or that their
backups complete a bit faster, while many thousand files may be missing.
Q: Why do you use LANG= ?
A: To illustrate the problem and because 'backup2l' does that.
Q: Why don't people use the -a switch?
A: People may not notice anything wrong with their backups until they
need them.
Q: Why don't you file a bug against 'backup2l'?
A: I will. But this is such a common use case that I suspect that many
of the backup scripts that people wrote just for themselves are now broken.
Q: Why don't you just set the correct locale?
A: Even then it suffices to have one bogus-encoded filename somewhere to
break your whole backup. It is easy to catch such a file from the
internet or from song or picture metadata.
Regards
--
Marcello Perathoner