The new heuristics to detect 'Binary files' should be reverted to the old one (before 2.20) as the new one has too big a potential to silently fail important tasks.

One of the most important use cases of grep is processing file lists,
eg. in the pipe: find | grep | tar. This is often done by backup software, eg. the in debian package 'backup2l'.

The new behaviour of grep -- to output 'Binary file matches' after output started -- has silently broken the 'backup2l' script and has the potential of silently breaking many other backup scripts as well.


Test case:

$ find /etc/ssl/certs/ | LANG= grep pem

Outcome:

grep will stop with 'Binary file (standard input) matches' after outputting a small percentage of the existing .pem files.

Expected behaviour:

grep should list all .pem files.


This behaviour is particularly insidious because users may not notice that their backup archives are a bit smaller than before or that their backups complete a bit faster, while many thousand files may be missing.



Q: Why do you use LANG= ?

A: To illustrate the problem and because 'backup2l' does that.

Q: Why don't people use the -a switch?

A: People may not notice anything wrong with their backups until they need them.

Q: Why don't you file a bug against 'backup2l'?

A: I will. But this is such a common use case that I suspect that many of the backup scripts that people wrote just for themselves are now broken.

Q: Why don't you just set the correct locale?

A: Even then it suffices to have one bogus-encoded filename somewhere to break your whole backup. It is easy to catch such a file from the internet or from song or picture metadata.



Regards

--
Marcello Perathoner




Reply via email to