bug#22838: New 'Binary file' detection considered harmful

Eric Blake Mon, 29 Feb 2016 14:38:21 -0800

On 02/29/2016 01:11 PM, Marcello Perathoner wrote:

>> Yes, locale dependencies on standard behavior can be annoying.
>>
> 
> You assume that a user will only ever want to grep text files encoded in
> the machine's locale. That is not so.


You've been relying on undefined behavior, and it caught up with you.
It's the same as asking for us to keep use-after-free "working" in a
multithreaded program because it has always "worked" in your older
single-threaded program when nothing was perturbing the memory between
free() and its latent use.  A latent bug in your usage is still a bug in
your usage, even if it took a change in grep's defaults to expose your
problem.

And meanwhile, newer grep 2.23 has improved the heuristics to only
complain about a binary file if it would otherwise be outputting
encoding errors (rather than blindly complaining about the encoding
error up front and stopping processing immediately), which does
alleviate some of the worst of the change caused by your undefined usage
(that is, you can still grep for valid encodings, and get reasonable
results so long as the valid text doesn't mix with lines with invalid
encodings).

> 
> As a German user I have on my disk files in many encodings: utf-8,
> iso-8859-1, win-1252, iso-8859-15, encodings that are now defunct like
> CP850, CP847, "German 7-bit ASCII" that replaced braces with Umlauts,
> old WordStar files that used control characters inside.
> 
> Since 2.21 I will now have to always specify -a or LC_ALL=C when
> grepping my files.

Yes, but then you are no longer relying on undefined behavior, and
therefore have a leg to stand on if we break that behavior.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

bug#22838: New 'Binary file' detection considered harmful

Reply via email to