bug#19242: latest grep considers text files as binary

Paul Eggert Sun, 22 Mar 2015 17:43:36 -0700

Thomas Wolff wrote:

Hi Paul and Jim,


Thanks for your previous quick responses on this matter and excuse my very late
additional statement.

However, the arguments are not convincing.
The new behavior violates the principle of least astonishment which is well
established in software design.

That cuts both ways. Older versions of grep could dump core when givenimproperly encoded text, which is even more astonishing. The new version is animprovement in that particular area. It is not clear how grep could be modifiedto avoid the core dumps while still preserving the old behavior in question.

It is not convincing that a text file is not considered a text file for a few
bytes that are not properly encoded in the current locale. Also the quoted POSIX
clause does not support that claim.

Not by itself, but from the chain of definitions it's clear that a text filemust contain properly encoded text. The quoted POSIX clause (3.397) says that atext file contains "characters", and an earlier clause (3.87) defines"character" to be "A sequence of one or more bytes representing a single graphicsymbol or control code. Note: This term corresponds to the ISO C standard termmulti-byte character".


http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87

Because encoding errors are not characters, they are not text.

And, considering the "pipe security" argument, shall all classic Unix tools now
get additional options -a, so that something like
     grep 'bla' | sed -e 'expr' | tr '' '' | grep -v 'argl'
would in future look like
     grep -a 'bla' | sed -a -e 'expr' | tr -a '' '' | grep -a -v 'argl'
?


It shouldn't be needed for tr, as tr's input is not required to be a text file.

GNU sed doesn't worry about whether files are text or binary. I expect this isbecause the problem of spitting out random binary data tends to be less of anissue for 'sed' in practice. However, portable scripts should not assume that'sed' will work on arbitrary binary data.

What about backwards compability of scripts then?
This is breaking decades of Unix tradition of modular tools for the mere
dogmatics of some peculiar and strict locale theory.

UTF-8 does tend to have that effect, yes. From the traditional Unix point ofview, patterns like 'a.b' are "broken" with modern grep in UTF-8 locales, sincethe "." no longer matches only single bytes. This has been true for decades,not just for 'grep' but also for 'sed' etc. These days, though, users tend tobe more interested in dealing with multibyte characters than in insisting oncirca-1977 semantics in all cases.

If you insist on this priority of locale strategy over Unix tradition,
please offer at least a compatibility option that does not break scripts,
i.e. an environment setting that enforces compatible behaviour (like other tools
have, e.g. LS_COLORS etc).


Instead of an environment variable I suggest using a script.  Please see:

http://bugs.gnu.org/19998#8

As a last remark, I wonder why my report does not show up in
http://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep
and apparently I cannot submit anything there myself. Please get the issue
documented there.

I unarchived that bug report and am quoting the entire new part of your message,which should do the trick.

Kind regards,
Thomas

bug#19242: latest grep considers text files as binary

Reply via email to