Thomas Wolff wrote:
Hi Paul and Jim,
Thanks for your previous quick responses on this matter and excuse my very late
additional statement.
However, the arguments are not convincing.
The new behavior violates the principle of least astonishment which is well
established in software design.
That cuts both ways. Older versions of grep could dump core when given
improperly encoded text, which is even more astonishing. The new version is an
improvement in that particular area. It is not clear how grep could be modified
to avoid the core dumps while still preserving the old behavior in question.
It is not convincing that a text file is not considered a text file for a few
bytes that are not properly encoded in the current locale. Also the quoted POSIX
clause does not support that claim.
Not by itself, but from the chain of definitions it's clear that a text file
must contain properly encoded text. The quoted POSIX clause (3.397) says that a
text file contains "characters", and an earlier clause (3.87) defines
"character" to be "A sequence of one or more bytes representing a single graphic
symbol or control code. Note: This term corresponds to the ISO C standard term
multi-byte character".
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_87
Because encoding errors are not characters, they are not text.
And, considering the "pipe security" argument, shall all classic Unix tools now
get additional options -a, so that something like
grep 'bla' | sed -e 'expr' | tr '' '' | grep -v 'argl'
would in future look like
grep -a 'bla' | sed -a -e 'expr' | tr -a '' '' | grep -a -v 'argl'
?
It shouldn't be needed for tr, as tr's input is not required to be a text file.
GNU sed doesn't worry about whether files are text or binary. I expect this is
because the problem of spitting out random binary data tends to be less of an
issue for 'sed' in practice. However, portable scripts should not assume that
'sed' will work on arbitrary binary data.
What about backwards compability of scripts then?
This is breaking decades of Unix tradition of modular tools for the mere
dogmatics of some peculiar and strict locale theory.
UTF-8 does tend to have that effect, yes. From the traditional Unix point of
view, patterns like 'a.b' are "broken" with modern grep in UTF-8 locales, since
the "." no longer matches only single bytes. This has been true for decades,
not just for 'grep' but also for 'sed' etc. These days, though, users tend to
be more interested in dealing with multibyte characters than in insisting on
circa-1977 semantics in all cases.
If you insist on this priority of locale strategy over Unix tradition,
please offer at least a compatibility option that does not break scripts,
i.e. an environment setting that enforces compatible behaviour (like other tools
have, e.g. LS_COLORS etc).
Instead of an environment variable I suggest using a script. Please see:
http://bugs.gnu.org/19998#8
As a last remark, I wonder why my report does not show up in
http://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep
and apparently I cannot submit anything there myself. Please get the issue
documented there.
I unarchived that bug report and am quoting the entire new part of your message,
which should do the trick.
Kind regards,
Thomas