Martin Hoch wrote:
I noticed that grep 2.21-1 regards ISO-8859-15 encoded files as binary, if
LC_ALL is set to en_US.UTF.
I am not sure if this is a bug or an expected behaviour change in 2.21-1
It's an expected change. Although this was documented in NEWS:
If a file contains data improperly encoded for the current locale,
and this is discovered before any of the file's contents are output,
grep now treats the file as binary.
the grep manual is not so clear about it. I installed the attached patch to try
to fix that.
>From 9ae1e287730366f49a08b09027f9dc65254d1bf9 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Mon, 15 Dec 2014 23:09:27 -0800
Subject: [PATCH] doc: document binary-data heuristic better
Problem reported by Martin Hoch in: http://bugs.gnu.org/19388
* doc/grep.texi (File and Directory Selection):
Document what non-text bytes are.
(Usage): Fix cross reference.
---
doc/grep.texi | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/doc/grep.texi b/doc/grep.texi
index 63016bd..acd5be8 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -596,6 +596,11 @@ If a file's allocation metadata,
or if its data read before a line is selected for output,
indicate that the file contains binary data,
assume that the file is of type @var{type}.
+Non-text bytes indicate binary data; these are either data bytes
+improperly encoded for the current locale, or null bytes when the
+@option{-z} (@option{--null-data}) option is not given (@pxref{Other
+Options}).
+
By default, @var{type} is @samp{binary},
and @command{grep} normally outputs either
a one-line message saying that a binary file matches,
@@ -1721,8 +1726,8 @@ Standard grep cannot do this, as it is fundamentally line-based.
Therefore, merely using the @code{[:space:]} character class does not
match newlines in the way you might expect.
-With the GNU @command{grep} option @code{-z} (@pxref{File and
-Directory Selection}), the input is terminated by null bytes. Thus,
+With the GNU @command{grep} option @option{-z} (@option{--null-data}), each
+input ``line'' is terminated by a null byte; @pxref{Other Options}. Thus,
you can match newlines in the input, but typically if there is a match
the entire input is output, so this usage is often combined with
output-suppressing options like @option{-q}, e.g.:
--
1.9.3