On 2024-09-20 22:41, Paul Eggert wrote:
I have the sneaking suspicion that the script is assuming properties of 'grep' that are not documented and that are not guaranteed.

In looking into the code a bit more, I can see some places where that is what is happening.

A couple of things.

First, grep 3.11 uses buffer sizes that depend on earlier files that it has scanned, and this affects whether grep decides later files are binary. This can lead to the sort of confusion that you mentioned. There are performance reasons to think that grep should not grow buffer sizes for later files merely because earlier files had very long lines, as huge buffers can hurt performance; so I installed onto the development repository on Savannah the first attached patch to fix that. As a side effect this may fix the symptoms you observed.

Second, 'grep' is not a good tool for determining whether a file is text or binary, since the definition of "text" vs "binary" is application-specific and grep's definition is suitable for 'grep' and it's problematic to use it elsewhere. I installed the second attached patch to try to document this better.

Hope this helps.

Boldly closing this bug as fixed; if I'm wrong we can reopen it.
From 8fb15fb5bff35ff069ce108b2ff987dc7183de37 Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Sat, 21 Sep 2024 12:32:53 -0700
Subject: [PATCH 1/2] grep: avoid huge reads

The previous code could call 'read' with a nearly unbounded size
if the input had long lines, and this unbounded size persisted
from one file to the next once the input buffer grew.
This could have bad effects on the CPU's data cache,
and also could cause 'grep' to make counterintuitive decisions as
to whether a file is binary <https://bugs.gnu.org/73360>.
Instead, pick a good read size and stick with it; this is
more consistent, and more likely to fit in a cache.
* src/grep.c (good_readsize): New static var.
(GOOD_READSIZE_MIN): Rename from INITIAL_BUFSIZE.  All uses changed.
(fillbuf): Read good_readsize bytes rather than trying to
fill the rest of the input buffer.
(drain_input): Read good_readsize rather than GOOD_READSIZE_MIN
bytes.
(main): Initialize good_readsize.
---
 src/grep.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/src/grep.c b/src/grep.c
index 4592b20..912bce4 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -872,6 +872,7 @@ static int bufdesc;		/* File descriptor. */
 static char *bufbeg;		/* Beginning of user-visible stuff. */
 static char *buflim;		/* Limit of user-visible stuff. */
 static idx_t pagesize;		/* alignment of memory pages */
+static idx_t good_readsize;	/* good size to pass to 'read' */
 static off_t bufoffset;		/* Read offset.  */
 static off_t after_last_match;	/* Pointer after last matching line that
                                    would have been output if we were
@@ -880,8 +881,15 @@ static bool skip_nuls;		/* Skip '\0' in data.  */
 static bool skip_empty_lines;	/* Skip empty lines in data.  */
 static intmax_t totalnl;	/* Total newline count before lastnl. */
 
-/* Initial buffer size, not counting slop. */
-enum { INITIAL_BUFSIZE = 96 * 1024 };
+/* Minimum value for good_readsize.
+   If it's too small, there are more syscalls;
+   if too large, it wastes memory and likely cache.
+   Use 96 KiB as it gave good results in a benchmark in 2018
+   (see 2018-09-06 commit labeled "grep: triple initial buffer size: 32k->96k")
+   even though the same benchmark in 2024 found no significant
+   difference for values from 32 KiB to 1024 KiB on Ubuntu 24.04.1 LTS
+   with an Intel Xeon W-1350.  */
+enum { GOOD_READSIZE_MIN = 96 * 1024 };
 
 /* Return VAL aligned to the next multiple of ALIGNMENT.  VAL can be
    an integer or a pointer.  Both args must be free of side effects.  */
@@ -946,9 +954,9 @@ fillbuf (idx_t save, struct stat const *st)
 {
   char *readbuf;
 
-  /* After BUFLIM, we need room for at least a page of data plus a
+  /* After BUFLIM, we need room for a good-sized read plus a
      trailing uword.  */
-  idx_t min_after_buflim = pagesize + uword_size;
+  idx_t min_after_buflim = good_readsize + uword_size;
 
   if (min_after_buflim <= buffer + bufalloc - buflim)
     readbuf = buflim;
@@ -957,8 +965,8 @@ fillbuf (idx_t save, struct stat const *st)
       char *newbuf;
 
       /* For data to be searched we need room for the saved bytes,
-         plus at least a page of data to read.  */
-      idx_t minsize = save + pagesize;
+         plus at least a good-sized read.  */
+      idx_t minsize = save + good_readsize;
 
       /* Add enough room so that the buffer is aligned and has room
          for byte sentinels fore and aft, and so that a uword can
@@ -1001,15 +1009,12 @@ fillbuf (idx_t save, struct stat const *st)
 
   clear_asan_poison ();
 
-  idx_t readsize = buffer + bufalloc - uword_size - readbuf;
-  readsize -= readsize % pagesize;
-
   ptrdiff_t fillsize;
   bool cc = true;
 
   while (true)
     {
-      fillsize = safe_read (bufdesc, readbuf, readsize);
+      fillsize = safe_read (bufdesc, readbuf, good_readsize);
       if (fillsize < 0)
         {
           fillsize = 0;
@@ -1769,12 +1774,12 @@ drain_input (int fd, struct stat const *st)
 #ifdef SPLICE_F_MOVE
       /* Should be faster, since it need not copy data to user space.  */
       nbytes = splice (fd, nullptr, STDOUT_FILENO, nullptr,
-                       INITIAL_BUFSIZE, SPLICE_F_MOVE);
+                       good_readsize, SPLICE_F_MOVE);
       if (0 <= nbytes || errno != EINVAL)
         {
           while (0 < nbytes)
             nbytes = splice (fd, nullptr, STDOUT_FILENO, nullptr,
-                             INITIAL_BUFSIZE, SPLICE_F_MOVE);
+                             good_readsize, SPLICE_F_MOVE);
           return nbytes == 0;
         }
 #endif
@@ -2994,7 +2999,8 @@ main (int argc, char **argv)
   if (! (0 < psize && psize <= (IDX_MAX - uword_size) / 2))
     abort ();
   pagesize = psize;
-  bufalloc = ALIGN_TO (INITIAL_BUFSIZE, pagesize) + pagesize + uword_size;
+  good_readsize = ALIGN_TO (GOOD_READSIZE_MIN, pagesize);
+  bufalloc = good_readsize + pagesize + uword_size;
   buffer = ximalloc (bufalloc);
 
   if (fts_options & FTS_LOGICAL && devices == READ_COMMAND_LINE_DEVICES)
-- 
2.43.0

From 944c2eccc7e88299c47ada1d319fbb5705bc713d Mon Sep 17 00:00:00 2001
From: Paul Eggert <egg...@cs.ucla.edu>
Date: Sat, 21 Sep 2024 22:37:33 -0700
Subject: [PATCH 2/2] =?UTF-8?q?doc:=20warn=20re=20using=20=E2=80=98grep?=
 =?UTF-8?q?=E2=80=99=20to=20detect=20binary=20files?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

This is in response to a bug report by Rodrigo Jorge
<https://bugs.gnu.org/73360>.
* doc/grep.texi (File and Directory Selection):
Warn that ‘grep’ shouldn’t be used to determine whether
a file is binary for other applications’ purposes, as
their definition of “binary” may well differ.
Improve documentation for discovery of null input.
---
 doc/grep.texi | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/doc/grep.texi b/doc/grep.texi
index 7d10f86..9c46e76 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -631,9 +631,11 @@ When some output is suppressed, @command{grep} follows any output
 with a message to standard error saying that a binary file matches.
 
 If @var{type} is @samp{without-match},
-when @command{grep} discovers null input binary data
-it assumes that the rest of the file does not match;
+when @command{grep} discovers null binary data in an input file
+it assumes that any unprocessed input does not match;
 this is equivalent to the @option{-I} option.
+In this case the region of unprocessed input starts no later than the
+null binary data, and continues to end of file.
 
 If @var{type} is @samp{text},
 @command{grep} processes binary data as if it were text;
@@ -649,6 +651,16 @@ is not matched when @var{type} is @samp{text}.  Conversely, when
 @var{type} is @samp{binary} the pattern @samp{.} (period) might not
 match a null byte.
 
+The heuristic that @command{grep} uses to intuit whether input is
+binary is specific to @command{grep} and may well be unsuitable for
+other applications, as it depends on command-line options, on locale,
+and on hardware and operating system characteristics such as system
+page size and input buffering.  For example, if the input consists of
+a matching text line followed by nonmatching data that contains a null
+byte, @command{grep} might either output the matching line or treat
+the file as binary, depending on whether the unprocessed input happens
+to include the matching text line.
+
 @emph{Warning:} The @option{-a} (@option{--binary-files=text}) option
 might output binary garbage, which can have nasty side effects if the
 output is a terminal and if the terminal driver interprets some of it
@@ -2004,7 +2016,7 @@ Note that on some platforms,
 except the available memory.
 
 @item
-Why does @command{grep} report ``Binary file matches''?
+Why does @command{grep} report ``binary file matches''?
 
 If @command{grep} listed all matching ``lines'' from a binary file, it
 would probably generate output that is not useful, and it might even
-- 
2.43.0

Reply via email to