bug#60506: feature: parallel grep --recursive

David G. Pickett Tue, 03 Jan 2023 15:57:23 -0800

It seems like we have 2 suggestions: parallel in different files and parallel 
is large files.
 - Parallel in different files is two ways tricky since you need threads and 
mutex on the file name stream, and in addition for parallel directories, some 
sort of threads and queue to pass the file names (producers) to the grep's 
(consumers).     
   - You might need a following consumer layer to ensure the output lines are 
in order or at very least not commingled.  A big FILE* buffer and fflush() can 
ensure each lines is a write(), but you give up original ordering unless you 
arrange to arrange or sort the output.  
   - You probably want to set a thread count limit.  
   - You might want to start with one file name producer, one grep 
consumer-producer and one arrange/sort consumer, and add more threads to which 
ever upstream side is emptying/filling a fixed sized queue.  
   - But of course, a lot of this is available from "parallel" if you make a 
study of it!  
   - I made a C pipe fitting I called xdemux to take a stream like file name 
lines from stdin and spread it in rotation to N downstream popen() pipes to a 
given command, like xargs grep.  N can be set to 2 x your local core count so 
it is less likely to block on IO, paging, or congestion.  
   - I also wrote a simpler, line oriented, faster xargs, fxargs!  
   - I also wrote a C tool I called pipebuf to buffer stdin to stdout so one 
slow consumer does not stop others from getting work, but more parallelism is a 
simpler solution.  
   - Threads in Intel Hyperthreaded CPUs can run twice as many in parallel as 
with parallel processes.


 - Parallel in large files reminds me of AbInitio ETL, which I assume divides a 
file into N portions, but each thread is responsible for any line that starts 
in its portion, even if it ends in another.  Merging output to present hits in 
order requires some sort of buffering or sorting of output.  For very simple 
grep (is it in there?), you need to design it so you can call off the other 
threads on any hit.
Doing both the above simultaneously would be a lot!  Either is a lot to focus 
on what is one of many simple tools!  Other tools might want similar 
enhancement!  :D

File read speeds vary wildly, between network drives on various speed and 
congestion networks, spinning hard drives of various RPM and bit density, solid 
state drives, and then files cached in DRAM (most read IO uses mmap64()), not 
to mention in MOBO and CPU caches at many levels.  I wrote a mmap64() based 
fgrep and it turned out to be so "good" on a big file list that ALL the other 
processes on the group's server got swapped out big time (without parallelism)!

-----Original Message-----
From: Paul Jackson <p...@usa.net>
To: Paul Eggert <egg...@cs.ucla.edu>; 60...@debbugs.gnu.org
Sent: Mon, Jan 2, 2023 9:56 pm
Subject: bug#60506: feature: parallel grep --recursive

<< a parallel grep to search a single large file >>

I'm but one user, and a rather idiosyncratic user at that,
but for my usage patterns, the specialized logic that it
would take to run a parallelized grep on a large file
would likely not shrink the elapsed time enough to justify
the coding, documentation, and maintenance effort.

I would expect the time to read the large file in from disk to
dominate the total elapsed time in any case.

(or maybe I am just jealous that I didn't think of that parallel
grep use case myself <grin>.)

-- 
                Paul Jackson
                p...@usa.net

bug#60506: feature: parallel grep --recursive

Reply via email to