bug#20768: RFC: Multithreaded grep

Zev Weiss Mon, 08 Jun 2015 05:17:21 -0700

Hello,

After what looks like 16+ years as an entry in the TODO file (back to
the first commit in the git history), I decided to see if I couldn't
get grep going multithreaded.  I now have a working version, so I
figured I'd try to get some feedback on it in its current form and
hopefully ultimately get it included in GNU grep.


At a high level: I've added a new flag, -M/--parallel[=N], that
enables multithreaded operation.  The N worker threads (defaulting to
the number of available CPUs) operate in parallel at file granularity,
so it's only of use when operating on multiple files (perhaps via
-R/-r/--directories=recurse).  The main thread opens each file and
adds it to a global queue of files to be searched ('workqueue' in
src/grep.c), and the worker threads then pull items off the queue to
search through.

For some rough performance numbers, on a 3.5GHz 4-core/8-thread
Haswell running Linux 3.16 with a warm pagecache:

 $ time ./src/grep -r FOOB $DIR
 ...

 real    0m13.740s
 user    0m12.424s
 sys     0m2.376s


 $ time ./src/grep -M -r FOOB $DIR
 ...

 real    0m3.348s
 user    0m19.428s
 sys     0m3.480s

$DIR here is a 15GB tree (24299 directories, 305282 files) of assorted
code -- git repos, tarballs, etc.  [Side note: this workload (perhaps
grep in general, not sure) suffered a major performance regression
between v2.20 and v2.21, which I bisected to commit cd36abd4 -- commit
dfff75a4 then recovered some of the lost performance, but it's still
substantially slower than it had been.]

If preferred I can send all the patches directly to the mailing list
(or as one combined diff, whatever's convenient), but for now it's
currently viewable as a series of mostly fairly fine-grained git
commits in the 'multithread' branch at:

 https://github.com/zevweiss/grep

(A few of the small cleanup commits might also be of use independently
of the multithreading effort.)

Some notes, caveats and such:

- It doesn't presently build -- all my testing & development has
  involved running the final link command manually, because (despite
  hours of bashing my head against it) I've been unable to convince
  the bootstrap/autotools/gnulib/etc agglomeration to add -lpthread
  to the command line.  Any advice here would be appreciated.

- There's a slight quirk in command-line interpretation: since the
  '-M' short flag takes an optional numeric argument, 'grep -M4' will
  be interpreted as '--parallel=4', not '--parallel --context=4'.
  This seems safe from a backwards-compatibility standpoint (since no
  existing scripts should be using '-M'), but could potentially
  produce surprising results if existing scripts are modified in
  particular ways (e.g. changing 'grep -R4' to 'grep -RM4').  I'm
  open to suggestions on more graceful ways to handle this.

- I'm not really at all familiar with the internal workings of grep's
  actual text-search guts, so some of the identifiers I've chosen in
  the parts that had to touch those bits (or elsewhere, for that
  matter) may not be the best.

- Some of the un-sharing (moving global state into thread-private
  structs) is probably a bit heavy-handed; i.e. some things are now
  thread-private that could have remained global, but (at least for
  now) it was easier to just do it that way.

- Along similar lines, it would probably be nicer to have a way to
  just copy the result of `compile()' rather than re-compiling for
  every thread.

- Testing: I've extended a few of the tests in the testsuite to also
  run multithreaded, but the coverage isn't exactly extensive.

- I haven't (yet) completed the FSF's copyright-assignment process,
  though I have sent the initial email requesting the paperwork.

- Commit messages are in a different style than used in the grep git
  repo; I'll fix them up depending on how exactly they should be
  bundled together.

- Though I have attempted to follow the GNU coding style, my own
  usual coding style differs from it quite a bit, and there may as a
  result be some stylistic misfits here and there.  I'm happy to fix
  any of these anyone points out.


Any/all feedback welcome.


Thanks,
Zev Weiss

bug#20768: RFC: Multithreaded grep

Reply via email to