Re: [PATCH] sort: Add --threads option, which parallelizes internal sort.

Ralf Wildenhues Sat, 18 Apr 2009 14:29:35 -0700

Hello Paul, Glen, Jim, all,

* Paul Eggert wrote on Fri, Apr 03, 2009 at 09:57:54PM CEST:
> Of course we cannot reasonably expect this one performance improvement
> to make 'sort' run 16x faster on a 16-CPU machine.  That is because the
> improvement parallelizes only one part of 'sort'.  Even assuming the
> parallelized part gets done in essentially zero time, the rest of 'sort'
> is still sequential (notably, input and output), and so it will dominate
> the wall-clock time.  Obviously we would like to parallelize all of
> 'sort', but the idea is to do one step at a time, focusing on the
> easiest parts first, and showing real improvements at each step.  I
> would be quite happy if this particular improvement gave us a 2x
> wall-clock improvement.


I've looked at this in a bit more detail; no big conclusion but maybe
a few more hints that could help.

I am now pretty confident that your patch implements the threading
correctly.  When inserting some
  expensive_computation ();

in the sortlines function right after `swap' is computed, then all
timings look good (in relation to each other).  For the expensive
computation, I used a function in a separate translation unit doing some
floating point, and compiled that without optimization.

'valgrind --cachegrind' shows a very low cache miss rate of 0.2% max
with both 1 and 2 threads, indicating there isn't anything funny going
on here.

So I think the only remaining issues that could be relevant for the
original timings are, IMHO:
- false data sharing.  Don't see where that could happen,
- memory bandwidth limitations of the specific system, or
- suboptimal NUMA memory distribution,

That said, I did manage to get 'valgrind --tool=helgrind sort
--threads=2' to show a potential error, with a large data set.  I'm
quite sure it is a false positive, but I post the output below for
reference.  The 'valgrind --tool=drd' tool could not reproduce this,
though.


Anyway.  I tried another experiment.  I replicated some large data set 8
times.  I then timed a manual merge:
  sort -m <( sort data ) <( sort data ) ... (8 times) >/dev/null

and compared that with:

  sort --threads=P 8-fold-data

manual merge: 16.75 s
--threads=8:  44.11 s
--threads=1:  81.32 s

This comparison isn't entirely fair, as the splicing was done as a
precomputation.  However, the difference is so pronounced that even
taking the splicing into account, the non-thread version would be
quite a bit faster.  So to me, it would seem that some approach going
in that direction could be beneficial.

Other than that, have you thought of implementing something like
described in <http://www.it.uom.gr/teaching/dbpp/text/node127.html>?

Cheers,
Ralf

This is the (non-portable) diff I used when running the command below.
With it, sort.c:2608 is the last sortlines invocation within sortlines.

diff --git a/src/sort.c b/src/sort.c
index 82aa7e8..c0a0462 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -1,5 +1,5 @@
 /* sort - sort lines of text (with all kinds of options).
-   Copyright (C) 1988, 1991-2008 Free Software Foundation, Inc.
+   Copyright (C) 1988, 1991-2009 Free Software Foundation, Inc.
 
    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
@@ -2520,6 +2520,7 @@ sortlines_thread (void *data)
   struct thread_args const *args = data;
   sortlines (args->lines, args->nlines, args->temp, args->nthreads,
             args->to_temp);
+  fprintf(stderr, "thread %lu done\n", pthread_self ());
   return NULL;
 }
 
@@ -2540,7 +2541,7 @@ sortlines_thread (void *data)
    usual 2*N lines.  Knuth writes that this memory optimization was
    originally published by D. A. Bell, Comp J. 1 (1958), 75.
 
-   This function is inline so that its tests for multthreadedness and
+   This function is inline so that its tests for multithreadedness and
    inplacedness can be optimized away in common cases.  */
 
 static void
@@ -2554,6 +2555,9 @@ sortlines (struct line *restrict lines, size_t nlines,
         <http://lists.gnu.org/archive/html/bug-coreutils/2005-10/msg00086.html>
         in the IBM xlc 6.0.0.0 compiler in 64-bit mode.  */
       int swap = (0 < compare (&lines[-1], &lines[-2]));
+      /* let's make this more expensive, artificially */
+      //int expensive_computation (int);
+      //swap = expensive_computation (swap);
       if (to_temp)
        {
          temp[-1] = lines[-1 - swap];
@@ -2561,9 +2565,9 @@ sortlines (struct line *restrict lines, size_t nlines,
        }
       else if (swap)
        {
-         temp[-1] = lines[-1];
+         struct line tmp = lines[-1];
          lines[-1] = lines[-2];
-         lines[-2] = temp[-1];
+         lines[-2] = tmp;
        }
     }
   else
@@ -2579,16 +2583,26 @@ sortlines (struct line *restrict lines, size_t nlines,
       struct thread_args args = {hi, nhi, temp - nlo, child_subthreads, 
to_temp};
 
       if (child_subthreads != 0 && SUBTHREAD_LINES_HEURISTIC <= nlines
+         && fprintf(stderr, "thread %lu is creating thread for %zu lines\n",
+                    pthread_self (), nlines)
          && pthread_create (&thread, NULL, sortlines_thread, &args) == 0)
        {
          /* Guarantee that nlo and nhi are each at least 2.  */
          verify (4 <= SUBTHREAD_LINES_HEURISTIC);
 
          sortlines (lo, nlo, temp, my_subthreads, !to_temp);
+         fprintf (stderr, "thread %lu joined by thread %lu\n", thread, 
pthread_self ());
          pthread_join (thread, NULL);
        }
       else
        {
+         static __thread int seen = 0;
+         if (!seen)
+           {
+             fprintf (stderr, "thread %lu is sorting %zu lines\n",
+                      pthread_self (), nlines);
+             seen = 1;
+           }
          sortlines (hi, nhi, temp - (to_temp ? nlo : 0), 1, to_temp);
          if (1 < nlo)
            sortlines (lo, nlo, temp, 1, !to_temp);


$ \time valgrind --tool=helgrind  --trace-addr=0x5d3da5e --trace-level=2 ./sort 
--threads=2 T >/dev/null 
==26798== Helgrind, a thread error detector.
==26798== Copyright (C) 2007-2008, and GNU GPL'd, by OpenWorks LLP et al.
==26798== Using LibVEX rev 1404, a library for dynamic binary translation.
==26798== Copyright (C) 2004-2008, and GNU GPL'd, by OpenWorks LLP.
==26798== Using valgrind-3.4.0.SVN, a dynamic binary instrumentation framework.
==26798== Copyright (C) 2000-2008, and GNU GPL'd, by Julian Seward et al.
==26798== For more details, rerun with: -v
==26798== 
==26798== 
==26798== Thread #1 is the program's root thread
==26798== 
==26798== TRACE: 0x59b0030 pa 255239968 thr#1 :: NoAccess --> New
==26798==    at 0x4C232CB: malloc (vg_replace_malloc.c:207)
==26798==    by 0x4043DE: initbuf (sort.c:1404)
==26798==    by 0x407FC1: sort (sort.c:2851)
==26798==    by 0x409B7E: main (sort.c:3670)
==26798== 
==26798== TRACE: 0x5d3da58 wr 8 thr#1 :: New --> Exclusive(thr#1)
==26798==    at 0x5317B40: __read_nocancel (in /lib/libc-2.7.so)
==26798==    by 0x52BBA8F: _IO_file_xsgetn (in /lib/libc-2.7.so)
==26798==    by 0x52BB10D: fread_unlocked (in /lib/libc-2.7.so)
==26798==    by 0x404942: fillbuf (sort.c:1610)
==26798==    by 0x408190: sort (sort.c:2857)
==26798==    by 0x409B7E: main (sort.c:3670)
thread 67380784 is creating thread for 788552 lines
thread 357677392 is sorting 394276 lines himself
thread 67380784 is sorting 394276 lines himself
==26798== 
==26798== TRACE: 0x5d3da5c rd 4 thr#1 :: Exclusive(thr#1) --> Exclusive(thr#1)
==26798==    at 0x52C9080: strlen (in /lib/libc-2.7.so)
==26798==    by 0x52CCB0F: strcoll_l (in /lib/libc-2.7.so)
==26798==    by 0x40D64C: memcoll (memcoll.c:56)
==26798==    by 0x40AF43: xmemcoll (xmemcoll.c:43)
==26798==    by 0x406153: compare (sort.c:2122)
==26798==    by 0x40727B: sortlines (sort.c:2557)
==26798==    by 0x4075A2: sortlines (sort.c:2606)
==26798==    by 0x4075A2: sortlines (sort.c:2606)
==26798== 
==26798== Thread #2 was created
==26798==    at 0x5325AFE: clone (in /lib/libc-2.7.so)
==26798==    by 0x5038A11: pthread_create@@GLIBC_2.2.5 (in 
/lib/libpthread-2.7.so)
==26798==    by 0x4C26983: pthread_cre...@* (hg_intercepts.c:213)
==26798==    by 0x4074A8: sortlines (sort.c:2585)
==26798==    by 0x408089: sort (sort.c:2876)
==26798==    by 0x409B7E: main (sort.c:3670)
==26798== 
==26798== TRACE: 0x5d3da5e rd 1 thr#2 :: Exclusive(thr#1) --> 
ShRO(#Tset=2,#Lset=0)
==26798==    at 0x40D596: memcoll (memcoll.c:50)
==26798==    by 0x40AF43: xmemcoll (xmemcoll.c:43)
==26798==    by 0x406153: compare (sort.c:2122)
==26798==    by 0x40727B: sortlines (sort.c:2557)
==26798==    by 0x4075A2: sortlines (sort.c:2606)
==26798==    by 0x4075D2: sortlines (sort.c:2608)
==26798==    by 0x4075D2: sortlines (sort.c:2608)
==26798==    by 0x4075D2: sortlines (sort.c:2608)
==26798== Possible data race during write of size 1 at 0x5d3da5e
==26798==    at 0x40D5B2: memcoll (memcoll.c:53)
==26798==    by 0x40AF43: xmemcoll (xmemcoll.c:43)
==26798==    by 0x406153: compare (sort.c:2122)
==26798==    by 0x40727B: sortlines (sort.c:2557)
==26798==    by 0x4075A2: sortlines (sort.c:2606)
==26798==    by 0x4075D2: sortlines (sort.c:2608)
==26798==    by 0x4075D2: sortlines (sort.c:2608)
==26798==    by 0x4075D2: sortlines (sort.c:2608)
==26798==    by 0x4075D2: sortlines (sort.c:2608)
==26798==    by 0x4075D2: sortlines (sort.c:2608)
==26798==    by 0x4075D2: sortlines (sort.c:2608)
==26798==    by 0x4075D2: sortlines (sort.c:2608)
==26798==   Old state: shared-readonly by threads #1, #2
==26798==   New state: shared-modified by threads #1, #2
==26798==   Reason:    this thread, #2, holds no consistent locks
==26798==   Location 0x5d3da5e has never been protected by any lock
==26798== 
==26798== TRACE: 0x5d3da5e wr 1 thr#2 :: ShRO(#Tset=2,#Lset=0) --> 
ShMod(#Tset=2,#Lset=0)
==26798==    at 0x40D5B2: memcoll (memcoll.c:53)
==26798==    by 0x40AF43: xmemcoll (xmemcoll.c:43)
==26798==    by 0x406153: compare (sort.c:2122)
==26798==    by 0x40727B: sortlines (sort.c:2557)
==26798==    by 0x4075A2: sortlines (sort.c:2606)
==26798==    by 0x4075D2: sortlines (sort.c:2608)
==26798==    by 0x4075D2: sortlines (sort.c:2608)
==26798==    by 0x4075D2: sortlines (sort.c:2608)
thread 357677392 done
thread 357677392 joined by thread 67380784
==26798== 
==26798== TRACE: 0x59b0030 pa 255239968 thr#1 :: Exclusive(thr#1) --> NoAccess
==26798==    at 0x4C22E4E: free (vg_replace_malloc.c:323)
==26798==    by 0x4081BF: sort (sort.c:2913)
==26798==    by 0x409B7E: main (sort.c:3670)
==26798== 
==26798== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 10 from 2)
176.15user 0.41system 2:57.74elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+8outputs (0major+82377minor)pagefaults 0swaps


_______________________________________________
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils

Re: [PATCH] sort: Add --threads option, which parallelizes internal sort.

Reply via email to