Hello Paul, Glen, Jim, all, * Paul Eggert wrote on Fri, Apr 03, 2009 at 09:57:54PM CEST: > Of course we cannot reasonably expect this one performance improvement > to make 'sort' run 16x faster on a 16-CPU machine. That is because the > improvement parallelizes only one part of 'sort'. Even assuming the > parallelized part gets done in essentially zero time, the rest of 'sort' > is still sequential (notably, input and output), and so it will dominate > the wall-clock time. Obviously we would like to parallelize all of > 'sort', but the idea is to do one step at a time, focusing on the > easiest parts first, and showing real improvements at each step. I > would be quite happy if this particular improvement gave us a 2x > wall-clock improvement.
I've looked at this in a bit more detail; no big conclusion but maybe a few more hints that could help. I am now pretty confident that your patch implements the threading correctly. When inserting some expensive_computation (); in the sortlines function right after `swap' is computed, then all timings look good (in relation to each other). For the expensive computation, I used a function in a separate translation unit doing some floating point, and compiled that without optimization. 'valgrind --cachegrind' shows a very low cache miss rate of 0.2% max with both 1 and 2 threads, indicating there isn't anything funny going on here. So I think the only remaining issues that could be relevant for the original timings are, IMHO: - false data sharing. Don't see where that could happen, - memory bandwidth limitations of the specific system, or - suboptimal NUMA memory distribution, That said, I did manage to get 'valgrind --tool=helgrind sort --threads=2' to show a potential error, with a large data set. I'm quite sure it is a false positive, but I post the output below for reference. The 'valgrind --tool=drd' tool could not reproduce this, though. Anyway. I tried another experiment. I replicated some large data set 8 times. I then timed a manual merge: sort -m <( sort data ) <( sort data ) ... (8 times) >/dev/null and compared that with: sort --threads=P 8-fold-data manual merge: 16.75 s --threads=8: 44.11 s --threads=1: 81.32 s This comparison isn't entirely fair, as the splicing was done as a precomputation. However, the difference is so pronounced that even taking the splicing into account, the non-thread version would be quite a bit faster. So to me, it would seem that some approach going in that direction could be beneficial. Other than that, have you thought of implementing something like described in <http://www.it.uom.gr/teaching/dbpp/text/node127.html>? Cheers, Ralf This is the (non-portable) diff I used when running the command below. With it, sort.c:2608 is the last sortlines invocation within sortlines. diff --git a/src/sort.c b/src/sort.c index 82aa7e8..c0a0462 100644 --- a/src/sort.c +++ b/src/sort.c @@ -1,5 +1,5 @@ /* sort - sort lines of text (with all kinds of options). - Copyright (C) 1988, 1991-2008 Free Software Foundation, Inc. + Copyright (C) 1988, 1991-2009 Free Software Foundation, Inc. This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by @@ -2520,6 +2520,7 @@ sortlines_thread (void *data) struct thread_args const *args = data; sortlines (args->lines, args->nlines, args->temp, args->nthreads, args->to_temp); + fprintf(stderr, "thread %lu done\n", pthread_self ()); return NULL; } @@ -2540,7 +2541,7 @@ sortlines_thread (void *data) usual 2*N lines. Knuth writes that this memory optimization was originally published by D. A. Bell, Comp J. 1 (1958), 75. - This function is inline so that its tests for multthreadedness and + This function is inline so that its tests for multithreadedness and inplacedness can be optimized away in common cases. */ static void @@ -2554,6 +2555,9 @@ sortlines (struct line *restrict lines, size_t nlines, <http://lists.gnu.org/archive/html/bug-coreutils/2005-10/msg00086.html> in the IBM xlc 6.0.0.0 compiler in 64-bit mode. */ int swap = (0 < compare (&lines[-1], &lines[-2])); + /* let's make this more expensive, artificially */ + //int expensive_computation (int); + //swap = expensive_computation (swap); if (to_temp) { temp[-1] = lines[-1 - swap]; @@ -2561,9 +2565,9 @@ sortlines (struct line *restrict lines, size_t nlines, } else if (swap) { - temp[-1] = lines[-1]; + struct line tmp = lines[-1]; lines[-1] = lines[-2]; - lines[-2] = temp[-1]; + lines[-2] = tmp; } } else @@ -2579,16 +2583,26 @@ sortlines (struct line *restrict lines, size_t nlines, struct thread_args args = {hi, nhi, temp - nlo, child_subthreads, to_temp}; if (child_subthreads != 0 && SUBTHREAD_LINES_HEURISTIC <= nlines + && fprintf(stderr, "thread %lu is creating thread for %zu lines\n", + pthread_self (), nlines) && pthread_create (&thread, NULL, sortlines_thread, &args) == 0) { /* Guarantee that nlo and nhi are each at least 2. */ verify (4 <= SUBTHREAD_LINES_HEURISTIC); sortlines (lo, nlo, temp, my_subthreads, !to_temp); + fprintf (stderr, "thread %lu joined by thread %lu\n", thread, pthread_self ()); pthread_join (thread, NULL); } else { + static __thread int seen = 0; + if (!seen) + { + fprintf (stderr, "thread %lu is sorting %zu lines\n", + pthread_self (), nlines); + seen = 1; + } sortlines (hi, nhi, temp - (to_temp ? nlo : 0), 1, to_temp); if (1 < nlo) sortlines (lo, nlo, temp, 1, !to_temp); $ \time valgrind --tool=helgrind --trace-addr=0x5d3da5e --trace-level=2 ./sort --threads=2 T >/dev/null ==26798== Helgrind, a thread error detector. ==26798== Copyright (C) 2007-2008, and GNU GPL'd, by OpenWorks LLP et al. ==26798== Using LibVEX rev 1404, a library for dynamic binary translation. ==26798== Copyright (C) 2004-2008, and GNU GPL'd, by OpenWorks LLP. ==26798== Using valgrind-3.4.0.SVN, a dynamic binary instrumentation framework. ==26798== Copyright (C) 2000-2008, and GNU GPL'd, by Julian Seward et al. ==26798== For more details, rerun with: -v ==26798== ==26798== ==26798== Thread #1 is the program's root thread ==26798== ==26798== TRACE: 0x59b0030 pa 255239968 thr#1 :: NoAccess --> New ==26798== at 0x4C232CB: malloc (vg_replace_malloc.c:207) ==26798== by 0x4043DE: initbuf (sort.c:1404) ==26798== by 0x407FC1: sort (sort.c:2851) ==26798== by 0x409B7E: main (sort.c:3670) ==26798== ==26798== TRACE: 0x5d3da58 wr 8 thr#1 :: New --> Exclusive(thr#1) ==26798== at 0x5317B40: __read_nocancel (in /lib/libc-2.7.so) ==26798== by 0x52BBA8F: _IO_file_xsgetn (in /lib/libc-2.7.so) ==26798== by 0x52BB10D: fread_unlocked (in /lib/libc-2.7.so) ==26798== by 0x404942: fillbuf (sort.c:1610) ==26798== by 0x408190: sort (sort.c:2857) ==26798== by 0x409B7E: main (sort.c:3670) thread 67380784 is creating thread for 788552 lines thread 357677392 is sorting 394276 lines himself thread 67380784 is sorting 394276 lines himself ==26798== ==26798== TRACE: 0x5d3da5c rd 4 thr#1 :: Exclusive(thr#1) --> Exclusive(thr#1) ==26798== at 0x52C9080: strlen (in /lib/libc-2.7.so) ==26798== by 0x52CCB0F: strcoll_l (in /lib/libc-2.7.so) ==26798== by 0x40D64C: memcoll (memcoll.c:56) ==26798== by 0x40AF43: xmemcoll (xmemcoll.c:43) ==26798== by 0x406153: compare (sort.c:2122) ==26798== by 0x40727B: sortlines (sort.c:2557) ==26798== by 0x4075A2: sortlines (sort.c:2606) ==26798== by 0x4075A2: sortlines (sort.c:2606) ==26798== ==26798== Thread #2 was created ==26798== at 0x5325AFE: clone (in /lib/libc-2.7.so) ==26798== by 0x5038A11: pthread_create@@GLIBC_2.2.5 (in /lib/libpthread-2.7.so) ==26798== by 0x4C26983: pthread_cre...@* (hg_intercepts.c:213) ==26798== by 0x4074A8: sortlines (sort.c:2585) ==26798== by 0x408089: sort (sort.c:2876) ==26798== by 0x409B7E: main (sort.c:3670) ==26798== ==26798== TRACE: 0x5d3da5e rd 1 thr#2 :: Exclusive(thr#1) --> ShRO(#Tset=2,#Lset=0) ==26798== at 0x40D596: memcoll (memcoll.c:50) ==26798== by 0x40AF43: xmemcoll (xmemcoll.c:43) ==26798== by 0x406153: compare (sort.c:2122) ==26798== by 0x40727B: sortlines (sort.c:2557) ==26798== by 0x4075A2: sortlines (sort.c:2606) ==26798== by 0x4075D2: sortlines (sort.c:2608) ==26798== by 0x4075D2: sortlines (sort.c:2608) ==26798== by 0x4075D2: sortlines (sort.c:2608) ==26798== Possible data race during write of size 1 at 0x5d3da5e ==26798== at 0x40D5B2: memcoll (memcoll.c:53) ==26798== by 0x40AF43: xmemcoll (xmemcoll.c:43) ==26798== by 0x406153: compare (sort.c:2122) ==26798== by 0x40727B: sortlines (sort.c:2557) ==26798== by 0x4075A2: sortlines (sort.c:2606) ==26798== by 0x4075D2: sortlines (sort.c:2608) ==26798== by 0x4075D2: sortlines (sort.c:2608) ==26798== by 0x4075D2: sortlines (sort.c:2608) ==26798== by 0x4075D2: sortlines (sort.c:2608) ==26798== by 0x4075D2: sortlines (sort.c:2608) ==26798== by 0x4075D2: sortlines (sort.c:2608) ==26798== by 0x4075D2: sortlines (sort.c:2608) ==26798== Old state: shared-readonly by threads #1, #2 ==26798== New state: shared-modified by threads #1, #2 ==26798== Reason: this thread, #2, holds no consistent locks ==26798== Location 0x5d3da5e has never been protected by any lock ==26798== ==26798== TRACE: 0x5d3da5e wr 1 thr#2 :: ShRO(#Tset=2,#Lset=0) --> ShMod(#Tset=2,#Lset=0) ==26798== at 0x40D5B2: memcoll (memcoll.c:53) ==26798== by 0x40AF43: xmemcoll (xmemcoll.c:43) ==26798== by 0x406153: compare (sort.c:2122) ==26798== by 0x40727B: sortlines (sort.c:2557) ==26798== by 0x4075A2: sortlines (sort.c:2606) ==26798== by 0x4075D2: sortlines (sort.c:2608) ==26798== by 0x4075D2: sortlines (sort.c:2608) ==26798== by 0x4075D2: sortlines (sort.c:2608) thread 357677392 done thread 357677392 joined by thread 67380784 ==26798== ==26798== TRACE: 0x59b0030 pa 255239968 thr#1 :: Exclusive(thr#1) --> NoAccess ==26798== at 0x4C22E4E: free (vg_replace_malloc.c:323) ==26798== by 0x4081BF: sort (sort.c:2913) ==26798== by 0x409B7E: main (sort.c:3670) ==26798== ==26798== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 10 from 2) 176.15user 0.41system 2:57.74elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+8outputs (0major+82377minor)pagefaults 0swaps _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils