Hi Darren, On Fri, 2013-11-22 at 21:55 -0800, Darren Hart wrote: > On Fri, 2013-11-22 at 16:56 -0800, Davidlohr Bueso wrote: > > We have been dealing with a customer database workload on large > > 12Tb, 240 core 16 socket NUMA system that exhibits high amounts > > of contention on some of the locks that serialize internal futex > > data structures. This workload specially suffers in the wakeup > > paths, where waiting on the corresponding hb->lock can account for > > up to ~60% of the time. The result of such calls can mostly be > > classified as (i) nothing to wake up and (ii) wakeup large amount > > of tasks. > > With as many cores as you have, have you done any analysis of how > effective the hashing algorithm is, and would more buckets relieve someHi > of the contention.... ah, I see below that you did. Nice work. > > > Before these patches are applied, we can see this pathological behavior: > > > > 37.12% 826174 xxx [kernel.kallsyms] [k] _raw_spin_lock > > --- _raw_spin_lock > > | > > |--97.14%-- futex_wake > > | do_futex > > | sys_futex > > | system_call_fastpath > > | | > > | |--99.70%-- 0x7f383fbdea1f > > | | yyy > > > > 43.71% 762296 xxx [kernel.kallsyms] [k] _raw_spin_lock > > --- _raw_spin_lock > > | > > |--53.74%-- futex_wake > > | do_futex > > | sys_futex > > | system_call_fastpath > > | | > > | |--99.40%-- 0x7fe7d44a4c05 > > | | zzz > > |--45.90%-- futex_wait_setup > > | futex_wait > > | do_futex > > | sys_futex > > | system_call_fastpath > > | 0x7fe7ba315789 > > | syscall > > > > Sorry to be dense, can you spell out how 60% falls out of these numbers?
By adding the respective percentages of futex_wake()*_raw_spin_lock calls. > > > > > With these patches, contention is practically non existent: > > > > 0.10% 49 xxx [kernel.kallsyms] [k] _raw_spin_lock > > --- _raw_spin_lock > > | > > |--76.06%-- futex_wait_setup > > | futex_wait > > | do_futex > > | sys_futex > > | system_call_fastpath > > | | > > | |--99.90%-- 0x7f3165e63789 > > | | syscall| > > ... > > |--6.27%-- futex_wake > > | do_futex > > | sys_futex > > | system_call_fastpath > > | | > > | |--54.56%-- 0x7f317fff2c05 > > ... > > > > Patches 1 & 2 are cleanups and micro optimizations. > > > > Patch 3 addresses the well known issue of the global hash table. > > By creating a larger and NUMA aware table, we can reduce the false > > sharing and collisions, thus reducing the chance of different futexes > > using hb->lock. > > > > Patch 4 reduces contention on the corresponding hb->lock by not trying to > > acquire it if there are no blocked tasks in the waitqueue. > > This particularly deals with point (i) above, where we see that it is not > > uncommon for up to 90% of wakeup calls end up returning 0, indicating that > > no > > tasks were woken. > > Can you determine how much benefit comes from 3 and how much additional > benefit comes from 4? While I don't have specific per-patch data, there are indications that the workload mostly deals with a handful of futexes. So its pretty safe to assume that patch 4 is the one with the most benefit for _this_ particular workload. > > > > > Patch 5 resurrects a two year old idea from Peter Zijlstra to delay > > the waking of the blocked tasks to be done without holding the hb->lock: > > https://lkml.org/lkml/2011/9/14/118 > > > > This is useful for locking primitives that can effect multiple wakeups > > per operation and want to avoid the futex's internal spinlock contention by > > delaying the wakeups until we've released the hb->lock. > > This particularly deals with point (ii) above, where we can observe that > > in occasions the wake calls end up waking 125 to 200 waiters in what we > > believe > > are RW locks in the application. > > > > This patchset has also been tested on smaller systems for a variety of > > benchmarks, including java workloads, kernel builds and custom > > bang-the-hell-out-of > > hb locks programs. So far, no functional or performance regressions have > > been seen. > > Furthermore, no issues were found when running the different tests in the > > futextest > > suite: http://git.kernel.org/cgit/linux/kernel/git/dvhart/futextest.git/ > > Excellent. Would you be able to contribute any of these (C only please) > to the stress test group? > Sure. Thanks, Davidlohr -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/