On Thu, 2014-03-20 at 11:36 -0700, Linus Torvalds wrote: > On Thu, Mar 20, 2014 at 10:18 AM, Davidlohr Bueso <davidl...@hp.com> wrote: > > > > Comparing with the patch I sent earlier this morning, looks equivalent, > > and fwiw, passes my initial qemu bootup, which is the first way of > > detecting anything stupid going on. > > > > So, Srikar, please try this patch out, as opposed to mine, you don't > > have to first revert the commit in question. > > Ok, so it boots for me too, so hopefully it isn't totally broken. > > However, since it's just closing a race, and since getting the counts > wrong should easily result in it *working* but always taking the slow > path (for example), I'd really like people to also verify that it > fixes the actual performance issue (ie assuming it fixes powerpc > behavior for Srikar, I'd like to get it double-checked that it also > avoids the spinlock in the common case).
Oh, it does. This atomics technique was tested at a customer's site and ready for upstream. To refresh, we were originally seeing massive contention on the hb->lock and an enormous amounts of 0 returns from futex_wake, indicating that spinners where piling up just to realize that the plist was empty! While I don't have any official numbers, I can confirm that perf showed that this issue was addressed with the atomics variant. Yes, such pathological behavior shows problems in the userspace locking primitives design/implementation, but allowing the kernel not to be affected by suboptimal uses of futexes is definitely a plus. As tglx suggested at the time, I also made sure that adding the barriers when doing the key refcounting didn't impose any serious restrictions to performance either. Now, what at the time required re-testing everything was when you suggested replacing this approach with a more elegant spin is locked test. Both approaches showed pretty much identical performance (and correctness, at least on x86). And to this day shows *significant* less time spent in kernel space dealing with futexes. > Because if the > increment/decrement pairings end up being wrong, we could have a > situation where the waiter count just ends up bogus, and it all works > from a correctness standpoint but not from the intended performance > optimization. > > No way I can test that sanely on my single-socket machine. Davidlohr? Not this patch, no :( -- we could never blindly reproduce the customer's workload. The only patch that I was able to create test cases for is the larger hash table one, which simply alleviates collisions. This is now part of perf-bench. Thanks, Davidlohr _______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev