Andres Freund <and...@anarazel.de> writes: > On 2018-07-20 15:35:39 -0400, Tom Lane wrote: >> In any case, I strongly resist making performance-based changes on >> the basis of one test on one kernel and one hardware platform.
> Sure, it'd be good to do more of that. But from a theoretical POV it's > quite logical that posix semas sharing cachelines is bad for > performance, if there's any contention. When backed by futexes - > i.e. all non ancient linux machines - the hot path just does a cmpxchg > of the *userspace* data (I've copied the relevant code below). Here's the thing: the hot path is of little or no interest, because if we are in the sema code at all, we are expecting to block. The only case where we wouldn't block is if the lock manager decided the current process needs to sleep, but some other process already released us by the time we reach the futex/kernel call. Certainly that will happen some of the time, but it's not likely to be the way to bet. So I'm very dubious of any arguments based on the speed of the "uncontended" path. It's possible that the bigger picture here is that the kernel boys optimized for the "uncontended" path to the point where they broke performance of the blocking path. It's hard to see how they could have broke it to the point of being slower than the SysV sema API, though. Anyway, I think we need to test first and patch second. I'm working on getting some numbers on my own machines now. On my RHEL6 machine, with unmodified HEAD and 8 sessions (since I've only got 8 cores) but other parameters matching Mithun's example, I just got transaction type: <builtin: TPC-B (sort of)> scaling factor: 300 query mode: prepared number of clients: 8 number of threads: 8 duration: 1800 s number of transactions actually processed: 29001016 latency average = 0.497 ms tps = 16111.575661 (including connections establishing) tps = 16111.623329 (excluding connections establishing) which is interesting because vmstat was pretty consistently reporting around 500000 context swaps/second during the run, or circa 30 cs/transaction. We'd have a minimum of 14 cs/transaction just between client and server (due to seven SQL commands per transaction in TPC-B) so that seems on the low side; not a lot of lock contention here it seems. I wonder what the corresponding ratio was in Mithun's runs. regards, tom lane