On Jun23, 2011, at 23:40 , Robert Haas wrote: >>> I tried rewriting the LWLocks using CAS. It actually seems to make >>> things slightly worse on the tests I've done so far, perhaps because I >>> didn't make it respect spins_per_delay. Perhaps fetch-and-add would >>> be better, but I'm not holding my breath. Everything I'm seeing so >>> far leads me to the belief that we need to get rid of the contention >>> altogether, not just contend more quickly.
I got curious and put a tool together to benchmark this. The code can be found at https://github.com/fgp/lockbench. The tool starts N workers who each acquire a lock in SHARE mode, increment a per-worker counter and release the lock again, rinse, repeat. That is done for T seconds, after which it reports the sum of the individual counters, the elapsed (wall) time and the sums of the user and system times consumed by the workers. The lock implementations currently supported are atomicinc: RW-Lock using atomic increment/decrement instructions (locked xaddl) to acquire and release the lock in SHARE mode in the non-contested case. The contested case is unimplemented. cmpxchng: Like atomicinc, but uses "locked cmpxchng" loop instead of "locked xaddl". spin: RW-Lock built around a simple cmpxchng-based spinlock (i.e., to lock/unlock spinlock is taken, shared-lockers count is incremented/ decremented, spinlock is released). Similar to pg_lwlock, but without the sleep() in the contested path of the spinlock. The contested case of the RW-Lock is again unimplemented. none: No locking. pg_lwlock: Postgres LWLock implementation. The contested case doesn't work because the workers currently don't have a valid MyProc. pw_lwlock_cas: Like pg_lwlock but with your CAS patch applied. On an 4-core Intel Xeon system (Intel Xeon X3430, 2.4 GHz, 8MB cache) I get the following results: | 1 worker | 2 workers | 3 workers | 4 workers | |wallsec usersec|wallsec usersec|wallsec usersec|wallsec usersec| | / / | / / | / / | / / | |cycles cycles|cycles cycles|cycles cycles|cycles cycles| -------------------------------------------------------------------------- none |2.5e-09 2.5e-09|1.3e-09 2.5e-09|8.5e-10 2.5e-09|6.8e-10 2.5e-09| atomicinc|1.8e-08 1.8e-08|2.8e-08 5.6e-08|2.7e-08 8.1e-08|2.9e-08 1.1e-07| cmpxchng |3.0e-08 3.0e-08|7.1e-08 1.4e-07|6.9e-08 2.0e-07|7.2e-08 2.9e-07| spin |5.0e-08 5.0e-08|3.0e-07 5.9e-07|3.8e-07 1.1e-06|4.0e-07 1.5e-06| pg_lwlock|2.8e-08 2.8e-08|2.9e-08 3.0e-08|3.0e-08 3.2e-08|2.9e-08 3.1e-08| pg_lwlock|2.6e-08 2.6e-08|6.2e-08 1.2e-07|7.7e-08 2.3e-07|7.8e-08 3.1e-07| _cas| | | | | -------------------------------------------------------------------------- These results allow the following conclusions to be drawn First, our current pg_lwlock is quite good at preserving resources, i.e. at not spinning excessively - at least for up to four workers. It's the only lock implementation that consistently uses about 30ns of processor time for one acquire/release cycle. Only atomicinc with one worker (i.e. no cachline fighting) manages to beat that. It does, however, effectively serializate execution with this workload - the consumed user time is approximately equal to the elapsed wall clock time, even in the case of 4 workers, meaning that most of the time 3 of those 4 workers are sleeping. Secondly, pg_lwlock_cas and cmpxchng perform simiarly, which comes at no surprise, since use effectively the same algorithm. One could expect spin and pg_lwlock to also perform similarly, but these two don't. The reason is probably that spin uses a rather brain-dead spinning method, while pg_lwlock uses "rep; nop". Now to atomicinc, which is (the fast-path of) the most optimized RW-lock possible, at least on x86 and x86_64. One interesting result here is that wall time seconds / cycle are suspiciously for atomicinc and pg_lwlock. This proves, I think, that the limiting factor in both of these tests is the speed at which cache line invalidation can occur. It doesn't seem to matter whether the other cores are idle (as in the pg_lwlock case), or trying to obtain ownership of the cache line themselves (as in the atomicinc case). Finally, the no-locking test (none) shows that, even though the counter is written to shared memory after every increment, memory bandwith isn't an issue here, since performance scaled nearly linearly with the number of workers here. Here's the result of another run, this time with the per-worker counter being increment 100 in each acquire/release cycle. ------------------- 100 Increments per Cycle ----------------------------- | 1 worker | 2 workers | 3 workers | 4 workers | |wallsec usersec|wallsec usersec|wallsec usersec|wallsec usersec| | / / | / / | / / | / / | |cycles cycles|cycles cycles|cycles cycles|cycles cycles| -------------------------------------------------------------------------- none |2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.4e-08 2.5e-07|6.5e-08 2.6e-07| atomicinc|2.5e-07 2.5e-07|1.3e-07 2.6e-07|8.6e-08 2.5e-07|6.5e-08 2.6e-07| cmpxchng |2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.6e-08 2.6e-07|7.8e-08 3.1e-07| spin |3.0e-07 3.0e-07|1.5e-07 3.0e-07|1.1e-07 3.2e-07|3.4e-07 1.4e-06| pg_lwlock|2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.4e-08 2.5e-07|1.2e-07 4.8e-07| pg_lwlock|2.6e-07 2.6e-07|1.6e-07 3.1e-07|1.1e-07 3.2e-07|8.7e-08 3.4e-07| _cas| | | | | -------------------------------------------------------------------------- This makes the overhead of the locking disappear in the noise in the single-worker case. With 4 workers, however, differences materialize. The no-locking case still shows approximately linear speedup though, which I think again rules out memory bandwith effects. Now, atomicinc is the only implementation which doesn't perform significantly worse than the no-lock case. The closest competitors are the CAS-based implementation cmpxchng and pg_lwlock_cas, with 20% and 33% slowdown respectively (of both wall seconds and user seconds per cycle). The current LWLock implementation falls behind now, taking about twice as many seconds per cycle compared to atomicinc. In other words, with this workload an atomicinc-based RW-lock, taken in SHARE mode should add virtually no overhead, while every other implementation adds at least 20%. It'd be interesting to get benchmark results also on machines with more than four cores, but I don't have access to any such machine ATM. So if someone could run that on a 8 or 16-core beast, that'd be great. The code compiles on Mac OS X and Ubuntu 10.04. It's fairly portable, to make the postgres parts compile on other platforms, it might be necessary to adjust pg_stub/pg_config_manual.h. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers