On Jun28, 2011, at 22:18 , Robert Haas wrote: > On Tue, Jun 28, 2011 at 2:33 PM, Florian Pflug <f...@phlo.org> wrote: >> [ testing of various spinlock implementations ] > > I set T=30 and N="1 2 4 8 16 32" and tried this out on a 32-core > loaner from Nate Boley:
Cool, thanks! > 100 counter increments per cycle > worker 1 2 4 8 > 16 32 > time wall user wall user wall user wall user wall > user wall user > none 2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 > 3.3e-07 1.1e-08 3.4e-07 > atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 > 1.5e-07 2.3e-06 1.5e-07 4.9e-06 > cmpxchng 3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 > 4.2e-07 6.6e-06 4.5e-07 1.4e-05 > spin 4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 > 6.1e-05 1.4e-05 4.3e-04 > pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 > 1.6e-06 2.5e-05 6.4e-06 2.0e-04 > pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 > 1.9e-07 3.0e-06 2.4e-07 7.5e-06 Here's the same table, formatted with spaces. worker 1 2 4 8 16 32 time wall user wall user wall user wall user wall user wall user none 2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 3.3e-07 1.1e-08 3.4e-07 atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 1.5e-07 2.3e-06 1.5e-07 4.9e-06 cmpxchng 3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 4.2e-07 6.6e-06 4.5e-07 1.4e-05 spin 4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 6.1e-05 1.4e-05 4.3e-04 pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 1.6e-06 2.5e-05 6.4e-06 2.0e-04 pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 1.9e-07 3.0e-06 2.4e-07 7.5e-06 And here's the throughput table calculated from your results, i.e. the wall time per cycle relative to the wall time per cycle for 1 worker. workers 2 4 8 16 32 none 1.9 3.5 6.7 13 26 atomicinc 1.4 2.6 2.6 2.4 2.4 cmpxchng 1.1 1.1 1.2 0.9 0.8 spin 1.5 2.6 0.3 0.1 0.0 pg_lwlock 1.4 2.5 1.0 0.2 0.1 pg_lwlock_cas 1.3 2.6 2.6 1.9 1.5 Hm, so in the best case we get 2.6x the throughput of a single core, and that only for 4 and 8 workers (1.4e-7 second / cycle vs 3.6e-7). In that case, there also seems to be little difference between pg_lwlock{_cas} and atomicinc. atomicinc again manages to at least sustain that throughput when the worker count is increased, while for for the others the throughput actually *decreases*. What totally puzzles me is that your results don't show any trace of a decreased system load for the pg_lwlock implementation, which I'd have expected due to the sleep() in the contested path. Here are the user vs. wall time ratios - I'd have expected to see value significantly below the number of workers for pg_lwlock none 1.0 2.0 4.0 7.9 16 31 atomicinc 1.0 2.0 3.9 7.9 15 33 cmpxchng 1.0 2.0 4.1 7.9 16 31 spin 1.0 2.0 3.9 7.8 16 31 pg_lwlock 1.0 2.0 4.1 7.9 16 31 pg_lwlock_cas 1.0 2.0 4.1 7.9 16 31 > I wrote a little script to show to reorganize this data in a > possibly-easier-to-understand format - ordering each column from > lowest to highest, and showing each algorithm as a multiple of the > cheapest value for that column: If you're OK with that, I'd like to add that to the lockbench repo. > There seems to be something a bit funky in your 3-core data, but > overall I read this data to indicate that 4 cores aren't really enough > to see a severe problem with spinlock contention. Hm, it starts to show if you lower the counter increment per cycle (the D constant in run.sh). But yeah, it's never as bad as the 32-core results above.. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers