Hi Pádraig, > Wow that's much better on a 40 core system: > > Before your patch: > ================= > $ time ./test-lock > Starting test_lock ... OK > Starting test_rwlock ... OK > Starting test_recursive_lock ... OK > Starting test_once ... OK > > real 1m32.547s > user 1m32.455s > sys 13m21.532s > > After your patch: > ================= > $ time ./test-lock > Starting test_lock ... OK > Starting test_rwlock ... OK > Starting test_recursive_lock ... OK > Starting test_once ... OK > > real 0m3.364s > user 0m3.087s > sys 0m25.477s
Wow, a 30x speed increase by using a lock instead of 'volatile'! Thanks for the testing. I cleaned up the patch to do less code duplication and pushed it. Still, I wonder about the cause of this speed difference. It must be the read from the 'volatile' variable that is problematic, because the program writes to 'volatile' variable only 6 times in total. What happens when a program reads from a 'volatile' variable at address xy in a multi-processor system? It must do a broadcast to all other CPUs "please flush your internal write caches", wait for these flushes to be completed, and then do a read at address xy. But the same procedure must also happen when taking a lock at address xy. So, where does the speed difference come from? The 'volatile' handling must be implemented in a terrible way; either GCC generates inefficient instructions? or these instructions are executed in a horrible way by the hardware? What is the hardware of your 40-core machine (just for reference)? Bruno
0001-lock-test-Fix-performance-problem-on-multi-core-mach.patch
Description: Binary data