On 24/12/16 17:52, Bruno Haible wrote: > Hi Pádraig, > >> Wow that's much better on a 40 core system: >> >> Before your patch: >> ================= >> $ time ./test-lock >> Starting test_lock ... OK >> Starting test_rwlock ... OK >> Starting test_recursive_lock ... OK >> Starting test_once ... OK >> >> real 1m32.547s >> user 1m32.455s >> sys 13m21.532s >> >> After your patch: >> ================= >> $ time ./test-lock >> Starting test_lock ... OK >> Starting test_rwlock ... OK >> Starting test_recursive_lock ... OK >> Starting test_once ... OK >> >> real 0m3.364s >> user 0m3.087s >> sys 0m25.477s > > Wow, a 30x speed increase by using a lock instead of 'volatile'! > > Thanks for the testing. I cleaned up the patch to do less > code duplication and pushed it. > > Still, I wonder about the cause of this speed difference. > It must be the read from the 'volatile' variable that is problematic, > because the program writes to 'volatile' variable only 6 times in total. > > What happens when a program reads from a 'volatile' variable > at address xy in a multi-processor system? It must do a broadcast > to all other CPUs "please flush your internal write caches", wait > for these flushes to be completed, and then do a read at address xy. > But the same procedure must also happen when taking a lock at > address xy. So, where does the speed difference come from? > The 'volatile' handling must be implemented in a terrible way; > either GCC generates inefficient instructions? or these instructions > are executed in a horrible way by the hardware? > > What is the hardware of your 40-core machine (just for reference)?
Might be NUMA related? CPU is Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz Attached is the output from: lstopo --no-legend -v -p --of png > test-lock.png thanks again, Pádraig