On Sat, Jun 29, 2013 at 10:06:11AM +0300, Alexander Motin wrote: > I understand that lock attempt will steal cache line from lock owner. > What I don't very understand is why avoiding it helps performance in > this case. Indeed, having mutex on own cache line will not let other > cores to steal also bswlist, but it also means that bswlist should be > prefetched separately (and profiling shows resource stalls there). Or in > this case separate speculative prefetch will be better then forced one > which could be stolen? Is there cases when it is not, or the only reason > to not pad all global mutexes is only saving memory?
I can speculate that it is the case when speculative execution helps. If mutex and list head are on the different cache lines, then cpu could speculatively read the head, and then prove that executing the read before the lock acquisition does not break the ordering rules (because lock protects the head, other core indeed cannot modify the head if the lock acquisition was successfull). I think it is very similar reason why locked instructions as barriers are faster then the explicit barriers, cpu could still do the speculative execution after the lock prefix if the ordering is provable consistent. Please see the Intel IA32 architecture optimization manual 8.4.5 for the recommendations (but not much explanation). Yes, I think putting all locks on dedicated cache lines is the waste, only hot locks need this.
pgpREr_akYKys.pgp
Description: PGP signature