On Thu, Aug 18, 2016 at 08:38:47 -0700, Richard Henderson wrote: > A couple of other notes, as I've thought about this some more.
Thanks for spending time on this. I have a new patchset (will send as a reply to this e-mail in a few minutes) that has good performance. Its main ideas: - Use transactions that start on ldrex and finish on strex. On an exception, end (instead of abort) the ongoing transaction, if any. There's little point in aborting, since the subsequent retries will end up in the same exception anyway. This means the translation of the corresponding blocks might happen via the fallback path. That's OK, given that subsequent executions of the TBs will (likely) complete via HTM. - For the fallback path, add a stop-the-world primitive that stops all other CPUs, without requiring the calling CPU to exit the CPU loop. Not breaking from the loop keeps the code simple--we can just keep translating/executing normally, with the guarantee that no other CPU can run until we're done. - The fallback path of the transaction stops the world and then continues execution (from ldrex) as the only running CPU. - Only retry when the hardware hints that we may do so. This ends up being rare (I can only get dozens of retries under heavy contention, for instance with 'atomic_add-bench -r 1') Limitations: for now user-mode only, and I have paid no attention to paired atomics. Also, I'm making no checks for unusual (undefined?) guest code, such as stray ldrex/strex thrown in there. Performance optimizations like you suggest (e.g. starting a TB on ldrex, or using TCG ops for beginning/ending the transaction) could be implemented, but at least on Intel TSX (the only one I've tried so far[*]), the transaction buffer seems big enough to not make these optimizations a necessity. [*] I tried running HTM primitives on the gcc compile farm's Power8, but I get an illegal instruction fault on tbegin. I've filed an issue here to report it: https://gna.org/support/?3369 ] Some observations: - The peak number of retries I see is for atomic_add-bench -r 1 -n 16 (on an 8-thread machine) at about ~90 retries. So I set the limit to 100. - The lowest success rate I've seen is ~98%, again for atomic_add-bench under high contention. Some numbers: - atomic_add's performance is lower for HTM vs cmpxchg, although under contention performance gets very similar. The reason for the perf gap is that xbegin/xend takes more cycles than cmpxchg, especially under little or no contention; this explains the large difference for threads=1. http://imgur.com/5kiT027 As a side note, contended transactions seem to scale worse than contended cmpxchg when exploiting SMT. But anyway I wouldn't read much into that. - For more realistic workloads that gap goes away, as the relative impact of cmpxchg or transaction delays is lower. For QHT, 1000 keys: http://imgur.com/l6vcowu And for SPEC (note that despite being single-threaded, SPEC executes a lot of atomics, e.g. from mutexes and from forking): http://imgur.com/W49YMhJ Performance is essentially identical to that of cmpxchg, but of course with HTM we get correct emulation. Thanks for reading this far! Emilio