Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex

Emilio G. Cota Wed, 24 Aug 2016 14:15:06 -0700

On Thu, Aug 18, 2016 at 08:38:47 -0700, Richard Henderson wrote:
> A couple of other notes, as I've thought about this some more.


Thanks for spending time on this.

I have a new patchset (will send as a reply to this e-mail in a few
minutes) that has good performance. Its main ideas:

- Use transactions that start on ldrex and finish on strex. On
  an exception, end (instead of abort) the ongoing transaction,
  if any. There's little point in aborting, since the subsequent
  retries will end up in the same exception anyway. This means
  the translation of the corresponding blocks might happen via
  the fallback path. That's OK, given that subsequent executions
  of the TBs will (likely) complete via HTM.

- For the fallback path, add a stop-the-world primitive that stops
  all other CPUs, without requiring the calling CPU to exit the CPU loop.
  Not breaking from the loop keeps the code simple--we can just
  keep translating/executing normally, with the guarantee that
  no other CPU can run until we're done.

- The fallback path of the transaction stops the world and then
  continues execution (from ldrex) as the only running CPU.

- Only retry when the hardware hints that we may do so. This
  ends up being rare (I can only get dozens of retries under
  heavy contention, for instance with 'atomic_add-bench -r 1')

Limitations: for now user-mode only, and I have paid no attention
to paired atomics. Also, I'm making no checks for unusual (undefined?)
guest code, such as stray ldrex/strex thrown in there.

Performance optimizations like you suggest (e.g. starting a TB
on ldrex, or using TCG ops for beginning/ending the transaction)
could be implemented, but at least on Intel TSX (the only one I've
tried so far[*]), the transaction buffer seems big enough to not
make these optimizations a necessity.

[*] I tried running HTM primitives on the gcc compile farm's Power8,
  but I get an illegal instruction fault on tbegin. I've filed
  an issue here to report it: https://gna.org/support/?3369 ]

Some observations:

- The peak number of retries I see is for atomic_add-bench -r 1 -n 16
  (on an 8-thread machine) at about ~90 retries. So I set the limit
  to 100.

- The lowest success rate I've seen is ~98%, again for atomic_add-bench
  under high contention.

Some numbers:

- atomic_add's performance is lower for HTM vs cmpxchg, although under
  contention performance gets very similar. The reason for the perf
  gap is that xbegin/xend takes more cycles than cmpxchg, especially
  under little or no contention; this explains the large difference
  for threads=1.
  http://imgur.com/5kiT027
  As a side note, contended transactions seem to scale worse than contended
  cmpxchg when exploiting SMT. But anyway I wouldn't read much into
  that.

- For more realistic workloads that gap goes away, as the relative impact
  of cmpxchg or transaction delays is lower. For QHT, 1000 keys:
  http://imgur.com/l6vcowu
  And for SPEC (note that despite being single-threaded, SPEC executes
  a lot of atomics, e.g. from mutexes and from forking):
  http://imgur.com/W49YMhJ
  Performance is essentially identical to that of cmpxchg, but of course
  with HTM we get correct emulation.

Thanks for reading this far!

                Emilio

Re: [Qemu-devel] [PATCH] aarch64: use TSX for ldrex/strex

Reply via email to