[Qemu-devel] [RFC 0/8] Helper-based Atomic Instruction Emulation (AIE)

Emilio G. Cota Fri, 08 May 2015 14:03:34 -0700

Hi all,

These are patches I've been working on for some time now.
Since emulation of atomic instructions is recently getting
attention([1], [2]), I'm submitting them for comment.


[1] http://thread.gmane.org/gmane.comp.emulators.qemu/314406
[2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561

Main features of this design:

- Performance and scalability are the main design goal: guest code should
  scale as much as it would scale running natively on the host.

  For this, a host lock is (if necessary) assigned to each 16-byte
  aligned chunk of the physical address space.
  The assignment (i.e. lock allocation + registration) only happens
  after an atomic operation on a particular physical address
  is performed. To keep track of this sparse set of locks,
  a lockless radix tree is used, so lookups are fast and scalable.

- Translation helpers are employed to call the 'aie' module, which is
  the common code that accesses the radix tree, locking the appropriate
  entry depending on the access' physical address.

- No special host atomic instructions (e.g. cmpxchg16b) are required;
  mutexes and include/qemu/atomic.h is all that's needed.

- Usermode and full-system are supported with the same code. Note that
  the newly-added tiny_set module is necessary to properly emulate LL/SC,
  since the number of "cpus" (i.e. threads) is unbounded in usermode--
  for full-system mode a bitmap would have been sufficient.

- ARM: Stores concurrent with LL/SC primitives are initially not dealt
  with.
  This is my choice, since I'm assuming most sane code will only
  handle data atomically using LL/SC primitives. However, SWP can
  be used, so whenevery a SWP instruction is issued, stores start checking
  that stores do not clash with concurrent SWP instructions. This is
  implemented via pre/post-store helpers. I've stress-tested this with a
  heavily contended guest lock (64 cores), and it works fine. Executing
  non-trivial pre/post-store helpers adds a 5% perf overhead to linux
  bootup, and is negligible on regular programs. Anyway most
  sane code doesn't use SWP (linux bootup certainly doesn't.), so this
  overhead is rarely seen.

- x86: Instead of acquiring the same host lock every time LOCK is found,
  the acquisition of an AIE lock (via the radix tree) is done when the
  address of the ensuing load/store is known.
  Loads perform this check at compile-time.
  Stores are emulated using the same trick as in ARM; non-atomic stores
  are executed as atomic stores iff there's a prior atomic operation that
  has been executed on their target address. This for instance ensures
  that a regular store cannot race with a cmpxchg.
  This has very small overhead (negligible with OpenSSL's bntest in
  user-only), and scales as native code.

- Barriers: not emulated yet. They're needed to correctly run non-trivial
  lockless code (I'm using concurrencykit's testbenches).
  The strongly-ordered-guest-on-weakly-ordered-host problem remains; my
  guess is that we'll have to sacrifice single-threaded performance to
  make it work (e.g. using pre-post ld/st helpers).

- 64-bit guest on 32-bit host: Not supported yet. Note that 64-bit
  loads/stores on a 32-bit guest are not atomic, yet 64-bit code might
  have been written assuming that they are. Checks for this will be needed.

- Other ISAs: not done yet, but they should be like either ARM or x86.

- License of new files: is there a preferred license for new code?

- Please tolerate the lack of comments in code and commit logs, when
  preparing this RFC I thought it's better to put all the info
  here. If this wasn't an RFC I'd have done it differently.

Thanks for reading this far, comments welcome!

                Emilio

[Qemu-devel] [RFC 0/8] Helper-based Atomic Instruction Emulation (AIE)

Reply via email to