Re: [Qemu-devel] [RFC 0/8] Helper-based Atomic Instruction Emulation (AIE)

Frederic Konrad Mon, 11 May 2015 09:04:09 -0700

On 08/05/2015 23:02, Emilio G. Cota wrote:

Hi all,


These are patches I've been working on for some time now.
Since emulation of atomic instructions is recently getting
attention([1], [2]), I'm submitting them for comment.

[1] http://thread.gmane.org/gmane.comp.emulators.qemu/314406
[2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561

Main features of this design:

- Performance and scalability are the main design goal: guest code should
   scale as much as it would scale running natively on the host.

   For this, a host lock is (if necessary) assigned to each 16-byte
   aligned chunk of the physical address space.
   The assignment (i.e. lock allocation + registration) only happens
   after an atomic operation on a particular physical address
   is performed. To keep track of this sparse set of locks,
   a lockless radix tree is used, so lookups are fast and scalable.

- Translation helpers are employed to call the 'aie' module, which is
   the common code that accesses the radix tree, locking the appropriate
   entry depending on the access' physical address.

- No special host atomic instructions (e.g. cmpxchg16b) are required;
   mutexes and include/qemu/atomic.h is all that's needed.

- Usermode and full-system are supported with the same code. Note that
   the newly-added tiny_set module is necessary to properly emulate LL/SC,
   since the number of "cpus" (i.e. threads) is unbounded in usermode--
   for full-system mode a bitmap would have been sufficient.

- ARM: Stores concurrent with LL/SC primitives are initially not dealt
   with.
   This is my choice, since I'm assuming most sane code will only
   handle data atomically using LL/SC primitives. However, SWP can
   be used, so whenevery a SWP instruction is issued, stores start checking
   that stores do not clash with concurrent SWP instructions. This is
   implemented via pre/post-store helpers. I've stress-tested this with a
   heavily contended guest lock (64 cores), and it works fine. Executing
   non-trivial pre/post-store helpers adds a 5% perf overhead to linux
   bootup, and is negligible on regular programs. Anyway most
   sane code doesn't use SWP (linux bootup certainly doesn't.), so this
   overhead is rarely seen.

- x86: Instead of acquiring the same host lock every time LOCK is found,
   the acquisition of an AIE lock (via the radix tree) is done when the
   address of the ensuing load/store is known.
   Loads perform this check at compile-time.
   Stores are emulated using the same trick as in ARM; non-atomic stores
   are executed as atomic stores iff there's a prior atomic operation that
   has been executed on their target address. This for instance ensures
   that a regular store cannot race with a cmpxchg.
   This has very small overhead (negligible with OpenSSL's bntest in
   user-only), and scales as native code.

- Barriers: not emulated yet. They're needed to correctly run non-trivial
   lockless code (I'm using concurrencykit's testbenches).
   The strongly-ordered-guest-on-weakly-ordered-host problem remains; my
   guess is that we'll have to sacrifice single-threaded performance to
   make it work (e.g. using pre-post ld/st helpers).

- 64-bit guest on 32-bit host: Not supported yet. Note that 64-bit
   loads/stores on a 32-bit guest are not atomic, yet 64-bit code might
   have been written assuming that they are. Checks for this will be needed.

- Other ISAs: not done yet, but they should be like either ARM or x86.

- License of new files: is there a preferred license for new code?

I think it's GPL V2 or later.

Fred


- Please tolerate the lack of comments in code and commit logs, when
   preparing this RFC I thought it's better to put all the info
   here. If this wasn't an RFC I'd have done it differently.

Thanks for reading this far, comments welcome!

                Emilio

Re: [Qemu-devel] [RFC 0/8] Helper-based Atomic Instruction Emulation (AIE)

Reply via email to