Hi all, These are patches I've been working on for some time now. Since emulation of atomic instructions is recently getting attention([1], [2]), I'm submitting them for comment.
[1] http://thread.gmane.org/gmane.comp.emulators.qemu/314406 [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561 Main features of this design: - Performance and scalability are the main design goal: guest code should scale as much as it would scale running natively on the host. For this, a host lock is (if necessary) assigned to each 16-byte aligned chunk of the physical address space. The assignment (i.e. lock allocation + registration) only happens after an atomic operation on a particular physical address is performed. To keep track of this sparse set of locks, a lockless radix tree is used, so lookups are fast and scalable. - Translation helpers are employed to call the 'aie' module, which is the common code that accesses the radix tree, locking the appropriate entry depending on the access' physical address. - No special host atomic instructions (e.g. cmpxchg16b) are required; mutexes and include/qemu/atomic.h is all that's needed. - Usermode and full-system are supported with the same code. Note that the newly-added tiny_set module is necessary to properly emulate LL/SC, since the number of "cpus" (i.e. threads) is unbounded in usermode-- for full-system mode a bitmap would have been sufficient. - ARM: Stores concurrent with LL/SC primitives are initially not dealt with. This is my choice, since I'm assuming most sane code will only handle data atomically using LL/SC primitives. However, SWP can be used, so whenevery a SWP instruction is issued, stores start checking that stores do not clash with concurrent SWP instructions. This is implemented via pre/post-store helpers. I've stress-tested this with a heavily contended guest lock (64 cores), and it works fine. Executing non-trivial pre/post-store helpers adds a 5% perf overhead to linux bootup, and is negligible on regular programs. Anyway most sane code doesn't use SWP (linux bootup certainly doesn't.), so this overhead is rarely seen. - x86: Instead of acquiring the same host lock every time LOCK is found, the acquisition of an AIE lock (via the radix tree) is done when the address of the ensuing load/store is known. Loads perform this check at compile-time. Stores are emulated using the same trick as in ARM; non-atomic stores are executed as atomic stores iff there's a prior atomic operation that has been executed on their target address. This for instance ensures that a regular store cannot race with a cmpxchg. This has very small overhead (negligible with OpenSSL's bntest in user-only), and scales as native code. - Barriers: not emulated yet. They're needed to correctly run non-trivial lockless code (I'm using concurrencykit's testbenches). The strongly-ordered-guest-on-weakly-ordered-host problem remains; my guess is that we'll have to sacrifice single-threaded performance to make it work (e.g. using pre-post ld/st helpers). - 64-bit guest on 32-bit host: Not supported yet. Note that 64-bit loads/stores on a 32-bit guest are not atomic, yet 64-bit code might have been written assuming that they are. Checks for this will be needed. - Other ISAs: not done yet, but they should be like either ARM or x86. - License of new files: is there a preferred license for new code? - Please tolerate the lack of comments in code and commit logs, when preparing this RFC I thought it's better to put all the info here. If this wasn't an RFC I'd have done it differently. Thanks for reading this far, comments welcome! Emilio