On 08/05/2015 23:02, Emilio G. Cota wrote:
Hi all,
These are patches I've been working on for some time now.
Since emulation of atomic instructions is recently getting
attention([1], [2]), I'm submitting them for comment.
[1] http://thread.gmane.org/gmane.comp.emulators.qemu/314406
[2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
Main features of this design:
- Performance and scalability are the main design goal: guest code should
scale as much as it would scale running natively on the host.
For this, a host lock is (if necessary) assigned to each 16-byte
aligned chunk of the physical address space.
The assignment (i.e. lock allocation + registration) only happens
after an atomic operation on a particular physical address
is performed. To keep track of this sparse set of locks,
a lockless radix tree is used, so lookups are fast and scalable.
- Translation helpers are employed to call the 'aie' module, which is
the common code that accesses the radix tree, locking the appropriate
entry depending on the access' physical address.
- No special host atomic instructions (e.g. cmpxchg16b) are required;
mutexes and include/qemu/atomic.h is all that's needed.
- Usermode and full-system are supported with the same code. Note that
the newly-added tiny_set module is necessary to properly emulate LL/SC,
since the number of "cpus" (i.e. threads) is unbounded in usermode--
for full-system mode a bitmap would have been sufficient.
- ARM: Stores concurrent with LL/SC primitives are initially not dealt
with.
This is my choice, since I'm assuming most sane code will only
handle data atomically using LL/SC primitives. However, SWP can
be used, so whenevery a SWP instruction is issued, stores start checking
that stores do not clash with concurrent SWP instructions. This is
implemented via pre/post-store helpers. I've stress-tested this with a
heavily contended guest lock (64 cores), and it works fine. Executing
non-trivial pre/post-store helpers adds a 5% perf overhead to linux
bootup, and is negligible on regular programs. Anyway most
sane code doesn't use SWP (linux bootup certainly doesn't.), so this
overhead is rarely seen.
- x86: Instead of acquiring the same host lock every time LOCK is found,
the acquisition of an AIE lock (via the radix tree) is done when the
address of the ensuing load/store is known.
Loads perform this check at compile-time.
Stores are emulated using the same trick as in ARM; non-atomic stores
are executed as atomic stores iff there's a prior atomic operation that
has been executed on their target address. This for instance ensures
that a regular store cannot race with a cmpxchg.
This has very small overhead (negligible with OpenSSL's bntest in
user-only), and scales as native code.
- Barriers: not emulated yet. They're needed to correctly run non-trivial
lockless code (I'm using concurrencykit's testbenches).
The strongly-ordered-guest-on-weakly-ordered-host problem remains; my
guess is that we'll have to sacrifice single-threaded performance to
make it work (e.g. using pre-post ld/st helpers).
- 64-bit guest on 32-bit host: Not supported yet. Note that 64-bit
loads/stores on a 32-bit guest are not atomic, yet 64-bit code might
have been written assuming that they are. Checks for this will be needed.
- Other ISAs: not done yet, but they should be like either ARM or x86.
- License of new files: is there a preferred license for new code?
I think it's GPL V2 or later.
Fred
- Please tolerate the lack of comments in code and commit logs, when
preparing this RFC I thought it's better to put all the info
here. If this wasn't an RFC I'd have done it differently.
Thanks for reading this far, comments welcome!
Emilio