Hi all, Here is MTTCG code I've been working on out-of-tree for the last few months.
The patchset applies on top of pbonzini's mttcg branch, commit ca56de6f. Fetch the branch from: https://github.com/bonzini/qemu/commits/mttcg The highlights of the patchset are as follows: - The first 5 patches are direct fixes to bugs only in the mttcg branch. - Patches 6-12 fix issues in the master branch. - The remaining patches are really the meat of this patchset. The main features are: * Support of MTTCG for both user and system mode. * Design: per-CPU TB jump list protected by a seqlock, if the TB is not found there then check on the global, RCU-protected 'hash table' (i.e. fixed number of buckets), if not there then grab lock, check again, and if it's not there then add generate the code and add the TB to the hash table. It makes sense that Paolo's recent work on the mttcg branch ended up being almost identical to this--it's simple and it scales well. * tb_lock must be held every time code is generated. The rationale is that most of the time QEMU is executing code, not generating it. * tb_flush: do it once all other CPUs have been put to sleep by calling rcu_synchronize(). We also instrument tb_lock to make sure that only one tb_flush request can happen at a given time. For this a mechanism to schedule work is added to supersede cpu_sched_safe_work, which cannot work in usermode. Here I've toyed with an alternative version that doesn't force the flushing CPU to exit, but in order to make this work we have save/restore the RCU read lock while tb_lock is held in order to avoid deadlocks. This isn't too pretty but it's good to know that the option is there. * I focused on x86 since it is a complex ISA and we support many cores via -smp. I work on a 64-core machine so concurrency bugs show up relatively easily. Atomics are modeled using spinlocks, i.e. one host lock per guest cache line. Note that spinlocks are way better than mutexes for this--perf on 64-cores is 2X with spinlocks on highly concurrent workloads (synchrobench, see below). Advantages: + Scalability. No unrelated atomics (e.g. atomics on the same page) can interfere with each other. Of course if the guest code has false sharing (i.e. atomics on the same cache line), then there's not much the host can do about that. This is an improved version over what I sent in May: https://lists.gnu.org/archive/html/qemu-devel/2015-05/msg01641.html Performance numbers are below. + No requirements on the capabilities of the host machine, e.g. no need for a host cmpxchg instruction. That is, we'd have no problem running x86 code on a weaker host (say ARM/PPC) although of course we'd have to sprinkle quite a few memory barriers. Note that the current MTTCG relies on cmpxchg(), which would be insufficient to run x86 code on ARM/PPC since that cmpxchg could very well race with a regular store (whereas in x86 it cannot). + Works unchanged for both system and user modes. As far as I can tell the TLB-based approach that Alvise is working on couldn't be used without the TLB--correct me if I'm wrong, it's been quite some time since I looked at that work. Disadvantages: - Overhead is added to every guest store. Depending on how frequent stores are, this can end up being significant single-threaded overhead (I've measured from a few % to up to ~50%). Note that this overhead applies to strong memory models such as x86, since the ISA can deal with concurrent stores and atomic instructions. Weaker memory models such as ARM/PPC's wouldn't have this overhead. * Performance I've used four C/C++ benchmarks from synchrobench: https://github.com/gramoli/synchrobench I'm running them with these arguments: -u 0 -f 1 -d 10000 -t $n_threads Here are two comparisons; * usermode vs. native http://imgur.com/RggzgyU * qemu-system vs qemu-KVM http://imgur.com/H9iH06B (full-system is run with -m 4096). Throughput is normalised for each of the four configurations over their throughput with 1 thread. For single-thread performance overhead of instrumenting writes I used two apps from PARSEC, all of them with the 'large' input: [Note that for the multithreaded tests I did not use PARSEC; it doesn't scale at all on large systems] blackscholes 1 thread, ~8% of stores per instruction: pbonzini/mttcg+Patches1-5: 62.922099012 seconds ( +- 0.05% ) +entire patchset: 67.680987626 seconds ( +- 0.35% ) That's about an 8% perf overhead. swaptions 1 thread, ~7% of stores per instruction: pbonzini/mttcg+Patches1-5: 144.542495834 seconds ( +- 0.49% ) +entire patchset: 157.673401200 seconds ( +- 0.25% ) That's about an 9% perf overhead. All tests use taskset appropriately to pack threads into CPUs in the same NUMA node, if possible. All tests are run on a 64-core (4x16) AMD Opteron 6376 with turbo core disabled. * Known Issues - In system mode, when run with a high number of threads, segfaults on translated code happen every now and then. Is there anything useful I can do with the segfaulting address? For example: (gdb) bt #0 0x00007fbf8013d89f in ?? () #1 0x0000000000000000 in ?? () Also, are there any things that should be protected by tb_lock but aren't? The only potential issue I've thought of so far is direct jumps racing with tb_phys_invalidate, but need to analyze in more detail. * Future work - Run on PowerPC host to look at how bad the barrier sprinkling has to be. I have access to a host so should do this in the next few days. However, ppc-usermode doesn't work in multithreaded--help would be appreciated, see this thread: http://lists.gnu.org/archive/html/qemu-ppc/2015-06/msg00164.html - Support more ISAs. I have done ARM, SPARC and PPC, but haven't tested them much so I'm keeping them out of this patchset. Thanks, Emilio