On 24/09/2015 10:32, Alvise Rigo wrote: > The implementation heavily uses the software TLB together with a new > bitmap that has been added to the ram_list structure which flags, on a > per-CPU basis, all the memory pages that are in the middle of a LoadLink > (LL), StoreConditional (SC) operation. Since all these pages can be > accessed directly through the fast-path and alter a vCPU's linked value, > the new bitmap has been coupled with a new TLB flag for the TLB virtual > address which forces the slow-path execution for all the accesses to a > page containing a linked address.
Alvise, Emilio, I have a doubt about your patches for ll/sc emulation, that I hope you can clarify. >From 10000ft, both approaches rely on checking a flag during stores. This is split between the TLB and the CPUState for Alvise's patches (in order to exploit the existing fast-path checks), and entirely in the radix tree for Emilio's. However, the idea is the same. Now, the patch are okay for serial emulation, but I am not sure if it's possible to do lock-free ll/sc emulation, because there is a race. If we check the flag before the store, the race is as follows: CPU0 CPU1 ------------------------------------------------------- check flag load locked: set flag load value (normal load on CPU) store store conditional (normal store on CPU) where the sc doesn't fail. For completeness, if we check it afterwards (which would be possible with Emilio's approach, though not for the TLB-based one): CPU0 CPU1 ------------------------------------------------------ load locked set bit load value (normal load on CPU) store store conditional (normal store on CPU) check flag and again the sc doesn't fail. Most solutions I can think of are impractical: - hardware ll/sc in CPU1. x86 doesn't have it. - hardware transactional memory in CPU0, checking the bit after the store and abort the transaction (I think). HTM just doesn't exist. - some kind of store-in-progress (SIP) flag that ll can test and force failure of the corresponding sc. For example, each CPU could store a (last_store_address, last_store_value) tuple. If the value that LL loads disagrees with any CPU, the LL would direct the SC to fail. A store would look like: store value to last_store_value smp_wmb() store address to last_store_address smp_mb() load TLB or radix tree The memory barrier orders the store to the SIP flag and the load from the TLB, and is probably too expensive. :( - some array of atomic global generation counts, incremented unconditionally on every store and checked between ll and sc. Cacheline bounce fiesta, hence extremely slow. :( Tell me I'm wrong. :) If I'm right, we can still keep the opcodes and implement them with a simple cmpxchg. It would provide a nice generic tool to implement atomic operations, and it will work correctly if the target has ll/sc. However, ll/sc-on-cmpxchg (e.g., ARM-on-x86) would be susceptible to the ABA problem. Paolo