> On 18/12/2014 13:24, Alexander Graf wrote: >> That's the nice thing about transactions - they guarantee that no other >> CPU accesses the same cache line at the same time. So you're safe >> against other vcpus even without blocking them manually. >> >> For the non-transactional implementation we probably would need an "IPI >> others and halt them until we're done with the critical section" >> approach. But I really wouldn't concentrate on making things fast on old >> CPUs. > > The non-transactional implementation can use softmmu to trap access to > the page from other VCPUs. This makes it possible to implement (at the > cost of speed) the same semantics on all hosts. > > Paolo
I believe what your describing, using transactional memory, or using softmmu amounts to either option 3 below or option 4. Relying on it totally was option 4. Seems to me, the problem with that option is that support for some hosts will be a pain, and covering everything will take some time :-( Option 3 suggests that we build a ‘slow path’ mechanism first - make sure that works (as a backup), and then add optimisations for specific hosts/guests afterwards. To me that still seems preferable? Cheers Mark. > On 18 Dec 2014, at 13:24, Alexander Graf <ag...@suse.de> wrote: > > > > On 18.12.14 10:12, Mark Burton wrote: >> >>> On 17 Dec 2014, at 17:39, Peter Maydell <peter.mayd...@linaro.org> wrote: >>> >>> On 17 December 2014 at 16:29, Mark Burton <mark.bur...@greensocs.com> wrote: >>>>> On 17 Dec 2014, at 17:27, Peter Maydell <peter.mayd...@linaro.org> wrote: >>>>> I think a mutex is fine, personally -- I just don't want >>>>> to see fifteen hand-hacked mutexes in the target-* code. >>>>> >>>> >>>> Which would seem to favour the helper function approach? >>>> Or am I missing something? >>> >>> You need at least some support from QEMU core -- consider >>> what happens with this patch if the ldrex takes a data >>> abort, for instance. >>> >>> And if you need the "stop all other CPUs while I do this” >> >> It looks like a corner case, but working this through - the ’simple’ put a >> mutex around the atomic instructions approach would indeed need to ensure >> that no other core was doing anything - that just happens to be true for >> qemu today (or - we would have to put a mutex around all writes); in order >> to ensure the case where a store exclusive could potential fail if a >> non-atomic instruction wrote (a different value) to the same address. This >> is currently guarantee by the implementation in Qemu - how useful it is I >> dont know, but if we break it, we run the risk that something will fail (at >> the least, we could not claim to have kept things the same). >> >> This also has implications for the idea of adding TCG ops I think... >> The ideal scenario is that we could ‘fallback’ on the same semantics that >> are there today - allowing specific target/host combinations to be optimised >> (and to improve their functionality). >> But that means, from within the TCG Op, we would need to have a mechanism, >> to cause other TCG’s to take an exit…. etc etc… In the end, I’m sure it’s >> possible, but it feels so awkward. > > That's the nice thing about transactions - they guarantee that no other > CPU accesses the same cache line at the same time. So you're safe > against other vcpus even without blocking them manually. > > For the non-transactional implementation we probably would need an "IPI > others and halt them until we're done with the critical section" > approach. But I really wouldn't concentrate on making things fast on old > CPUs. > > Also keep in mind that for the UP case we can always omit all the magic > - we only need to detect when we move into an SMP case (linux-user clone > or -smp on system). > >> >> To re-cap where we are (for my own benefit if nobody else): >> We have several propositions in terms of implementing Atomic instructions >> >> 1/ We wrap the atomic instructions in a mutex using helper functions (this >> is the approach others have taken, it’s simple, but it is not clean, as >> stated above). > > This is horrible. Imagine you have this split approach with a load > exclusive and then store whereas the load starts mutex usage and the > store stop is. At that point if the store creates a segfault you'll be > left with a dangling mutex. > > This stuff really belongs into the TCG core. > >> >> 1.5/ We add a mechanism to ensure that when the mutex is taken, all other >> cores are ‘stopped’. >> >> 2/ We add some TCG ops to effectively do the same thing, but this would give >> us the benefit of being able to provide better implementations. This is >> attractive, but we would end up needing ops to cover at least exclusive >> load/store and atomic compare exchange. To me this looks less than elegant >> (being pulled close to the target, rather than being able to generalise), >> but it’s not clear how we would implement the operations as we would like, >> with a machine instruction, unless we did split them out along these lines. >> This approach also (probably) requires the 1.5 mechanism above. > > I'm still in favor of just forcing the semantics of transactions onto > this. If the host doesn't implement transactions, tough luck - do the > "halt all others" IPI. > >> >> 3/ We have discussed a ‘h/w’ approach to the problem. In this case, all >> atomic instructions are forced to take the slow path - and a additional >> flags are added to the memory API. We then deal with the issue closer to the >> memory where we can record who has a lock on a memory address. For this to >> work - we would also either >> a) need to add a mprotect type approach to ensure no ‘non atomic’ writes >> occur - or >> b) need to force all cores to mark the page with the exclusive memory as IO >> or similar to ensure that all write accesses followed the slow path. >> >> 4/ There is an option to implement exclusive operations within the TCG using >> mprotect (and signal handlers). I have some concerns on this : would we need >> have to have support for each host O/S…. I also think we might end up the a >> lot of protected regions causing a lot of SIGSEGV’s because an errant guest >> doesn’t behave well - basically we will need to see the impact on >> performance - finally - this will be really painful to deal with for cases >> where the exclusive memory is held in what Qemu considers IO space !!! >> In other words - putting the mprotect inside TCG looks to me like it’s >> mutually exclusive to supporting a memory-based scheme like (3). > > Again, I don't think it's worth caring about legacy host systems too > much. In a few years from now transactional memory will be commodity, > just like KVM is today. > > > Alex > >> My personal preference is for 3b) it is “safe” - its where the hardware is. >> 3a is an optimization of that. >> to me, (2) is an optimisation again. We are effectively saying, if you are >> able to do this directly, then you dont need to pass via the slow path. >> Otherwise, you always have the option of reverting to the slow path. >> >> Frankly - 1 and 1.5 are hacks - they are not optimisations, they are just >> dirty hacks. However - their saving grace is that they are hacks that exist >> and “work”. I dislike patching the hack, but it did seem to offer the >> fastest solution to get around this problem - at least for now. I am no >> longer convinced. >> >> 4/ is something I’d like other peoples views on too… Is it a better >> approach? What about the slow path? >> >> I increasingly begin to feel that we should really approach this from the >> other end, and provide a ‘correct’ solution using the memory - then worry >> about making that faster… >> >> Cheers >> >> Mark. >> >> >> >> >> >> >> >> >>> semantics linux-user currently uses then that definitely needs >>> core code support. (Maybe linux-user is being over-zealous >>> there; I haven't thought about it.) >>> >>> -- PMM >> >> >> +44 (0)20 7100 3485 x 210 >> +33 (0)5 33 52 01 77x 210 >> >> +33 (0)603762104 >> mark.burton >> +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton