On 07.10.2015 16:52, Frederic Konrad wrote: > Hi Claudio, > > I'll rebase soon tomorrow with a bit of luck ;). > > Thanks, > Fred
a respectful ping on this one :-) I am looking at http://git.greensocs.com/fkonrad/mttcg.git branch multi_tcg_v7_bugfixed, is there something new? Ciao, Claudio > > On 07/10/2015 14:46, Claudio Fontana wrote: >> Hello Frederic, >> >> On 11.08.2015 08:27, Frederic Konrad wrote: >>> On 11/08/2015 08:15, Benjamin Herrenschmidt wrote: >>>> On Mon, 2015-08-10 at 17:26 +0200, fred.kon...@greensocs.com wrote: >>>>> From: KONRAD Frederic <fred.kon...@greensocs.com> >>>>> >>>>> This is the 7th round of the MTTCG patch series. >>>>> >>>>> >>>>> It can be cloned from: >>>>> g...@git.greensocs.com:fkonrad/mttcg.git branch multi_tcg_v7. >> would it be possible to rebase on latest qemu? I wonder if mttcg is >> diverging a bit too much from mainline, >> which will make it more difficult to rebase later..(Or did I get confused >> about all these repos?) >> >> Thank you! >> >> Claudio >> >>>>> This patch-set try to address the different issues in the global picture >>>>> of >>>>> MTTCG, presented on the wiki. >>>>> >>>>> == Needed patch for our work == >>>>> >>>>> Some preliminaries are needed for our work: >>>>> * current_cpu doesn't make sense in mttcg so a tcg_executing flag is >>>>> added to >>>>> the CPUState. >>>> Can't you just make it a TLS ? >>> True that can be done as well. But the tcg_exec_flags has a second meaning >>> saying >>> "you can't start executing code right now because I want to do a safe_work". >>>>> * We need to run some work safely when all VCPUs are outside their >>>>> execution >>>>> loop. This is done with the async_run_safe_work_on_cpu function >>>>> introduced >>>>> in this series. >>>>> * QemuSpin lock is introduced (on posix only yet) to allow a faster >>>>> handling of >>>>> atomic instruction. >>>> How do you handle the memory model ? IE , ARM and PPC are OO while x86 >>>> is (mostly) in order, so emulating ARM/PPC on x86 is fine but emulating >>>> x86 on ARM or PPC will lead to problems unless you generate memory >>>> barriers with every load/store .. >>> For the moment we are trying to do the first case. >>>> At least on POWER7 and later on PPC we have the possibility of setting >>>> the attribute "Strong Access Ordering" with mremap/mprotect (I dont' >>>> remember which one) which gives us x86-like memory semantics... >>>> >>>> I don't know if ARM supports something similar. On the other hand, when >>>> emulating ARM on PPC or vice-versa, we can probably get away with no >>>> barriers. >>>> >>>> Do you expose some kind of guest memory model info to the TCG backend so >>>> it can decide how to handle these things ? >>>> >>>>> == Code generation and cache == >>>>> >>>>> As Qemu stands, there is no protection at all against two threads >>>>> attempting to >>>>> generate code at the same time or modifying a TranslationBlock. >>>>> The "protect TBContext with tb_lock" patch address the issue of code >>>>> generation >>>>> and makes all the tb_* function thread safe (except tb_flush). >>>>> This raised the question of one or multiple caches. We choosed to use one >>>>> unified cache because it's easier as a first step and since the structure >>>>> of >>>>> QEMU effectively has a ‘local’ cache per CPU in the form of the jump >>>>> cache, we >>>>> don't see the benefit of having two pools of tbs. >>>>> >>>>> == Dirty tracking == >>>>> >>>>> Protecting the IOs: >>>>> To allows all VCPUs threads to run at the same time we need to drop the >>>>> global_mutex as soon as possible. The io access need to take the mutex. >>>>> This is >>>>> likely to change when >>>>> http://thread.gmane.org/gmane.comp.emulators.qemu/345258 >>>>> will be upstreamed. >>>>> >>>>> Invalidation of TranslationBlocks: >>>>> We can have all VCPUs running during an invalidation. Each VCPU is able >>>>> to clean >>>>> it's jump cache itself as it is in CPUState so that can be handled by a >>>>> simple >>>>> call to async_run_on_cpu. However tb_invalidate also writes to the >>>>> TranslationBlock which is shared as we have only one pool. >>>>> Hence this part of invalidate requires all VCPUs to exit before it can be >>>>> done. >>>>> Hence the async_run_safe_work_on_cpu is introduced to handle this case. >>>> What about the host MMU emulation ? Is that multithreaded ? It has >>>> potential issues when doing things like dirty bit updates into guest >>>> memory, those need to be done atomically. Also TLB invalidations on ARM >>>> and PPC are global, so they will need to invalidate the remote SW TLBs >>>> as well. >>>> >>>> Do you have a mechanism to synchronize with another thread ? IE, make it >>>> pop out of TCG if already in and prevent it from getting in ? That way >>>> you can "remotely" invalidate its TLB... >>> Yes that's what the safe_work is doing. Ask everybody to exit prevent VCPUs >>> to >>> resume (tcg_exec_flag) and do the work when everybody is outside cpu-exec. >>> >>>>> == Atomic instruction == >>>>> >>>>> For now only ARM on x64 is supported by using an cmpxchg instruction. >>>>> Specifically the limitation of this approach is that it is harder to >>>>> support >>>>> 64bit ARM on a host architecture that is multi-core, but only supports 32 >>>>> bit >>>>> cmpxchg (we believe this could be the case for some PPC cores). >>>> Right, on the other hand 64-bit will do fine. But then x86 has 2-value >>>> atomics nowadays, doesn't it ? And that will be hard to emulate on >>>> anything. You might need to have some kind of global hashed lock list >>>> used by atomics (hash the physical address) as a fallback if you don't >>>> have a 1:1 match between host and guest capabilities. >>> VOS did a "Slow path for atomic instruction translation" series you can >>> find here: >>> https://lists.gnu.org/archive/html/qemu-devel/2015-08/msg00971.html >>> >>> Which will be used in the end. >>> >>> Thanks, >>> Fred >>>> Cheers, >>>> Ben. > -- Claudio Fontana Server Virtualization Architect Huawei Technologies Duesseldorf GmbH Riesstraße 25 - 80992 München