Mark Burton <mark.bur...@greensocs.com> writes: > I think we SHOUDL use the wiki - and keep it current. A lot of what you have > is in the wiki too, but I’d like to see the wiki updated. > We will add our stuff there too…
I'll do a pass today and update it to point to lists, discussions and WIP trees. > > Cheers > Mark. > > > >> On 15 Jun 2015, at 12:06, Alex Bennée <alex.ben...@linaro.org> wrote: >> >> >> Frederic Konrad <fred.kon...@greensocs.com> writes: >> >>> On 12/06/2015 18:37, Alex Bennée wrote: >>>> Hi, >>> >>> Hi Alex, >>> >>> I've completed some of the points below. We will also work on a design >>> decisions >>> document to add to this one. >>> >>> We probably want to merge that with what we did on the wiki? >>> http://wiki.qemu.org/Features/tcg-multithread >> >> Well hopefully there is cross-over as I started with the wiki as a basic >> ;-) >> >> Do we want to just keep the wiki as the live design document or put >> pointers to the current drafts? I'm hoping eventually the page will just >> point to the design in the doc directory at git.qemu.org. >> >>>> One thing that Peter has been asking for is a design document for the >>>> way we are going to approach multi-threaded TCG emulation. I started >>>> with the information that was captured on the wiki and tried to build on >>>> that. It's almost certainly incomplete but I thought it would be worth >>>> posting for wider discussion early rather than later. >>>> >>>> One obvious omission at the moment is the lack of discussion about other >>>> non-TLB shared data structures in QEMU (I'm thinking of the various >>>> dirty page tracking bits, I'm sure there is more). >>>> >>>> I've also deliberately tried to avoid documenting the design decisions >>>> made in the current Greensoc's patch series. This is so we can >>>> concentrate on the big picture before getting side-tracked into the >>>> implementation details. >>>> >>>> I have now started digging into the Greensocs code in earnest and the >>>> plan is eventually the design and the implementation will converge on a >>>> final documented complete solution ;-) >>>> >>>> Anyway as ever I look forward to the comments and discussion: >>>> >>>> STATUS: DRAFTING >>>> >>>> Introduction >>>> ============ >>>> >>>> This document outlines the design for multi-threaded TCG emulation. >>>> The original TCG implementation was single threaded and dealt with >>>> multiple CPUs by with simple round-robin scheduling. This simplified a >>>> lot of things but became increasingly limited as systems being >>>> emulated gained additional cores and per-core performance gains for host >>>> systems started to level off. >>>> >>>> Memory Consistency >>>> ================== >>>> >>>> Between emulated guests and host systems there are a range of memory >>>> consistency models. While emulating weakly ordered systems on strongly >>>> ordered hosts shouldn't cause any problems the same is not true for >>>> the reverse setup. >>>> >>>> The proposed design currently does not address the problem of >>>> emulating strong ordering on a weakly ordered host although even on >>>> strongly ordered systems software should be using synchronisation >>>> primitives to ensure correct operation. >>>> >>>> Memory Barriers >>>> --------------- >>>> >>>> Barriers (sometimes known as fences) provide a mechanism for software >>>> to enforce a particular ordering of memory operations from the point >>>> of view of external observers (e.g. another processor core). They can >>>> apply to any memory operations as well as just loads or stores. >>>> >>>> The Linux kernel has an excellent write-up on the various forms of >>>> memory barrier and the guarantees they can provide [1]. >>>> >>>> Barriers are often wrapped around synchronisation primitives to >>>> provide explicit memory ordering semantics. However they can be used >>>> by themselves to provide safe lockless access by ensuring for example >>>> a signal flag will always be set after a payload. >>>> >>>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op >>>> >>>> This would enforce a strong load/store ordering so all loads/stores >>>> complete at the memory barrier. On single-core non-SMP strongly >>>> ordered backends this could become a NOP. >>>> >>>> There may be a case for further refinement if this causes performance >>>> bottlenecks. >>>> >>>> Memory Control and Maintenance >>>> ------------------------------ >>>> >>>> This includes a class of instructions for controlling system cache >>>> behaviour. While QEMU doesn't model cache behaviour these instructions >>>> are often seen when code modification has taken place to ensure the >>>> changes take effect. >>>> >>>> Synchronisation Primitives >>>> -------------------------- >>>> >>>> There are two broad types of synchronisation primitives found in >>>> modern ISAs: atomic instructions and exclusive regions. >>>> >>>> The first type offer a simple atomic instruction which will guarantee >>>> some sort of test and conditional store will be truly atomic w.r.t. >>>> other cores sharing access to the memory. The classic example is the >>>> x86 cmpxchg instruction. >>>> >>>> The second type offer a pair of load/store instructions which offer a >>>> guarantee that an region of memory has not been touched between the >>>> load and store instructions. An example of this is ARM's ldrex/strex >>>> pair where the strex instruction will return a flag indicating a >>>> successful store only if no other CPU has accessed the memory region >>>> since the ldrex. >>>> >>>> Traditionally TCG has generated a series of operations that work >>>> because they are within the context of a single translation block so >>>> will have completed before another CPU is scheduled. However with >>>> the ability to have multiple threads running to emulate multiple CPUs >>>> we will need to explicitly expose these semantics. >>>> >>>> DESIGN REQUIREMENTS: >>>> - atomics >>>> - Introduce some atomic TCG ops for the common semantics >>>> - The default fallback helper function will use qemu_atomics >>>> - Each backend can then add a more efficient implementation >>>> - load/store exclusive >>>> [AJB: >>>> There are currently a number proposals of interest: >>>> - Greensocs tweaks to ldst ex (using locks) >>>> - Slow-path for atomic instruction translation [2] >>>> - Helper-based Atomic Instruction Emulation (AIE) [3] >>>> ] >>>> >>>> >>>> Shared Data Structures >>>> ====================== >>>> >>>> Global TCG State >>>> ---------------- >>>> >>>> We need to protect the entire code generation cycle including any post >>>> generation patching of the translated code. This also implies a shared >>>> translation buffer which contains code running on all cores. Any >>>> execution path that comes to the main run loop will need to hold a >>>> mutex for code generation. This also includes times when we need flush >>>> code or jumps from the tb_cache. >>>> >>>> DESIGN REQUIREMENT: Add locking around all code generation, patching >>>> and jump cache modification >>> Actually from my point of view jump cache modification requires more than a >>> lock as other VCPU thread can be executing code during the modification. >>> >>> Fortunately this happen "only" with tlb_flush, tlb_page_flush, tb_flush and >>> tb_invalidate which need all CPU to be halted anyway. >> >> How about: >> >> DESIGN REQUIREMENT: >> - Code generation and patching will be protected by a lock >> - Jump cache modification will assert all CPUs are halted >> >>>> >>>> Memory maps and TLBs >>>> -------------------- >>>> >>>> The memory handling code is fairly critical to the speed of memory >>>> access in the emulated system. >>>> >>>> - Memory regions (dividing up access to PIO, MMIO and RAM) >>>> - Dirty page tracking (for code gen, migration and display) >>>> - Virtual TLB (for translating guest address->real address) >>>> >>>> There is a both a fast path walked by the generated code and a slow >>>> path when resolution is required. When the TLB tables are updated we >>>> need to ensure they are done in a safe way by bringing all executing >>>> threads to a halt before making the modifications. >>>> >>>> DESIGN REQUIREMENTS: >>>> >>>> - TLB Flush All/Page >>>> - can be across-CPUs >>>> - will need all other CPUs brought to a halt >>>> - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) >>>> - This is a per-CPU table - by definition can't race >>>> - updated by it's own thread when the slow-path is forced >>> Actually as we have approximately the same behaviour for all of this memory >>> handling operation eg: tb_flush, tb_*_invalidate, tlb_*_flush which are >>> all playing with >>> the TranslationBlock and the jump cache across-CPU I think we have to add a >>> generic "exit and do something" mechanism for the CPU threads. >>> So every VCPU threads has a list of thing to do when they exit (such as >>> clearing it's >>> own tb_jmp_cache during a tlb_flush or wait other CPU and flush only one >>> entry for >>> tb_invalidate). >> >> Sounds like I should write an additional section to describe the process >> of halting CPUs and carrying out deferred per-CPU actions as well as >> ensuring we can tell when they are all halted. >> >>>> Emulated hardware state >>>> ----------------------- >>>> >>>> Currently the hardware emulation has no protection against >>>> multiple-accesses. However guest systems accessing emulated hardware >>>> should be carrying out their own locking to prevent multiple CPUs >>>> confusing the hardware. Of course there is no guarantee the there >>>> couldn't be a broken guest that doesn't lock so you could get racing >>>> accesses to the hardware. >>>> >>>> There is the class of paravirtualized hardware (VIRTIO) that works in >>>> a purely mmio mode. Often setting flags directly in guest memory as a >>>> result of a guest triggered transaction. >>>> >>>> DESIGN REQUIREMENTS: >>>> >>>> - Access to IO Memory should be serialised by an IOMem mutex >>>> - The mutex should be recursive (e.g. allowing pid to relock itself) >>> That might be done with the global mutex as it is today? >>> We need changes here anyway to have VCPU threads running in parallel. >> >> I'm not sure re-using the global mutex is a good idea. I've had to hack >> the global mutex to allow recursive locking to get around the virtio >> hang I discovered last week. While it works I'm uneasy making such a >> radical change upstream given how widely the global mutex is used hence >> the suggestion to have an explicit IOMem mutex. >> >> Actually I'm surprised the iothread muxtex just re-uses the global one. >> I guess I need to talk to the IO guys as to why they took that >> decision. >> >>> >>> Thanks, >> >> Thanks for your quick review :-) >> >>> Fred >>> >>>> IO Subsystem >>>> ------------ >>>> >>>> The I/O subsystem is heavily used by KVM and has seen a lot of >>>> improvements to offload I/O tasks to dedicated IOThreads. There should >>>> be no additional locking required once we reach the Block Driver. >>>> >>>> DESIGN REQUIREMENTS: >>>> >>>> - The dataplane should continue to be protected by the iothread locks >>>> >>>> >>>> References >>>> ========== >>>> >>>> [1] >>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt >>>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561 >>>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297 >>>> >>>> >>>> >> >> -- >> Alex Bennée > > > +44 (0)20 7100 3485 x 210 > +33 (0)5 33 52 01 77x 210 > > +33 (0)603762104 > mark.burton -- Alex Bennée