I think we SHOUDL use the wiki - and keep it current. A lot of what you have is in the wiki too, but I’d like to see the wiki updated. We will add our stuff there too…
Cheers Mark. > On 15 Jun 2015, at 12:06, Alex Bennée <alex.ben...@linaro.org> wrote: > > > Frederic Konrad <fred.kon...@greensocs.com> writes: > >> On 12/06/2015 18:37, Alex Bennée wrote: >>> Hi, >> >> Hi Alex, >> >> I've completed some of the points below. We will also work on a design >> decisions >> document to add to this one. >> >> We probably want to merge that with what we did on the wiki? >> http://wiki.qemu.org/Features/tcg-multithread > > Well hopefully there is cross-over as I started with the wiki as a basic > ;-) > > Do we want to just keep the wiki as the live design document or put > pointers to the current drafts? I'm hoping eventually the page will just > point to the design in the doc directory at git.qemu.org. > >>> One thing that Peter has been asking for is a design document for the >>> way we are going to approach multi-threaded TCG emulation. I started >>> with the information that was captured on the wiki and tried to build on >>> that. It's almost certainly incomplete but I thought it would be worth >>> posting for wider discussion early rather than later. >>> >>> One obvious omission at the moment is the lack of discussion about other >>> non-TLB shared data structures in QEMU (I'm thinking of the various >>> dirty page tracking bits, I'm sure there is more). >>> >>> I've also deliberately tried to avoid documenting the design decisions >>> made in the current Greensoc's patch series. This is so we can >>> concentrate on the big picture before getting side-tracked into the >>> implementation details. >>> >>> I have now started digging into the Greensocs code in earnest and the >>> plan is eventually the design and the implementation will converge on a >>> final documented complete solution ;-) >>> >>> Anyway as ever I look forward to the comments and discussion: >>> >>> STATUS: DRAFTING >>> >>> Introduction >>> ============ >>> >>> This document outlines the design for multi-threaded TCG emulation. >>> The original TCG implementation was single threaded and dealt with >>> multiple CPUs by with simple round-robin scheduling. This simplified a >>> lot of things but became increasingly limited as systems being >>> emulated gained additional cores and per-core performance gains for host >>> systems started to level off. >>> >>> Memory Consistency >>> ================== >>> >>> Between emulated guests and host systems there are a range of memory >>> consistency models. While emulating weakly ordered systems on strongly >>> ordered hosts shouldn't cause any problems the same is not true for >>> the reverse setup. >>> >>> The proposed design currently does not address the problem of >>> emulating strong ordering on a weakly ordered host although even on >>> strongly ordered systems software should be using synchronisation >>> primitives to ensure correct operation. >>> >>> Memory Barriers >>> --------------- >>> >>> Barriers (sometimes known as fences) provide a mechanism for software >>> to enforce a particular ordering of memory operations from the point >>> of view of external observers (e.g. another processor core). They can >>> apply to any memory operations as well as just loads or stores. >>> >>> The Linux kernel has an excellent write-up on the various forms of >>> memory barrier and the guarantees they can provide [1]. >>> >>> Barriers are often wrapped around synchronisation primitives to >>> provide explicit memory ordering semantics. However they can be used >>> by themselves to provide safe lockless access by ensuring for example >>> a signal flag will always be set after a payload. >>> >>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op >>> >>> This would enforce a strong load/store ordering so all loads/stores >>> complete at the memory barrier. On single-core non-SMP strongly >>> ordered backends this could become a NOP. >>> >>> There may be a case for further refinement if this causes performance >>> bottlenecks. >>> >>> Memory Control and Maintenance >>> ------------------------------ >>> >>> This includes a class of instructions for controlling system cache >>> behaviour. While QEMU doesn't model cache behaviour these instructions >>> are often seen when code modification has taken place to ensure the >>> changes take effect. >>> >>> Synchronisation Primitives >>> -------------------------- >>> >>> There are two broad types of synchronisation primitives found in >>> modern ISAs: atomic instructions and exclusive regions. >>> >>> The first type offer a simple atomic instruction which will guarantee >>> some sort of test and conditional store will be truly atomic w.r.t. >>> other cores sharing access to the memory. The classic example is the >>> x86 cmpxchg instruction. >>> >>> The second type offer a pair of load/store instructions which offer a >>> guarantee that an region of memory has not been touched between the >>> load and store instructions. An example of this is ARM's ldrex/strex >>> pair where the strex instruction will return a flag indicating a >>> successful store only if no other CPU has accessed the memory region >>> since the ldrex. >>> >>> Traditionally TCG has generated a series of operations that work >>> because they are within the context of a single translation block so >>> will have completed before another CPU is scheduled. However with >>> the ability to have multiple threads running to emulate multiple CPUs >>> we will need to explicitly expose these semantics. >>> >>> DESIGN REQUIREMENTS: >>> - atomics >>> - Introduce some atomic TCG ops for the common semantics >>> - The default fallback helper function will use qemu_atomics >>> - Each backend can then add a more efficient implementation >>> - load/store exclusive >>> [AJB: >>> There are currently a number proposals of interest: >>> - Greensocs tweaks to ldst ex (using locks) >>> - Slow-path for atomic instruction translation [2] >>> - Helper-based Atomic Instruction Emulation (AIE) [3] >>> ] >>> >>> >>> Shared Data Structures >>> ====================== >>> >>> Global TCG State >>> ---------------- >>> >>> We need to protect the entire code generation cycle including any post >>> generation patching of the translated code. This also implies a shared >>> translation buffer which contains code running on all cores. Any >>> execution path that comes to the main run loop will need to hold a >>> mutex for code generation. This also includes times when we need flush >>> code or jumps from the tb_cache. >>> >>> DESIGN REQUIREMENT: Add locking around all code generation, patching >>> and jump cache modification >> Actually from my point of view jump cache modification requires more than a >> lock as other VCPU thread can be executing code during the modification. >> >> Fortunately this happen "only" with tlb_flush, tlb_page_flush, tb_flush and >> tb_invalidate which need all CPU to be halted anyway. > > How about: > > DESIGN REQUIREMENT: > - Code generation and patching will be protected by a lock > - Jump cache modification will assert all CPUs are halted > >>> >>> Memory maps and TLBs >>> -------------------- >>> >>> The memory handling code is fairly critical to the speed of memory >>> access in the emulated system. >>> >>> - Memory regions (dividing up access to PIO, MMIO and RAM) >>> - Dirty page tracking (for code gen, migration and display) >>> - Virtual TLB (for translating guest address->real address) >>> >>> There is a both a fast path walked by the generated code and a slow >>> path when resolution is required. When the TLB tables are updated we >>> need to ensure they are done in a safe way by bringing all executing >>> threads to a halt before making the modifications. >>> >>> DESIGN REQUIREMENTS: >>> >>> - TLB Flush All/Page >>> - can be across-CPUs >>> - will need all other CPUs brought to a halt >>> - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) >>> - This is a per-CPU table - by definition can't race >>> - updated by it's own thread when the slow-path is forced >> Actually as we have approximately the same behaviour for all of this memory >> handling operation eg: tb_flush, tb_*_invalidate, tlb_*_flush which are >> all playing with >> the TranslationBlock and the jump cache across-CPU I think we have to add a >> generic "exit and do something" mechanism for the CPU threads. >> So every VCPU threads has a list of thing to do when they exit (such as >> clearing it's >> own tb_jmp_cache during a tlb_flush or wait other CPU and flush only one >> entry for >> tb_invalidate). > > Sounds like I should write an additional section to describe the process > of halting CPUs and carrying out deferred per-CPU actions as well as > ensuring we can tell when they are all halted. > >>> Emulated hardware state >>> ----------------------- >>> >>> Currently the hardware emulation has no protection against >>> multiple-accesses. However guest systems accessing emulated hardware >>> should be carrying out their own locking to prevent multiple CPUs >>> confusing the hardware. Of course there is no guarantee the there >>> couldn't be a broken guest that doesn't lock so you could get racing >>> accesses to the hardware. >>> >>> There is the class of paravirtualized hardware (VIRTIO) that works in >>> a purely mmio mode. Often setting flags directly in guest memory as a >>> result of a guest triggered transaction. >>> >>> DESIGN REQUIREMENTS: >>> >>> - Access to IO Memory should be serialised by an IOMem mutex >>> - The mutex should be recursive (e.g. allowing pid to relock itself) >> That might be done with the global mutex as it is today? >> We need changes here anyway to have VCPU threads running in parallel. > > I'm not sure re-using the global mutex is a good idea. I've had to hack > the global mutex to allow recursive locking to get around the virtio > hang I discovered last week. While it works I'm uneasy making such a > radical change upstream given how widely the global mutex is used hence > the suggestion to have an explicit IOMem mutex. > > Actually I'm surprised the iothread muxtex just re-uses the global one. > I guess I need to talk to the IO guys as to why they took that > decision. > >> >> Thanks, > > Thanks for your quick review :-) > >> Fred >> >>> IO Subsystem >>> ------------ >>> >>> The I/O subsystem is heavily used by KVM and has seen a lot of >>> improvements to offload I/O tasks to dedicated IOThreads. There should >>> be no additional locking required once we reach the Block Driver. >>> >>> DESIGN REQUIREMENTS: >>> >>> - The dataplane should continue to be protected by the iothread locks >>> >>> >>> References >>> ========== >>> >>> [1] >>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt >>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561 >>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297 >>> >>> >>> > > -- > Alex Bennée +44 (0)20 7100 3485 x 210 +33 (0)5 33 52 01 77x 210 +33 (0)603762104 mark.burton