Re: [Qemu-devel] RFC Multi-threaded TCG design document

Mark Burton Mon, 15 Jun 2015 03:53:11 -0700

I think we SHOUDL use the wiki - and keep it current. A lot of what you have is 
in the wiki too, but I’d like to see the wiki updated.
We will add our stuff there too…


Cheers
Mark.



> On 15 Jun 2015, at 12:06, Alex Bennée <alex.ben...@linaro.org> wrote:
> 
> 
> Frederic Konrad <fred.kon...@greensocs.com> writes:
> 
>> On 12/06/2015 18:37, Alex Bennée wrote:
>>> Hi,
>> 
>> Hi Alex,
>> 
>> I've completed some of the points below. We will also work on a design 
>> decisions
>> document to add to this one.
>> 
>> We probably want to merge that with what we did on the wiki?
>> http://wiki.qemu.org/Features/tcg-multithread
> 
> Well hopefully there is cross-over as I started with the wiki as a basic
> ;-)
> 
> Do we want to just keep the wiki as the live design document or put
> pointers to the current drafts? I'm hoping eventually the page will just
> point to the design in the doc directory at git.qemu.org.
> 
>>> One thing that Peter has been asking for is a design document for the
>>> way we are going to approach multi-threaded TCG emulation. I started
>>> with the information that was captured on the wiki and tried to build on
>>> that. It's almost certainly incomplete but I thought it would be worth
>>> posting for wider discussion early rather than later.
>>> 
>>> One obvious omission at the moment is the lack of discussion about other
>>> non-TLB shared data structures in QEMU (I'm thinking of the various
>>> dirty page tracking bits, I'm sure there is more).
>>> 
>>> I've also deliberately tried to avoid documenting the design decisions
>>> made in the current Greensoc's patch series. This is so we can
>>> concentrate on the big picture before getting side-tracked into the
>>> implementation details.
>>> 
>>> I have now started digging into the Greensocs code in earnest and the
>>> plan is eventually the design and the implementation will converge on a
>>> final documented complete solution ;-)
>>> 
>>> Anyway as ever I look forward to the comments and discussion:
>>> 
>>> STATUS: DRAFTING
>>> 
>>> Introduction
>>> ============
>>> 
>>> This document outlines the design for multi-threaded TCG emulation.
>>> The original TCG implementation was single threaded and dealt with
>>> multiple CPUs by with simple round-robin scheduling. This simplified a
>>> lot of things but became increasingly limited as systems being
>>> emulated gained additional cores and per-core performance gains for host
>>> systems started to level off.
>>> 
>>> Memory Consistency
>>> ==================
>>> 
>>> Between emulated guests and host systems there are a range of memory
>>> consistency models. While emulating weakly ordered systems on strongly
>>> ordered hosts shouldn't cause any problems the same is not true for
>>> the reverse setup.
>>> 
>>> The proposed design currently does not address the problem of
>>> emulating strong ordering on a weakly ordered host although even on
>>> strongly ordered systems software should be using synchronisation
>>> primitives to ensure correct operation.
>>> 
>>> Memory Barriers
>>> ---------------
>>> 
>>> Barriers (sometimes known as fences) provide a mechanism for software
>>> to enforce a particular ordering of memory operations from the point
>>> of view of external observers (e.g. another processor core). They can
>>> apply to any memory operations as well as just loads or stores.
>>> 
>>> The Linux kernel has an excellent write-up on the various forms of
>>> memory barrier and the guarantees they can provide [1].
>>> 
>>> Barriers are often wrapped around synchronisation primitives to
>>> provide explicit memory ordering semantics. However they can be used
>>> by themselves to provide safe lockless access by ensuring for example
>>> a signal flag will always be set after a payload.
>>> 
>>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>>> 
>>> This would enforce a strong load/store ordering so all loads/stores
>>> complete at the memory barrier. On single-core non-SMP strongly
>>> ordered backends this could become a NOP.
>>> 
>>> There may be a case for further refinement if this causes performance
>>> bottlenecks.
>>> 
>>> Memory Control and Maintenance
>>> ------------------------------
>>> 
>>> This includes a class of instructions for controlling system cache
>>> behaviour. While QEMU doesn't model cache behaviour these instructions
>>> are often seen when code modification has taken place to ensure the
>>> changes take effect.
>>> 
>>> Synchronisation Primitives
>>> --------------------------
>>> 
>>> There are two broad types of synchronisation primitives found in
>>> modern ISAs: atomic instructions and exclusive regions.
>>> 
>>> The first type offer a simple atomic instruction which will guarantee
>>> some sort of test and conditional store will be truly atomic w.r.t.
>>> other cores sharing access to the memory. The classic example is the
>>> x86 cmpxchg instruction.
>>> 
>>> The second type offer a pair of load/store instructions which offer a
>>> guarantee that an region of memory has not been touched between the
>>> load and store instructions. An example of this is ARM's ldrex/strex
>>> pair where the strex instruction will return a flag indicating a
>>> successful store only if no other CPU has accessed the memory region
>>> since the ldrex.
>>> 
>>> Traditionally TCG has generated a series of operations that work
>>> because they are within the context of a single translation block so
>>> will have completed before another CPU is scheduled. However with
>>> the ability to have multiple threads running to emulate multiple CPUs
>>> we will need to explicitly expose these semantics.
>>> 
>>> DESIGN REQUIREMENTS:
>>>  - atomics
>>>    - Introduce some atomic TCG ops for the common semantics
>>>    - The default fallback helper function will use qemu_atomics
>>>    - Each backend can then add a more efficient implementation
>>>  - load/store exclusive
>>>    [AJB:
>>>         There are currently a number proposals of interest:
>>>      - Greensocs tweaks to ldst ex (using locks)
>>>      - Slow-path for atomic instruction translation [2]
>>>      - Helper-based Atomic Instruction Emulation (AIE) [3]
>>>     ]
>>> 
>>> 
>>> Shared Data Structures
>>> ======================
>>> 
>>> Global TCG State
>>> ----------------
>>> 
>>> We need to protect the entire code generation cycle including any post
>>> generation patching of the translated code. This also implies a shared
>>> translation buffer which contains code running on all cores. Any
>>> execution path that comes to the main run loop will need to hold a
>>> mutex for code generation. This also includes times when we need flush
>>> code or jumps from the tb_cache.
>>> 
>>> DESIGN REQUIREMENT: Add locking around all code generation, patching
>>> and jump cache modification
>> Actually from my point of view jump cache modification requires more than a
>> lock as other VCPU thread can be executing code during the modification.
>> 
>> Fortunately this happen "only" with tlb_flush, tlb_page_flush, tb_flush and
>> tb_invalidate which need all CPU to be halted anyway.
> 
> How about:
> 
> DESIGN REQUIREMENT:
>       - Code generation and patching will be protected by a lock
>       - Jump cache modification will assert all CPUs are halted
> 
>>> 
>>> Memory maps and TLBs
>>> --------------------
>>> 
>>> The memory handling code is fairly critical to the speed of memory
>>> access in the emulated system.
>>> 
>>>   - Memory regions (dividing up access to PIO, MMIO and RAM)
>>>   - Dirty page tracking (for code gen, migration and display)
>>>   - Virtual TLB (for translating guest address->real address)
>>> 
>>> There is a both a fast path walked by the generated code and a slow
>>> path when resolution is required. When the TLB tables are updated we
>>> need to ensure they are done in a safe way by bringing all executing
>>> threads to a halt before making the modifications.
>>> 
>>> DESIGN REQUIREMENTS:
>>> 
>>>   - TLB Flush All/Page
>>>     - can be across-CPUs
>>>     - will need all other CPUs brought to a halt
>>>   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>>     - This is a per-CPU table - by definition can't race
>>>     - updated by it's own thread when the slow-path is forced
>> Actually as we have  approximately the same behaviour for all of this memory
>> handling operation eg: tb_flush, tb_*_invalidate, tlb_*_flush which are 
>> all playing with
>> the TranslationBlock and the jump cache across-CPU I think we have to add a
>> generic "exit and do something" mechanism for the CPU threads.
>> So every VCPU threads has a list of thing to do when they exit (such as 
>> clearing it's
>> own tb_jmp_cache during a tlb_flush or wait other CPU and flush only one 
>> entry for
>> tb_invalidate).
> 
> Sounds like I should write an additional section to describe the process
> of halting CPUs and carrying out deferred per-CPU actions as well as
> ensuring we can tell when they are all halted.
> 
>>> Emulated hardware state
>>> -----------------------
>>> 
>>> Currently the hardware emulation has no protection against
>>> multiple-accesses. However guest systems accessing emulated hardware
>>> should be carrying out their own locking to prevent multiple CPUs
>>> confusing the hardware. Of course there is no guarantee the there
>>> couldn't be a broken guest that doesn't lock so you could get racing
>>> accesses to the hardware.
>>> 
>>> There is the class of paravirtualized hardware (VIRTIO) that works in
>>> a purely mmio mode. Often setting flags directly in guest memory as a
>>> result of a guest triggered transaction.
>>> 
>>> DESIGN REQUIREMENTS:
>>> 
>>>   - Access to IO Memory should be serialised by an IOMem mutex
>>>   - The mutex should be recursive (e.g. allowing pid to relock itself)
>> That might be done with the global mutex as it is today?
>> We need changes here anyway to have VCPU threads running in parallel.
> 
> I'm not sure re-using the global mutex is a good idea. I've had to hack
> the global mutex to allow recursive locking to get around the virtio
> hang I discovered last week. While it works I'm uneasy making such a
> radical change upstream given how widely the global mutex is used hence
> the suggestion to have an explicit IOMem mutex.
> 
> Actually I'm surprised the iothread muxtex just re-uses the global one.
> I guess I need to talk to the IO guys as to why they took that
> decision.
> 
>> 
>> Thanks,
> 
> Thanks for your quick review :-)
> 
>> Fred
>> 
>>> IO Subsystem
>>> ------------
>>> 
>>> The I/O subsystem is heavily used by KVM and has seen a lot of
>>> improvements to offload I/O tasks to dedicated IOThreads. There should
>>> be no additional locking required once we reach the Block Driver.
>>> 
>>> DESIGN REQUIREMENTS:
>>> 
>>>   - The dataplane should continue to be protected by the iothread locks
>>> 
>>> 
>>> References
>>> ==========
>>> 
>>> [1] 
>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
>>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>>> 
>>> 
>>> 
> 
> -- 
> Alex Bennée


         +44 (0)20 7100 3485 x 210
 +33 (0)5 33 52 01 77x 210

        +33 (0)603762104
        mark.burton

Re: [Qemu-devel] RFC Multi-threaded TCG design document

Reply via email to