Re: [Qemu-devel] RFC Multi-threaded TCG design document

Alex Bennée Mon, 15 Jun 2015 05:45:10 -0700

Mark Burton <mark.bur...@greensocs.com> writes:

> I think we SHOUDL use the wiki - and keep it current. A lot of what you have 
> is in the wiki too, but I’d like to see the wiki updated.
> We will add our stuff there too…


I'll do a pass today and update it to point to lists, discussions and
WIP trees.

>
> Cheers
> Mark.
>
>
>
>> On 15 Jun 2015, at 12:06, Alex Bennée <alex.ben...@linaro.org> wrote:
>> 
>> 
>> Frederic Konrad <fred.kon...@greensocs.com> writes:
>> 
>>> On 12/06/2015 18:37, Alex Bennée wrote:
>>>> Hi,
>>> 
>>> Hi Alex,
>>> 
>>> I've completed some of the points below. We will also work on a design 
>>> decisions
>>> document to add to this one.
>>> 
>>> We probably want to merge that with what we did on the wiki?
>>> http://wiki.qemu.org/Features/tcg-multithread
>> 
>> Well hopefully there is cross-over as I started with the wiki as a basic
>> ;-)
>> 
>> Do we want to just keep the wiki as the live design document or put
>> pointers to the current drafts? I'm hoping eventually the page will just
>> point to the design in the doc directory at git.qemu.org.
>> 
>>>> One thing that Peter has been asking for is a design document for the
>>>> way we are going to approach multi-threaded TCG emulation. I started
>>>> with the information that was captured on the wiki and tried to build on
>>>> that. It's almost certainly incomplete but I thought it would be worth
>>>> posting for wider discussion early rather than later.
>>>> 
>>>> One obvious omission at the moment is the lack of discussion about other
>>>> non-TLB shared data structures in QEMU (I'm thinking of the various
>>>> dirty page tracking bits, I'm sure there is more).
>>>> 
>>>> I've also deliberately tried to avoid documenting the design decisions
>>>> made in the current Greensoc's patch series. This is so we can
>>>> concentrate on the big picture before getting side-tracked into the
>>>> implementation details.
>>>> 
>>>> I have now started digging into the Greensocs code in earnest and the
>>>> plan is eventually the design and the implementation will converge on a
>>>> final documented complete solution ;-)
>>>> 
>>>> Anyway as ever I look forward to the comments and discussion:
>>>> 
>>>> STATUS: DRAFTING
>>>> 
>>>> Introduction
>>>> ============
>>>> 
>>>> This document outlines the design for multi-threaded TCG emulation.
>>>> The original TCG implementation was single threaded and dealt with
>>>> multiple CPUs by with simple round-robin scheduling. This simplified a
>>>> lot of things but became increasingly limited as systems being
>>>> emulated gained additional cores and per-core performance gains for host
>>>> systems started to level off.
>>>> 
>>>> Memory Consistency
>>>> ==================
>>>> 
>>>> Between emulated guests and host systems there are a range of memory
>>>> consistency models. While emulating weakly ordered systems on strongly
>>>> ordered hosts shouldn't cause any problems the same is not true for
>>>> the reverse setup.
>>>> 
>>>> The proposed design currently does not address the problem of
>>>> emulating strong ordering on a weakly ordered host although even on
>>>> strongly ordered systems software should be using synchronisation
>>>> primitives to ensure correct operation.
>>>> 
>>>> Memory Barriers
>>>> ---------------
>>>> 
>>>> Barriers (sometimes known as fences) provide a mechanism for software
>>>> to enforce a particular ordering of memory operations from the point
>>>> of view of external observers (e.g. another processor core). They can
>>>> apply to any memory operations as well as just loads or stores.
>>>> 
>>>> The Linux kernel has an excellent write-up on the various forms of
>>>> memory barrier and the guarantees they can provide [1].
>>>> 
>>>> Barriers are often wrapped around synchronisation primitives to
>>>> provide explicit memory ordering semantics. However they can be used
>>>> by themselves to provide safe lockless access by ensuring for example
>>>> a signal flag will always be set after a payload.
>>>> 
>>>> DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
>>>> 
>>>> This would enforce a strong load/store ordering so all loads/stores
>>>> complete at the memory barrier. On single-core non-SMP strongly
>>>> ordered backends this could become a NOP.
>>>> 
>>>> There may be a case for further refinement if this causes performance
>>>> bottlenecks.
>>>> 
>>>> Memory Control and Maintenance
>>>> ------------------------------
>>>> 
>>>> This includes a class of instructions for controlling system cache
>>>> behaviour. While QEMU doesn't model cache behaviour these instructions
>>>> are often seen when code modification has taken place to ensure the
>>>> changes take effect.
>>>> 
>>>> Synchronisation Primitives
>>>> --------------------------
>>>> 
>>>> There are two broad types of synchronisation primitives found in
>>>> modern ISAs: atomic instructions and exclusive regions.
>>>> 
>>>> The first type offer a simple atomic instruction which will guarantee
>>>> some sort of test and conditional store will be truly atomic w.r.t.
>>>> other cores sharing access to the memory. The classic example is the
>>>> x86 cmpxchg instruction.
>>>> 
>>>> The second type offer a pair of load/store instructions which offer a
>>>> guarantee that an region of memory has not been touched between the
>>>> load and store instructions. An example of this is ARM's ldrex/strex
>>>> pair where the strex instruction will return a flag indicating a
>>>> successful store only if no other CPU has accessed the memory region
>>>> since the ldrex.
>>>> 
>>>> Traditionally TCG has generated a series of operations that work
>>>> because they are within the context of a single translation block so
>>>> will have completed before another CPU is scheduled. However with
>>>> the ability to have multiple threads running to emulate multiple CPUs
>>>> we will need to explicitly expose these semantics.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>>  - atomics
>>>>    - Introduce some atomic TCG ops for the common semantics
>>>>    - The default fallback helper function will use qemu_atomics
>>>>    - Each backend can then add a more efficient implementation
>>>>  - load/store exclusive
>>>>    [AJB:
>>>>         There are currently a number proposals of interest:
>>>>      - Greensocs tweaks to ldst ex (using locks)
>>>>      - Slow-path for atomic instruction translation [2]
>>>>      - Helper-based Atomic Instruction Emulation (AIE) [3]
>>>>     ]
>>>> 
>>>> 
>>>> Shared Data Structures
>>>> ======================
>>>> 
>>>> Global TCG State
>>>> ----------------
>>>> 
>>>> We need to protect the entire code generation cycle including any post
>>>> generation patching of the translated code. This also implies a shared
>>>> translation buffer which contains code running on all cores. Any
>>>> execution path that comes to the main run loop will need to hold a
>>>> mutex for code generation. This also includes times when we need flush
>>>> code or jumps from the tb_cache.
>>>> 
>>>> DESIGN REQUIREMENT: Add locking around all code generation, patching
>>>> and jump cache modification
>>> Actually from my point of view jump cache modification requires more than a
>>> lock as other VCPU thread can be executing code during the modification.
>>> 
>>> Fortunately this happen "only" with tlb_flush, tlb_page_flush, tb_flush and
>>> tb_invalidate which need all CPU to be halted anyway.
>> 
>> How about:
>> 
>> DESIGN REQUIREMENT:
>>       - Code generation and patching will be protected by a lock
>>       - Jump cache modification will assert all CPUs are halted
>> 
>>>> 
>>>> Memory maps and TLBs
>>>> --------------------
>>>> 
>>>> The memory handling code is fairly critical to the speed of memory
>>>> access in the emulated system.
>>>> 
>>>>   - Memory regions (dividing up access to PIO, MMIO and RAM)
>>>>   - Dirty page tracking (for code gen, migration and display)
>>>>   - Virtual TLB (for translating guest address->real address)
>>>> 
>>>> There is a both a fast path walked by the generated code and a slow
>>>> path when resolution is required. When the TLB tables are updated we
>>>> need to ensure they are done in a safe way by bringing all executing
>>>> threads to a halt before making the modifications.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>> 
>>>>   - TLB Flush All/Page
>>>>     - can be across-CPUs
>>>>     - will need all other CPUs brought to a halt
>>>>   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
>>>>     - This is a per-CPU table - by definition can't race
>>>>     - updated by it's own thread when the slow-path is forced
>>> Actually as we have  approximately the same behaviour for all of this memory
>>> handling operation eg: tb_flush, tb_*_invalidate, tlb_*_flush which are 
>>> all playing with
>>> the TranslationBlock and the jump cache across-CPU I think we have to add a
>>> generic "exit and do something" mechanism for the CPU threads.
>>> So every VCPU threads has a list of thing to do when they exit (such as 
>>> clearing it's
>>> own tb_jmp_cache during a tlb_flush or wait other CPU and flush only one 
>>> entry for
>>> tb_invalidate).
>> 
>> Sounds like I should write an additional section to describe the process
>> of halting CPUs and carrying out deferred per-CPU actions as well as
>> ensuring we can tell when they are all halted.
>> 
>>>> Emulated hardware state
>>>> -----------------------
>>>> 
>>>> Currently the hardware emulation has no protection against
>>>> multiple-accesses. However guest systems accessing emulated hardware
>>>> should be carrying out their own locking to prevent multiple CPUs
>>>> confusing the hardware. Of course there is no guarantee the there
>>>> couldn't be a broken guest that doesn't lock so you could get racing
>>>> accesses to the hardware.
>>>> 
>>>> There is the class of paravirtualized hardware (VIRTIO) that works in
>>>> a purely mmio mode. Often setting flags directly in guest memory as a
>>>> result of a guest triggered transaction.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>> 
>>>>   - Access to IO Memory should be serialised by an IOMem mutex
>>>>   - The mutex should be recursive (e.g. allowing pid to relock itself)
>>> That might be done with the global mutex as it is today?
>>> We need changes here anyway to have VCPU threads running in parallel.
>> 
>> I'm not sure re-using the global mutex is a good idea. I've had to hack
>> the global mutex to allow recursive locking to get around the virtio
>> hang I discovered last week. While it works I'm uneasy making such a
>> radical change upstream given how widely the global mutex is used hence
>> the suggestion to have an explicit IOMem mutex.
>> 
>> Actually I'm surprised the iothread muxtex just re-uses the global one.
>> I guess I need to talk to the IO guys as to why they took that
>> decision.
>> 
>>> 
>>> Thanks,
>> 
>> Thanks for your quick review :-)
>> 
>>> Fred
>>> 
>>>> IO Subsystem
>>>> ------------
>>>> 
>>>> The I/O subsystem is heavily used by KVM and has seen a lot of
>>>> improvements to offload I/O tasks to dedicated IOThreads. There should
>>>> be no additional locking required once we reach the Block Driver.
>>>> 
>>>> DESIGN REQUIREMENTS:
>>>> 
>>>>   - The dataplane should continue to be protected by the iothread locks
>>>> 
>>>> 
>>>> References
>>>> ==========
>>>> 
>>>> [1] 
>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
>>>> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/334561
>>>> [3] http://thread.gmane.org/gmane.comp.emulators.qemu/335297
>>>> 
>>>> 
>>>> 
>> 
>> -- 
>> Alex Bennée
>
>
>        +44 (0)20 7100 3485 x 210
>  +33 (0)5 33 52 01 77x 210
>
>       +33 (0)603762104
>       mark.burton

-- 
Alex Bennée

Re: [Qemu-devel] RFC Multi-threaded TCG design document

Reply via email to