Hi Alex, On Thu, Sep 10, 2015 at 6:19 PM, Alex Bennée <alex.ben...@linaro.org> wrote: > > alvise rigo <a.r...@virtualopensystems.com> writes: > >> Hi Paolo, >> >> A brief update on this. I have a first implementation of the idea you >> proposed, though it's not working really well. The failing rate of SCs >> for some reason is very high. > > Due to high memory contention on the EXCL page?
I guess it's also due to the fact that we were using the bitmap itself to determine if a SC operation should fail or not. The SC should rather fail if another vCPU has an intersecting exclusive range set and not for the state of the bitmap. > >> Instead of trying to fix it, I came up with this alternative design: >> we still use 8 bits per page and we group the smp_cpus in clusters of >> maximum size ceiling(smp_cpus / 8). >> Given a byte describing a certain page, its first bit will represents >> the vCPUs of index 0, 9, 17, ... and so on for the other bits. >> >> Note that systems with eight vCPUs or less will not be affected by >> this simplified model. >> >> Now, unless I'm missing some advantages of your proposal except the >> less aggressive memory consumption, I would go with this design. > > BTW have you had a look at Emilio's series? I'm about half way through > reviewing it but I'm looking forward to seeing how his implementation > differs from yours. Maybe there are some ideas in there that will > inspire? I still have to give an in-depth look at them. However, if I'm not missing something, the problem of the fast-path not being interceptable is still there. Isn't it? Regards, alvise > >> >> Regards, >> alvise >> >> On Wed, Aug 12, 2015 at 2:36 PM, Paolo Bonzini <pbonz...@redhat.com> wrote: >>> >>> >>> On 12/08/2015 09:31, alvise rigo wrote: >>>> I think that tlb_flush_entry is not enough, since in theory another >>>> vCPU could have a different TLB address referring the same phys >>>> address. >>> >>> You're right, this is a TLB so it's virtually-indexed. :( I'm not sure >>> what happens on ARM, since it has a virtually indexed (VIVT or VIPT) >>> cache, but indeed it would be a problem when implementing e.g. CMPXCHG >>> using the TCG ll/sc ops. >>> >>> I'm a bit worried about adding such a big bitmap. It's only used on >>> TCG, but it is also allocated on KVM and on KVM you can have hundreds >>> of VCPUs. Wasting 200 bits per guest memory page (i.e. ~0.6% of guest >>> memory) is obviously not a great idea. :( >>> >>> Perhaps we can use a bytemap instead: >>> >>> - 0..253 = TLB_EXCL must be set in all VCPUs except CPU n. A VCPU that >>> loads the TLB for this vaddr does not have to set it. >>> >>> - 254 = TLB_EXCL must be set in all VCPUs. A VCPU that >>> loads the TLB for this vaddr has to set it. >>> >>> - 255 = TLB_EXCL not set in at least two VCPUs >>> >>> Transitions: >>> >>> - ll transitions: anything -> 254 >>> >>> - sc transitions: 254 -> current CPU_ID >>> >>> - TLB_EXCL store transitions: 254 -> current CPU_ID >>> >>> - tlb_st_page transitions: CPU_ID other than current -> 255 >>> >>> The initial value is 255 on SMP guests, 0 on UP guests. >>> >>> The algorithms are very similar to yours, just using this approximate >>> representation. >>> >>> ll algorithm: >>> llsc_value = bytemap[vaddr] >>> if llsc_value == CPU_ID >>> do nothing >>> elseif llsc_value < 254 >>> flush TLB of CPU llsc_value >>> elseif llsc_value == 255 >>> flush all TLBs >>> set TLB_EXCL >>> bytemap[vaddr] = 254 >>> load >>> >>> tlb_set_page algorithm: >>> llsc_value = bytemap[vaddr] >>> if llsc_value == CPU_ID or llsc_value == 255 >>> do nothing >>> else if llsc_value == 254 >>> set TLB_EXCL >>> else >>> # two CPUs without TLB_EXCL >>> bytemap[vaddr] = 255 >>> >>> TLB_EXCL slow path algorithm: >>> if bytemap[vaddr] == 254 >>> bytemap[vaddr] = CPU_ID >>> else >>> # two CPUs without TLB_EXCL >>> bytemap[vaddr] = 255 >>> clear TLB_EXCL in this CPU >>> store >>> >>> sc algorithm: >>> if bytemap[vaddr] == CPU_ID or bytemap[vaddr] == 254 >>> bytemap[vaddr] = CPU_ID >>> clear TLB_EXCL in this CPU >>> store >>> succeed >>> else >>> fail >>> >>> clear algorithm: >>> if bytemap[vaddr] == 254 >>> bytemap[vaddr] = CPU_ID >>> >>> The UP case is optimized because bytemap[vaddr] will always be 0 or 254. >>> >>> The algorithm requires the LL to be cleared e.g. on exceptions. >>> Paolo >>> >>>> alvise >>>> >>>> On Tue, Aug 11, 2015 at 6:32 PM, Paolo Bonzini <pbonz...@redhat.com> wrote: >>>>> >>>>> >>>>> On 11/08/2015 18:11, alvise rigo wrote: >>>>>>>> Why flush the entire cache (I understand you mean TLB)? >>>>>> Sorry, I meant the TLB. >>>>>> If for each removal of an exclusive entry we set also the bit to 1, we >>>>>> force the following LL to make a tlb_flush() on every vCPU. >>>>> >>>>> What if you only flush one entry with tlb_flush_entry (on every vCPU)? >>>>> >>>>> Paolo > > -- > Alex Bennée