Hi Julien,

Thanks for your review.

> > The first one is addressed by relaxing VMALLS12E1IS -> VMALLE1IS.
> > Each CPU have their own private TLBs, so flush between vCPU of the
> > same domains is required to avoid translations from vCPUx to "leak"
> > to the vCPUy.
>
> This doesn't really tell me why we don't need the flush the S2. The key
> point is (barring altp2m) the stage-2 is common between all the vCPUs of
> a VM.

Alright, I'll update the commit message in version 2.

> > This can be achieved by using VMALLE1. If FEAT_nTLBPA
> > is present then VMALLE1 can also be avoided.
>
> I had a look at the Arm Arm and I can't figure out why it is fine to
> skip the flush. Can you provide a pointer? BTW, in general, it is useful
> to quote the Arm Arm for the reviewer and future reader. It makes easier
> to find what you are talking about.

Okay. This was pointed out by @Mohamed 
Mediouni<mailto:[email protected]>. From Arm Arm:
> Translation table entry caching that is used for stage 1 translations and is 
> indexed by the intermediate physical
> address of the location holding the translation table entry. However, 
> FEAT_nTLBPA allows software
> discoverability of whether such caches exist, such that if FEAT_nTLBPA is 
> implemented, such caching is not
> implemented.

> > +/*
> > + * FLush TLB by IPA. This will likely be used in a loop, so the caller
> > + * is responsible to use the appropriate memory barriers before/after
> > + * the sequence.
>
> If the goal is to call TLB_HELPER_IPA() in a loop, then the current
> implementation is too expensive.
>
> If the CPU doesn't need the repeat TLBI workaround, then you only need
> to do the dsb; isb once.
>
> If the CPU need the repeat TLBI workaround, looking at the Cortex A76
> errata doc (https://developer.arm.com/documentation/SDEN885749/latest/)
> then I think you might be able to do:
>
> "Flush TLBs"
> "DSB"
> "ISB"
> "Flush TLBs"
> "DSB"
> "ISB"

Yes, I did not use dsb/isb inside this helper TLB_HELPER_IPA(). That's what the 
comment explains that the caller is responsible to call isb/dsb outside as it 
can be invoked in a loop. So, dsb() and isb() should be added before and after 
the loop where this is invoked in the loop. (I forgot isb() in my patch, I'll 
update that). And I kept the sequence with repeat TLBI workaround same as used 
in TLB_HELPER_VA() and it is also same in Linux Kernel: 
https://github.com/torvalds/linux/blob/master/arch/arm64/include/asm/tlbflush.h#L32.

> > diff --git a/xen/arch/arm/include/asm/mmu/p2m.h 
> > b/xen/arch/arm/include/asm/mmu/p2m.h
> > index 58496c0b09..fc2e08bbe8 100644
> > --- a/xen/arch/arm/include/asm/mmu/p2m.h
> > +++ b/xen/arch/arm/include/asm/mmu/p2m.h
> > @@ -10,6 +10,10 @@ extern unsigned int p2m_root_level;
> >
> >   struct p2m_domain;
> >   void p2m_force_tlb_flush_sync(struct p2m_domain *p2m);
> > +#ifdef CONFIG_ARM_64
>
> We should also handle Arm 32-bit. Barring nTLBA, the code should be the
> same.

Okay, nTLBPA feature is also available on Arm 32-bit. I'll update this.

> > diff --git a/xen/arch/arm/mmu/p2m.c b/xen/arch/arm/mmu/p2m.c
> > index 51abf3504f..28268fb67f 100644
> > --- a/xen/arch/arm/mmu/p2m.c
> > +++ b/xen/arch/arm/mmu/p2m.c
> > @@ -235,7 +235,12 @@ void p2m_restore_state(struct vcpu *n)
> >        * when running multiple vCPU of the same domain on a single pCPU.
> >        */
> >       if ( *last_vcpu_ran != INVALID_VCPU_ID && *last_vcpu_ran != 
> > n->vcpu_id )
> > +#ifdef CONFIG_ARM_64
> > +        if ( system_cpuinfo.mm64.ntlbpa != MM64_NTLBPA_SUPPORT_IMP )
>
> If we decide to use nTLBA, then we should introduce a capability so the
> check can be patched at aboot time.

Alright, I need to go through how a CPU capability is added in Xen. Any commit 
I can use as reference?

> > +        /*
> > +         * ARM64_WORKAROUND_AT_SPECULATE: We need to stop AT to allocate
> > +         * TLBs entries because the context is partially modified. We
> > +         * only need the VMID for flushing the TLBs, so we can generate
> > +         * a new VTTBR with the VMID to flush and the empty root table.
> > +         */
> > +        if ( !cpus_have_const_cap(ARM64_WORKAROUND_AT_SPECULATE) )
> > +            vttbr = p2m->vttbr;
> > +        else
> > +            vttbr = generate_vttbr(p2m->vmid, empty_root_mfn);
> > +
> > +        WRITE_SYSREG64(vttbr, VTTBR_EL2);
> > +
> > +        /* Ensure VTTBR_EL2 is synchronized before flushing the TLBs */
> > +        isb();
> > +    }
>
> I don't really like the idea to duplicate the AT speculation logic.
> Could we try to consolidate by introducing helper to load and unload the
> VTTBR?

Okay, I'll create helpers for load_vttbr() and restore_vttbr().

> > +
> > +    /* Ensure prior page-tables updates have completed */
> > +    dsb(ishst);
> > +
> > +    /* Invalidate stage-2 TLB entries by IPA range */
> > +    for ( i = 0; i < page_count; i++ ) {
> > +        flush_guest_tlb_one_s2(ipa);
> > +        ipa += 1UL << PAGE_SHIFT;
> > +    }
>
> In theory, __p2m_set_entry() could modify large region. For 1GB region
> it means the loop would send 262144 TLB instructions. This seems quite a
> lot.
>
> If the region is a superpage, then you might be able to send a single
> TLB instruction (need to confirm from the ARM ARM).
>
> If the region contains multiple mapping, then I wonder whether it would
> be better to flush the full S2. Not sure what would be the threshold.

__p2m_set_entry() invokes p2m_force_tlb_flush_range_sync() only after splitting 
the superpage. Therefore, I think it would require invalidating w.r.t. normal 
page size. IPAS2E1 does not have any input argument to specify superpage size, 
only base address and translation granules of 4K, 16K and 64K.
I'll do some profiling and let you know of threshold for full S2 invalidation 
vs IPA-based S2-invalidation in my use-case.

> > @@ -1090,8 +1169,13 @@ static int __p2m_set_entry(struct p2m_domain *p2m,
> >           p2m_remove_pte(entry, p2m->clean_pte);
> >
> >       if ( removing_mapping )
> > +#ifdef CONFIG_ARM_64
> > +        p2m_force_tlb_flush_range_sync(p2m, gfn_x(sgfn) << PAGE_SHIFT,
> > +                                       1UL << page_order);
> > +#else
> >           /* Flush can be deferred if the entry is removed */
> >           p2m->need_flush |= !!lpae_is_valid(orig_pte);
> > +#endif
>
> To emphasis on what I wrote above, this is one of the reason I would
> strongly prefer if we had support for p2m_force_flush_range_sync() on
> Arm 32-bit. This would make the code a lot simpler and easier to reason.

IPA-based TLBI (TLBIIPAS2) exists for Arm 32-bit only after armv8a.
For simplification, we can wrap p2m_force_tlb_flush_sync() in 
p2m_force_tlb_flush_range_sync() for Arm 32-bit for older architectures where 
this is unsupported. How an architecture-specific feature is implemented? like 
this one is supported only after armv8a and range TLBI is supported only after 
armv8.4a. Any reference example would be helpful.

Regards,
Haseeb

Reply via email to