Hi Will,

On Thu, 2025-09-18 at 21:21 +0100, Will Deacon wrote:
> Hi Patrick,
> We chatted briefly at KVM Forum, so I wanted to chime in here too from
> the arm64 side.
> 
> On Fri, Sep 12, 2025 at 09:17:37AM +0000, Roy, Patrick wrote:
>> Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD()
>> ioctl. When set, guest_memfd folios will be removed from the direct map
>> after preparation, with direct map entries only restored when the folios
>> are freed.
>>
>> To ensure these folios do not end up in places where the kernel cannot
>> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct
>> address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested.
>>
>> Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether
>> guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on
>> guest_memfd itself being supported, but also on whether linux supports
>> manipulatomg the direct map at page granularity at all (possible most of
>> the time, outliers being arm64 where its impossible if the direct map
>> has been setup using hugepages, as arm64 cannot break these apart due to
>> break-before-make semantics, and powerpc, which does not select
>> ARCH_HAS_SET_DIRECT_MAP, which also doesn't support guest_memfd anyway
>> though).
>>
>> Note that this flag causes removal of direct map entries for all
>> guest_memfd folios independent of whether they are "shared" or "private"
>> (although current guest_memfd only supports either all folios in the
>> "shared" state, or all folios in the "private" state if
>> GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map
>> entries of also the shared parts of guest_memfd are a special type of
>> non-CoCo VM where, host userspace is trusted to have access to all of
>> guest memory, but where Spectre-style transient execution attacks
>> through the host kernel's direct map should still be mitigated.  In this
>> setup, KVM retains access to guest memory via userspace mappings of
>> guest_memfd, which are reflected back into KVM's memslots via
>> userspace_addr. This is needed for things like MMIO emulation on x86_64
>> to work.
>>
>> Do not perform TLB flushes after direct map manipulations. This is
>> because TLB flushes resulted in a up to 40x elongation of page faults in
>> guest_memfd (scaling with the number of CPU cores), or a 5x elongation
>> of memory population. TLB flushes are not needed for functional
>> correctness (the virt->phys mapping technically stays "correct",  the
>> kernel should simply not use it for a while). On the other hand, it means
>> that the desired protection from Spectre-style attacks is not perfect,
>> as an attacker could try to prevent a stale TLB entry from getting
>> evicted, keeping it alive until the page it refers to is used by the
>> guest for some sensitive data, and then targeting it using a
>> spectre-gadget.
> 
> I'm really not keen on this last part (at least, for arm64).
> 
> If you're not going to bother invalidating the TLB after unmapping from
> the direct map because of performance reasons, you're better off just
> leaving the direct map intact and getting even better performance. On
> arm64, that would mean you could use block mappings too.

Not until we have hardware with the newest BBM goodies I thought. When I
checked earlier this year, a defconfig has the direct map setup at 4k
granularity.

> On the other hand, if you actually care about the security properties
> from the unmap then you need the invalidation so that the mapping
> doesn't linger around. With "modern" CPU features such as pte
> aggregation and shared TLB walk caches it's not unlikely that these
> entries will persist a lot longer than you think and it makes the
> security benefits of this series impossible to reason about.

I agree it's not 100% protection, but it is still better than the status quo. I
would also love to have the TLB flushes, but sadly the performance impact of
them would make this completely unusable for Amazon :/

Mh, thinking about it now though, iirc the performance problems were mostly
because all CPUs needed to acknowledge the flush before the issuing CPU could
continue. Is there a way to have "fire and forget" flushes, where we don't wait
for acknowledgements?

> As a compromise, could we make the TLB invalidation an architecture
> opt-in so that we can have it enabled on arm64, please?

How about instead of an architecture opt-in, we have some sort of opt-out flag
userspace can set? Similar to the PFNMAP stuff KVM already has.

> Will

Best,
Patrick

Reply via email to