Hi Will, On Thu, 2025-09-18 at 21:21 +0100, Will Deacon wrote: > Hi Patrick, > We chatted briefly at KVM Forum, so I wanted to chime in here too from > the arm64 side. > > On Fri, Sep 12, 2025 at 09:17:37AM +0000, Roy, Patrick wrote: >> Add GUEST_MEMFD_FLAG_NO_DIRECT_MAP flag for KVM_CREATE_GUEST_MEMFD() >> ioctl. When set, guest_memfd folios will be removed from the direct map >> after preparation, with direct map entries only restored when the folios >> are freed. >> >> To ensure these folios do not end up in places where the kernel cannot >> deal with them, set AS_NO_DIRECT_MAP on the guest_memfd's struct >> address_space if GUEST_MEMFD_FLAG_NO_DIRECT_MAP is requested. >> >> Add KVM_CAP_GUEST_MEMFD_NO_DIRECT_MAP to let userspace discover whether >> guest_memfd supports GUEST_MEMFD_FLAG_NO_DIRECT_MAP. Support depends on >> guest_memfd itself being supported, but also on whether linux supports >> manipulatomg the direct map at page granularity at all (possible most of >> the time, outliers being arm64 where its impossible if the direct map >> has been setup using hugepages, as arm64 cannot break these apart due to >> break-before-make semantics, and powerpc, which does not select >> ARCH_HAS_SET_DIRECT_MAP, which also doesn't support guest_memfd anyway >> though). >> >> Note that this flag causes removal of direct map entries for all >> guest_memfd folios independent of whether they are "shared" or "private" >> (although current guest_memfd only supports either all folios in the >> "shared" state, or all folios in the "private" state if >> GUEST_MEMFD_FLAG_MMAP is not set). The usecase for removing direct map >> entries of also the shared parts of guest_memfd are a special type of >> non-CoCo VM where, host userspace is trusted to have access to all of >> guest memory, but where Spectre-style transient execution attacks >> through the host kernel's direct map should still be mitigated. In this >> setup, KVM retains access to guest memory via userspace mappings of >> guest_memfd, which are reflected back into KVM's memslots via >> userspace_addr. This is needed for things like MMIO emulation on x86_64 >> to work. >> >> Do not perform TLB flushes after direct map manipulations. This is >> because TLB flushes resulted in a up to 40x elongation of page faults in >> guest_memfd (scaling with the number of CPU cores), or a 5x elongation >> of memory population. TLB flushes are not needed for functional >> correctness (the virt->phys mapping technically stays "correct", the >> kernel should simply not use it for a while). On the other hand, it means >> that the desired protection from Spectre-style attacks is not perfect, >> as an attacker could try to prevent a stale TLB entry from getting >> evicted, keeping it alive until the page it refers to is used by the >> guest for some sensitive data, and then targeting it using a >> spectre-gadget. > > I'm really not keen on this last part (at least, for arm64). > > If you're not going to bother invalidating the TLB after unmapping from > the direct map because of performance reasons, you're better off just > leaving the direct map intact and getting even better performance. On > arm64, that would mean you could use block mappings too.
Not until we have hardware with the newest BBM goodies I thought. When I checked earlier this year, a defconfig has the direct map setup at 4k granularity. > On the other hand, if you actually care about the security properties > from the unmap then you need the invalidation so that the mapping > doesn't linger around. With "modern" CPU features such as pte > aggregation and shared TLB walk caches it's not unlikely that these > entries will persist a lot longer than you think and it makes the > security benefits of this series impossible to reason about. I agree it's not 100% protection, but it is still better than the status quo. I would also love to have the TLB flushes, but sadly the performance impact of them would make this completely unusable for Amazon :/ Mh, thinking about it now though, iirc the performance problems were mostly because all CPUs needed to acknowledge the flush before the issuing CPU could continue. Is there a way to have "fire and forget" flushes, where we don't wait for acknowledgements? > As a compromise, could we make the TLB invalidation an architecture > opt-in so that we can have it enabled on arm64, please? How about instead of an architecture opt-in, we have some sort of opt-out flag userspace can set? Similar to the PFNMAP stuff KVM already has. > Will Best, Patrick