Yan Zhao <[email protected]> writes:
> On Fri, Jun 26, 2026 at 08:28:32AM -0700, Ackerley Tng wrote:
>> Yan Zhao <[email protected]> writes:
>>
>> > On Thu, Jun 25, 2026 at 05:07:23PM -0700, Ackerley Tng wrote:
>> >> Yan Zhao <[email protected]> writes:
>> >>
>> >> > On Wed, Jun 24, 2026 at 04:00:32PM -0700, Ackerley Tng wrote:
>> >> >> Sean Christopherson <[email protected]> writes:
>> >> >>
>> >> >> > On Tue, Jun 23, 2026, Yan Zhao wrote:
>> >> >> >> On Tue, Jun 23, 2026 at 01:16:14PM +0800, Yan Zhao wrote:
>> >> >> >> > On Mon, Jun 22, 2026 at 06:22:45PM -0700, Sean Christopherson
>> >> >> >> > wrote:
>> >> >> >> > > On Mon, Jun 22, 2026, Yan Zhao wrote:
>> >> >> >> > > > On Thu, Jun 18, 2026 at 05:32:00PM -0700, Ackerley Tng via B4
>> >> >> >> > > > Relay wrote:
>> >> >> >> > > > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> >> >> >> > > > > index ffe9d0db58c59..56d10333c61a7 100644
>> >> >> >> > > > > --- a/arch/x86/kvm/vmx/tdx.c
>> >> >> >> > > > > +++ b/arch/x86/kvm/vmx/tdx.c
>> >> >> >> > > > > @@ -3198,8 +3198,12 @@ static int
>> >> >> >> > > > > tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn,
>> >> >> >> > > > > kvm_pfn_t pfn,
>> >> >> >> > > > > if (KVM_BUG_ON(kvm_tdx->page_add_src, kvm))
>> >> >> >> > > > > return -EIO;
>> >> >> >> > > > >
>> >> >> >> > > > > - if (!src_page)
>> >> >> >> > > > > - return -EOPNOTSUPP;
>> >> >> >> > > > > + if (!src_page) {
>> >> >> >> > > > > + if (!gmem_in_place_conversion)
>> >> >> >> > > > When userspace turns on gmem_in_place_conversion while
>> >> >> >> > > > creating guest_memfd
>> >> >> >> > > > without the MMAP flag, the absence of src_page should still
>> >> >> >> > > > be treated as an
>> >> >> >> > > > error.
>> >> >> >> > >
>> >> >> >> > > Why MMAP?
>> >> >> >> > Hmm, I was showing a scenario that in-place conversion couldn't
>> >> >> >> > occur.
>> >> >> >> > I didn't mean that with the MMAP flag, mmap() and user write must
>> >> >> >> > occur.
>> >> >> >> >
>> >> >> >> > > Shouldn't this be a general "if (!src_page && !up-to-date)"?
>> >> >> >> > > Just
>> >> >> >> > > because userspace _can_ mmap() the memory doesn't mean
>> >> >> >> > > userspace _has_ mmap()'d
>> >> >> >> > > and written memory. And when write() lands, MMAP wouldn't be
>> >> >> >> > > necessary to
>> >> >> >> > > initialize the memory.
>> >> >> >> > Do you mean using up-to-date flag as below?
>> >> >> >
>> >> >> > Yes? I didn't actually look at the implementation details.
>> >> >> >
>> >> >> >> > if (!src_page) {
>> >> >> >> > src_page = pfn_to_page(pfn);
>> >> >> >> > if (!folio_test_uptodate(page_folio(src_page)))
>> >> >> >> > return -EOPNOTSUPP;
>> >> >> >> > }
>> >> >>
>> >> >> Yan is right that with the earlier patch "Zero page while getting pfn",
>> >> >> folio_test_uptodate() here will always return true.
>> >> >>
>> >> >> Actually, this is an alternative fix for the issue Sashiko pointed out
>> >> >> on v7 where userspace can do a populate() (either TDX or SNP) without
>> >> >> first allocating the page, with src_address == NULL, and leak
>> >> >> uninitialized memory into the guest.
>> >> >>
>> >> >> Advantage of using the uptodate check in populate: if the host never
>> >> >> allocates the page, populate doesn't incur zeroing before writing the
>> >> >> page anyway in populate().
>> >> >>
>> >> >> Disadvantage: Both TDX and SNP will have to implement this uptodate
>> >> >> check. guest_memfd can't check centrally because for SNP, for a
>> >> >> PAGE_TYPE_ZERO, !src_page should be allowed with a !uptodate page since
>> >> >> firmware will zero and there's no leakage of uninitialized host memory?
>> >> > Another disadvantage: the uptodate flag is per-folio. What if the folio
>> >> > is only partially initialized by the userspace especially after huge
>> >> > page is
>> >> > supported?
>> >> >
>> >>
>> >> Good point on huge pages!
>> >>
>> >> The uptodate flag on the folio in guest_memfd means "this folio has been
>> >> written to". As of now (before patch at [1]), this happens when
>> >>
>> >> + folio is zeroed on first use by userspace
>> >> + folio is zeroed on first use of the guest
>> >> + folio is populated
>> >>
>> >> When huge pages are supported, the folio can't partially be initialized?
>> >>
>> >> On allocation, if any part is shared, we split the page. The parts are
>> >> separate folios that have their own uptodate flags.
>> >>
>> >> On splitting, if the huge page is uptodate, the split pages will also be
>> >> uptodate. If the huge page is not uptodate, the split pages won't be
>> >> uptodate, but that's ok since they will be marked uptodate on first use.
>> >>
>> >> On merging, the non-uptodate parts have to be zeroed and then marked
>> > If that's true, it would be good.
>> >
>> >> uptodate. Any parts that are in use would have been marked uptodate
>> >> already, so there's no overwriting data that is in use. I'll need to
>> >> think more about when it's safe to zero.
>> >>
>> >> I'm still on the fence between the two options
>> >>
>> >> 1. Using uptodate check in populate to reject src_pages that have never
>> >> been written to or
>> >> 2. Always zero before populate
>> > 2 does not work?
>> > The flow is
>> > 1. mmap gmem_fd, make GFN shared, and write initial content.
>> > 2. convert GFN to private
>> > 3. invoke ioctl to trigger populate.
>> >
>>
>> This flow is correct, is what users of in-place conversion should do.
>>
>> "Always" is the wrong word, I should have said "zero if not uptodate
>> before populate", as in, with patch at [1].
>>
>> By doing the zeroing in __kvm_gmem_get_pfn instead, by the time populate
>> gets the pfn, the page would be zeroed, either because userspace faulted
>> it in, and the zeroing happened in kvm_gmem_fault_user_mapping(), or if
>> userspace never faulted it in, the zeroing would happen because
>> populate() allocated the page.
>
> I see.
>
>> >> but whether the uptodate flag is per-folio or not doesn't affect these
>> >> two options in terms of fixing the leak of uninitialized host memory,
>> >> right?
>> > yes, provided "On merging, the non-uptodate parts have to be zeroed and
>> > then
>> > marked uptodate".
>> >
>>
>> Thank you so much for bringing this up, I hadn't considered this
>> before. I'll do that when I get to guest_memfd hugepage restructuring.
>>
>> >> >
>> >> >> >> Another concern with this fix is that:
>> >> >> >> commit "KVM: guest_memfd: Zero page while getting pfn" [1] always
>> >> >> >> marks the
>> >> >> >> folio uptodate before reaching post_populate().
>> >> >> >>
>> >> >> >> [1]
>> >> >> >> https://lore.kernel.org/all/[email protected]/
>> >> >> >>
>> >> >> >> > One concern is that TDX now does not much care about the
>> >> >> >> > up-to-date flag since
>> >> >> >> > TDX doesn't rely on the flag to clear pages on conversions.
>> >> >> >> > I'm not sure if the flag can be reliably checked in this case.
>> >> >> >> > e.g.,
>> >> >> >> > now the whole folio is marked up-to-date even if only part of it
>> >> >> >> > is faulted by
>> >> >> >> > user access.
>> >> >> >> > Ensuring that the up-to-date flag works correctly with huge page
>> >> >> >> > support seems
>> >> >> >> > to have more effort than introducing a dedicated flag for TDX.
>> >> >> >> >
>> >> >> >> > > > Additionally, to properly enable in-place copying for the TDX
>> >> >> >> > > > initial memory
>> >> >> >> > > > region, userspace must not only specify source_addr to NULL,
>> >> >> >> > > > but also follow
>> >> >> >> > > > a specific sequence (where steps 1/2/3/7 are required only
>> >> >> >> > > > for in-place copy):
>> >> >> >> > > > 1. create guest_memfd with MMAP flag
>> >> >> >> > > > 2. mmap the guest_memfd.
>> >> >> >> > > > 3. convert the initial memory range to shared.
>> >> >> >> > > > 4. copy initial content to the source page.
>> >> >> >> > > > 5. convert the initial memory range to private
>> >> >> >> > > > 6. invoke ioctl KVM_TDX_INIT_MEM_REGION.
>> >> >> >> > > > 7. do not unmap the source backend.
>> >> >> >> > > >
>> >> >> >> > > > So, would it be reasonable to introduce a dedicated flag that
>> >> >> >> > > > allows userspace
>> >> >> >> > > > to explicitly opt into the in-place copy functionality? e.g.,
>> >> >> >> > >
>> >> >> >> > > Why? It's userspace's responsibility to get the above right.
>> >> >> >> > > If userspace fails
>> >> >> >> > > to provide a src_page when it doesn't want in-place copy,
>> >> >> >> > > that's a userspace bug.
>> >> >>
>> >> >> Yan, is your concern that userspace forgot to update the code and
>> >> >> forgets to provide a src_page, and if we keep the "Zero page while
>> >> > Yes. Previously, it would be rejected after GUP fails.
>> >> >
>> >>
>> >> I see, didn't realize previously it would be rejected because GUP
>> >> fails. GUP failed because it wasn't faulted into the host?
>> > GUP fails if 0 is not a valid user address.
>> > But GUP would not fail if 0 is a valid address. e.g., in below scenario:
>> >
>> > #include <sys/mman.h>
>> > #include <stdio.h>
>> > int main(void)
>> > {
>> > void *p=mmap((void*)0,4096,PROT_READ|PROT_WRITE,
>> > MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS,-1,0);
>> > if (p==MAP_FAILED) {
>> > perror("mmap");
>> > return 1;
>> > }
>> > *(char*)0='Y';
>> > printf("addr0=%p val=%c\n",p,*(char*)0);
>> > return 0;
>> > }
>> >
>> >
>> >> That's kind of orthogonal, I don't think GUP fail leading to rejecting
>> >> populate was meant to help userspace catch these issues. GUP would also
>> >> fail if the user did mmap(), write to it, unmap using
>> >> madvise(MADV_DONTNEED), then forget and pass 0 as src_address.
>> > The original uAPI did not explicitly define 0 as an invalid uaddr. Whether
>> > 0 was
>> > rejected depended on whether the user mmap()'d address 0. If 0 was a valid
>> > mapping, populate() could proceed.
>> >
>> > commit 2a62345b3052 ("KVM: guest_memfd: GUP source pages prior to
>> > populating
>> > guest memory") changed the behavior though. It would return -EOPNOTSUPP
>> > for a 0
>> > uaddr.
>> >
>>
>> I see, I only looked at this after commit 2a62345b3052.
>>
>> > But if a user configures 0 uaddr as valid, writes to it, and then passes 0
>> > as
>> > source_addr(not from gmem), I'm not sure if it's good for the kernel to
>> > silently
>> > treat 0 uaddr as an identifier for in-place copy from the private PFN in
>> > gmem.
>> >
>>
>> I'd say the original uAPI perhaps just didn't document 0 as an
>> unsupported uaddr. Given that commit 2a62345b3052 already merged, uAPI
>> was perhaps accidentally changed and no customer complained, I think we
>> can move forward with 0 as an invalid src_address? I wouldn't think
>> anyone relies on 0 intentionally being a valid address.
>>
>> I could document that, if it helps?
> What about just documenting that 0 is an unsupported uaddr which will be
> re-purposed as an indicator to use the target pfn as the source, regardless of
> whether gmem_in_place_conversion is true? i.e.,
>
> if (!src_page)
> src_page = pfn_to_page(pfn);
>
> I don't get why the two scenarios should be treated differently:
> 1. gmem_in_place_conversion==true, shared memory is not from gmem
> 2. gmem_in_place_conversion==false, shared memory is not from gmem
>
> In both case, a 0 uaddr could be mapped to a valid page not from gmem.
This is true, but this check isn't about whether the page is from gmem.
> So why not update the uAPI to handle both cases consistently? :)
>
Wait, but before this series, if region.src_address = 0, src_page = NULL
and that's not supported so it returns -EOPNOTSUPP.
If that's dropped, then suddenly if region.src_address = 0 and
!gmem_in_place_conversion, tdx_gmem_post_populate() will now load the
memory (zeroed) after [1] into the guest? I don't think we want to
change that behavior.
I could document that 0 is an unsupported uaddr only for TDX, and only
when gmem_in_place_conversion = false.
Since it is unsupported only when gmem_in_place_conversion = false, the
check two lines marked with <<==== can't go away?
if (!src_page) {
if (!gmem_in_place_conversion) <<====
return -EOPNOTSUPP; <<====
src_page = pfn_to_page(pfn);
}
Also, for SNP, src_address == 0 is permitted (and desired, I believe, to
avoid a pointless kernel memcpy) if the type of population is
KVM_SEV_SNP_PAGE_TYPE_ZERO.
>> >> >> getting pfn" patch, ends up with the guest silently having a zero page?
>> >> >> I think that would be found quite early in userspace VMM testing...
>>
>> [...snip...]
>>