Hi, <...>
> > Regarding pKVM's use case, with the shim approach I believe this can be > > done by > > allowing userspace mmap() the "hidden" memfd, but with a ton of restrictions > > piled on top. > > > > My first thought was to make the uAPI a set of KVM ioctls so that KVM > > could tightly > > tightly control usage without taking on too much complexity in the > > kernel, but > > working through things, routing the behavior through the shim itself > > might not be > > all that horrific. > > > > IIRC, we discarded the idea of allowing userspace to map the "private" > > fd because > > things got too complex, but with the shim it doesn't seem _that_ bad. > > What's the exact use case? Is it just to pre-populate the memory? Prepopulate memory and access memory that could go back and forth from being shared to being private. Cheers, /fuad > > > > E.g. on the memfd side: > > > > 1. The entire memfd must be mapped, and at most one mapping is allowed, > > i.e. > > mapping is all or nothing. > > > > 2. Acquiring a reference via get_pfn() is disallowed if there's a mapping > > for > > the restricted memfd. > > > > 3. Add notifier hooks to allow downstream users to further restrict > > things. > > > > 4. Disallow splitting VMAs, e.g. to force userspace to munmap() > > everything in > > one shot. > > > > 5. Require that there are no outstanding references at munmap(). Or if > > this > > can't be guaranteed by userspace, maybe add some way for userspace to > > wait > > until it's ok to convert to private? E.g. so that get_pfn() doesn't > > need > > to do an expensive check every time. > > Hmm. I haven't looked at the code to see if this would really work, but I > think this could be done more in line with how the rest of the kernel works > by using the rmap infrastructure. When the pKVM memfd is in not-yet-private > mode, just let it be mmapped as usual (but don't allow any form of GUP or > pinning). Then have an ioctl to switch to to shared mode that takes locks or > sets flags so that no new faults can be serviced and does unmap_mapping_range. > > As long as the shim arranges to have its own vm_ops, I don't immediately see > any reason this can't work.