On Fri, Mar 04, 2022 at 11:24:30AM -0800, Andy Lutomirski wrote: > On 2/23/22 04:05, Steven Price wrote: > > On 23/02/2022 11:49, Chao Peng wrote: > > > On Thu, Feb 17, 2022 at 11:09:35AM -0800, Andy Lutomirski wrote: > > > > On Thu, Feb 17, 2022, at 5:06 AM, Chao Peng wrote: > > > > > On Fri, Feb 11, 2022 at 03:33:35PM -0800, Andy Lutomirski wrote: > > > > > > On 1/18/22 05:21, Chao Peng wrote: > > > > > > > From: "Kirill A. Shutemov" <kirill.shute...@linux.intel.com> > > > > > > > > > > > > > > Introduce a new seal F_SEAL_INACCESSIBLE indicating the content of > > > > > > > the file is inaccessible from userspace through ordinary MMU > > > > > > > access > > > > > > > (e.g., read/write/mmap). However, the file content can be accessed > > > > > > > via a different mechanism (e.g. KVM MMU) indirectly. > > > > > > > > > > > > > > It provides semantics required for KVM guest private memory > > > > > > > support > > > > > > > that a file descriptor with this seal set is going to be used as > > > > > > > the > > > > > > > source of guest memory in confidential computing environments such > > > > > > > as Intel TDX/AMD SEV but may not be accessible from host > > > > > > > userspace. > > > > > > > > > > > > > > At this time only shmem implements this seal. > > > > > > > > > > > > > > > > > > > I don't dislike this *that* much, but I do dislike this. > > > > > > F_SEAL_INACCESSIBLE > > > > > > essentially transmutes a memfd into a different type of object. > > > > > > While this > > > > > > can apparently be done successfully and without races (as in this > > > > > > code), > > > > > > it's at least awkward. I think that either creating a special > > > > > > inaccessible > > > > > > memfd should be a single operation that create the correct type of > > > > > > object or > > > > > > there should be a clear justification for why it's a two-step > > > > > > process. > > > > > > > > > > Now one justification maybe from Stever's comment to patch-00: for ARM > > > > > usage it can be used with creating a normal memfd, (partially)populate > > > > > it with initial guest memory content (e.g. firmware), and then > > > > > F_SEAL_INACCESSIBLE it just before the first time lunch of the guest > > > > > in > > > > > KVM (definitely the current code needs to be changed to support that). > > > > > > > > Except we don't allow F_SEAL_INACCESSIBLE on a non-empty file, right? > > > > So this won't work. > > > > > > Hmm, right, if we set F_SEAL_INACCESSIBLE on a non-empty file, we will > > > need to make sure access to existing mmap-ed area should be prevented, > > > but that is hard. > > > > > > > > > > > In any case, the whole confidential VM initialization story is a bit > > > > buddy. From the earlier emails, it sounds like ARM expects the host to > > > > fill in guest memory and measure it. From my recollection of Intel's > > > > scheme (which may well be wrong, and I could easily be confusing it > > > > with SGX), TDX instead measures what is essentially a transcript of the > > > > series of operations that initializes the VM. These are fundamentally > > > > not the same thing even if they accomplish the same end goal. For TDX, > > > > we unavoidably need an operation (ioctl or similar) that initializes > > > > things according to the VM's instructions, and ARM ought to be able to > > > > use roughly the same mechanism. > > > > > > Yes, TDX requires a ioctl. Steven may comment on the ARM part. > > > > The Arm story is evolving so I can't give a definite answer yet. Our > > current prototyping works by creating the initial VM content in a > > memslot as with a normal VM and then calling an ioctl which throws the > > big switch and converts all the (populated) pages to be protected. At > > this point the RMM performs a measurement of the data that the VM is > > being populated with. > > > > The above (in our prototype) suffers from all the expected problems with > > a malicious VMM being able to trick the host kernel into accessing those > > pages after they have been protected (causing a fault detected by the > > hardware). > > > > The ideal (from our perspective) approach would be to follow the same > > flow but where the VMM populates a memfd rather than normal anonymous > > pages. The memfd could then be sealed and the pages converted to > > protected ones (with the RMM measuring them in the process). > > > > The question becomes how is that memfd populated? It would be nice if > > that could be done using normal operations on a memfd (i.e. using > > mmap()) and therefore this code could be (relatively) portable. This > > would mean that any pages mapped from the memfd would either need to > > block the sealing or be revoked at the time of sealing. > > > > The other approach is we could of course implement a special ioctl which > > effectively does a memcpy into the (created empty and sealed) memfd and > > does the necessary dance with the RMM to measure the contents. This > > would match the "transcript of the series of operations" described above > > - but seems much less ideal from the viewpoint of the VMM. > > A VMM that supports Other Vendors will need to understand this sort of model > regardless. > > I don't particularly mind the idea of having the kernel consume a normal > memfd and spit out a new object, but I find the concept of changing the type > of the object in place, even if it has other references, and trying to > control all the resulting races to be somewhat alarming. > > In pseudo-Rust, this is the difference between: > > fn convert_to_private(in: &mut Memfd) > > and > > fn convert_to_private(in: Memfd) -> PrivateMemoryFd > > This doesn't map particularly nicely to the kernel, though.
I understand this Rust semantics and the difficulty to handle races. Probably we should not expose F_SEAL_INACCESSIBLE to userspace, instead we can use a new in-kernel flag to indicate the same thing. That flag should be set only when the memfd is created with MFD_INACCESSIBLE. Chao > > --Andy\