On Thursday 28 Apr 2022 at 20:29:52 (+0800), Chao Peng wrote: > > + Michael in case he has comment from SEV side. > > On Mon, Apr 25, 2022 at 07:52:38AM -0700, Andy Lutomirski wrote: > > > > > > On Mon, Apr 25, 2022, at 6:40 AM, Chao Peng wrote: > > > On Sun, Apr 24, 2022 at 09:59:37AM -0700, Andy Lutomirski wrote: > > >> > > > > >> > > >> 2. Bind the memfile to a VM (or at least to a VM technology). Now it's > > >> in the initial state appropriate for that VM. > > >> > > >> For TDX, this completely bypasses the cases where the data is > > >> prepopulated and TDX can't handle it cleanly. For SEV, it bypasses a > > >> situation in which data might be written to the memory before we find > > >> out whether that data will be unreclaimable or unmovable. > > > > > > This sounds a more strict rule to avoid semantics unclear. > > > > > > So userspace needs to know what excatly happens for a 'bind' operation. > > > This is different when binds to different technologies. E.g. for SEV, it > > > may imply after this call, the memfile can be accessed (through mmap or > > > what ever) from userspace, while for current TDX this should be not > > > allowed. > > > > I think this is actually a good thing. While SEV, TDX, pKVM, etc achieve > > similar goals and have broadly similar ways of achieving them, they really > > are different, and having userspace be aware of the differences seems okay > > to me. > > > > (Although I don't think that allowing userspace to mmap SEV shared pages is > > particularly wise -- it will result in faults or cache incoherence > > depending on the variant of SEV in use.) > > > > > > > > And I feel we still need a third flow/operation to indicate the > > > completion of the initialization on the memfile before the guest's > > > first-time launch. SEV needs to check previous mmap-ed areas are munmap-ed > > > and prevent future userspace access. After this point, then the memfile > > > becomes truely private fd. > > > > Even that is technology-dependent. For TDX, this operation doesn't really > > exist. For SEV, I'm not sure (I haven't read the specs in nearly enough > > detail). For pKVM, I guess it does exist and isn't quite the same as a > > shared->private conversion. > > > > Maybe this could be generalized a bit as an operation "measure and make > > private" that would be supported by the technologies for which it's useful. > > Then I think we need callback instead of static flag field. Backing > store implements this callback and consumers change the flags > dynamically with this callback. This implements kind of state machine > flow. > > > > > > > > > > >> > > >> > > >> ---------------------------------------------- > > >> > > >> Now I have a question, since I don't think anyone has really answered > > >> it: how does this all work with SEV- or pKVM-like technologies in which > > >> private and shared pages share the same address space? I sounds like > > >> you're proposing to have a big memfile that contains private and shared > > >> pages and to use that same memfile as pages are converted back and > > >> forth. IO and even real physical DMA could be done on that memfile. Am > > >> I understanding correctly? > > > > > > For TDX case, and probably SEV as well, this memfile contains private > > > memory > > > only. But this design at least makes it possible for usage cases like > > > pKVM which wants both private/shared memory in the same memfile and rely > > > on other ways like mmap/munmap or mprotect to toggle private/shared > > > instead > > > of fallocate/hole punching. > > > > Hmm. Then we still need some way to get KVM to generate the correct SEV > > pagetables. For TDX, there are private memslots and shared memslots, and > > they can overlap. If they overlap and both contain valid pages at the same > > address, then the results may not be what the guest-side ABI expects, but > > everything will work. So, when a single logical guest page transitions > > between shared and private, no change to the memslots is needed. For SEV, > > this is not the case: everything is in one set of pagetables, and there > > isn't a natural way to resolve overlaps. > > I don't see SEV has problem. Note for all the cases, both private/shared > memory are in the same memslot. For a given GPA, if there is no private > page, then shared page will be used to establish KVM pagetables, so this > can guarantee there is no overlaps. > > > > > If the memslot code becomes efficient enough, then the memslots could be > > fragmented. Or the memfile could support private and shared data in the > > same memslot. And if pKVM does this, I don't see why SEV couldn't also do > > it and hopefully reuse the same code. > > For pKVM, that might be the case. For SEV, I don't think we require > private/shared data in the same memfile. The same model that works for > TDX should also work for SEV. Or maybe I misunderstood something here? > > > > > > > > >> > > >> If so, I think this makes sense, but I'm wondering if the actual memslot > > >> setup should be different. For TDX, private memory lives in a logically > > >> separate memslot space. For SEV and pKVM, it doesn't. I assume the API > > >> can reflect this straightforwardly. > > > > > > I believe so. The flow should be similar but we do need pass different > > > flags during the 'bind' to the backing store for different usages. That > > > should be some new flags for pKVM but the callbacks (API here) between > > > memfile_notifile and its consumers can be reused. > > > > And also some different flag in the operation that installs the fd as a > > memslot? > > > > > > > >> > > >> And the corresponding TDX question: is the intent still that shared > > >> pages aren't allowed at all in a TDX memfile? If so, that would be the > > >> most direct mapping to what the hardware actually does. > > > > > > Exactly. TDX will still use fallocate/hole punching to turn on/off the > > > private page. Once off, the traditional shared page will become > > > effective in KVM. > > > > Works for me. > > > > For what it's worth, I still think it should be fine to land all the TDX > > memfile bits upstream as long as we're confident that SEV, pKVM, etc can be > > added on without issues. > > > > I think we can increase confidence in this by either getting one other > > technology's maintainers to get far enough along in the design to be > > confident > > AFAICS, SEV shouldn't have any problem, But would like to see AMD people > can comment. For pKVM, definitely need more work, but isn't totally > undoable. Also would be good if pKVM people can comment.
Merging things incrementally sounds good to me if we can indeed get some time to make sure it'll be a workable solution for other technologies. I'm happy to prototype a pKVM extension to the proposed series to see if there are any major blockers. Thanks, Quentin