Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
Hi Sean, Thanks for taking the time to review the series. I took note of your comments across the series, and will incorporate them into the LPC discussion. On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote: > On Wed, Nov 08, 2023, Sean Christopherson wrote: > > On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > > > This RFC series introduces the necessary infrastructure to emulate VSM > > > enabled guests. It is a snapshot of the progress we made so far, and its > > > main goal is to gather design feedback. > > > > Heh, then please provide an overview of the design, and ideally context > > and/or > > justification for various design decisions. It doesn't need to be a proper > > design > > doc, and you can certainly point at other documentation for explaining > > VSM/VTLs, > > but a few paragraphs and/or verbose bullet points would go a long way. > > > > The documentation in patch 33 provides an explanation of VSM itself, and a > > little > > insight into how userspace can utilize the KVM implementation. But the > > documentation > > provides no explanation of the mechanics that KVM *developers* care about, > > e.g. > > the use of memory attributes, how memory attributes are enforced, whether > > or not > > an in-kernel local APIC is required, etc. > > > > Nor does the documentation explain *why*, e.g. why store a separate set of > > memory > > attributes per VTL "device", which by the by is broken and unnecessary. > > After speed reading the series.. An overview of the design, why you made > certain > choices, and the tradeoffs between various options is definitely needed. > > A few questions off the top of my head: > > - What is the split between userspace and KVM? How did you arrive at that > split? Our original design, which we discussed in the KVM forum 2023 [1] and is public [2], implemented most of VSM in-kernel. Notably we introduced VTL awareness in struct kvm_vcpu. This turned out to be way too complex: now vcpus have multiple CPU architectural states, events, apics, mmu, etc. First of all, the code turned out to be very intrusive, for example, most apic APIs had to be reworked one way or another to accommodate the fact there are multiple apics available. Also, we were forced to introduce VSM-specific semantics in x86 emulation code. But more importantly, the biggest pain has been dealing with state changes, they may be triggered remotely through requests, and some are already fairly delicate as-is. They involve a multitude of corner cases that almost never apply for a VTL aware kvm_vcpu. Especially if you factor in features like live migration. It's been a continuous source of regressions. Memory protections were implemented by using memory slot modifications. We introduced a downstream API that allows updating memory slots concurrently with vCPUs running. I think there was a similar proposal upstream from Red Hat some time ago. The result is complex, hard to generalize and slow. So we decided to move all this complexity outside of struct kvm_vcpu and, as much as possible, out of the kernel. We figures out the basic kernel building blocks that are absolutely necessary, and let user-space deal with the rest. > - How much *needs* to be in KVM? I.e. how much can be pushed to userspace > while >maintaininly good performance? As I said above, the aim of the current design is to keep it as light as possible. The biggest move we made was moving VTL switching into user-space. We don't see any indication that performance is affected in a major way. But we will know for sure once we finish the implementation and test it under real use-cases. > - Why not make VTLs a first-party concept in KVM? E.g. rather than bury info >in a VTL device and APIC ID groups, why not modify "struct kvm" to support >replicating state that needs to be tracked per-VTL? Because of how memory >attributes affect hugepages, duplicating *memslots* might actually be > easier >than teaching memslots to be VTL-aware. I do agree that we need to introduce some level VTL awareness into memslots. There's the hugepages issues you pointed out. But it'll be also necessary once we look at how to implement overlay pages that are per-VTL. (A topic I didn't mention in the series as I though I had managed to solve memory protections while avoiding the need for multiple slots). What I have in mind is introducing a memory slot address space per-VTL, similar to how we do things with SMM. It's important to note that requirements for overlay pages and memory protections are very different. Overlay pages are scarce, and are setup once and never change (AFAICT), so we think stopping all vCPUs, updating slots, and resuming execution will provide good enough performance. Memory protections happen very frequently, generally with page granularity, and may be short-lived. > - Is "struct kvm_vcpu" the best representation of an execution context (if > I'm >getting the terminology right)? Le
Re: [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS
On Wed Nov 8, 2023 at 5:47 PM UTC, Sean Christopherson wrote: > On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > > From: Anel Orazgaliyeva > > > > Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC > > ids into two. The lower bits, the physical APIC id, represent the part > > that's exposed to the guest. The higher bits, which are private to KVM, > > groups APICs together. APICs in different groups are isolated from each > > other, and IPIs can only be directed at APICs that share the same group > > as its source. Furthermore, groups are only relevant to IPIs, anything > > incoming from outside the local APIC complex: from the IOAPIC, MSIs, or > > PV-IPIs is targeted at the default APIC group, group 0. > > > > When routing IPIs with physical destinations, KVM will OR the source's > > vCPU APIC group with the ICR's destination ID and use that to resolve > > the target lAPIC. > > Is all of the above arbitrary KVM behavior or defined by the TLFS? All this matches VSM's expectations to how interrupts are to be handled. But APIC groups is a concept we created with the aim of generalizing the behaviour as much as possible. > > The APIC physical map is also made group aware in > > order to speed up this process. For the sake of simplicity, the logical > > map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI > > routing to the slower per-vCPU scan method. > > Why? I mean, I kinda sorta understand what it does for VSM, but it's not at > all > obvious why this information needs to be shoved into the APIC IDs. E.g. why > not > have an explicit group_id and then maintain separate optimization maps for > each? There are three tricks to APIC groups. One is IPI routing: for ex. the ICR phyical destination is mixed with the source vCPU's APIC group to find the destination vCPU. Another is presenting a coherent APIC ID across VTLs; switching VTLs shouldn't change the guest's view of the APIC ID. And ultimately keeps track of the vCPU's VTL level. I don't wee why we couldn't decouple them, as long as we keep filtering the APIC ID before it reaches the guest. > > This capability serves as a building block to implement virtualisation > > based security features like Hyper-V's Virtual Secure Mode (VSM). VSM > > introduces a para-virtualised switch that allows for guest CPUs to jump > > into a different execution context, this switches into a different CPU > > state, lAPIC state, and memory protections. We model this in KVM by > > Who is "we"? As a general rule, avoid pronouns. "we" and "us" in particular > should never show up in a changelog. I genuinely don't know if "we" means > userspace or KVM, and the distinction matters because it clarifies whether or > not KVM is actively involved in the modeling versus KVM being little more > than a > dumb pipe to provide the plumbing. Sorry, I've been actively trying to avoid pronouns as you already mentioned it on a previous review. This one made it through the cracks. > > using distinct kvm_vcpus for each context. > > > > Moreover, execution contexts are hierarchical and its APICs are meant to > > remain functional even when the context isn't 'scheduled in'. > > Please explain the relationship and rules of execution contexts. E.g. are > execution contexts the same thing as VTLs? Do all "real" vCPUs belong to > every > execution context? If so, is that a requirement? I left a note about this in my reply to your questions in the cover letter. > > For example, we have to keep track of > > timers' expirations, and interrupt execution of lesser priority contexts > > when relevant. Hence the need to alias physical APIC ids, while keeping > > the ability to target specific execution contexts. > > > > Signed-off-by: Anel Orazgaliyeva > > Co-developed-by: Nicolas Saenz Julienne > > Signed-off-by: Nicolas Saenz Julienne > > --- > > > > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h > > index e1021517cf04..542bd208e52b 100644 > > --- a/arch/x86/kvm/lapic.h > > +++ b/arch/x86/kvm/lapic.h > > @@ -97,6 +97,8 @@ void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned > > long cr8); > > void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu); > > void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); > > u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); > > +int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm, > > + struct kvm_apic_id_groups *groups); > > void kvm_recalculate_apic_map(struct kvm *kvm); > > void kvm_apic_set_version(struct kvm_vcpu *vcpu); > > void kvm_apic_after_set_mcg_cap(struct kvm_vcpu *vcpu); > > @@ -277,4 +279,35 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic) > > return kvm_lapic_get_reg(apic, APIC_ID) >> 24; > > } > > > > +static inline u32 kvm_apic_id(struct kvm_vcpu *vcpu) > > +{ > > + return vcpu->vcpu_id & ~vcpu->kvm->arch.apic_id_group_mask; > > This is *extremely* misleading. KVM forces the x2APIC ID to match vcpu_id, > but > in xAPIC mo
Re: [RFC 14/33] KVM: x86: Add VTL to the MMU role
On Wed Nov 8, 2023 at 5:26 PM UTC, Sean Christopherson wrote: > On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > > With the upcoming introduction of per-VTL memory protections, make MMU > > roles VTL aware. This will avoid sharing PTEs between vCPUs that belong > > to different VTLs, and that have distinct memory access restrictions. > > > > Four bits are allocated to store the VTL number in the MMU role, since > > the TLFS states there is a maximum of 16 levels. > > How many does KVM actually allow/support? Multiplying the number of possible > roots by 16x is a *major* change. AFAIK in practice only VTL0/1 are used. Don't know if Microsoft will come up with more in the future. We could introduce a CAP that expses the number of supported VTLs to user-space, and leave it as a compile option.
Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
On Wed Nov 8, 2023 at 4:55 PM UTC, Sean Christopherson wrote: > > This RFC series introduces the necessary infrastructure to emulate VSM > > enabled guests. It is a snapshot of the progress we made so far, and its > > main goal is to gather design feedback. > > Heh, then please provide an overview of the design, and ideally context and/or > justification for various design decisions. It doesn't need to be a proper > design > doc, and you can certainly point at other documentation for explaining > VSM/VTLs, > but a few paragraphs and/or verbose bullet points would go a long way. > > The documentation in patch 33 provides an explanation of VSM itself, and a > little > insight into how userspace can utilize the KVM implementation. But the > documentation > provides no explanation of the mechanics that KVM *developers* care about, > e.g. > the use of memory attributes, how memory attributes are enforced, whether or > not > an in-kernel local APIC is required, etc. Noted, I'll provide a design review on the next submission. > Nor does the documentation explain *why*, e.g. why store a separate set of > memory > attributes per VTL "device", which by the by is broken and unnecessary. It's clear to me how the current implementation of VTL devices is broken. But unncessary? That made me think we could inject the VTL In the memory attribute key, for ex. with 'gfn | vtl << 58'. And then use generic API and a single xarray. Nicolas
Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
On Fri, Nov 10, 2023, Nicolas Saenz Julienne wrote: > On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote: > > - What is the split between userspace and KVM? How did you arrive at that > > split? > > Our original design, which we discussed in the KVM forum 2023 [1] and is > public [2], implemented most of VSM in-kernel. Notably we introduced VTL > awareness in struct kvm_vcpu. ... > So we decided to move all this complexity outside of struct kvm_vcpu > and, as much as possible, out of the kernel. We figures out the basic > kernel building blocks that are absolutely necessary, and let user-space > deal with the rest. Sorry, I should have been more clear. What's the split in terms of responsibilities, i.e. what will KVM's ABI look like? E.g. if the vCPU=>VTLs setup is nonsensical, does KVM care? My general preference is for KVM to be as permissive as possible, i.e. let userspace do whatever it wants so long as it doesn't place undue burden on KVM. But at the same time I don't to end up in a similar boat as many of the paravirt features, where things just stop working if userspace or the guest makes a goof. > > - Why not make VTLs a first-party concept in KVM? E.g. rather than bury > > info > >in a VTL device and APIC ID groups, why not modify "struct kvm" to > > support > >replicating state that needs to be tracked per-VTL? Because of how > > memory > >attributes affect hugepages, duplicating *memslots* might actually be > > easier > >than teaching memslots to be VTL-aware. > > I do agree that we need to introduce some level VTL awareness into > memslots. There's the hugepages issues you pointed out. But it'll be > also necessary once we look at how to implement overlay pages that are > per-VTL. (A topic I didn't mention in the series as I though I had > managed to solve memory protections while avoiding the need for multiple > slots). What I have in mind is introducing a memory slot address space > per-VTL, similar to how we do things with SMM. N (I hate memslot address spaces :-) ) Why not represent each VTL with a separate "struct kvm" instance? That would naturally provide per-VTL behavior for: - APIC groups - memslot overlays - memory attributes (and their impact on hugepages) - MMU pages The only (obvious) issue with that approach would be cross-VTL operations. IIUC, sending IPIs across VTLs isn't allowed, but even if it were, that should be easy enough to solve, e.g. KVM already supports posting interrupts from non-KVM sources. GVA=>GPA translation would be trickier, but that patch suggests you want to handle that in userspace anyways. And if translation is a rare/slow path, maybe it could simply be punted to userspace? NOTE: The hypercall implementation is incomplete and only shared for completion. Additionally we'd like to move the VTL aware parts to user-space. Ewww, and looking at what it would take to support cross-VM translations shows another problem with using vCPUs to model VTLs. Nothing prevents userspace from running a virtual CPU at multiple VTLs concurrently, which means that anything that uses kvm_hv_get_vtl_vcpu() is unsafe, e.g. walk_mmu->gva_to_gpa() could be modified while kvm_hv_xlate_va_walk() is running. I suppose that's not too hard to solve, e.g. mutex_trylock() and bail if something holds the other kvm_vcpu/VTL's mutex. Though ideally, KVM would punt all cross-VTL operations to userspace. :-) If punting to userspace isn't feasible, using a struct kvm per VTL probably wouldn't make the locking and concurrency problems meaningfully easier or harder to solve. E.g. KVM could require VTLs, i.e. "struct kvm" instances that are part of a single virtual machine, to belong to the same process. That'd avoid headaches with mm_struct, at which point I don't _think_ getting and using a kvm_vcpu from a different kvm would need special handling? Heh, another fun one, the VTL handling in kvm_hv_send_ipi() is wildly broken, the in_vtl field is consumed before send_ipi is read from userspace. union hv_input_vtl *in_vtl; u64 valid_bank_mask; u32 vector; bool all_cpus; u8 vtl; /* VTL is at the same offset on both IPI types */ in_vtl = &send_ipi.in_vtl; vtl = in_vtl->use_target_vtl ? in_vtl->target_vtl : kvm_hv_get_active_vtl(vcpu); > >E.g. if 90% of the state is guaranteed to be identical for a given > >vCPU across execution contexts, then modeling that with separate > >kvm_vcpu structures is very inefficient. I highly doubt it's 90%, > >but it might be quite high depending on how much the TFLS restricts > >the state of the vCPU, e.g. if it's 64-bit only. > > For the record here's the private VTL state (TLFS 15.11.1): > > "In general, each VTL has its own control registers, RIP register, RSP > register, and MSRs: > > SYSENTER_CS, SYSENTER_ESP, SYSENTER_EIP, STAR, LSTAR, CSTAR, SFMASK, > EFER, PAT, KERNEL_GSBASE