Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support

2023-11-10 Thread Nicolas Saenz Julienne
Hi Sean,
Thanks for taking the time to review the series. I took note of your
comments across the series, and will incorporate them into the LPC
discussion.

On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote:
> On Wed, Nov 08, 2023, Sean Christopherson wrote:
> > On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> > > This RFC series introduces the necessary infrastructure to emulate VSM
> > > enabled guests. It is a snapshot of the progress we made so far, and its
> > > main goal is to gather design feedback.
> >
> > Heh, then please provide an overview of the design, and ideally context 
> > and/or
> > justification for various design decisions.  It doesn't need to be a proper 
> > design
> > doc, and you can certainly point at other documentation for explaining 
> > VSM/VTLs,
> > but a few paragraphs and/or verbose bullet points would go a long way.
> >
> > The documentation in patch 33 provides an explanation of VSM itself, and a 
> > little
> > insight into how userspace can utilize the KVM implementation.  But the 
> > documentation
> > provides no explanation of the mechanics that KVM *developers* care about, 
> > e.g.
> > the use of memory attributes, how memory attributes are enforced, whether 
> > or not
> > an in-kernel local APIC is required, etc.
> >
> > Nor does the documentation explain *why*, e.g. why store a separate set of 
> > memory
> > attributes per VTL "device", which by the by is broken and unnecessary.
>
> After speed reading the series..  An overview of the design, why you made 
> certain
> choices, and the tradeoffs between various options is definitely needed.
>
> A few questions off the top of my head:
>
>  - What is the split between userspace and KVM?  How did you arrive at that 
> split?

Our original design, which we discussed in the KVM forum 2023 [1] and is
public [2], implemented most of VSM in-kernel. Notably we introduced VTL
awareness in struct kvm_vcpu. This turned out to be way too complex: now
vcpus have multiple CPU architectural states, events, apics, mmu, etc.
First of all, the code turned out to be very intrusive, for example,
most apic APIs had to be reworked one way or another to accommodate
the fact there are multiple apics available. Also, we were forced to
introduce VSM-specific semantics in x86 emulation code. But more
importantly, the biggest pain has been dealing with state changes, they
may be triggered remotely through requests, and some are already fairly
delicate as-is. They involve a multitude of corner cases that almost
never apply for a VTL aware kvm_vcpu. Especially if you factor in
features like live migration. It's been a continuous source of
regressions.

Memory protections were implemented by using memory slot modifications.
We introduced a downstream API that allows updating memory slots
concurrently with vCPUs running. I think there was a similar proposal
upstream from Red Hat some time ago. The result is complex, hard to
generalize and slow.

So we decided to move all this complexity outside of struct kvm_vcpu
and, as much as possible, out of the kernel. We figures out the basic
kernel building blocks that are absolutely necessary, and let user-space
deal with the rest.

>  - How much *needs* to be in KVM?  I.e. how much can be pushed to userspace 
> while
>maintaininly good performance?

As I said above, the aim of the current design is to keep it as light as
possible. The biggest move we made was moving VTL switching into
user-space. We don't see any indication that performance is affected in
a major way. But we will know for sure once we finish the implementation
and test it under real use-cases.

>  - Why not make VTLs a first-party concept in KVM?  E.g. rather than bury info
>in a VTL device and APIC ID groups, why not modify "struct kvm" to support
>replicating state that needs to be tracked per-VTL?  Because of how memory
>attributes affect hugepages, duplicating *memslots* might actually be 
> easier
>than teaching memslots to be VTL-aware.

I do agree that we need to introduce some level VTL awareness into
memslots. There's the hugepages issues you pointed out. But it'll be
also necessary once we look at how to implement overlay pages that are
per-VTL. (A topic I didn't mention in the series as I though I had
managed to solve memory protections while avoiding the need for multiple
slots). What I have in mind is introducing a memory slot address space
per-VTL, similar to how we do things with SMM.

It's important to note that requirements for overlay pages and memory
protections are very different. Overlay pages are scarce, and are setup
once and never change (AFAICT), so we think stopping all vCPUs, updating
slots, and resuming execution will provide good enough performance.
Memory protections happen very frequently, generally with page
granularity, and may be short-lived.

>  - Is "struct kvm_vcpu" the best representation of an execution context (if 
> I'm
>getting the terminology right)?

Le

Re: [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS

2023-11-10 Thread Nicolas Saenz Julienne
On Wed Nov 8, 2023 at 5:47 PM UTC, Sean Christopherson wrote:
> On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> > From: Anel Orazgaliyeva 
> >
> > Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC
> > ids into two. The lower bits, the physical APIC id, represent the part
> > that's exposed to the guest. The higher bits, which are private to KVM,
> > groups APICs together. APICs in different groups are isolated from each
> > other, and IPIs can only be directed at APICs that share the same group
> > as its source. Furthermore, groups are only relevant to IPIs, anything
> > incoming from outside the local APIC complex: from the IOAPIC, MSIs, or
> > PV-IPIs is targeted at the default APIC group, group 0.
> >
> > When routing IPIs with physical destinations, KVM will OR the source's
> > vCPU APIC group with the ICR's destination ID and use that to resolve
> > the target lAPIC.
>
> Is all of the above arbitrary KVM behavior or defined by the TLFS?

All this matches VSM's expectations to how interrupts are to be handled.
But APIC groups is a concept we created with the aim of generalizing the
behaviour as much as possible.

> > The APIC physical map is also made group aware in
> > order to speed up this process. For the sake of simplicity, the logical
> > map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI
> > routing to the slower per-vCPU scan method.
>
> Why?  I mean, I kinda sorta understand what it does for VSM, but it's not at 
> all
> obvious why this information needs to be shoved into the APIC IDs.  E.g. why 
> not
> have an explicit group_id and then maintain separate optimization maps for 
> each?

There are three tricks to APIC groups. One is IPI routing: for ex. the
ICR phyical destination is mixed with the source vCPU's APIC group to
find the destination vCPU. Another is presenting a coherent APIC ID
across VTLs; switching VTLs shouldn't change the guest's view of the
APIC ID. And ultimately keeps track of the vCPU's VTL level. I don't wee
why we couldn't decouple them, as long as we keep filtering the APIC ID
before it reaches the guest.

> > This capability serves as a building block to implement virtualisation
> > based security features like Hyper-V's Virtual Secure Mode (VSM). VSM
> > introduces a para-virtualised switch that allows for guest CPUs to jump
> > into a different execution context, this switches into a different CPU
> > state, lAPIC state, and memory protections. We model this in KVM by
>
> Who is "we"?  As a general rule, avoid pronouns.  "we" and "us" in particular
> should never show up in a changelog.  I genuinely don't know if "we" means
> userspace or KVM, and the distinction matters because it clarifies whether or
> not KVM is actively involved in the modeling versus KVM being little more 
> than a
> dumb pipe to provide the plumbing.

Sorry, I've been actively trying to avoid pronouns as you already
mentioned it on a previous review. This one made it through the cracks.

> > using distinct kvm_vcpus for each context.
> >
> > Moreover, execution contexts are hierarchical and its APICs are meant to
> > remain functional even when the context isn't 'scheduled in'.
>
> Please explain the relationship and rules of execution contexts.  E.g. are
> execution contexts the same thing as VTLs?  Do all "real" vCPUs belong to 
> every
> execution context?  If so, is that a requirement?

I left a note about this in my reply to your questions in the cover
letter.

> > For example, we have to keep track of
> > timers' expirations, and interrupt execution of lesser priority contexts
> > when relevant. Hence the need to alias physical APIC ids, while keeping
> > the ability to target specific execution contexts.
> >
> > Signed-off-by: Anel Orazgaliyeva 
> > Co-developed-by: Nicolas Saenz Julienne 
> > Signed-off-by: Nicolas Saenz Julienne 
> > ---
>
>
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index e1021517cf04..542bd208e52b 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -97,6 +97,8 @@ void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned 
> > long cr8);
> >  void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
> >  void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
> >  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
> > +int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm,
> > + struct kvm_apic_id_groups *groups);
> >  void kvm_recalculate_apic_map(struct kvm *kvm);
> >  void kvm_apic_set_version(struct kvm_vcpu *vcpu);
> >  void kvm_apic_after_set_mcg_cap(struct kvm_vcpu *vcpu);
> > @@ -277,4 +279,35 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
> >   return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
> >  }
> >
> > +static inline u32 kvm_apic_id(struct kvm_vcpu *vcpu)
> > +{
> > + return vcpu->vcpu_id & ~vcpu->kvm->arch.apic_id_group_mask;
>
> This is *extremely* misleading.  KVM forces the x2APIC ID to match vcpu_id, 
> but
> in xAPIC mo

Re: [RFC 14/33] KVM: x86: Add VTL to the MMU role

2023-11-10 Thread Nicolas Saenz Julienne
On Wed Nov 8, 2023 at 5:26 PM UTC, Sean Christopherson wrote:
> On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> > With the upcoming introduction of per-VTL memory protections, make MMU
> > roles VTL aware. This will avoid sharing PTEs between vCPUs that belong
> > to different VTLs, and that have distinct memory access restrictions.
> >
> > Four bits are allocated to store the VTL number in the MMU role, since
> > the TLFS states there is a maximum of 16 levels.
>
> How many does KVM actually allow/support?  Multiplying the number of possible
> roots by 16x is a *major* change.

AFAIK in practice only VTL0/1 are used. Don't know if Microsoft will
come up with more in the future. We could introduce a CAP that expses
the number of supported VTLs to user-space, and leave it as a compile
option.



Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support

2023-11-10 Thread Nicolas Saenz Julienne
On Wed Nov 8, 2023 at 4:55 PM UTC, Sean Christopherson wrote:
> > This RFC series introduces the necessary infrastructure to emulate VSM
> > enabled guests. It is a snapshot of the progress we made so far, and its
> > main goal is to gather design feedback.
>
> Heh, then please provide an overview of the design, and ideally context and/or
> justification for various design decisions.  It doesn't need to be a proper 
> design
> doc, and you can certainly point at other documentation for explaining 
> VSM/VTLs,
> but a few paragraphs and/or verbose bullet points would go a long way.
>
> The documentation in patch 33 provides an explanation of VSM itself, and a 
> little
> insight into how userspace can utilize the KVM implementation.  But the 
> documentation
> provides no explanation of the mechanics that KVM *developers* care about, 
> e.g.
> the use of memory attributes, how memory attributes are enforced, whether or 
> not
> an in-kernel local APIC is required, etc.

Noted, I'll provide a design review on the next submission.

> Nor does the documentation explain *why*, e.g. why store a separate set of 
> memory
> attributes per VTL "device", which by the by is broken and unnecessary.

It's clear to me how the current implementation of VTL devices is
broken. But unncessary? That made me think we could inject the VTL In
the memory attribute key, for ex. with 'gfn | vtl << 58'. And then use
generic API and a single xarray.

Nicolas



Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support

2023-11-10 Thread Sean Christopherson
On Fri, Nov 10, 2023, Nicolas Saenz Julienne wrote:
> On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote:
> >  - What is the split between userspace and KVM?  How did you arrive at that 
> > split?
> 
> Our original design, which we discussed in the KVM forum 2023 [1] and is
> public [2], implemented most of VSM in-kernel. Notably we introduced VTL
> awareness in struct kvm_vcpu.

...

> So we decided to move all this complexity outside of struct kvm_vcpu
> and, as much as possible, out of the kernel. We figures out the basic
> kernel building blocks that are absolutely necessary, and let user-space
> deal with the rest.

Sorry, I should have been more clear.  What's the split in terms of 
responsibilities,
i.e. what will KVM's ABI look like?  E.g. if the vCPU=>VTLs setup is 
nonsensical,
does KVM care?

My general preference is for KVM to be as permissive as possible, i.e. let
userspace do whatever it wants so long as it doesn't place undue burden on KVM.
But at the same time I don't to end up in a similar boat as many of the paravirt
features, where things just stop working if userspace or the guest makes a goof.

> >  - Why not make VTLs a first-party concept in KVM?  E.g. rather than bury 
> > info
> >in a VTL device and APIC ID groups, why not modify "struct kvm" to 
> > support
> >replicating state that needs to be tracked per-VTL?  Because of how 
> > memory
> >attributes affect hugepages, duplicating *memslots* might actually be 
> > easier
> >than teaching memslots to be VTL-aware.
> 
> I do agree that we need to introduce some level VTL awareness into
> memslots. There's the hugepages issues you pointed out. But it'll be
> also necessary once we look at how to implement overlay pages that are
> per-VTL. (A topic I didn't mention in the series as I though I had
> managed to solve memory protections while avoiding the need for multiple
> slots). What I have in mind is introducing a memory slot address space
> per-VTL, similar to how we do things with SMM.

N (I hate memslot address spaces :-) )

Why not represent each VTL with a separate "struct kvm" instance?  That would
naturally provide per-VTL behavior for:

  - APIC groups
  - memslot overlays
  - memory attributes (and their impact on hugepages)
  - MMU pages

The only (obvious) issue with that approach would be cross-VTL operations.  
IIUC,
sending IPIs across VTLs isn't allowed, but even if it were, that should be easy
enough to solve, e.g. KVM already supports posting interrupts from non-KVM 
sources.

GVA=>GPA translation would be trickier, but that patch suggests you want to 
handle
that in userspace anyways.  And if translation is a rare/slow path, maybe it 
could
simply be punted to userspace?

  NOTE: The hypercall implementation is incomplete and only shared for
  completion. Additionally we'd like to move the VTL aware parts to
  user-space.

Ewww, and looking at what it would take to support cross-VM translations shows
another problem with using vCPUs to model VTLs.  Nothing prevents userspace from
running a virtual CPU at multiple VTLs concurrently, which means that anything
that uses kvm_hv_get_vtl_vcpu() is unsafe, e.g. walk_mmu->gva_to_gpa() could be
modified while kvm_hv_xlate_va_walk() is running.

I suppose that's not too hard to solve, e.g. mutex_trylock() and bail if 
something
holds the other kvm_vcpu/VTL's mutex.  Though ideally, KVM would punt all 
cross-VTL
operations to userspace.  :-)

If punting to userspace isn't feasible, using a struct kvm per VTL probably 
wouldn't
make the locking and concurrency problems meaningfully easier or harder to 
solve.
E.g. KVM could require VTLs, i.e. "struct kvm" instances that are part of a 
single
virtual machine, to belong to the same process.  That'd avoid headaches with
mm_struct, at which point I don't _think_ getting and using a kvm_vcpu from a
different kvm would need special handling?

Heh, another fun one, the VTL handling in kvm_hv_send_ipi() is wildly broken, 
the
in_vtl field is consumed before send_ipi is read from userspace.

union hv_input_vtl *in_vtl;
u64 valid_bank_mask;
u32 vector;
bool all_cpus;
u8 vtl;

/* VTL is at the same offset on both IPI types */
in_vtl = &send_ipi.in_vtl;
vtl = in_vtl->use_target_vtl ? in_vtl->target_vtl : 
kvm_hv_get_active_vtl(vcpu);

> >E.g. if 90% of the state is guaranteed to be identical for a given
> >vCPU across execution contexts, then modeling that with separate
> >kvm_vcpu structures is very inefficient.  I highly doubt it's 90%,
> >but it might be quite high depending on how much the TFLS restricts
> >the state of the vCPU, e.g. if it's 64-bit only.
> 
> For the record here's the private VTL state (TLFS 15.11.1):
> 
> "In general, each VTL has its own control registers, RIP register, RSP
>  register, and MSRs:
> 
>  SYSENTER_CS, SYSENTER_ESP, SYSENTER_EIP, STAR, LSTAR, CSTAR, SFMASK,
>  EFER, PAT, KERNEL_GSBASE