Re: [PATCH] checkpatch: allow tags between co-developed-by and their sign-off

2023-10-23 Thread Sean Christopherson
+Mateusz

On Mon, Oct 23, 2023, Przemek Kitszel wrote:
> Additional tags between Co-developed-by and corresponding Signed-off-by
> could include Reviewed-by tags collected by Submitter, which is also
> a Co-developer, but should sign-off at the very end of tags provided by
> the Submitter.

...

> Mateusz Polchlopek  has reported this to me.

Heh, there's a tag for that...

  Reported-by: Mateusz Polchlopek 

And it's usually a good idea to Cc the reporter in case there are questions they
can help answer.

> @@ -509,16 +509,18 @@ Example of a patch submitted by the From: author::
>   Signed-off-by: Second Co-Author 
>   Signed-off-by: From Author 
>  
> -Example of a patch submitted by a Co-developed-by: author::
> +Example of a patch submitted by a Co-developed-by: author, who also collected
> +a Reviewed-by: tag posted for earlier version::
>  
>   From: From Author 
>  
>   
>  
>   Co-developed-by: Random Co-Author 
>   Signed-off-by: Random Co-Author 
>   Signed-off-by: From Author 
>   Co-developed-by: Submitting Co-Author 
> + Reviewed-by: Some Reviewer 
>   Signed-off-by: Submitting Co-Author 

This is silly.  Allowing tags in-between Co-developed-by with Signed-off-by
unnecessarily complicates things, e.g. people already miss/forget the rule about
tightly coupling Co-developed-by with Signed-off-by.

And if we're being super pedantic about chronological history, arguably the
Reviewed-by should come before the Co-developed-by as adding the Reviewed-by is
a (trivial) modification to the patch that was done by the submitter.


Re: [RFC 01/33] KVM: x86: Decouple lapic.h from hyperv.h

2023-11-08 Thread Sean Christopherson
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> lapic.h has no dependencies with hyperv.h, so don't include it there.
> 
> Additionally, cpuid.c implicitly relied on hyperv.h's inclusion through
> lapic.h, so include it explicitly there.
> 
> Signed-off-by: Nicolas Saenz Julienne 
> ---

FWIW, feel free to post patches like this without the full context, I'm more 
than
happy to take patches that resolve header inclusion issues even if the issue(s)
only become visible with additional changes.

I'll earmark this one for 6.8.



Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation

2023-11-08 Thread Sean Christopherson
On Wed, Nov 08, 2023, Alexander Graf wrote:
> 
> On 08.11.23 12:18, Nicolas Saenz Julienne wrote:
> > Save the length of the instruction that triggered an EPT violation in
> > struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory
> > intercept messages.
> > 
> > Signed-off-by: Nicolas Saenz Julienne 
> 
> 
> In v1, please do this for SVM as well :)

Why?  KVM caches values on VMX because VMREAD is measurable slower than memory
accesses, especially when running nested.  SVM has no such problems.  I wouldn't
be surprised if adding a "cache" is actually less performant due to increased
pressure and misses on the hardware cache.



Re: [RFC 25/33] KVM: Introduce a set of new memory attributes

2023-11-08 Thread Sean Christopherson
On Wed, Nov 08, 2023, Alexander Graf wrote:
> 
> On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
> > Introduce the following memory attributes:
> >   - KVM_MEMORY_ATTRIBUTE_READ
> >   - KVM_MEMORY_ATTRIBUTE_WRITE
> >   - KVM_MEMORY_ATTRIBUTE_EXECUTE
> >   - KVM_MEMORY_ATTRIBUTE_NO_ACCESS
> > 
> > Note that NO_ACCESS is necessary in order to make a distinction between
> > the lack of attributes for a gfn, which defaults to the memory
> > protections of the backing memory, versus explicitly prohibiting any
> > access to that gfn.
> 
> 
> If we negate the attributes (no read, no write, no execute), we can keep 0
> == default and 0b111 becomes "no access".

Yes, I suggested this in the initial discussion[*].  I think it makes sense to
have the uAPI flags have positive polarity, i.e. as above, but internally we can
invert things so that the default 000b gives full RWX protections.  Or we could
make the push for a range-based xarray implementation so that storing 111b for
all gfns is super cheap.

Regardless of how KVM stores the information internally, there's no need for a
NO_ACCESS flag in the uAPI.

[*] https://lore.kernel.org/all/zgfuqblao+ci9...@google.com



Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support

2023-11-08 Thread Sean Christopherson
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> This RFC series introduces the necessary infrastructure to emulate VSM
> enabled guests. It is a snapshot of the progress we made so far, and its
> main goal is to gather design feedback.

Heh, then please provide an overview of the design, and ideally context and/or
justification for various design decisions.  It doesn't need to be a proper 
design
doc, and you can certainly point at other documentation for explaining VSM/VTLs,
but a few paragraphs and/or verbose bullet points would go a long way.

The documentation in patch 33 provides an explanation of VSM itself, and a 
little
insight into how userspace can utilize the KVM implementation.  But the 
documentation
provides no explanation of the mechanics that KVM *developers* care about, e.g.
the use of memory attributes, how memory attributes are enforced, whether or not
an in-kernel local APIC is required, etc.

Nor does the documentation explain *why*, e.g. why store a separate set of 
memory
attributes per VTL "device", which by the by is broken and unnecessary.

> Specifically on the KVM APIs we introduce. For a high level design overview,
> see the documentation in patch 33.
> 
> Additionally, this topic will be discussed as part of the KVM
> Micro-conference, in this year's Linux Plumbers Conference [2].
> 
> The series is accompanied by two repositories:
>  - A PoC QEMU implementation of VSM [3].
>  - VSM kvm-unit-tests [4].
> 
> Note that this isn't a full VSM implementation. For now it only supports
> 2 VTLs, and only runs on uniprocessor guests. It is capable of booting
> Windows Sever 2016/2019, but is unstable during runtime.
> 
> The series is based on the v6.6 kernel release, and depends on the
> introduction of KVM memory attributes, which is being worked on
> independently in "KVM: guest_memfd() and per-page attributes" [5].

This doesn't actually apply on 6.6 with v14 of guest_memfd, because v14 of
guest_memfd is based on kvm-6.7-1.  Ah, and looking at your github repo, this
isn't based on v14 at all, it's based on v12.

That's totally fine, but the cover letter needs to explicitly, clearly, and
*accurately* state the dependencies.  I can obviously grab the full branch from
github, but that's not foolproof, e.g. if you accidentally delete or force push
to that branch.  And I also prefer to know that what I'm replying to on list is
the exact same code that I am looking at.

> A full Linux tree is also made available [6].
> 
> Series rundown:
>  - Patch 2 introduces the concept of APIC ID groups.
>  - Patches 3-12 introduce the VSM capability and basic VTL awareness into
>Hyper-V emulation.
>  - Patch 13 introduces vCPU polling support.
>  - Patches 14-31 use KVM's memory attributes to implement VTL memory
>protections. Introduces the VTL KMV device and secure memory
>intercepts.
>  - Patch 32 is a temporary implementation of
>HVCALL_TRANSLATE_VIRTUAL_ADDRESS necessary to boot Windows 2019.
>  - Patch 33 introduces documentation.
> 
> Our intention is to integrate feedback gathered in the RFC and LPC while
> we finish the VSM implementation. In the future, we will split the series
> into distinct feature patch sets and upstream these independently.
> 
> Thanks,
> Nicolas
> 
> [1] 
> https://raw.githubusercontent.com/Microsoft/Virtualization-Documentation/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf
> [2] https://lpc.events/event/17/sessions/166/#20231114
> [3] https://github.com/vianpl/qemu/tree/vsm-rfc-v1
> [4] https://github.com/vianpl/kvm-unit-tests/tree/vsm-rfc-v1
> [5] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonz...@redhat.com/.
> [6] Full tree: https://github.com/vianpl/linux/tree/vsm-rfc-v1. 

When providing github links, my preference is to format the pointers as:

   

or
   tags/

e.g.

  https://github.com/vianpl/linux vsm-rfc-v1

so that readers can copy+paste the full thing directly into `git fetch`.  It's a
minor thing, but AFAIK no one actually does review by clicking through github's
webview.

> There are also two small dependencies with
> https://marc.info/?l=kvm&m=167887543028109&w=2 and
> https://lkml.org/lkml/2023/10/17/972

Please use lore links, there's zero reason to use anything else these days.  For
those of us that use b4, lore links make life much easier.



Re: [RFC 18/33] KVM: x86: Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array

2023-11-08 Thread Sean Christopherson
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 631fd532c97a..4242588e3dfb 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -2385,9 +2385,10 @@ static inline void 
> kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
>  }
>  
>  #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> -static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t 
> gfn)
> +static inline unsigned long
> +kvm_get_memory_attributes(struct xarray *mem_attr_array, gfn_t gfn)

Do not wrap before the function name.  Linus has a nice explanation/rant on 
this[*].

[*] 
https://lore.kernel.org/all/CAHk-=wjoLAYG446ZNHfg=ghjsy6nfmub_wa8fyd5ilbnxjo...@mail.gmail.com



Re: [RFC 21/33] KVM: Pass memory attribute array as a MMU notifier argument

2023-11-08 Thread Sean Christopherson
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> Pass the memory attribute array through struct kvm_mmu_notifier_arg and
> use it in kvm_arch_post_set_memory_attributes() instead of defaulting on
> kvm->mem_attr_array.
> 
> Signed-off-by: Nicolas Saenz Julienne 
> ---
>  arch/x86/kvm/mmu/mmu.c   | 8 
>  include/linux/kvm_host.h | 5 -
>  virt/kvm/kvm_main.c  | 1 +
>  3 files changed, 9 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c0fd3afd6be5..c2bec2be2ba9 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -7311,6 +7311,7 @@ static bool hugepage_has_attrs(struct xarray 
> *mem_attr_array,
>  bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
>struct kvm_gfn_range *range)
>  {
> + struct xarray *mem_attr_array = range->arg.mem_attr_array;
>   unsigned long attrs = range->arg.attributes;
>   struct kvm_memory_slot *slot = range->slot;
>   int level;
> @@ -7344,8 +7345,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm 
> *kvm,
>* misaligned address regardless of memory attributes.
>*/
>   if (gfn >= slot->base_gfn) {
> - if (hugepage_has_attrs(&kvm->mem_attr_array,
> -slot, gfn, level, attrs))
> + if (hugepage_has_attrs(mem_attr_array, slot,
> +gfn, level, attrs))

This is wildly broken.  The hugepage tracking is per VM, whereas the attributes
here are per-VTL.  I.e. KVM will (dis)allow hugepages based on whatever VTL last
changed its protections.



Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation

2023-11-08 Thread Sean Christopherson
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> Save the length of the instruction that triggered an EPT violation in
> struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory
> intercept messages.

This is silly and unnecessarily obfuscates *why* (as my response regarding SVM
shows), i.e. that this is "needed" becuase the value is consumed by a 
*different*
vCPU, not because of performance concerns.

It's also broken, AFAICT nothing prevents the intercepted vCPU from hitting a
different EPT violation before the target vCPU consumes exit_instruction_len.

Holy cow.  All of deliver_gpa_intercept() is wildly unsafe.  Aside from race
conditions, which in and of themselves are a non-starter, nothing guarantees 
that
the intercepted vCPU actually cached all of the information that is held in its 
VMCS.

The sane way to do this is to snapshot *all* information on the intercepted 
vCPU,
and then hand that off as a payload to the target vCPU.  That is, assuming the
cross-vCPU stuff is actually necessary.  At a glance, I don't see anything that
explains *why*.



Re: [RFC 14/33] KVM: x86: Add VTL to the MMU role

2023-11-08 Thread Sean Christopherson
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> With the upcoming introduction of per-VTL memory protections, make MMU
> roles VTL aware. This will avoid sharing PTEs between vCPUs that belong
> to different VTLs, and that have distinct memory access restrictions.
> 
> Four bits are allocated to store the VTL number in the MMU role, since
> the TLFS states there is a maximum of 16 levels.

How many does KVM actually allow/support?  Multiplying the number of possible
roots by 16x is a *major* change.



Re: [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS

2023-11-08 Thread Sean Christopherson
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> From: Anel Orazgaliyeva 
> 
> Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC
> ids into two. The lower bits, the physical APIC id, represent the part
> that's exposed to the guest. The higher bits, which are private to KVM,
> groups APICs together. APICs in different groups are isolated from each
> other, and IPIs can only be directed at APICs that share the same group
> as its source. Furthermore, groups are only relevant to IPIs, anything
> incoming from outside the local APIC complex: from the IOAPIC, MSIs, or
> PV-IPIs is targeted at the default APIC group, group 0.
> 
> When routing IPIs with physical destinations, KVM will OR the source's
> vCPU APIC group with the ICR's destination ID and use that to resolve
> the target lAPIC.

Is all of the above arbitrary KVM behavior or defined by the TLFS?

> The APIC physical map is also made group aware in
> order to speed up this process. For the sake of simplicity, the logical
> map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI
> routing to the slower per-vCPU scan method.

Why?  I mean, I kinda sorta understand what it does for VSM, but it's not at all
obvious why this information needs to be shoved into the APIC IDs.  E.g. why not
have an explicit group_id and then maintain separate optimization maps for each?

> This capability serves as a building block to implement virtualisation
> based security features like Hyper-V's Virtual Secure Mode (VSM). VSM
> introduces a para-virtualised switch that allows for guest CPUs to jump
> into a different execution context, this switches into a different CPU
> state, lAPIC state, and memory protections. We model this in KVM by

Who is "we"?  As a general rule, avoid pronouns.  "we" and "us" in particular
should never show up in a changelog.  I genuinely don't know if "we" means
userspace or KVM, and the distinction matters because it clarifies whether or
not KVM is actively involved in the modeling versus KVM being little more than a
dumb pipe to provide the plumbing.

> using distinct kvm_vcpus for each context.
>
> Moreover, execution contexts are hierarchical and its APICs are meant to
> remain functional even when the context isn't 'scheduled in'.

Please explain the relationship and rules of execution contexts.  E.g. are
execution contexts the same thing as VTLs?  Do all "real" vCPUs belong to every
execution context?  If so, is that a requirement?

> For example, we have to keep track of
> timers' expirations, and interrupt execution of lesser priority contexts
> when relevant. Hence the need to alias physical APIC ids, while keeping
> the ability to target specific execution contexts.
> 
> Signed-off-by: Anel Orazgaliyeva 
> Co-developed-by: Nicolas Saenz Julienne 
> Signed-off-by: Nicolas Saenz Julienne 
> ---


> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index e1021517cf04..542bd208e52b 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -97,6 +97,8 @@ void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long 
> cr8);
>  void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu);
>  void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value);
>  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
> +int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm,
> + struct kvm_apic_id_groups *groups);
>  void kvm_recalculate_apic_map(struct kvm *kvm);
>  void kvm_apic_set_version(struct kvm_vcpu *vcpu);
>  void kvm_apic_after_set_mcg_cap(struct kvm_vcpu *vcpu);
> @@ -277,4 +279,35 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
>   return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
>  }
>  
> +static inline u32 kvm_apic_id(struct kvm_vcpu *vcpu)
> +{
> + return vcpu->vcpu_id & ~vcpu->kvm->arch.apic_id_group_mask;

This is *extremely* misleading.  KVM forces the x2APIC ID to match vcpu_id, but
in xAPIC mode the ID is fully writable.



Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support

2023-11-08 Thread Sean Christopherson
On Wed, Nov 08, 2023, Sean Christopherson wrote:
> On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote:
> > This RFC series introduces the necessary infrastructure to emulate VSM
> > enabled guests. It is a snapshot of the progress we made so far, and its
> > main goal is to gather design feedback.
> 
> Heh, then please provide an overview of the design, and ideally context and/or
> justification for various design decisions.  It doesn't need to be a proper 
> design
> doc, and you can certainly point at other documentation for explaining 
> VSM/VTLs,
> but a few paragraphs and/or verbose bullet points would go a long way.
> 
> The documentation in patch 33 provides an explanation of VSM itself, and a 
> little
> insight into how userspace can utilize the KVM implementation.  But the 
> documentation
> provides no explanation of the mechanics that KVM *developers* care about, 
> e.g.
> the use of memory attributes, how memory attributes are enforced, whether or 
> not
> an in-kernel local APIC is required, etc.
> 
> Nor does the documentation explain *why*, e.g. why store a separate set of 
> memory
> attributes per VTL "device", which by the by is broken and unnecessary.

After speed reading the series..  An overview of the design, why you made 
certain
choices, and the tradeoffs between various options is definitely needed.

A few questions off the top of my head:

 - What is the split between userspace and KVM?  How did you arrive at that 
split?

 - How much *needs* to be in KVM?  I.e. how much can be pushed to userspace 
while
   maintaininly good performance?
   
 - Why not make VTLs a first-party concept in KVM?  E.g. rather than bury info
   in a VTL device and APIC ID groups, why not modify "struct kvm" to support
   replicating state that needs to be tracked per-VTL?  Because of how memory
   attributes affect hugepages, duplicating *memslots* might actually be easier
   than teaching memslots to be VTL-aware.

 - Is "struct kvm_vcpu" the best representation of an execution context (if I'm
   getting the terminology right)?  E.g. if 90% of the state is guaranteed to be
   identical for a given vCPU across execution contexts, then modeling that with
   separate kvm_vcpu structures is very inefficient.  I highly doubt it's 90%,
   but it might be quite high depending on how much the TFLS restricts the state
   of the vCPU, e.g. if it's 64-bit only.

The more info you can provide before LPC, the better, e.g. so that we can spend
time discussing options instead of you getting peppered with questions about the
requirements and whatnot.



Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support

2023-11-10 Thread Sean Christopherson
On Fri, Nov 10, 2023, Nicolas Saenz Julienne wrote:
> On Wed Nov 8, 2023 at 6:33 PM UTC, Sean Christopherson wrote:
> >  - What is the split between userspace and KVM?  How did you arrive at that 
> > split?
> 
> Our original design, which we discussed in the KVM forum 2023 [1] and is
> public [2], implemented most of VSM in-kernel. Notably we introduced VTL
> awareness in struct kvm_vcpu.

...

> So we decided to move all this complexity outside of struct kvm_vcpu
> and, as much as possible, out of the kernel. We figures out the basic
> kernel building blocks that are absolutely necessary, and let user-space
> deal with the rest.

Sorry, I should have been more clear.  What's the split in terms of 
responsibilities,
i.e. what will KVM's ABI look like?  E.g. if the vCPU=>VTLs setup is 
nonsensical,
does KVM care?

My general preference is for KVM to be as permissive as possible, i.e. let
userspace do whatever it wants so long as it doesn't place undue burden on KVM.
But at the same time I don't to end up in a similar boat as many of the paravirt
features, where things just stop working if userspace or the guest makes a goof.

> >  - Why not make VTLs a first-party concept in KVM?  E.g. rather than bury 
> > info
> >in a VTL device and APIC ID groups, why not modify "struct kvm" to 
> > support
> >replicating state that needs to be tracked per-VTL?  Because of how 
> > memory
> >attributes affect hugepages, duplicating *memslots* might actually be 
> > easier
> >than teaching memslots to be VTL-aware.
> 
> I do agree that we need to introduce some level VTL awareness into
> memslots. There's the hugepages issues you pointed out. But it'll be
> also necessary once we look at how to implement overlay pages that are
> per-VTL. (A topic I didn't mention in the series as I though I had
> managed to solve memory protections while avoiding the need for multiple
> slots). What I have in mind is introducing a memory slot address space
> per-VTL, similar to how we do things with SMM.

N (I hate memslot address spaces :-) )

Why not represent each VTL with a separate "struct kvm" instance?  That would
naturally provide per-VTL behavior for:

  - APIC groups
  - memslot overlays
  - memory attributes (and their impact on hugepages)
  - MMU pages

The only (obvious) issue with that approach would be cross-VTL operations.  
IIUC,
sending IPIs across VTLs isn't allowed, but even if it were, that should be easy
enough to solve, e.g. KVM already supports posting interrupts from non-KVM 
sources.

GVA=>GPA translation would be trickier, but that patch suggests you want to 
handle
that in userspace anyways.  And if translation is a rare/slow path, maybe it 
could
simply be punted to userspace?

  NOTE: The hypercall implementation is incomplete and only shared for
  completion. Additionally we'd like to move the VTL aware parts to
  user-space.

Ewww, and looking at what it would take to support cross-VM translations shows
another problem with using vCPUs to model VTLs.  Nothing prevents userspace from
running a virtual CPU at multiple VTLs concurrently, which means that anything
that uses kvm_hv_get_vtl_vcpu() is unsafe, e.g. walk_mmu->gva_to_gpa() could be
modified while kvm_hv_xlate_va_walk() is running.

I suppose that's not too hard to solve, e.g. mutex_trylock() and bail if 
something
holds the other kvm_vcpu/VTL's mutex.  Though ideally, KVM would punt all 
cross-VTL
operations to userspace.  :-)

If punting to userspace isn't feasible, using a struct kvm per VTL probably 
wouldn't
make the locking and concurrency problems meaningfully easier or harder to 
solve.
E.g. KVM could require VTLs, i.e. "struct kvm" instances that are part of a 
single
virtual machine, to belong to the same process.  That'd avoid headaches with
mm_struct, at which point I don't _think_ getting and using a kvm_vcpu from a
different kvm would need special handling?

Heh, another fun one, the VTL handling in kvm_hv_send_ipi() is wildly broken, 
the
in_vtl field is consumed before send_ipi is read from userspace.

union hv_input_vtl *in_vtl;
u64 valid_bank_mask;
u32 vector;
bool all_cpus;
u8 vtl;

/* VTL is at the same offset on both IPI types */
in_vtl = &send_ipi.in_vtl;
vtl = in_vtl->use_target_vtl ? in_vtl->target_vtl : 
kvm_hv_get_active_vtl(vcpu);

> >E.g. if 90% of the state is guaranteed to be identical for a given
> >vCPU across execution contexts, then modeling that with separate
> >kvm_vcpu structures is very inefficient.  I highly doubt it's 90%,
> >but it might be quite high depending on how much the TFLS restricts
> >the state of the vCPU, e

Re: [RFC 16/33] KVM: x86/mmu: Expose R/W/X flags during memory fault exits

2023-11-28 Thread Sean Christopherson
On Tue, Nov 28, 2023, Maxim Levitsky wrote:
> On Wed, 2023-11-08 at 11:17 +, Nicolas Saenz Julienne wrote:
> > Include the fault's read, write and execute status when exiting to
> > user-space.
> > 
> > Signed-off-by: Nicolas Saenz Julienne 
> > ---
> >  arch/x86/kvm/mmu/mmu.c   | 4 ++--
> >  include/linux/kvm_host.h | 9 +++--
> >  include/uapi/linux/kvm.h | 6 ++
> >  3 files changed, 15 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 4e02d506cc25..feca077c0210 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -4300,8 +4300,8 @@ static inline u8 kvm_max_level_for_order(int order)
> >  static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> >   struct kvm_page_fault *fault)
> >  {
> > -   kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT,
> > - PAGE_SIZE, fault->write, fault->exec,
> > +   kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT, PAGE_SIZE,
> > + fault->write, fault->exec, fault->user,
> >   fault->is_private);
> >  }
> >  
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 71e1e8cf8936..631fd532c97a 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -2367,14 +2367,19 @@ static inline void kvm_account_pgtable_pages(void 
> > *virt, int nr)
> >  static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
> >  gpa_t gpa, gpa_t size,
> >  bool is_write, bool is_exec,
> > -bool is_private)
> > +bool is_read, bool is_private)
> 
> It almost feels like there is a need for a struct to hold all of those 
> parameters.

The most obvious solution would be to make "struct kvm_page_fault" common, e.g.
ARM's user_mem_abort() fills RWX booleans just like x86 fills kvm_page_fault.
But I think it's best to wait to do something like that until after Anish's 
series
lands[*].  That way the conversion can be more of a pure refactoring.

[*] https://lore.kernel.org/all/20231109210325.3806151-1-amoor...@google.com




Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page

2023-12-01 Thread Sean Christopherson
On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > To support this I think that we can add a userspace msr filter on the 
> > HV_X64_MSR_HYPERCALL,
> > although I am not 100% sure if a userspace msr filter overrides the 
> > in-kernel msr handling.
> 
> I thought about it at the time. It's not that simple though, we should
> still let KVM set the hypercall bytecode, and other quirks like the Xen
> one.

Yeah, that Xen quirk is quite the killer.

Can you provide pseudo-assembly for what the final page is supposed to look 
like?
I'm struggling mightily to understand what this is actually trying to do.



Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page

2023-12-01 Thread Sean Christopherson
On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> On Fri Dec 1, 2023 at 4:32 PM UTC, Sean Christopherson wrote:
> > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > To support this I think that we can add a userspace msr filter on the 
> > > > HV_X64_MSR_HYPERCALL,
> > > > although I am not 100% sure if a userspace msr filter overrides the 
> > > > in-kernel msr handling.
> > >
> > > I thought about it at the time. It's not that simple though, we should
> > > still let KVM set the hypercall bytecode, and other quirks like the Xen
> > > one.
> >
> > Yeah, that Xen quirk is quite the killer.
> >
> > Can you provide pseudo-assembly for what the final page is supposed to look 
> > like?
> > I'm struggling mightily to understand what this is actually trying to do.
> 
> I'll make it as simple as possible (diregard 32bit support and that xen
> exists):
> 
> vmcall <-  Offset 0, regular Hyper-V hypercalls enter here
> ret
> mov rax,rcx  <-  VTL call hypercall enters here

I'm missing who/what defines "here" though.  What generates the CALL that points
at this exact offset?  If the exact offset is dictated in the TLFS, then aren't
we screwed with the whole Xen quirk, which inserts 5 bytes before that first 
VMCALL?



Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page

2023-12-05 Thread Sean Christopherson
On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> On Fri Dec 1, 2023 at 5:47 PM UTC, Sean Christopherson wrote:
> > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > On Fri Dec 1, 2023 at 4:32 PM UTC, Sean Christopherson wrote:
> > > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > > To support this I think that we can add a userspace msr filter on 
> > > > > > the HV_X64_MSR_HYPERCALL,
> > > > > > although I am not 100% sure if a userspace msr filter overrides the 
> > > > > > in-kernel msr handling.
> > > > >
> > > > > I thought about it at the time. It's not that simple though, we should
> > > > > still let KVM set the hypercall bytecode, and other quirks like the 
> > > > > Xen
> > > > > one.
> > > >
> > > > Yeah, that Xen quirk is quite the killer.
> > > >
> > > > Can you provide pseudo-assembly for what the final page is supposed to 
> > > > look like?
> > > > I'm struggling mightily to understand what this is actually trying to 
> > > > do.
> > >
> > > I'll make it as simple as possible (diregard 32bit support and that xen
> > > exists):
> > >
> > > vmcall <-  Offset 0, regular Hyper-V hypercalls enter here
> > > ret
> > > mov rax,rcx  <-  VTL call hypercall enters here
> >
> > I'm missing who/what defines "here" though.  What generates the CALL that 
> > points
> > at this exact offset?  If the exact offset is dictated in the TLFS, then 
> > aren't
> > we screwed with the whole Xen quirk, which inserts 5 bytes before that 
> > first VMCALL?
> 
> Yes, sorry, I should've included some more context.
> 
> Here's a rundown (from memory) of how the first VTL call happens:
>  - CPU0 start running at VTL0.
>  - Hyper-V enables VTL1 on the partition.
>  - Hyper-V enabled VTL1 on CPU0, but doesn't yet switch to it. It passes
>the initial VTL1 CPU state alongside the enablement hypercall
>arguments.
>  - Hyper-V sets the Hypercall page overlay address through
>HV_X64_MSR_HYPERCALL. KVM fills it.
>  - Hyper-V gets the VTL-call and VTL-return offset into the hypercall
>page using the VP Register HvRegisterVsmCodePageOffsets (VP register
>handling is in user-space).

Ah, so the guest sets the offsets by "writing" HvRegisterVsmCodePageOffsets via
a HvSetVpRegisters() hypercall.

I don't see a sane way to handle this in KVM if userspace handles 
HvSetVpRegisters().
E.g. if the guest requests offsets that don't leave enough room for KVM to shove
in its data, then presumably userspace needs to reject HvSetVpRegisters().  But
that requires userspace to know exactly how many bytes KVM is going to write at
each offsets.

My vote is to have userspace do all the patching.  IIUC, all of this is going to
be mutually exclusive with kvm_xen_hypercall_enabled(), i.e. userspace doesn't
need to worry about setting RAX[31].  At that point, it's just VMCALL versus
VMMCALL, and userspace is more than capable of identifying whether its running
on Intel or AMD.

>  - Hyper-V performs the first VTL-call, and has all it needs to move
>between VTL0/1.
> 
> Nicolas



Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page

2023-12-05 Thread Sean Christopherson
On Tue, Dec 05, 2023, Maxim Levitsky wrote:
> On Tue, 2023-12-05 at 11:21 -0800, Sean Christopherson wrote:
> > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > On Fri Dec 1, 2023 at 5:47 PM UTC, Sean Christopherson wrote:
> > > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > On Fri Dec 1, 2023 at 4:32 PM UTC, Sean Christopherson wrote:
> > > > > > On Fri, Dec 01, 2023, Nicolas Saenz Julienne wrote:
> > > > > > > > To support this I think that we can add a userspace msr filter 
> > > > > > > > on the HV_X64_MSR_HYPERCALL,
> > > > > > > > although I am not 100% sure if a userspace msr filter overrides 
> > > > > > > > the in-kernel msr handling.
> > > > > > > 
> > > > > > > I thought about it at the time. It's not that simple though, we 
> > > > > > > should
> > > > > > > still let KVM set the hypercall bytecode, and other quirks like 
> > > > > > > the Xen
> > > > > > > one.
> > > > > > 
> > > > > > Yeah, that Xen quirk is quite the killer.
> > > > > > 
> > > > > > Can you provide pseudo-assembly for what the final page is supposed 
> > > > > > to look like?
> > > > > > I'm struggling mightily to understand what this is actually trying 
> > > > > > to do.
> > > > > 
> > > > > I'll make it as simple as possible (diregard 32bit support and that 
> > > > > xen
> > > > > exists):
> > > > > 
> > > > > vmcall <-  Offset 0, regular Hyper-V hypercalls enter here
> > > > > ret
> > > > > mov rax,rcx  <-  VTL call hypercall enters here
> > > > 
> > > > I'm missing who/what defines "here" though.  What generates the CALL 
> > > > that points
> > > > at this exact offset?  If the exact offset is dictated in the TLFS, 
> > > > then aren't
> > > > we screwed with the whole Xen quirk, which inserts 5 bytes before that 
> > > > first VMCALL?
> > > 
> > > Yes, sorry, I should've included some more context.
> > > 
> > > Here's a rundown (from memory) of how the first VTL call happens:
> > >  - CPU0 start running at VTL0.
> > >  - Hyper-V enables VTL1 on the partition.
> > >  - Hyper-V enabled VTL1 on CPU0, but doesn't yet switch to it. It passes
> > >the initial VTL1 CPU state alongside the enablement hypercall
> > >arguments.
> > >  - Hyper-V sets the Hypercall page overlay address through
> > >HV_X64_MSR_HYPERCALL. KVM fills it.
> > >  - Hyper-V gets the VTL-call and VTL-return offset into the hypercall
> > >page using the VP Register HvRegisterVsmCodePageOffsets (VP register
> > >handling is in user-space).
> > 
> > Ah, so the guest sets the offsets by "writing" HvRegisterVsmCodePageOffsets 
> > via
> > a HvSetVpRegisters() hypercall.
> 
> No, you didn't understand this correctly. 
> 
> The guest writes the HV_X64_MSR_HYPERCALL, and in the response hyperv fills

When people say "Hyper-V", do y'all mean "root partition"?  If so, can we just
say "root partition"?  Part of my confusion is that I don't instinctively know
whether things like "Hyper-V enables VTL1 on the partition" are talking about 
the
root partition (or I guess parent partition?) or the hypervisor.  Functionally 
it
probably doesn't matter, it's just hard to reconcile things with the TLFS, which
is written largely to describe the hypervisor's behavior.

> the hypercall page, including the VSM thunks.
>
> Then the guest can _read_ the offsets, hyperv chose there by issuing another 
> hypercall. 

Hrm, now I'm really confused.  Ah, the TLFS contradicts itself.  The blurb for
AccessVpRegisters says:

  The partition can invoke the hypercalls HvSetVpRegisters and HvGetVpRegisters.

And HvSetVpRegisters confirms that requirement:

  The caller must either be the parent of the partition specified by 
PartitionId,
  or the partition specified must be “self” and the partition must have the
  AccessVpRegisters privilege

But it's absent from HvGetVpRegisters:

  The caller must be the parent of the partition specified by PartitionId or the
  partition specifying its own partition ID.

> In the current implementation, the offsets that the kernel choose are first
> exposed to the userspace via new ioctl, and then the userspace exposes these
> offsets to the guest via that 'another hypercall' (reading a pseudo partition
> register 'HvRegisterVsmCodePageOffsets')
> 
> I personally don't know for sure anymore if the userspace or kernel based
> hypercall page is better here, it's ugly regardless :(

Hrm.  Requiring userspace to intercept the WRMSR will be a mess because then KVM
will have zero knowledge of the hypercall page, e.g. userspace would be forced 
to
intercept HV_X64_MSR_GUEST_OS_ID as well.  That's not the end of the world, but
it's not exactly ideal either.

What if we exit to userspace with a new kvm_hyperv_exit reason that requires
completion?  I.e. punt to userspace if VSM is enabled, but still record the data
in KVM?  Ugh, but even that's a mess because kvm_hv_set_msr_pw() is deep in the
WRMSR emulation call stack and can't easily signal that an exit to userspace is
needed.  Blech.



Re: [PATCH 16/18] KVM: x86: Take mem attributes into account when faulting memory

2024-08-22 Thread Sean Christopherson
On Thu, Aug 22, 2024, Nicolas Saenz Julienne wrote:
> On Sun Jun 9, 2024 at 3:49 PM UTC, Nicolas Saenz Julienne wrote:
> > Take into account access restrictions memory attributes when faulting
> > guest memory. Prohibited memory accesses will cause an user-space fault
> > exit.
> >
> > Additionally, bypass a warning in the !tdp case. Access restrictions in
> > guest page tables might not necessarily match the host pte's when memory
> > attributes are in use.
> >
> > Signed-off-by: Nicolas Saenz Julienne 
> 
> I now realize that only taking into account memory attributes during
> faults isn't good enough for VSM. We should check the attributes anytime
> KVM takes GPAs as input for any action initiated by the guest. If the
> memory attributes are incompatible with such action, it should be
> stopped. Failure to do so opens side channels that unprivileged VTLs can
> abuse to infer information about privileged VTL. Some examples I came up
> with:
> - Guest page walks: VTL0 could install malicious directory entries that
>   point to GPAs only visible to VTL1. KVM will happily continue the
>   walk. Among other things, this could be use to infer VTL1's GVA->GPA
>   mappings.
> - PV interfaces like the Hyper-V TSC page or VP assist page, could be
>   used to modify portions of VTL1 memory.
> - Hyper-V hypercalls that take GPAs as input/output can be abused in a
>   myriad of ways. Including ones that exit into user-space.
> 
> We would be protected against all these if we implemented the memory
> access restrictions through the memory slots API. As is, it has the
> drawback of having to quiesce the whole VM for any non-trivial slot
> modification (i.e. VSM's memory protections). But if we found a way to
> speed up the slot updates we could rely on that, and avoid having to
> teach kvm_read/write_guest() and friends to deal with memattrs. Note
> that we would still need to use memory attributes to request for faults
> to exit onto user-space on those select GPAs. Any opinions or
> suggestions?
> 
> Note that, for now, I'll stick with the memory attributes approach to
> see what the full solution looks like.

FWIW, I suspect we'll be better off honoring memory attributes.  It's not just
the KVM side that has issues with memslot updates, my understanding is userspace
has also built up "slow" code with respect to memslot updates, in part because
it's such a slow path in KVM.



Re: [PATCH v2 07/25] KVM: VMX: Set intercept for FRED MSRs

2024-09-12 Thread Sean Christopherson
On Thu, Sep 05, 2024, Xin Li wrote:
> On 6/12/2024 2:32 PM, Sean Christopherson wrote:
> > On Fri, Apr 19, 2024, Chao Gao wrote:
> > > On Wed, Feb 07, 2024 at 09:26:27AM -0800, Xin Li wrote:
> > > > Add FRED MSRs to the valid passthrough MSR list and set FRED MSRs 
> > > > intercept
> > > > based on FRED enumeration.
> > 
> > This needs a *much* more verbose explanation.  It's pretty darn obvious 
> > _what_
> > KVM is doing, but it's not at all clear _why_ KVM is passing through FRED 
> > MSRs.
> > E.g. why is FRED_SSP0 not included in the set of passthrough MSRs?
> > 
> > > > static void vmx_vcpu_config_fred_after_set_cpuid(struct kvm_vcpu *vcpu)
> > > > {
> > > > struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > > +   bool fred_enumerated;
> > > > 
> > > > kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_FRED);
> > > > +   fred_enumerated = guest_can_use(vcpu, X86_FEATURE_FRED);
> > > > 
> > > > -   if (guest_can_use(vcpu, X86_FEATURE_FRED)) {
> > > > +   if (fred_enumerated) {
> > > > vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_FRED);
> > > > secondary_vm_exit_controls_setbit(vmx,
> > > >   
> > > > SECONDARY_VM_EXIT_SAVE_IA32_FRED |
> > > > @@ -7788,6 +7793,16 @@ static void 
> > > > vmx_vcpu_config_fred_after_set_cpuid(struct kvm_vcpu *vcpu)
> > > > 
> > > > SECONDARY_VM_EXIT_SAVE_IA32_FRED |
> > > > 
> > > > SECONDARY_VM_EXIT_LOAD_IA32_FRED);
> > > > }
> > > > +
> > > > +   vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP0, 
> > > > MSR_TYPE_RW, !fred_enumerated);
> > > > +   vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP1, 
> > > > MSR_TYPE_RW, !fred_enumerated);
> > > > +   vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP2, 
> > > > MSR_TYPE_RW, !fred_enumerated);
> > > > +   vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP3, 
> > > > MSR_TYPE_RW, !fred_enumerated);
> > > > +   vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_STKLVLS, 
> > > > MSR_TYPE_RW, !fred_enumerated);
> > > > +   vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP1, 
> > > > MSR_TYPE_RW, !fred_enumerated);
> > > > +   vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP2, 
> > > > MSR_TYPE_RW, !fred_enumerated);
> > > > +   vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP3, 
> > > > MSR_TYPE_RW, !fred_enumerated);
> > > > +   vmx_set_intercept_for_msr(vcpu, MSR_IA32_FRED_CONFIG, 
> > > > MSR_TYPE_RW, !fred_enumerated);
> > > 
> > > Use a for-loop here? e.g.,
> > >   for (i = MSR_IA32_FRED_RSP0; i <= MSR_IA32_FRED_CONFIG; i++)
> > 
> > Hmm, I'd prefer to keep the open coded version.  It's not pretty, but I 
> > don't
> > expect this to have much, if any, maintenance cost.  And using a loop makes 
> > it
> > harder to both understand _exactly_ what's happening, and to search for 
> > relevant
> > code.  E.g. it's quite difficult to see that FRED_SSP0 is still intercepted 
> > (see
> > my comment regarding the changelog).
> 
> 
> I owe you an explanation; I have been thinking about figuring out a way
> to include FRED SSP0 in the FRED KVM patch set...
> 
> MSR_IA32_FRED_SSP0 is an alias of the CET MSR_IA32_PL0_SSP and likely to
> be used in the same way as FRED RSP0, i.e., host FRED SSP0 _should_ be
> restored in arch_exit_to_user_mode_prepare().  However as of today Linux
> has no plan to utilize kernel shadow stack thus no one cares host FRED
> SSP0 (no?).  But lets say anyway it is host's responsibility to manage
> host FRED SSP0, then KVM only needs to take care of guest FRED SSP0
> (just like how KVM should handle guest FRED RSP0) even before the
> supervisor shadow stack feature is advertised to guest.

Heh, I'm not sure what your question is, or if there even is a question.  KVM
needs to context switch FRED SSP0 if FRED is exposed to the guest, but 
presumably
that will be done through XSAVE state?  If that's the long term plan, I would
prefer to focus on merging CET virtualization first, and then land FRED 
virtualization
on top so that KVM doesn't have to carry intermediate code to deal with the 
aliased
MSR.

Ugh, but what happens if a CPU (or the host kernel) supports FRED but not CET 
SS?
Or is that effectively an illegal combination?

> Another question is should KVM handle userspace request to set/get FRED
> SSP0?  IMO, it should be part of CET state management.

Yes, KVM needs to allow userspace to get/set FRED SSP0.  In general, KVM needs 
to
allow reads/writes to MSRs even if they can be saved/restored through some other
means.  In most cases, including this one, it's a moot point, because KVM needs
to have the necessary code anyways, e.g. if KVM encounters a RDMSR/WRMSR while
emulating.



Re: [PATCH v8 18/27] mm: Introduce do_mmap_locked()

2019-08-19 Thread Sean Christopherson
On Tue, Aug 13, 2019 at 01:52:16PM -0700, Yu-cheng Yu wrote:
> There are a few places that need do_mmap() with mm->mmap_sem held.
> Create an in-line function for that.
> 
> Signed-off-by: Yu-cheng Yu 
> ---
>  include/linux/mm.h | 18 ++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index bc58585014c9..275c385f53c6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2394,6 +2394,24 @@ static inline void mm_populate(unsigned long addr, 
> unsigned long len)
>  static inline void mm_populate(unsigned long addr, unsigned long len) {}
>  #endif
>  
> +static inline unsigned long do_mmap_locked(struct file *file,
> + unsigned long addr, unsigned long len, unsigned long prot,
> + unsigned long flags, vm_flags_t vm_flags, struct list_head *uf)
> +{
> + struct mm_struct *mm = current->mm;
> + unsigned long populate;
> +
> + down_write(&mm->mmap_sem);
> + addr = do_mmap(file, addr, len, prot, flags, vm_flags, 0,
> +&populate, uf);
> + up_write(&mm->mmap_sem);
> +
> + if (populate)
> + mm_populate(addr, populate);
> +
> + return addr;
> +}

Any reason not to put this in cet.c, as suggested by PeterZ?  All of the
calls from CET have identical params except for @len, e.g. you can add
'static unsigned long cet_mmap(unsigned long len)' and bury most of the
copy-paste code in there.

https://lkml.kernel.org/r/20190607074707.gd3...@hirez.programming.kicks-ass.net

> +
>  /* These take the mm semaphore themselves */
>  extern int __must_check vm_brk(unsigned long, unsigned long);
>  extern int __must_check vm_brk_flags(unsigned long, unsigned long, unsigned 
> long);
> -- 
> 2.17.1
> 


Re: [PATCH] doc: kvm: fix return description of KVM_SET_MSRS

2019-09-03 Thread Sean Christopherson
On Mon, Sep 02, 2019 at 06:12:14PM +0800, Xiaoyao Li wrote:

It may seem silly, but a proper changelog would be helpful even here,
e.g. to explain how and when a positive return value can diverge from the
number of MSRs specific in struct kvm_msrs.

> Signed-off-by: Xiaoyao Li 
> ---
>  Documentation/virt/kvm/api.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> index 2d067767b617..a2efc19e0f4e 100644
> --- a/Documentation/virt/kvm/api.txt
> +++ b/Documentation/virt/kvm/api.txt
> @@ -586,7 +586,7 @@ Capability: basic
>  Architectures: x86
>  Type: vcpu ioctl
>  Parameters: struct kvm_msrs (in)
> -Returns: 0 on success, -1 on error
> +Returns: number of msrs successfully set, -1 on error

Similar to the changelong comment, it'd be helpful to elaborate on the
positive return value, e.g.:

  Returns: number of msrs successfully set (see below), -1 on error

and then something in the free form text explaining how the ioctl stops
processing MSRs if setting an MSR fails.

>  Writes model-specific registers to the vcpu.  See KVM_GET_MSRS for the
>  data structures.
> -- 
> 2.19.1
> 


Re: [PATCH v2] doc: kvm: Fix return description of KVM_SET_MSRS

2019-09-04 Thread Sean Christopherson
On Wed, Sep 04, 2019 at 02:01:18PM +0800, Xiaoyao Li wrote:
> Userspace can use ioctl KVM_SET_MSRS to update a set of MSRs of guest.
> This ioctl sets specified MSRs one by one. Once it fails to set an MSR
> due to setting reserved bits, the MSR is not supported/emulated by kvm,
> or violating other restrictions, it stops further processing and returns
> the number of MSRs have been set successfully.
> 
> Signed-off-by: Xiaoyao Li 
> ---
> v2:
>   elaborate the changelog and description of ioctl KVM_SET_MSRS based on
>   Sean's comments.
> 
> ---
>  Documentation/virt/kvm/api.txt | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/virt/kvm/api.txt b/Documentation/virt/kvm/api.txt
> index 2d067767b617..4638e893dec0 100644
> --- a/Documentation/virt/kvm/api.txt
> +++ b/Documentation/virt/kvm/api.txt
> @@ -586,7 +586,7 @@ Capability: basic
>  Architectures: x86
>  Type: vcpu ioctl
>  Parameters: struct kvm_msrs (in)
> -Returns: 0 on success, -1 on error
> +Returns: number of msrs successfully set (see below), -1 on error
>  
>  Writes model-specific registers to the vcpu.  See KVM_GET_MSRS for the
>  data structures.
> @@ -595,6 +595,11 @@ Application code should set the 'nmsrs' member (which 
> indicates the
>  size of the entries array), and the 'index' and 'data' members of each
>  array entry.
>  
> +It tries to set the MSRs in array entries[] one by one. Once failing to

Probably better to say 'If' as opposed to 'Once', don't want to imply that
userspace is incompetent :)

> +set an MSR (due to setting reserved bits, the MSR is not supported/emulated
> +by kvm, or violating other restrctions), 

Make it clear the list is not exhaustive, e.g.:

It tries to set the MSRs in array entries[] one by one.  If setting an MSR
fails, e.g. due to setting reserved bits, the MSR isn't supported/emulated by
KVM, etc..., it stops processing the MSR list and returns the number of MSRs
that have been set successfully.

> it stops setting following MSRs
> +and returns the number of MSRs have been set successfully.
> +
>  
>  4.20 KVM_SET_CPUID
>  
> -- 
> 2.19.1
> 


Re: [PATCH v5 11/11] intel_sgx: driver documentation

2017-11-27 Thread Sean Christopherson
On Tue, 2017-11-21 at 01:08 +0200, Jarkko Sakkinen wrote:
> On Sat, Nov 18, 2017 at 12:34:33AM +0100, Thomas Gleixner wrote:
> > 
> > This is architecural. From the cursory read of that series it seems there
> > are two parts to it:
> > 
> >   1) The actual core handling, which should be in arch/x86 because that
> >  hardly qualifies as a 'platform' device driver.
> > 
> >   2) The user space interface, which can be separated out perhaps.
> > 
> > I don't know how intertwingled they are, but that's hard to tell from the
> > actual patches w/o doing a deep inspection. Jarkko should be able to answer
> > that.
> > 
> > Thanks,
> > 
> > tglx
> Darren, tglx,
> 
> You can leave user space device as separate module as sgx_ioctl.c merely
> calls stuff that I have inside sgx_encl.c. VMA creation is bound to file
> operations.
> 
> My questions would be:
> 
> 1. What is your recommendation on the deployment under arch/x86?
> 2. Which parts should be compilable as a LKM? Only the user interface
>    or both parts?
> 
> /Jarkko

To enable KVM and a cgroup for EPC accounting, at a minimum arch/x86 needs to
manage the EPC pages (alloc/free/lrus/reclaim/etc...) and LE hash MSRs.  IMO,
ideally everything else would be left in the device driver, e.g. anything
involving ENCLS.  Keeping the majority of the driver out of arch/x86 minimizes
the footprint in arch/x86 and thereby the size of KVM's dependency required to
virtualize SGX, and allows the various SGX pieces, e.g. arch, driver and KVM, to
evolve more independently.

Preferably the arch/x86 code would not be a loadable module, e.g. to simplify
KVM support.

I have a branch based on Jarkko's patches (I believe it's up-to-date with v5)
that implements what I described.  I'd be happy to send RFC patches if that
would help.


Branches for those interested:

https://github.com/sean-jc/linux.git sgx/arch   - move core EPC to arch/x86
https://github.com/sean-jc/linux.git sgx/kvm    - KVM support for SGX
https://github.com/sean-jc/linux.git sgx/lc     - KVM support for Launch Control
https://github.com/sean-jc/linux.git sgx/cgroup - EPC cgroup


branch relationships:

    Jarkko's patches
|
|
 sgx/arch
/\
 sgx/kvmsgx/cgroup
  /
   sgx/lc


















--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 11/11] intel_sgx: driver documentation

2017-11-27 Thread Sean Christopherson
+ Cc: KVM, Paolo and Radim

On Mon, 2017-11-27 at 09:03 -0800, Sean Christopherson wrote:
> On Tue, 2017-11-21 at 01:08 +0200, Jarkko Sakkinen wrote:
> > 
> > On Sat, Nov 18, 2017 at 12:34:33AM +0100, Thomas Gleixner wrote:
> > > 
> > > 
> > > This is architecural. From the cursory read of that series it seems there
> > > are two parts to it:
> > > 
> > >   1) The actual core handling, which should be in arch/x86 because that
> > >  hardly qualifies as a 'platform' device driver.
> > > 
> > >   2) The user space interface, which can be separated out perhaps.
> > > 
> > > I don't know how intertwingled they are, but that's hard to tell from the
> > > actual patches w/o doing a deep inspection. Jarkko should be able to
> > > answer
> > > that.
> > > 
> > > Thanks,
> > > 
> > >   tglx
> > Darren, tglx,
> > 
> > You can leave user space device as separate module as sgx_ioctl.c merely
> > calls stuff that I have inside sgx_encl.c. VMA creation is bound to file
> > operations.
> > 
> > My questions would be:
> > 
> > 1. What is your recommendation on the deployment under arch/x86?
> > 2. Which parts should be compilable as a LKM? Only the user interface
> >    or both parts?
> > 
> > /Jarkko
> To enable KVM and a cgroup for EPC accounting, at a minimum arch/x86 needs to
> manage the EPC pages (alloc/free/lrus/reclaim/etc...) and LE hash MSRs.  IMO,
> ideally everything else would be left in the device driver, e.g. anything
> involving ENCLS.  Keeping the majority of the driver out of arch/x86 minimizes
> the footprint in arch/x86 and thereby the size of KVM's dependency required to
> virtualize SGX, and allows the various SGX pieces, e.g. arch, driver and KVM,
> to evolve more independently.
> 
> Preferably the arch/x86 code would not be a loadable module, e.g. to simplify
> KVM support.
> 
> I have a branch based on Jarkko's patches (I believe it's up-to-date with v5)
> that implements what I described.  I'd be happy to send RFC patches if that
> would help.
> 
> 
> Branches for those interested:
> 
> https://github.com/sean-jc/linux.git sgx/arch   - move core EPC to arch/x86
> https://github.com/sean-jc/linux.git sgx/kvm    - KVM support for SGX
> https://github.com/sean-jc/linux.git sgx/lc     - KVM support for Launch
> Control
> https://github.com/sean-jc/linux.git sgx/cgroup - EPC cgroup
> 
> 
> branch relationships:
> 
>     Jarkko's patches
> |
> |
>  sgx/arch
> /\
>  sgx/kvmsgx/cgroup
>   /
>    sgx/lc

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 00/13] Intel SGX1 support

2018-12-11 Thread Sean Christopherson
On Tue, Dec 11, 2018 at 10:10:38AM -0800, Dave Hansen wrote:
> On 12/10/18 3:12 PM, Josh Triplett wrote:
> >> Or maybe even python/shell scripts? It looked to me like virtual
> >> memory will be "interesting" for enclaves.
> > Memory management doesn't seem that hard to deal with.
> 
> The problems are:
> 
> 1. SGX enclave memory (EPC) is statically allocated at boot and can't
>grow or shrink
> 2. EPC is much smaller than regular RAM
> 3. The core VM has no comprehension of EPC use, thus can not help
>with its algorithms, like the LRU
> 4. The SGX driver implements its own VM which is substantially simpler
>than the core VM, but less feature-rich, fast, or scalable

I'd also add:

  5. Swapping EPC pages can only be done through SGX specific ISA that
 has strict concurrency requirements and enforces TLB flushing.
  6. There are specialized types of EPC pages that have different
 swapping requirements than regular EPC pages.
  7. EPC pages that are exposed to a KVM guest have yet another set of
 swapping requirements.

In other words, extending the core VM to SGX EPC is painfully difficult.


[PATCH] docs: Explicitly state that the 'Fixes:' tag shouldn't split lines

2019-02-19 Thread Sean Christopherson
...and use a commit with an obnoxiously long summary in the example to
make it abundantly clear that keeping the tag on a single line takes
priority over wrapping at 75 columns.  Without the explicit exemption,
one might assume splitting the tag is acceptable, even encouraged, e.g.
due to being conditioned by checkpatch's line length warning.

Per Stephen's scripts[1] and implied by commit bf4daf12a9fb ("checkpatch:
avoid some commit message long line warnings"), splitting the 'Fixes:'
tag across multiple lines is a no-no, presumably because parsing multi-
line tags is unnecessarily painful.

[1] https://lkml.kernel.org/r/20190216183433.71b7c...@canb.auug.org.au

Cc: Stephen Rothwell 
Signed-off-by: Sean Christopherson 
---
 Documentation/process/submitting-patches.rst | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/Documentation/process/submitting-patches.rst 
b/Documentation/process/submitting-patches.rst
index 30dc00a364e8..be7d1829c3af 100644
--- a/Documentation/process/submitting-patches.rst
+++ b/Documentation/process/submitting-patches.rst
@@ -182,9 +182,11 @@ change five years from now.
 
 If your patch fixes a bug in a specific commit, e.g. you found an issue using
 ``git bisect``, please use the 'Fixes:' tag with the first 12 characters of
-the SHA-1 ID, and the one line summary.  For example::
+the SHA-1 ID, and the one line summary.  Do not split the tag across multiple
+lines, tags are exempt from the "wrap at 75 columns" rule in order to simplify
+parsing scripts.  For example::
 
-   Fixes: e21d2170f366 ("video: remove unnecessary platform_set_drvdata()")
+   Fixes: 54a4f0239f2e ("KVM: MMU: make kvm_mmu_zap_page() return the 
number of pages it actually freed")
 
 The following ``git config`` settings can be used to add a pretty format for
 outputting the above style in the ``git log`` or ``git show`` commands::
-- 
2.20.1



gRe: [PATCH V5 1/5] KVM: X86: Memory ROE documentation

2018-10-29 Thread Sean Christopherson
On Fri, Oct 26, 2018 at 05:12:19PM +0200, Ahmed Abd El Mawgood wrote:
> Following up with my previous threads on KVM assisted Anti rootkit
> protections.

All of the changelogs in this series need to be rewritten to adhere to
Documentation/process[1].  In particular, use imperative mood and
describe why/what is being done in a self-contained manner, e.g. the
above line isn't very helpful without a lot of prior context.

[1] 
https://github.com/torvalds/linux/blob/master/Documentation/process/submitting-patches.rst#2-describe-your-changes

> The current version doesn't address the attacks involving pages
> remapping. It is still design in progress, nevertheless, it will be in
> my later patch sets.

This series should be tagged RFC if its design is a WIP.


Re: [PATCH V5 5/5] KVM: Small Refactoring to kvm_free_memslot

2018-10-29 Thread Sean Christopherson
On Fri, Oct 26, 2018 at 05:12:23PM +0200, Ahmed Abd El Mawgood wrote:
> This should be a little bit more readable and prone to memory leaks

Describe what is being, both in the subject line and continuing on in
the full changelog, e.g. "Small Refactoring to kvm_free_memslot" doesn't
provide any clue as to what is being done.  And this is not what I would
describe as refactoring, e.g. verifying the new behavior means tracing
through its impact on __kvm_set_memory_region().

Lastly, this should be sent as a separate patch.  There is no dependency
on the ROE code and if it actually addresses a potential memory leak (I
haven't actually reviewed the code itself) it should go in sooner rather
than later.

> 
> Signed-off-by: Ahmed Abd El Mawgood 
> ---
>  virt/kvm/kvm_main.c | 15 +++
>  1 file changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2d3011e8490e..79c98db03c84 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -550,11 +550,11 @@ static void kvm_destroy_dirty_bitmap(struct 
> kvm_memory_slot *memslot)
>   * Free any memory in @free but not in @dont.
>   */
>  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
> -   struct kvm_memory_slot *dont)
> +   struct kvm_memory_slot *dont,
> +   enum kvm_mr_change change)
>  {
> + if (change == KVM_MR_DELETE) {
>  #ifdef CONFIG_KVM_ROE
> - if (!dont) {
> - //TODO still this might leak
>   struct protected_chunk *pos, *n;
>   struct list_head *head = free->prot_list;
>   kvfree(free->roe_bitmap);
> @@ -564,10 +564,9 @@ static void kvm_free_memslot(struct kvm *kvm, struct 
> kvm_memory_slot *free,
>   kvfree(pos);
>   }
>   kvfree(free->prot_list);
> - }
>  #endif
> - if (!dont || free->dirty_bitmap != dont->dirty_bitmap)
>   kvm_destroy_dirty_bitmap(free);
> + }
>  
>   kvm_arch_free_memslot(kvm, free, dont);
>  
> @@ -582,7 +581,7 @@ static void kvm_free_memslots(struct kvm *kvm, struct 
> kvm_memslots *slots)
>   return;
>  
>   kvm_for_each_memslot(memslot, slots)
> - kvm_free_memslot(kvm, memslot, NULL);
> + kvm_free_memslot(kvm, memslot, NULL, KVM_MR_DELETE);
>  
>   kvfree(slots);
>  }
> @@ -1100,14 +1099,14 @@ int __kvm_set_memory_region(struct kvm *kvm,
>  
>   kvm_arch_commit_memory_region(kvm, mem, &old, &new, change);
>  
> - kvm_free_memslot(kvm, &old, &new);
> + kvm_free_memslot(kvm, &old, &new, change);
>   kvfree(old_memslots);
>   return 0;
>  
>  out_slots:
>   kvfree(slots);
>  out_free:
> - kvm_free_memslot(kvm, &new, &old);
> + kvm_free_memslot(kvm, &new, &old, change);
>  out:
>   return r;
>  }
> -- 
> 2.18.1
> 


Re: [PATCH 0/7] KVM: x86: Introduce new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT

2024-10-14 Thread Sean Christopherson
On Fri, Oct 04, 2024, Nikolas Wipper wrote:
> This series introduces a new ioctl KVM_HYPERV_SET_TLB_FLUSH_INHIBIT. It
> allows hypervisors to inhibit remote TLB flushing of a vCPU coming from
> Hyper-V hyper-calls (namely HvFlushVirtualAddressSpace(Ex) and
> HvFlushirtualAddressList(Ex)). It is required to implement the
> HvTranslateVirtualAddress hyper-call as part of the ongoing effort to
> emulate VSM within KVM and QEMU. The hyper-call requires several new KVM
> APIs, one of which is KVM_HYPERV_SET_TLB_FLUSH_INHIBIT.
> 
> Once the inhibit flag is set, any processor attempting to flush the TLB on
> the marked vCPU, with a HyperV hyper-call, will be suspended until the
> flag is cleared again. During the suspension the vCPU will not run at all,
> neither receiving events nor running other code. It will wake up from
> suspension once the vCPU it is waiting on clears the inhibit flag. This
> behaviour is specified in Microsoft's "Hypervisor Top Level Functional
> Specification" (TLFS).
> 
> The vCPU will block execution during the suspension, making it transparent
> to the hypervisor.
 
s/hypervisor/VMM.  In the world of KVM, the typical terminology is that KVM 
itself
is the hypervisor, and the userspace side is the VMM.  It's not perfect, but 
it's
good enough and fairly ubiquitous at this point, and thus many readers will be
quite confused as to how a vCPU blocking is transparent to KVM :-)



Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state

2024-10-15 Thread Sean Christopherson
On Tue, Oct 15, 2024, Vitaly Kuznetsov wrote:
> Nikolas Wipper  writes:
> 
> > On 10.10.24 10:57, Vitaly Kuznetsov wrote:
> 
> ...
> 
> >>>  int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu);
> >>> +
> >>> +static inline bool kvm_hv_vcpu_suspended(struct kvm_vcpu *vcpu)
> >>> +{
> >>> + return vcpu->arch.hyperv_enabled &&
> >>> +READ_ONCE(vcpu->arch.hyperv->suspended);
> >>
> >> I don't think READ_ONCE() means anything here, does it?
> >>
> >
> > It does prevent compiler optimisations and is actually required[1]. Also
> > it makes clear that this variable is shared, and may be accessed from
> > remote CPUs.
> >
> > [1] 
> > https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0124r6.html#Variable%20Access
> 
> It certainly does no harm but I think if we follow 'Loads from and
> stores to shared (but non-atomic) variables should be protected with the
> READ_ONCE(), WRITE_ONCE()' rule literally we will need to sprinkle them
> all over KVM/kernel ;-) And personally, this makes reading the code
> harder.
> 
> To my (very limited) knowledge, we really need READ_ONCE()s when we need
> to have some sort of a serialization, e.g. the moment when this read
> happens actually makes a difference. If we can e.g. use a local variable
> in the beginning of a function and replace all READ_ONCE()s with
> reading this local variable -- then we don't need READ_ONCE()s and are
> OK with possible compiler optimizations. Similar (reversed) thoughts go
> to WRITE_ONCE().
> 
> I think it's OK to keep them but it would be nice (not mandatory IMO,
> but nice) to have a comment describing which particular synchronization
> we are achieving (== the compiler optimization scenario we are protecting
> against). 
> 
> In this particular case, kvm_hv_vcpu_suspended() is inline so I briefly
> looked at all kvm_hv_vcpu_suspended() call sites (there are three) in
> your series but couldn't think of a place where the READ_ONCE() makes a
> real difference. kvm_hv_hypercall_complete() looks pretty safe
> anyway. kvm_hv_vcpu_unsuspend_tlb_flush() will be simplified
> significantly if we merge 'suspended' with 'waiting_on': instead of 
> 
>   kvm_for_each_vcpu(i, v, vcpu->kvm) {
>   vcpu_hv = to_hv_vcpu(v);
> 
>   if (kvm_hv_vcpu_suspended(v) &&
>   READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) {
> ...
> 
> you will have just
> 
>   kvm_for_each_vcpu(i, v, vcpu->kvm) {
>   vcpu_hv = to_hv_vcpu(v);
> 
>   if (vcpu_hv && vcpu_hv->waiting_on == vcpu->vcpu_id) {
> ...
> (and yes, I also think that READ_ONCE() is superfluous here, as real
> (non-speculative) write below can't happen _before_ the check )
> 
> The last one, kvm_vcpu_running(), should also be indifferent to
> READ_ONCE() in kvm_hv_vcpu_suspended(). I may had missed something, of
> course, but I hope you got my line of thought.

I don't think you're missing anything.  In general, all of this code is more 
than
a bit heavy-handed and lacks any kind of precision, which makes it *really* hard
to see what actually guarantees a vCPU won't get stuck blocking.

Writers synchronize SRCU and readers are required to acquire SRCU, but there's
no actual data tagged as being protected by SRCU, i.e. tlb_flush_inhibit should
be __rcu.

All of the {READ,WRITE}_ONCE() stuff provides some implicit compiler barriers,
but the actual protection to ensure a vCPU either observes inhibit=false or a 
wake
event is provided by the smp_wmb() in __kvm_make_request().

And from a performance perspective, synchronizing on kvm->srcu is going to be
susceptible to random slowdowns, because writers will have to wait until all 
vCPUs
drop SRCU, even if they have nothing to do with PV TLB flushes.  E.g. if vCPUs
are faulting in memory from swap, uninhibiting a TLB flushes could be stalled
unnecessarily for an extended duration.

Lastly, KVM_REQ_EVENT is a big hammer (triggers a lot of processing) and 
semantically
misleading (there is no event to process).  At a glance, KVM_REQ_UNBLOCK is 
likely
more appropriate.

Before we spend too much time cleaning things up, I want to first settle on the
overall design, because it's not clear to me that punting 
HvTranslateVirtualAddress
to userspace is a net positive.  We agreed that VTLs should be modeled primarily
in userspace, but that doesn't automatically make punting everything to 
userspace
the best option, especially given the discussion at KVM Forum with respect to
mplementing VTLs, VMPLs, TD partitions, etc.

The cover letters for this series and KVM_TRANSLATE2 simply say they're needed
for HvTranslateVirtualAddress, but neither series nor Nicolas' patch to punt
HVCALL_TRANSLATE_VIRTUAL_ADDRESS[*] justifies the split between userspace and
KVM.  And it very much is a split, because there are obviously a lot of details
around TlbFlushInhibit that bleed into KVM.

Side topic, what actually clears HvRegisterInterceptSuspend.TlbFlushInhibit?  
The
TLFS just says 

  After the memory intercept 

Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state

2024-10-15 Thread Sean Christopherson
On Tue, Oct 15, 2024, Nicolas Saenz Julienne wrote:
> Hi Sean,
> 
> On Tue Oct 15, 2024 at 3:58 PM UTC, Sean Christopherson wrote:
> > Before we spend too much time cleaning things up, I want to first settle on 
> > the
> > overall design, because it's not clear to me that punting 
> > HvTranslateVirtualAddress
> > to userspace is a net positive.  We agreed that VTLs should be modeled 
> > primarily
> > in userspace, but that doesn't automatically make punting everything to 
> > userspace
> > the best option, especially given the discussion at KVM Forum with respect 
> > to
> > mplementing VTLs, VMPLs, TD partitions, etc.
> 
> Since you mention it, Paolo said he was going to prep a doc with an
> overview of the design we discussed there. Was it published? Did I miss
> it?

Nope, we're all hitting F5 mercilessly :-)



Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state

2024-10-15 Thread Sean Christopherson
On Tue, Oct 15, 2024, Nikolas Wipper wrote:
> On 15.10.24 17:58, Sean Christopherson wrote:
> > ...
> >
> > And from a performance perspective, synchronizing on kvm->srcu is going to 
> > be
> > susceptible to random slowdowns, because writers will have to wait until 
> > all vCPUs
> > drop SRCU, even if they have nothing to do with PV TLB flushes.  E.g. if 
> > vCPUs
> > are faulting in memory from swap, uninhibiting a TLB flushes could be 
> > stalled
> > unnecessarily for an extended duration.
> >
> 
> This should be an easy fix, right? Just create an SRCU only for the TLB 
> flushes only.

Yes, this is a very solvable problem.  But while SRCU objects aren't expensive,
they aren't entirely free either.
 
> > Lastly, KVM_REQ_EVENT is a big hammer (triggers a lot of processing) and 
> > semantically
> > misleading (there is no event to process).  At a glance, KVM_REQ_UNBLOCK is 
> > likely
> > more appropriate.
> >
> > Before we spend too much time cleaning things up, I want to first settle on 
> > the
> > overall design, because it's not clear to me that punting 
> > HvTranslateVirtualAddress
> > to userspace is a net positive.  We agreed that VTLs should be modeled 
> > primarily
> > in userspace, but that doesn't automatically make punting everything to 
> > userspace
> > the best option, especially given the discussion at KVM Forum with respect 
> > to
> > mplementing VTLs, VMPLs, TD partitions, etc.
> >
> 
> I wasn't at the discussion, so maybe I'm missing something, but the hypercall
> still needs VTL awareness. 

Yeah, the KVM Forum discussion is relevant, because one of the key takeaways 
from
that discussion was that KVM will need some amount of VTL awareness.

> For one, it is primarily executed from VTL0 and primarily targets VTL1
> (primarily here means "thats what I see when I boot Windows Server 2019"), so
> it would need to know which vCPU is the corresponding VTL (this assumes one
> vCPU per VTL, as in the QEMU implementation).

Right, but even without the takeways from KVM Forum, we need to look at big 
picture
and come up with a design that makes the most sense.  E.g. if making KVM aware
of "struct kvm" objects that represent different VTLs in the same VM greatly
simplifies supporting HvTranslateVirtualAddress, then it's likely worth doing.

> To make matters worse, the hypercall can also arbitrarily choose to target a
> different VP.  This would require a way to map (VP index, VTL) -> (vcpu_id)
> within KVM.

I don't think so.  The TLFS definition for TlbFlushInhibit give KVM a _lot_ of
wiggle room, e.g. KVM can retry the hypercall as many times as necessary.  To
honor TlbFlushInhibit, KVM just needs to ensure that flushes are blocked if any
vCPU at the target VTL is blocking flushes.  And to avoid hanging a vCPU, KVM
only needs to ensure a vCPU is awakened if it _might_ be able to make forward
progress.

I.e. I don't think KVM needs to be super precise when waking blocking vCPUs, and
thus there's no need to precisely track who is blocking whom.  I think :-)
 
> > The cover letters for this series and KVM_TRANSLATE2 simply say they're 
> > needed
> > for HvTranslateVirtualAddress, but neither series nor Nicolas' patch to punt
> > HVCALL_TRANSLATE_VIRTUAL_ADDRESS[*] justifies the split between userspace 
> > and
> > KVM.  And it very much is a split, because there are obviously a lot of 
> > details
> > around TlbFlushInhibit that bleed into KVM.
> >
> > Side topic, what actually clears 
> > HvRegisterInterceptSuspend.TlbFlushInhibit?  The
> > TLFS just says
> >
> >   After the memory intercept routine performs instruction completion, it 
> > should
> >   clear the TlbFlushInhibit bit of the HvRegisterInterceptSuspend register.
> >
> > but I can't find anything that says _how_ it clears TlbFlushInhibit.
> >
> 
> The register cannot be accessed using the HvSetVpRegisters hypercall, but the 
> TLFS
> talks about it elsewhere. I'm assuming this is a formatting issue (there are 
> a few
> elsewhere). In 15.5.1.3 it says
> 
>   To unlock the TLB, the higher VTL can clear this bit. Also, once a VP 
> returns
>   to a lower VTL, it releases all TLB locks which it holds at the time.
> 
> The QEMU implementation also just uninhibits on intercept exit, and that, at 
> least,
> does not crash.

Hmm, it would be nice to bottom out on whether the higher VLT or the 
VMM/hypervisor
is responsible for clearing TlbFlushInhibit, because that may influence KVM's
design.



Re: [PATCH v5 4/5] KVM: selftests: Add test for PSCI SYSTEM_OFF2

2024-10-16 Thread Sean Christopherson
On Tue, Oct 15, 2024, Oliver Upton wrote:
> On Sat, Oct 12, 2024 at 10:28:10AM +0100, David Woodhouse wrote:
> > I suspect the real answer here is that the individual tests here be
> > calling ksft_test_result_pass(), and the system_off2 one should call
> > ksft_test_result_skip() if it skips?
> 
> modulo a few one-offs, KVM selftests doesn't use the kselftest harness
> so it isn't subject to this comment. Since there's no test plan, we can
> skip at any time.

FWIW, I have some ideas on how to use the nicer pieces of kselftest harness, if
anyone is looking for a project.  :-)

https://lore.kernel.org/all/zjuwqexpa5qvi...@google.com



Re: [PATCH 2/3] KVM: x86: Add support for VMware guest specific hypercalls

2024-11-13 Thread Sean Christopherson
On Wed, Nov 13, 2024, Paolo Bonzini wrote:
> Il mar 12 nov 2024, 21:44 Doug Covelli  ha
> scritto:
> 
> > > Split irqchip should be the best tradeoff. Without it, moves from cr8
> > > stay in the kernel, but moves to cr8 always go to userspace with a
> > > KVM_EXIT_SET_TPR exit. You also won't be able to use Intel
> > > flexpriority (in-processor accelerated TPR) because KVM does not know
> > > which bits are set in IRR. So it will be *really* every move to cr8
> > > that goes to userspace.
> >
> > Sorry to hijack this thread but is there a technical reason not to allow
> > CR8 based accesses to the TPR (not MMIO accesses) when the in-kernel local
> > APIC is not in use?
> 
> No worries, you're not hijacking :) The only reason is that it would be
> more code for a seldom used feature and anyway with worse performance. (To
> be clear, CR8 based accesses are allowed, but stores cause an exit in order
> to check the new TPR against IRR. That's because KVM's API does not have an
> equivalent of the TPR threshold as you point out below).
> 
> > Also I could not find these documented anywhere but with MSFT's APIC our
> > monitor relies on extensions for trapping certain events such as INIT/SIPI
> > plus LINT0 and SVR writes:
> >
> > UINT64 X64ApicInitSipiExitTrap: 1; //
> > WHvRunVpExitReasonX64ApicInitSipiTrap
> > UINT64 X64ApicWriteLint0ExitTrap  : 1; //
> > WHvRunVpExitReasonX64ApicWriteTrap
> > UINT64 X64ApicWriteLint1ExitTrap  : 1; //
> > WHvRunVpExitReasonX64ApicWriteTrap
> > UINT64 X64ApicWriteSvrExitTrap: 1; //
> > WHvRunVpExitReasonX64ApicWriteTrap
> >
> 
> There's no need for this in KVM's in-kernel APIC model. INIT and SIPI are
> handled in the hypervisor and you can get the current state of APs via
> KVM_GET_MPSTATE. LINT0 and LINT1 are injected with KVM_INTERRUPT and
> KVM_NMI respectively, and they obey IF/PPR and NMI blocking respectively,
> plus the interrupt shadow; so there's no need for userspace to know when
> LINT0/LINT1 themselves change. The spurious interrupt vector register is
> also handled completely in kernel.
> 
> > I did not see any similar functionality for KVM.  Does anything like that
> > exist?  In any case we would be happy to add support for handling CR8
> > accesses w/o exiting w/o the in-kernel APIC along with some sort of a way
> > to configure the TPR threshold if folks are not opposed to that.
> 
> As far I know everybody who's using KVM (whether proprietary or open
> source) has had no need for that, so I don't think it's a good idea to make
> the API more complex. 

+1

> Performance of Windows guests is going to be bad anyway with userspace APIC.

Heh, on modern hardware, performance of any guest is going to suck with a 
userspace
APIC, compared to what is possible with an in-kernel APIC.

More importantly, I really, really don't want to encourage non-trivial usage of
a local APIC in userspace.  KVM's support for a userspace local APIC is very
poorly tested these days.   I have zero desire to spend any amount of time 
reviewing
and fixing issues that are unique to emulating the local APIC in userspace.  And
long term, I would love to force an in-kernel local APIC, though I don't know if
that's entirely feasible.



Re: [PATCH v2 07/25] KVM: VMX: Set intercept for FRED MSRs

2024-09-30 Thread Sean Christopherson
On Fri, Sep 27, 2024, Xin Li wrote:
> > > > When FRED is advertised to a guest, KVM should allow FRED SSP MSRs
> > > > accesses through disabling FRED SSP MSRs interception no matter whether
> > > > supervisor shadow stacks are enabled or not.
> > > 
> > > KVM doesn't necessarily need to disabling MSR interception, e.g. if
> > > the expectation
> > > is that the guest will rarely/never access the MSRs when CET is
> > > unsupported, then
> > > we're likely better off going with a trap-and-emulate model.  KVM
> > > needs to emulate
> > > RDMSR and WRMSR no matter what, e.g. in case the guest triggers a
> > > WRMSR when KVM
> > > is emulating, and so that userspace can get/set MSR values.
> > > 
> > > And this means that yes, FRED virtualization needs to land after CET
> > > virtualization,
> > > otherwise managing the conflicts/dependencies will be a nightmare.
> > > 
> 
> I still plan to send another iteration of the FRED patch set for review,
> however I haven't seen your x86 KVM changes land into Linus' tree, it
> will happen soon, right?

Yep, we squeaked into rc1, the pull request to Linus was delayed because of
travel and conferences.



Re: [PATCH v2 07/25] KVM: VMX: Set intercept for FRED MSRs

2024-09-25 Thread Sean Christopherson
On Wed, Sep 18, 2024, Xin Li wrote:
> > > MSR_IA32_FRED_SSP0 is an alias of the CET MSR_IA32_PL0_SSP and likely to
> > > be used in the same way as FRED RSP0, i.e., host FRED SSP0 _should_ be
> > > restored in arch_exit_to_user_mode_prepare().  However as of today Linux
> > > has no plan to utilize kernel shadow stack thus no one cares host FRED
> > > SSP0 (no?).  But lets say anyway it is host's responsibility to manage
> > > host FRED SSP0, then KVM only needs to take care of guest FRED SSP0
> > > (just like how KVM should handle guest FRED RSP0) even before the
> > > supervisor shadow stack feature is advertised to guest.
> > 
> > Heh, I'm not sure what your question is, or if there even is a question.  
> > KVM
> > needs to context switch FRED SSP0 if FRED is exposed to the guest, but 
> > presumably
> > that will be done through XSAVE state?  If that's the long term plan, I 
> > would
> > prefer to focus on merging CET virtualization first, and then land FRED 
> > virtualization
> > on top so that KVM doesn't have to carry intermediate code to deal with the 
> > aliased
> > MSR.
> 
> You mean the following patch set, right?

Yep, and presumably the KVM support as well:

https://lore.kernel.org/all/20240219074733.122080-1-weijiang.y...@intel.com

> https://lore.kernel.org/kvm/20240531090331.13713-1-weijiang.y...@intel.com/

...

> > Ugh, but what happens if a CPU (or the host kernel) supports FRED but not 
> > CET SS?
> > Or is that effectively an illegal combination?
> 
> The FRED Spec says:
> 
> IA32_FRED_SSP1, IA32_FRED_SSP2, and IA32_FRED_SSP3 (MSR indices 1D1H–
> 1D3H). Together with the existing MSR IA32_PL0_SSP (MSR index 6A4H), these
> are the FRED SSP MSRs.
> 
> The FRED SSP MSRs are supported by any processor that enumerates
> CPUID.(EAX=7,ECX=1):EAX.FRED[bit 17] as 1. If such a processor does not
> support CET, FRED transitions will not use the MSRs (because shadow stacks
> are not enabled), but the MSRs would still be accessible using RDMSR and
> WRMSR.
> 
> 
> So they are independent, just that FRED SSP MSRs are NOT used if
> supervisor shadow stacks are not enabled (obviously Qemu can be
> configured to not advertise CET but FRED).
> 
> When FRED is advertised to a guest, KVM should allow FRED SSP MSRs
> accesses through disabling FRED SSP MSRs interception no matter whether
> supervisor shadow stacks are enabled or not.

KVM doesn't necessarily need to disabling MSR interception, e.g. if the 
expectation
is that the guest will rarely/never access the MSRs when CET is unsupported, 
then
we're likely better off going with a trap-and-emulate model.  KVM needs to 
emulate
RDMSR and WRMSR no matter what, e.g. in case the guest triggers a WRMSR when KVM
is emulating, and so that userspace can get/set MSR values.

And this means that yes, FRED virtualization needs to land after CET 
virtualization,
otherwise managing the conflicts/dependencies will be a nightmare.



Re: [PATCH 2/3] KVM: x86: Add support for VMware guest specific hypercalls

2024-11-07 Thread Sean Christopherson
On Mon, Nov 04, 2024, Zack Rusin wrote:
> On Mon, Nov 4, 2024 at 5:13 PM Paolo Bonzini  wrote:
> >
> > On Wed, Oct 30, 2024 at 4:35 AM Zack Rusin  wrote:
> > >
> > > VMware products handle hypercalls in userspace. Give KVM the ability
> > > to run VMware guests unmodified by fowarding all hypercalls to the
> > > userspace.
> > >
> > > Enabling of the KVM_CAP_X86_VMWARE_HYPERCALL_ENABLE capability turns
> > > the feature on - it's off by default. This allows vmx's built on top
> > > of KVM to support VMware specific hypercalls.
> >
> > Hi Zack,
> 
> Hi, Paolo.
> 
> Thank you for looking at this.
> 
> > is there a spec of the hypercalls that are supported by userspace? I
> > would like to understand if there's anything that's best handled in
> > the kernel.
> 
> There's no spec but we have open headers listing the hypercalls.
> There's about a 100 of them (a few were deprecated), the full
> list starts here:
> https://github.com/vmware/open-vm-tools/blob/739c5a2f4bfd4cdda491e6a6f6869d88c0bd6972/open-vm-tools/lib/include/backdoor_def.h#L97
> They're not well documented, but the names are pretty self-explenatory.

At a quick glance, this one needs to be handled in KVM:

  BDOOR_CMD_VCPU_MMIO_HONORS_PAT

and these probably should be in KVM:

  BDOOR_CMD_GETTIME
  BDOOR_CMD_SIDT
  BDOOR_CMD_SGDT
  BDOOR_CMD_SLDT_STR
  BDOOR_CMD_GETTIMEFULL
  BDOOR_CMD_VCPU_LEGACY_X2APIC_OK
  BDOOR_CMD_STEALCLOCK

and these maybe? (it's not clear what they do, from the name alone)

  BDOOR_CMD_GET_VCPU_INFO
  BDOOR_CMD_VCPU_RESERVED

> > If we allow forwarding _all_ hypercalls to userspace, then people will
> > use it for things other than VMware and there goes all hope of
> > accelerating stuff in the kernel in the future.

To some extent, that ship has sailed, no?  E.g. do KVM_XEN_HVM_CONFIG with
KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL set, and userspace can intercept pretty much
all hypercalls with very few side effects.

> > So even having _some_ checks in the kernel before going out to
> > userspace would keep that door open, or at least try.
> 
> Doug just looked at this and I think I might have an idea on how to
> limit the scope at least a bit: if you think it would help we could
> limit forwarding of hypercalls to userspace only to those that that
> come with a BDOOR_MAGIC (which is 0x564D5868) in eax. Would that help?

I don't think it addresses Paolo's concern (if I understood Paolo's concern
correctly), but it would help from the perspective of allowing KVM to support
VMware hypercalls and Xen/Hyper-V/KVM hypercalls in the same VM.

I also think we should add CONFIG_KVM_VMWARE from the get-go, and if we're 
feeling
lucky, maybe even retroactively bury KVM_CAP_X86_VMWARE_BACKDOOR behind that
Kconfig.  That would allow limiting the exposure to VMware specific code, e.g. 
if
KVM does end up handling hypercalls in-kernel.  And it might deter abuse to some
extent.



Re: [PATCH 2/3] KVM: x86: Add support for VMware guest specific hypercalls

2025-02-03 Thread Sean Christopherson
On Mon, Feb 03, 2025, Doug Covelli wrote:
> On Mon, Feb 3, 2025 at 1:22 PM Paolo Bonzini  wrote:
> >
> > On Mon, Feb 3, 2025 at 5:35 PM Doug Covelli  
> > wrote:
> > > OK.  It seems like fully embracing the in-kernel APIC is the way to go
> > > especially considering it really simplifies using KVM's support for nested
> > > virtualization.  Speaking of nested virtualization we have been working on
> > > adding support for that and would like to propose a couple of changes:
> > >
> > > - Add an option for L0 to handle backdoor accesses from CPL3 code running 
> > > in L2.
> > > On a #GP nested_vmx_l0_wants_exit can check if this option is enabled and 
> > > KVM
> > > can handle the #GP like it would if it had been from L1 (exit to 
> > > userlevel iff
> > > it is a backdoor access otherwwise deliver the fault to L2).  When 
> > > combined with
> > > enable_vmware_backdoor this will allow L0 to optionally handle backdoor 
> > > accesses
> > > from CPL3 code running in L2.  This is needed for cases such as running 
> > > VMware
> > > tools in a Windows VM with VBS enabled.  For other cases such as running 
> > > tools
> > > in a Windows VM in an ESX VM we still want L1 to handle the backdoor 
> > > accesses
> > > from L2.
> >
> > I think this makes sense and could be an argument to KVM_ENABLE_CAP.
> >
> > > - Extend KVM_EXIT_MEMORY_FAULT for permission faults (e.g the guest 
> > > attempting
> > > to write to a page that has been protected by userlevel calling 
> > > mprotect).  This
> > > is useful for cases where we want synchronous detection of guest writes 
> > > such as
> > > lazy snapshots (dirty page tracking is no good for this case).  Currently
> > > permission faults result in KVM_RUN returning EFAULT which we handle by
> > > interpreting the instruction as we do not know the guest physical address
> > > associated with the fault.
> >
> > Yes, this makes sense too, though you might want to look into
> > userfaultfd as well.
> >
> > We had something planned using attributes, but I don't see any issue
> > extending it to EFAULT. Maybe it would have to be yet another
> > KVM_ENABLE_CAP; considering that it would break your existing code,
> > there might be someone else in the wild doing it.
> 
> It looks like KVM_EXIT_MEMORY_FAULT was implemented in such a way that it
> won't break existing code:
> 
> Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in
> that it accompanies a return code of ‘-1’, not ‘0’! errno will always
> be set to EFAULT or EHWPOISON when KVM exits with
> KVM_EXIT_MEMORY_FAULT, userspace should assume kvm_run.exit_reason is
> stale/undefined for all other error numbers.
> 
> That being said we could certainly make this opt-in if that is preferable.

-EFAULT isn't the problem, KVM not being able to return useful information in
all situations is the issue.  Specifically, "guest" accesses that are emulated
by KVM are problematic, because the -EFAULT from e.g. __kvm_write_guest_page()
is disconnected from the code that actually kicks out to userspace.  In that
case, userspace will get KVM_EXIT_MMIO, not -EFAULT.  There are more problems
beyond KVM_EXIT_MMIO vs. -EFAULT, e.g. instructions that perform multiple memory
accesses, "failures" that are squashed and never propagated to userspace (PV
features tend to do this), page splits, etc.

In general, I don't expect most KVM access to guest memory to Just Work, as I
doubt KVM will behave as you want.

We spent a lot of time trying to sort out a viable approach in the context of 
the
USERFAULT_ON_MISSING series[1], and ultimately gave up (ignoring that we 
postponed
the entire series)[2], because we decided that fully solving KVM accesses would
require an absurd amount of effort and churn, and wasn't at all necessary for 
the
userfault use case.

What exactly needs to happen on "synchronous detection of guest writes"?  One
idea (which may be horribly flawed as I have put *very* little thought into it)
would be to implement a module (or KVM extension) that utilizes KVM's "external"
write-tracking APIs to get the synchronous notifications (see
arch/x86/include/asm/kvm_page_track.h).

[1] https://lore.kernel.org/all/zin6vqsebtrn1...@google.com
[2] https://lore.kernel.org/all/zr88w9w62qszd...@google.com



Re: [PATCH 2/3] KVM: x86: Add support for VMware guest specific hypercalls

2025-02-03 Thread Sean Christopherson
On Mon, Feb 03, 2025, Paolo Bonzini wrote:
> On 2/3/25 20:41, Sean Christopherson wrote:
> > -EFAULT isn't the problem, KVM not being able to return useful information 
> > in
> > all situations is the issue.
> 
> Yes, that's why I don't want it to be an automatically opted-in API.  If
> incremental improvements are possible, it may be useful to allow interested
> userspace to enable it early.  For example...
> 
> > Specifically, "guest" accesses that are emulated
> > by KVM are problematic, because the -EFAULT from e.g. 
> > __kvm_write_guest_page()
> > is disconnected from the code that actually kicks out to userspace.  In that
> > case, userspace will get KVM_EXIT_MMIO, not -EFAULT.  There are more 
> > problems
> > beyond KVM_EXIT_MMIO vs. -EFAULT, e.g. instructions that perform multiple 
> > memory
> > accesses,
> 
> those are obviously synchronous and I expect VMware to handle them already.
> 
> That said my preferred solution to just use userfaultfd, which is
> synchronous by definition.

Oh, right, userfaultfd would be far better than piggybacking write-tracking.



Re: [PATCH 00/15] KVM: x86: Introduce new ioctl KVM_TRANSLATE2

2024-12-11 Thread Sean Christopherson
On Tue, Sep 10, 2024, Nikolas Wipper wrote:
> This series introduces a new ioctl KVM_TRANSLATE2, which expands on
> KVM_TRANSLATE. It is required to implement Hyper-V's
> HvTranslateVirtualAddress hyper-call as part of the ongoing effort to
> emulate HyperV's Virtual Secure Mode (VSM) within KVM and QEMU. The hyper-
> call requires several new KVM APIs, one of which is KVM_TRANSLATE2, which
> implements the core functionality of the hyper-call. The rest of the
> required functionality will be implemented in subsequent series.
> 
> Other than translating guest virtual addresses, the ioctl allows the
> caller to control whether the access and dirty bits are set during the
> page walk. It also allows specifying an access mode instead of returning
> viable access modes, which enables setting the bits up to the level that
> caused a failure. Additionally, the ioctl provides more information about
> why the page walk failed, and which page table is responsible. This
> functionality is not available within KVM_TRANSLATE, and can't be added
> without breaking backwards compatiblity, thus a new ioctl is required.

...

>  Documentation/virt/kvm/api.rst| 131 
>  arch/x86/include/asm/kvm_host.h   |  18 +-
>  arch/x86/kvm/hyperv.c |   3 +-
>  arch/x86/kvm/kvm_emulate.h|   8 +
>  arch/x86/kvm/mmu.h|  10 +-
>  arch/x86/kvm/mmu/mmu.c|   7 +-
>  arch/x86/kvm/mmu/paging_tmpl.h|  80 +++--
>  arch/x86/kvm/x86.c| 123 ++-
>  include/linux/kvm_host.h  |   6 +
>  include/uapi/linux/kvm.h  |  33 ++
>  tools/testing/selftests/kvm/Makefile  |   1 +
>  .../selftests/kvm/x86_64/kvm_translate2.c | 310 ++
>  virt/kvm/kvm_main.c   |  41 +++
>  13 files changed, 724 insertions(+), 47 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/kvm_translate2.c

...

> The simple reason for keeping this functionality in KVM, is that it already
> has a mature, production-level page walker (which is already exposed) and
> creating something similar QEMU would take a lot longer and would be much
> harder to maintain than just creating an API that leverages the existing
> walker.

I'm not convinced that implementing targeted support in QEMU (or any other VMM)
would be at all challenging or a burden to maintain.  I do think duplicating
functionality across multiple VMMs is undesirable, but that's an argument for
creating modular userspace libraries for such functionality.  E.g. I/O APIC
emulation is another one I'd love to move to a common library.

Traversing page tables isn't difficult.  Checking permission bits isn't complex.
Tedious, perhaps.  But not complex.  KVM's rather insane code comes from KVM's
desire to make the checks as performant as possible, because eking out every 
little
bit of performance matters for legacy shadow paging.  I doubt VSM needs _that_
level of performance.

I say "targeted", because I assume the only use case for VSM is 64-bit 
non-nested
guests.  QEMU already has a rudimentary supporting for walking guest page 
tables,
and that code is all of 40 LoC.  Granted, it's heinous and lacks permission 
checks
and A/D updates, but I would expect a clean implementation with permission 
checks
and A/D support would clock in around 200 LoC.  Maybe 300.

And ignoring docs and selftests, that's roughly what's being added in this 
series.
Much of the code being added is quite simple, but there are non-trivial changes
here as well.  E.g. the different ways of setting A/D bits.

My biggest concern is taking on ABI that restricts what KVM can do in its 
walker.
E.g. I *really* don't like the PKU change.  Yeah, Intel doesn't explicitly 
define
architectural behavior, but diverging from hardware behavior is rarely a good 
idea.

Similarly, the behavior of FNAME(protect_clean_gpte)() probably isn't desirable
for the VSM use case.



Re: [PATCH 14/15] KVM: x86: Implement KVM_TRANSLATE2

2024-12-11 Thread Sean Christopherson
On Tue, Sep 10, 2024, Nikolas Wipper wrote:
> +int kvm_arch_vcpu_ioctl_translate2(struct kvm_vcpu *vcpu,
> + struct kvm_translation2 *tr)
> +{
> + int idx, set_bit_mode = 0, access = 0;
> + struct x86_exception exception = { };
> + gva_t vaddr = tr->linear_address;
> + u16 status = 0;
> + gpa_t gpa;
> +
> + if (tr->flags & KVM_TRANSLATE_FLAGS_SET_ACCESSED)
> + set_bit_mode |= PWALK_SET_ACCESSED;
> + if (tr->flags & KVM_TRANSLATE_FLAGS_SET_DIRTY)
> + set_bit_mode |= PWALK_SET_DIRTY;
> + if (tr->flags & KVM_TRANSLATE_FLAGS_FORCE_SET_ACCESSED)
> + set_bit_mode |= PWALK_FORCE_SET_ACCESSED;
> +
> + if (tr->access & KVM_TRANSLATE_ACCESS_WRITE)
> + access |= PFERR_WRITE_MASK;
> + if (tr->access & KVM_TRANSLATE_ACCESS_USER)
> + access |= PFERR_USER_MASK;
> + if (tr->access & KVM_TRANSLATE_ACCESS_EXEC)
> + access |= PFERR_FETCH_MASK;

WRITE and FETCH accesses need to be mutually exclusive.



Re: [PATCH 2/3] KVM: x86: Add support for VMware guest specific hypercalls

2024-12-17 Thread Sean Christopherson
On Thu, Dec 12, 2024, Doug Covelli wrote:
> On Thu, Nov 14, 2024 at 10:45 AM Doug Covelli  
> wrote:
> > > For LINT1, it should be less performance critical; if it's possible
> > > to just go through all vCPUs, and do KVM_GET_LAPIC to check who you
> > > should send a KVM_NMI to, then I'd do that.  I'd also accept a patch
> > > that adds a VM-wide KVM_NMI ioctl that does the same in the hypervisor
> > > if it's useful for you.
> >
> > Thanks for the patch - I'll get it a try but it might not be right away.
> >
> > > And since I've been proven wrong already, what do you need INIT/SIPI for?
> >
> > I don't think this one is as critical.  I believe the reason it was
> > added was so that we can synchronize startup of the APs with execution
> > of the BSP for guests that do not do a good job of that (Windows).
> >
> > Doug
> 
> We were able to get the in-kernel APIC working with our code using the split
> IRQ chip option with our virtual EFI FW even w/o the traps for SVR and LVT0
> writes.  Performance of Windows VMs is greatly improved as expected.
> Unfortunately our ancient legacy BIOS will not work with > 1 VCPU due to lack
> of support for IPIs with an archaic delivery mode of remote read which it uses
> to discover APs by attempting to read their APIC ID register.  MSFT WHP 
> supports
> this functionality via an option, 
> WHvPartitionPropertyCodeApicRemoteReadSupport.
> 
> Changing our legacy BIOS is not an option so in order to support Windows VMs
> with the legacy BIOS with decent performance we would either need to add 
> support
> for remote reads of the APIC ID register to KVM or support CR8 accesses w/o
> exiting w/o the in-kernel APIC in order.  Do you have a preference?

I didn't quite follow the CR8 access thing.  If the choice is between emulating
Remote Read IPIs and using a userspace local APIC, then I vote with both hands
for emulating Remote Reads, especially if we can do a half-assed version that
provides only what your crazy BIOS needs :-)

The biggest wrinkle I can think of is that KVM uses the Remote Read IPI encoding
for a paravirt vCPU kick feature, but I doubt that's used by Windows guests and
so can be sacrificed on the Altar of Ancient BIOS.



Re: [PATCH 04/17] KVM: arm64: Cleanup PMU includes

2025-06-02 Thread Sean Christopherson
On Mon, Jun 02, 2025, Colton Lewis wrote:
> * Delete kvm/arm_pmu.h. These functions are mostly internal to KVM and
>   should go in asm/kvm_host.h.

Ha!  I'm a hair too late, as usual.  I _just_ resurrected a patch[*] to move and
rename all of the  headers to .  If only I had
posted on Friday when they were ready :-)

It's a relatively small series (mostly arm64 code movement), but it does touch
all architectures due to giving the same treatment to kvm/iodev.h (and purging
include/kvm entirely).

Any preference/thoughts on how to proceed?  My stuff obviously isn't urgent 
since
I sat on the patches for almost two years.  On the other hand, the almost pure
code movement would be a nice precursor to this patch, e.g. move and rename to
asm/kvm_pmu.h before extracting chunks of code into asm/kvm_host.h.

[*] https://lore.kernel.org/all/20230916003118.2540661-15-sea...@google.com


Diff stats for context:
---
Anish Ghulati (1):
  KVM: arm64: Move arm_{psci,hypercalls}.h to an internal KVM path

Sean Christopherson (7):
  KVM: arm64: Include KVM headers to get forward declarations
  KVM: arm64: Move ARM specific headers in include/kvm to arch directory
  KVM: Move include/kvm/iodev.h to include/linux as kvm_iodev.h
  KVM: MIPS: Stop adding virt/kvm to the arch include path
  KVM: PPC: Stop adding virt/kvm to the arch include path
  KVM: s390: Stop adding virt/kvm to the arch include path
  KVM: Standardize include paths across all architectures

 MAINTAINERS| 1 -
 .../arm64/include/asm/kvm_arch_timer.h | 2 ++
 arch/arm64/include/asm/kvm_host.h  | 7 +++
 include/kvm/arm_pmu.h => arch/arm64/include/asm/kvm_pmu.h  | 2 ++
 .../kvm/arm_vgic.h => arch/arm64/include/asm/kvm_vgic.h| 2 +-
 arch/arm64/kvm/Makefile| 2 --
 arch/arm64/kvm/arch_timer.c| 5 ++---
 arch/arm64/kvm/arm.c   | 6 +++---
 {include => arch/arm64}/kvm/arm_hypercalls.h   | 0
 {include => arch/arm64}/kvm/arm_psci.h | 0
 arch/arm64/kvm/guest.c | 2 +-
 arch/arm64/kvm/handle_exit.c   | 2 +-
 arch/arm64/kvm/hyp/Makefile| 6 +++---
 arch/arm64/kvm/hyp/include/hyp/switch.h| 4 ++--
 arch/arm64/kvm/hyp/nvhe/switch.c   | 4 ++--
 arch/arm64/kvm/hyp/vhe/switch.c| 4 ++--
 arch/arm64/kvm/hypercalls.c| 4 ++--
 arch/arm64/kvm/pmu-emul.c  | 4 ++--
 arch/arm64/kvm/psci.c  | 4 ++--
 arch/arm64/kvm/pvtime.c| 2 +-
 arch/arm64/kvm/reset.c | 3 +--
 arch/arm64/kvm/trace_arm.h | 2 +-
 arch/arm64/kvm/trng.c  | 2 +-
 arch/arm64/kvm/vgic/vgic-debug.c   | 2 +-
 arch/arm64/kvm/vgic/vgic-init.c| 2 +-
 arch/arm64/kvm/vgic/vgic-irqfd.c   | 2 +-
 arch/arm64/kvm/vgic/vgic-kvm-device.c  | 2 +-
 arch/arm64/kvm/vgic/vgic-mmio-v2.c | 4 ++--
 arch/arm64/kvm/vgic/vgic-mmio-v3.c | 4 ++--
 arch/arm64/kvm/vgic/vgic-mmio.c| 6 +++---
 arch/arm64/kvm/vgic/vgic-v2.c  | 2 +-
 arch/arm64/kvm/vgic/vgic-v3-nested.c   | 3 +--
 arch/arm64/kvm/vgic/vgic-v3.c  | 2 +-
 arch/loongarch/include/asm/kvm_eiointc.h   | 2 +-
 arch/loongarch/include/asm/kvm_ipi.h   | 2 +-
 arch/loongarch/include/asm/kvm_pch_pic.h   | 2 +-
 arch/mips/include/asm/kvm_host.h   | 3 +--
 arch/mips/kvm/Makefile | 2 --
 arch/powerpc/kvm/Makefile  | 2 --
 arch/powerpc/kvm/mpic.c| 2 +-
 arch/riscv/kvm/Makefile| 2 --
 arch/riscv/kvm/aia_aplic.c | 2 +-
 arch/riscv/kvm/aia_imsic.c | 2 +-
 arch/s390/kvm/Makefile | 2 --
 arch/x86/kvm/Makefile  | 1 -
 arch/x86/kvm/i8254.h   | 2 +-
 arch/x86/kvm/ioapic.h  | 2 +-
 arch/x86/kvm/irq.h | 2 +-
 arch/x86/kvm/lapic.h   | 2 +-
 include/{kvm/iodev.h => linux/kvm_iodev.h} | 0
 virt/kvm/Makefile.kvm  | 2 ++
 virt/kvm/coalesced_mmio.c