[RFC 0/33] KVM: x86: hyperv: Introduce VSM support
Hyper-V's Virtual Secure Mode (VSM) is a virtualisation security feature that leverages the hypervisor to create secure execution environments within a guest. VSM is documented as part of Microsoft's Hypervisor Top Level Functional Specification [1]. Security features that build upon VSM, like Windows Credential Guard, are enabled by default on Windows 11, and are becoming a prerequisite in some industries. This RFC series introduces the necessary infrastructure to emulate VSM enabled guests. It is a snapshot of the progress we made so far, and its main goal is to gather design feedback. Specifically on the KVM APIs we introduce. For a high level design overview, see the documentation in patch 33. Additionally, this topic will be discussed as part of the KVM Micro-conference, in this year's Linux Plumbers Conference [2]. The series is accompanied by two repositories: - A PoC QEMU implementation of VSM [3]. - VSM kvm-unit-tests [4]. Note that this isn't a full VSM implementation. For now it only supports 2 VTLs, and only runs on uniprocessor guests. It is capable of booting Windows Sever 2016/2019, but is unstable during runtime. The series is based on the v6.6 kernel release, and depends on the introduction of KVM memory attributes, which is being worked on independently in "KVM: guest_memfd() and per-page attributes" [5]. A full Linux tree is also made available [6]. Series rundown: - Patch 2 introduces the concept of APIC ID groups. - Patches 3-12 introduce the VSM capability and basic VTL awareness into Hyper-V emulation. - Patch 13 introduces vCPU polling support. - Patches 14-31 use KVM's memory attributes to implement VTL memory protections. Introduces the VTL KMV device and secure memory intercepts. - Patch 32 is a temporary implementation of HVCALL_TRANSLATE_VIRTUAL_ADDRESS necessary to boot Windows 2019. - Patch 33 introduces documentation. Our intention is to integrate feedback gathered in the RFC and LPC while we finish the VSM implementation. In the future, we will split the series into distinct feature patch sets and upstream these independently. Thanks, Nicolas [1] https://raw.githubusercontent.com/Microsoft/Virtualization-Documentation/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf [2] https://lpc.events/event/17/sessions/166/#20231114 [3] https://github.com/vianpl/qemu/tree/vsm-rfc-v1 [4] https://github.com/vianpl/kvm-unit-tests/tree/vsm-rfc-v1 [5] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonz...@redhat.com/. [6] Full tree: https://github.com/vianpl/linux/tree/vsm-rfc-v1. There are also two small dependencies with https://marc.info/?l=kvm&m=167887543028109&w=2 and https://lkml.org/lkml/2023/10/17/972
[RFC 01/33] KVM: x86: Decouple lapic.h from hyperv.h
lapic.h has no dependencies with hyperv.h, so don't include it there. Additionally, cpuid.c implicitly relied on hyperv.h's inclusion through lapic.h, so include it explicitly there. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/cpuid.c | 1 + arch/x86/kvm/lapic.h | 1 - 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 773132c3bf5a..eabd5e9dc003 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -28,6 +28,7 @@ #include "trace.h" #include "pmu.h" #include "xen.h" +#include "hyperv.h" /* * Unlike "struct cpuinfo_x86.x86_capability", kvm_cpu_caps doesn't need to be diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 0a0ea4b5dd8c..e1021517cf04 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -6,7 +6,6 @@ #include -#include "hyperv.h" #include "smm.h" #define KVM_APIC_INIT 0 -- 2.40.1
[RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS
From: Anel Orazgaliyeva Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC ids into two. The lower bits, the physical APIC id, represent the part that's exposed to the guest. The higher bits, which are private to KVM, groups APICs together. APICs in different groups are isolated from each other, and IPIs can only be directed at APICs that share the same group as its source. Furthermore, groups are only relevant to IPIs, anything incoming from outside the local APIC complex: from the IOAPIC, MSIs, or PV-IPIs is targeted at the default APIC group, group 0. When routing IPIs with physical destinations, KVM will OR the source's vCPU APIC group with the ICR's destination ID and use that to resolve the target lAPIC. The APIC physical map is also made group aware in order to speed up this process. For the sake of simplicity, the logical map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI routing to the slower per-vCPU scan method. This capability serves as a building block to implement virtualisation based security features like Hyper-V's Virtual Secure Mode (VSM). VSM introduces a para-virtualised switch that allows for guest CPUs to jump into a different execution context, this switches into a different CPU state, lAPIC state, and memory protections. We model this in KVM by using distinct kvm_vcpus for each context. Moreover, execution contexts are hierarchical and its APICs are meant to remain functional even when the context isn't 'scheduled in'. For example, we have to keep track of timers' expirations, and interrupt execution of lesser priority contexts when relevant. Hence the need to alias physical APIC ids, while keeping the ability to target specific execution contexts. Signed-off-by: Anel Orazgaliyeva Co-developed-by: Nicolas Saenz Julienne Signed-off-by: Nicolas Saenz Julienne --- arch/x86/include/asm/kvm_host.h | 3 ++ arch/x86/include/uapi/asm/kvm.h | 5 +++ arch/x86/kvm/lapic.c| 59 - arch/x86/kvm/lapic.h| 33 ++ arch/x86/kvm/x86.c | 15 + include/uapi/linux/kvm.h| 2 ++ 6 files changed, 108 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dff10051e9b6..a2f224f95404 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1298,6 +1298,9 @@ struct kvm_arch { struct rw_semaphore apicv_update_lock; unsigned long apicv_inhibit_reasons; + u32 apic_id_group_mask; + u8 apic_id_group_shift; + gpa_t wall_clock; bool mwait_in_guest; diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index a448d0964fc0..f73d137784d7 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -565,4 +565,9 @@ struct kvm_pmu_event_filter { #define KVM_X86_DEFAULT_VM 0 #define KVM_X86_SW_PROTECTED_VM1 +/* for KVM_SET_APIC_ID_GROUPS */ +struct kvm_apic_id_groups { + __u8 n_bits; /* nr of bits used to represent group in the APIC ID */ +}; + #endif /* _ASM_X86_KVM_H */ diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 3e977dbbf993..f55d216cb2a0 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -141,7 +141,7 @@ static inline int apic_enabled(struct kvm_lapic *apic) static inline u32 kvm_x2apic_id(struct kvm_lapic *apic) { - return apic->vcpu->vcpu_id; + return kvm_apic_id(apic->vcpu); } static bool kvm_can_post_timer_interrupt(struct kvm_vcpu *vcpu) @@ -219,8 +219,8 @@ static int kvm_recalculate_phys_map(struct kvm_apic_map *new, bool *xapic_id_mismatch) { struct kvm_lapic *apic = vcpu->arch.apic; - u32 x2apic_id = kvm_x2apic_id(apic); - u32 xapic_id = kvm_xapic_id(apic); + u32 x2apic_id = kvm_apic_id_and_group(vcpu); + u32 xapic_id = kvm_apic_id_and_group(vcpu); u32 physical_id; /* @@ -299,6 +299,13 @@ static void kvm_recalculate_logical_map(struct kvm_apic_map *new, u16 mask; u32 ldr; + /* +* Using maps for logical destinations when KVM_CAP_APIC_ID_GRUPS is in +* use isn't supported. +*/ + if (kvm_apic_group(vcpu)) + new->logical_mode = KVM_APIC_MODE_MAP_DISABLED; + if (new->logical_mode == KVM_APIC_MODE_MAP_DISABLED) return; @@ -370,6 +377,25 @@ enum { DIRTY }; +int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm, + struct kvm_apic_id_groups *groups) +{ + u8 n_bits = groups->n_bits; + + if (n_bits > 32) + return -EINVAL; + + kvm->arch.apic_id_group_mask = n_bits ? GENMASK(31, 32 - n_bits): 0; + /* +* Bitshifts >= than the width of the type are UD, so set the +* apic group shift to 0 when n_bits == 0. The group mask above will +*
[RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support
Prepare infrastructure to be able to return data through the XMM registers when Hyper-V hypercalls are issues in fast mode. The XMM registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and restored on successful hypercall completion. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/include/asm/hyperv-tlfs.h | 2 +- arch/x86/kvm/hyperv.c | 33 +- include/uapi/linux/kvm.h | 6 ++ 3 files changed, 39 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index 2ff26f53cd62..af594aa65307 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -49,7 +49,7 @@ /* Support for physical CPU dynamic partitioning events is available*/ #define HV_X64_CPU_DYNAMIC_PARTITIONING_AVAILABLE BIT(3) /* - * Support for passing hypercall input parameter block via XMM + * Support for passing hypercall input and output parameter block via XMM * registers is available */ #define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE BIT(4) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 238afd7335e4..e1bc861ab3b0 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -1815,6 +1815,7 @@ struct kvm_hv_hcall { u16 rep_idx; bool fast; bool rep; + bool xmm_dirty; sse128_t xmm[HV_HYPERCALL_MAX_XMM_REGISTERS]; /* @@ -2346,9 +2347,33 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result) return ret; } +static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm) +{ + int reg; + + kvm_fpu_get(); + for (reg = 0; reg < HV_HYPERCALL_MAX_XMM_REGISTERS; reg++) { + const sse128_t data = sse128(xmm[reg].low, xmm[reg].high); + _kvm_write_sse_reg(reg, &data); + } + kvm_fpu_put(); +} + +static bool kvm_hv_is_xmm_output_hcall(u16 code) +{ + return false; +} + static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu) { - return kvm_hv_hypercall_complete(vcpu, vcpu->run->hyperv.u.hcall.result); + bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT); + u16 code = vcpu->run->hyperv.u.hcall.input & 0x; + u64 result = vcpu->run->hyperv.u.hcall.result; + + if (kvm_hv_is_xmm_output_hcall(code) && hv_result_success(result) && fast) + kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm); + + return kvm_hv_hypercall_complete(vcpu, result); } static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc) @@ -2623,6 +2648,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) break; } + if ((ret & HV_HYPERCALL_RESULT_MASK) == HV_STATUS_SUCCESS && hc.xmm_dirty) + kvm_hv_write_xmm((struct kvm_hyperv_xmm_reg*)hc.xmm); + hypercall_complete: return kvm_hv_hypercall_complete(vcpu, ret); @@ -2632,6 +2660,8 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) vcpu->run->hyperv.u.hcall.input = hc.param; vcpu->run->hyperv.u.hcall.params[0] = hc.ingpa; vcpu->run->hyperv.u.hcall.params[1] = hc.outgpa; + if (hc.fast) + memcpy(vcpu->run->hyperv.u.hcall.xmm, hc.xmm, sizeof(hc.xmm)); vcpu->arch.complete_userspace_io = kvm_hv_hypercall_complete_userspace; return 0; } @@ -2780,6 +2810,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS; ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE; + ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE; ent->edx |= HV_FEATURE_FREQUENCY_MSRS_AVAILABLE; ent->edx |= HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index d7a01766bf21..5ce06a1eee2b 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -192,6 +192,11 @@ struct kvm_s390_cmma_log { __u64 values; }; +struct kvm_hyperv_xmm_reg { + __u64 low; + __u64 high; +}; + struct kvm_hyperv_exit { #define KVM_EXIT_HYPERV_SYNIC 1 #define KVM_EXIT_HYPERV_HCALL 2 @@ -210,6 +215,7 @@ struct kvm_hyperv_exit { __u64 input; __u64 result; __u64 params[2]; + struct kvm_hyperv_xmm_reg xmm[6]; } hcall; struct { __u32 msr; -- 2.40.1
[RFC 04/33] KVM: x86: hyper-v: Move hypercall page handling into separate function
The hypercall page patching is about to grow considerably, move it into its own function. No functional change intended. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/hyperv.c | 69 --- 1 file changed, 39 insertions(+), 30 deletions(-) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index e1bc861ab3b0..78d053042667 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -256,6 +256,42 @@ static void synic_exit(struct kvm_vcpu_hv_synic *synic, u32 msr) kvm_make_request(KVM_REQ_HV_EXIT, vcpu); } +static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data) +{ + struct kvm *kvm = vcpu->kvm; + u8 instructions[9]; + int i = 0; + u64 addr; + + /* +* If Xen and Hyper-V hypercalls are both enabled, disambiguate +* the same way Xen itself does, by setting the bit 31 of EAX +* which is RsvdZ in the 32-bit Hyper-V hypercall ABI and just +* going to be clobbered on 64-bit. +*/ + if (kvm_xen_hypercall_enabled(kvm)) { + /* orl $0x8000, %eax */ + instructions[i++] = 0x0d; + instructions[i++] = 0x00; + instructions[i++] = 0x00; + instructions[i++] = 0x00; + instructions[i++] = 0x80; + } + + /* vmcall/vmmcall */ + static_call(kvm_x86_patch_hypercall)(vcpu, instructions + i); + i += 3; + + /* ret */ + ((unsigned char *)instructions)[i++] = 0xc3; + + addr = data & HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK; + if (kvm_vcpu_write_guest(vcpu, addr, instructions, i)) + return 1; + + return 0; +} + static int synic_set_msr(struct kvm_vcpu_hv_synic *synic, u32 msr, u64 data, bool host) { @@ -1338,11 +1374,7 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data, if (!hv->hv_guest_os_id) hv->hv_hypercall &= ~HV_X64_MSR_HYPERCALL_ENABLE; break; - case HV_X64_MSR_HYPERCALL: { - u8 instructions[9]; - int i = 0; - u64 addr; - + case HV_X64_MSR_HYPERCALL: /* if guest os id is not set hypercall should remain disabled */ if (!hv->hv_guest_os_id) break; @@ -1351,34 +1383,11 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data, break; } - /* -* If Xen and Hyper-V hypercalls are both enabled, disambiguate -* the same way Xen itself does, by setting the bit 31 of EAX -* which is RsvdZ in the 32-bit Hyper-V hypercall ABI and just -* going to be clobbered on 64-bit. -*/ - if (kvm_xen_hypercall_enabled(kvm)) { - /* orl $0x8000, %eax */ - instructions[i++] = 0x0d; - instructions[i++] = 0x00; - instructions[i++] = 0x00; - instructions[i++] = 0x00; - instructions[i++] = 0x80; - } - - /* vmcall/vmmcall */ - static_call(kvm_x86_patch_hypercall)(vcpu, instructions + i); - i += 3; - - /* ret */ - ((unsigned char *)instructions)[i++] = 0xc3; - - addr = data & HV_X64_MSR_HYPERCALL_PAGE_ADDRESS_MASK; - if (kvm_vcpu_write_guest(vcpu, addr, instructions, i)) + if (patch_hypercall_page(vcpu, data)) return 1; + hv->hv_hypercall = data; break; - } case HV_X64_MSR_REFERENCE_TSC: hv->hv_tsc_page = data; if (hv->hv_tsc_page & HV_X64_MSR_TSC_REFERENCE_ENABLE) { -- 2.40.1
[RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
VTL call/return hypercalls have their own entry points in the hypercall page because they don't follow normal hyper-v hypercall conventions. Move the VTL call/return control input into ECX/RAX and set the hypercall code into EAX/RCX before calling the hypercall instruction in order to be able to use the Hyper-V hypercall entry function. Guests can read an emulated code page offsets register to know the offsets into the hypercall page for the VTL call/return entries. Signed-off-by: Nicolas Saenz Julienne --- My tree has the additional patch, we're still trying to understand under what conditions Windows expects the offset to be fixed. diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 54f7f36a89bf..9f2ea8c34447 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -294,6 +294,7 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data) /* VTL call/return entries */ if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) { + i = 22; #ifdef CONFIG_X86_64 if (is_64_bit_mode(vcpu)) { /* --- arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/hyperv.c | 78 ++- include/asm-generic/hyperv-tlfs.h | 11 + 3 files changed, 90 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index a2f224f95404..00cd21b09f8c 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1105,6 +1105,8 @@ struct kvm_hv { u64 hv_tsc_emulation_status; u64 hv_invtsc_control; + union hv_register_vsm_code_page_offsets vsm_code_page_offsets; + /* How many vCPUs have VP index != vCPU index */ atomic_t num_mismatched_vp_indexes; diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 78d053042667..d4b1b53ea63d 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -259,7 +259,8 @@ static void synic_exit(struct kvm_vcpu_hv_synic *synic, u32 msr) static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data) { struct kvm *kvm = vcpu->kvm; - u8 instructions[9]; + struct kvm_hv *hv = to_kvm_hv(kvm); + u8 instructions[0x30]; int i = 0; u64 addr; @@ -285,6 +286,81 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data) /* ret */ ((unsigned char *)instructions)[i++] = 0xc3; + /* VTL call/return entries */ + if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) { +#ifdef CONFIG_X86_64 + if (is_64_bit_mode(vcpu)) { + /* +* VTL call 64-bit entry prologue: +* mov %rcx, %rax +* mov $0x11, %ecx +* jmp 0: +*/ + hv->vsm_code_page_offsets.vtl_call_offset = i; + instructions[i++] = 0x48; + instructions[i++] = 0x89; + instructions[i++] = 0xc8; + instructions[i++] = 0xb9; + instructions[i++] = 0x11; + instructions[i++] = 0x00; + instructions[i++] = 0x00; + instructions[i++] = 0x00; + instructions[i++] = 0xeb; + instructions[i++] = 0xe0; + /* +* VTL return 64-bit entry prologue: +* mov %rcx, %rax +* mov $0x12, %ecx +* jmp 0: +*/ + hv->vsm_code_page_offsets.vtl_return_offset = i; + instructions[i++] = 0x48; + instructions[i++] = 0x89; + instructions[i++] = 0xc8; + instructions[i++] = 0xb9; + instructions[i++] = 0x12; + instructions[i++] = 0x00; + instructions[i++] = 0x00; + instructions[i++] = 0x00; + instructions[i++] = 0xeb; + instructions[i++] = 0xd6; + } else +#endif + { + /* +* VTL call 32-bit entry prologue: +* mov %eax, %ecx +* mov $0x11, %eax +* jmp 0: +*/ + hv->vsm_code_page_offsets.vtl_call_offset = i; + instructions[i++] = 0x89; + instructions[i++] = 0xc1; + instructions[i++] = 0xb8; + instructions[i++] = 0x11; + instructions[i++] = 0x00; + instructions[i++] = 0x00; + instructions[i++] = 0x00;
[RFC 06/33] KVM: x86: hyper-v: Introduce VTL awareness to Hyper-V's PV-IPIs
HVCALL_SEND_IPI and HVCALL_SEND_IPI_EX allow targeting specific a specific VTL. Honour the requests. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/hyperv.c | 24 +--- arch/x86/kvm/trace.h | 20 include/asm-generic/hyperv-tlfs.h | 6 -- 3 files changed, 33 insertions(+), 17 deletions(-) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index d4b1b53ea63d..2cf430f6ddd8 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -2230,7 +2230,7 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc) } static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector, - u64 *sparse_banks, u64 valid_bank_mask) + u64 *sparse_banks, u64 valid_bank_mask, int vtl) { struct kvm_lapic_irq irq = { .delivery_mode = APIC_DM_FIXED, @@ -2245,6 +2245,9 @@ static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector, valid_bank_mask, sparse_banks)) continue; + if (kvm_hv_get_active_vtl(vcpu) != vtl) + continue; + /* We fail only when APIC is disabled */ kvm_apic_set_irq(vcpu, &irq, NULL); } @@ -2257,13 +2260,19 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc) struct kvm *kvm = vcpu->kvm; struct hv_send_ipi_ex send_ipi_ex; struct hv_send_ipi send_ipi; + union hv_input_vtl *in_vtl; u64 valid_bank_mask; u32 vector; bool all_cpus; + u8 vtl; + + /* VTL is at the same offset on both IPI types */ + in_vtl = &send_ipi.in_vtl; + vtl = in_vtl->use_target_vtl ? in_vtl->target_vtl : kvm_hv_get_active_vtl(vcpu); if (hc->code == HVCALL_SEND_IPI) { if (!hc->fast) { - if (unlikely(kvm_read_guest(kvm, hc->ingpa, &send_ipi, + if (unlikely(kvm_vcpu_read_guest(vcpu, hc->ingpa, &send_ipi, sizeof(send_ipi return HV_STATUS_INVALID_HYPERCALL_INPUT; sparse_banks[0] = send_ipi.cpu_mask; @@ -2278,10 +2287,10 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc) all_cpus = false; valid_bank_mask = BIT_ULL(0); - trace_kvm_hv_send_ipi(vector, sparse_banks[0]); + trace_kvm_hv_send_ipi(vector, sparse_banks[0], vtl); } else { if (!hc->fast) { - if (unlikely(kvm_read_guest(kvm, hc->ingpa, &send_ipi_ex, + if (unlikely(kvm_vcpu_read_guest(vcpu, hc->ingpa, &send_ipi_ex, sizeof(send_ipi_ex return HV_STATUS_INVALID_HYPERCALL_INPUT; } else { @@ -2292,7 +2301,8 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc) trace_kvm_hv_send_ipi_ex(send_ipi_ex.vector, send_ipi_ex.vp_set.format, -send_ipi_ex.vp_set.valid_bank_mask); +send_ipi_ex.vp_set.valid_bank_mask, +vtl); vector = send_ipi_ex.vector; valid_bank_mask = send_ipi_ex.vp_set.valid_bank_mask; @@ -2322,9 +2332,9 @@ static u64 kvm_hv_send_ipi(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc) return HV_STATUS_INVALID_HYPERCALL_INPUT; if (all_cpus) - kvm_hv_send_ipi_to_many(kvm, vector, NULL, 0); + kvm_hv_send_ipi_to_many(kvm, vector, NULL, 0, vtl); else - kvm_hv_send_ipi_to_many(kvm, vector, sparse_banks, valid_bank_mask); + kvm_hv_send_ipi_to_many(kvm, vector, sparse_banks, valid_bank_mask, vtl); ret_success: return HV_STATUS_SUCCESS; diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index 83843379813e..ab8839c47bc7 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -1606,42 +1606,46 @@ TRACE_EVENT(kvm_hv_flush_tlb_ex, * Tracepoints for kvm_hv_send_ipi. */ TRACE_EVENT(kvm_hv_send_ipi, - TP_PROTO(u32 vector, u64 processor_mask), - TP_ARGS(vector, processor_mask), + TP_PROTO(u32 vector, u64 processor_mask, u8 vtl), + TP_ARGS(vector, processor_mask, vtl), TP_STRUCT__entry( __field(u32, vector) __field(u64, processor_mask) + __field(u8, vtl) ), TP_fast_assign( __entry->vector = vector; __entry->processor_mask = processor_mask; + __entry->vtl = vtl; ), - TP_printk("vector %x processor_mask 0x%llx",
[RFC 08/33] KVM: x86: Don't use hv_timer if CAP_HYPERV_VSM enabled
VSM's VTLs are modeled by using a distinct vCPU per VTL. While one VTL is running the rest of vCPUs are left idle. This doesn't play well with the approach of tracking emulated timer expiration by using the VMX preemption timer. Inactive VTL's timers are still meant to run and inject interrupts regardless of their runstate. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/lapic.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index f55d216cb2a0..8cc75b24381b 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -152,9 +152,10 @@ static bool kvm_can_post_timer_interrupt(struct kvm_vcpu *vcpu) bool kvm_can_use_hv_timer(struct kvm_vcpu *vcpu) { - return kvm_x86_ops.set_hv_timer - && !(kvm_mwait_in_guest(vcpu->kvm) || - kvm_can_post_timer_interrupt(vcpu)); + return kvm_x86_ops.set_hv_timer && + !(kvm_mwait_in_guest(vcpu->kvm) || +kvm_can_post_timer_interrupt(vcpu)) && + !(kvm_hv_vsm_enabled(vcpu->kvm)); } static bool kvm_use_posted_timer_interrupt(struct kvm_vcpu *vcpu) -- 2.40.1
[RFC 07/33] KVM: x86: hyper-v: Introduce KVM_CAP_HYPERV_VSM
Introduce a new capability to enable Hyper-V Virtual Secure Mode (VSM) emulation support. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/hyperv.h | 5 + arch/x86/kvm/x86.c | 5 + include/uapi/linux/kvm.h| 1 + 4 files changed, 13 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 00cd21b09f8c..7712e31b7537 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1118,6 +1118,8 @@ struct kvm_hv { struct hv_partition_assist_pg *hv_pa_pg; struct kvm_hv_syndbg hv_syndbg; + + bool hv_enable_vsm; }; struct msr_bitmap_range { diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index f83b8db72b11..2bfed69ba0db 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -238,4 +238,9 @@ static inline int kvm_hv_verify_vp_assist(struct kvm_vcpu *vcpu) int kvm_hv_vcpu_flush_tlb(struct kvm_vcpu *vcpu); +static inline bool kvm_hv_vsm_enabled(struct kvm *kvm) +{ + return kvm->arch.hyperv.hv_enable_vsm; +} + #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 4cd3f00475c1..b0512e433032 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4485,6 +4485,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_HYPERV_CPUID: case KVM_CAP_HYPERV_ENFORCE_CPUID: case KVM_CAP_SYS_HYPERV_CPUID: + case KVM_CAP_HYPERV_VSM: case KVM_CAP_PCI_SEGMENT: case KVM_CAP_DEBUGREGS: case KVM_CAP_X86_ROBUST_SINGLESTEP: @@ -6519,6 +6520,10 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, } mutex_unlock(&kvm->lock); break; + case KVM_CAP_HYPERV_VSM: + kvm->arch.hyperv.hv_enable_vsm = true; + r = 0; + break; default: r = -EINVAL; break; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 5ce06a1eee2b..168b6ac6ebe5 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1226,6 +1226,7 @@ struct kvm_ppc_resize_hpt { #define KVM_CAP_GUEST_MEMFD 233 #define KVM_CAP_VM_TYPES 234 #define KVM_CAP_APIC_ID_GROUPS 235 +#define KVM_CAP_HYPERV_VSM 237 #ifdef KVM_CAP_IRQ_ROUTING -- 2.40.1
[RFC 09/33] KVM: x86: hyper-v: Introduce per-VTL vcpu helpers
Introduce two helper functions. The first one queries a vCPU's VTL level, the second one, given a struct kvm_vcpu and VTL pair, returns the corresponding 'sibling' struct kvm_vcpu at the right VTL. We keep track of each VTL's state by having a distinct struct kvm_vpcu for each level. VTL-vCPUs that belong to the same guest CPU share the same physical APIC id, but belong to different APIC groups where the apic group represents the vCPU's VTL. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/hyperv.h | 18 ++ 1 file changed, 18 insertions(+) diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index 2bfed69ba0db..5433107e7cc8 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -23,6 +23,7 @@ #include #include "x86.h" +#include "lapic.h" /* "Hv#1" signature */ #define HYPERV_CPUID_SIGNATURE_EAX 0x31237648 @@ -83,6 +84,23 @@ static inline struct kvm_hv_syndbg *to_hv_syndbg(struct kvm_vcpu *vcpu) return &vcpu->kvm->arch.hyperv.hv_syndbg; } +static inline struct kvm_vcpu *kvm_hv_get_vtl_vcpu(struct kvm_vcpu *vcpu, int vtl) +{ + struct kvm *kvm = vcpu->kvm; + u32 target_id = kvm_apic_id(vcpu); + + kvm_apic_id_set_group(kvm, vtl, &target_id); + if (vcpu->vcpu_id == target_id) + return vcpu; + + return kvm_get_vcpu_by_id(kvm, target_id); +} + +static inline u8 kvm_hv_get_active_vtl(struct kvm_vcpu *vcpu) +{ + return kvm_apic_group(vcpu); +} + static inline u32 kvm_hv_get_vpindex(struct kvm_vcpu *vcpu) { struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu); -- 2.40.1
[RFC 10/33] KVM: x86: hyper-v: Introduce KVM_HV_GET_VSM_STATE
HVCALL_GET_VP_REGISTERS exposes the VTL call hypercall page entry offsets to the guest. This hypercall is implemented in user-space while the hypercall page patching happens in-kernel. So expose it as part of the partition wide VSM state. NOTE: Alternatively there is the option of sharing this information through a VTL KVM device attribute (the device is introduced in subsequent patches). Signed-off-by: Nicolas Saenz Julienne --- arch/x86/include/uapi/asm/kvm.h | 5 + arch/x86/kvm/hyperv.c | 8 arch/x86/kvm/hyperv.h | 2 ++ arch/x86/kvm/x86.c | 18 ++ include/uapi/linux/kvm.h| 4 5 files changed, 37 insertions(+) diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index f73d137784d7..370483d5d5fd 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -570,4 +570,9 @@ struct kvm_apic_id_groups { __u8 n_bits; /* nr of bits used to represent group in the APIC ID */ }; +/* for KVM_HV_GET_VSM_STATE */ +struct kvm_hv_vsm_state { + __u64 vsm_code_page_offsets; +}; + #endif /* _ASM_X86_KVM_H */ diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 2cf430f6ddd8..caaa859932c5 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -2990,3 +2990,11 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, return 0; } + +int kvm_vm_ioctl_get_hv_vsm_state(struct kvm *kvm, struct kvm_hv_vsm_state *state) +{ + struct kvm_hv* hv = &kvm->arch.hyperv; + + state->vsm_code_page_offsets = hv->vsm_code_page_offsets.as_u64; + return 0; +} diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index 5433107e7cc8..b3d1113efe82 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -261,4 +261,6 @@ static inline bool kvm_hv_vsm_enabled(struct kvm *kvm) return kvm->arch.hyperv.hv_enable_vsm; } +int kvm_vm_ioctl_get_hv_vsm_state(struct kvm *kvm, struct kvm_hv_vsm_state *state); + #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index b0512e433032..57f9c58e1e32 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7132,6 +7132,24 @@ int kvm_arch_vm_ioctl(struct file *filp, unsigned int ioctl, unsigned long arg) r = kvm_vm_ioctl_set_apic_id_groups(kvm, &groups); break; } + case KVM_HV_GET_VSM_STATE: { + struct kvm_hv_vsm_state vsm_state; + + r = -EINVAL; + if (!kvm_hv_vsm_enabled(kvm)) + goto out; + + r = kvm_vm_ioctl_get_hv_vsm_state(kvm, &vsm_state); + if (r) + goto out; + + r = -EFAULT; + if (copy_to_user(argp, &vsm_state, sizeof(vsm_state))) + goto out; + + r = 0; + break; + } default: r = -ENOTTY; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 168b6ac6ebe5..03f5c08fd7aa 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -2316,4 +2316,8 @@ struct kvm_create_guest_memfd { #define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE (1ULL << 0) #define KVM_SET_APIC_ID_GROUPS _IOW(KVMIO, 0xd7, struct kvm_apic_id_groups) + +/* Get/Set Hyper-V VSM state. Available with KVM_CAP_HYPERV_VSM */ +#define KVM_HV_GET_VSM_STATE _IOR(KVMIO, 0xd5, struct kvm_hv_vsm_state) + #endif /* __LINUX_KVM_H */ -- 2.40.1
[RFC 11/33] KVM: x86: hyper-v: Handle GET/SET_VP_REGISTER hcall in user-space
Let user-space handle HVCALL_GET_VP_REGISTERS and HVCALL_SET_VP_REGISTERS through the KVM_EXIT_HYPERV_HVCALL exit reason. Additionally, expose the cpuid bit. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/hyperv.c | 9 + include/asm-generic/hyperv-tlfs.h | 1 + 2 files changed, 10 insertions(+) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index caaa859932c5..a3970d52eef1 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -2456,6 +2456,9 @@ static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm) static bool kvm_hv_is_xmm_output_hcall(u16 code) { + if (code == HVCALL_GET_VP_REGISTERS) + return true; + return false; } @@ -2520,6 +2523,8 @@ static bool is_xmm_fast_hypercall(struct kvm_hv_hcall *hc) case HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX: case HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX: case HVCALL_SEND_IPI_EX: + case HVCALL_GET_VP_REGISTERS: + case HVCALL_SET_VP_REGISTERS: return true; } @@ -2738,6 +2743,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) break; } goto hypercall_userspace_exit; + case HVCALL_GET_VP_REGISTERS: + case HVCALL_SET_VP_REGISTERS: + goto hypercall_userspace_exit; default: ret = HV_STATUS_INVALID_HYPERCALL_CODE; break; @@ -2903,6 +2911,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, ent->ebx |= HV_POST_MESSAGES; ent->ebx |= HV_SIGNAL_EVENTS; ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS; + ent->ebx |= HV_ACCESS_VP_REGISTERS; ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE; ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE; diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h index 40d7dc793c03..24ea699a3d8e 100644 --- a/include/asm-generic/hyperv-tlfs.h +++ b/include/asm-generic/hyperv-tlfs.h @@ -89,6 +89,7 @@ #define HV_ACCESS_STATSBIT(8) #define HV_DEBUGGING BIT(11) #define HV_CPU_MANAGEMENT BIT(12) +#define HV_ACCESS_VP_REGISTERS BIT(17) #define HV_ENABLE_EXTENDED_HYPERCALLS BIT(20) #define HV_ISOLATION BIT(22) -- 2.40.1
[RFC 12/33] KVM: x86: hyper-v: Handle VSM hcalls in user-space
Let user-space handle all hypercalls that fall under the AccessVsm partition privilege flag. That is: - HVCALL_MODIFY_VTL_PROTECTION_MASK: - HVCALL_ENABLE_PARTITION_VTL: - HVCALL_ENABLE_VP_VTL: - HVCALL_VTL_CALL: - HVCALL_VTL_RETURN: The hypercalls are processed through the KVM_EXIT_HYPERV_HVCALL exit. Additionally, expose the cpuid bit. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/hyperv.c | 15 +++ include/asm-generic/hyperv-tlfs.h | 7 ++- 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index a3970d52eef1..a266c5d393f5 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -2462,6 +2462,11 @@ static bool kvm_hv_is_xmm_output_hcall(u16 code) return false; } +static inline bool kvm_hv_is_vtl_call_return(u16 code) +{ + return code == HVCALL_VTL_CALL || code == HVCALL_VTL_RETURN; +} + static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu) { bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT); @@ -2471,6 +2476,9 @@ static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu) if (kvm_hv_is_xmm_output_hcall(code) && hv_result_success(result) && fast) kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm); + if (kvm_hv_is_vtl_call_return(code)) + return kvm_skip_emulated_instruction(vcpu); + return kvm_hv_hypercall_complete(vcpu, result); } @@ -2525,6 +2533,7 @@ static bool is_xmm_fast_hypercall(struct kvm_hv_hcall *hc) case HVCALL_SEND_IPI_EX: case HVCALL_GET_VP_REGISTERS: case HVCALL_SET_VP_REGISTERS: + case HVCALL_MODIFY_VTL_PROTECTION_MASK: return true; } @@ -2745,6 +2754,11 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) goto hypercall_userspace_exit; case HVCALL_GET_VP_REGISTERS: case HVCALL_SET_VP_REGISTERS: + case HVCALL_MODIFY_VTL_PROTECTION_MASK: + case HVCALL_ENABLE_PARTITION_VTL: + case HVCALL_ENABLE_VP_VTL: + case HVCALL_VTL_CALL: + case HVCALL_VTL_RETURN: goto hypercall_userspace_exit; default: ret = HV_STATUS_INVALID_HYPERCALL_CODE; @@ -2912,6 +2926,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, ent->ebx |= HV_SIGNAL_EVENTS; ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS; ent->ebx |= HV_ACCESS_VP_REGISTERS; + ent->ebx |= HV_ACCESS_VSM; ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE; ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE; diff --git a/include/asm-generic/hyperv-tlfs.h b/include/asm-generic/hyperv-tlfs.h index 24ea699a3d8e..a8b5c8a84bbc 100644 --- a/include/asm-generic/hyperv-tlfs.h +++ b/include/asm-generic/hyperv-tlfs.h @@ -89,6 +89,7 @@ #define HV_ACCESS_STATSBIT(8) #define HV_DEBUGGING BIT(11) #define HV_CPU_MANAGEMENT BIT(12) +#define HV_ACCESS_VSM BIT(16) #define HV_ACCESS_VP_REGISTERS BIT(17) #define HV_ENABLE_EXTENDED_HYPERCALLS BIT(20) #define HV_ISOLATION BIT(22) @@ -147,9 +148,13 @@ union hv_reference_tsc_msr { /* Declare the various hypercall operations. */ #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE 0x0002 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST 0x0003 -#define HVCALL_ENABLE_VP_VTL 0x000f #define HVCALL_NOTIFY_LONG_SPIN_WAIT 0x0008 #define HVCALL_SEND_IPI0x000b +#define HVCALL_MODIFY_VTL_PROTECTION_MASK 0x000c +#define HVCALL_ENABLE_PARTITION_VTL0x000d +#define HVCALL_ENABLE_VP_VTL 0x000f +#define HVCALL_VTL_CALL0x0011 +#define HVCALL_VTL_RETURN 0x0012 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX 0x0013 #define HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX 0x0014 #define HVCALL_SEND_IPI_EX 0x0015 -- 2.40.1
[RFC 14/33] KVM: x86: Add VTL to the MMU role
With the upcoming introduction of per-VTL memory protections, make MMU roles VTL aware. This will avoid sharing PTEs between vCPUs that belong to different VTLs, and that have distinct memory access restrictions. Four bits are allocated to store the VTL number in the MMU role, since the TLFS states there is a maximum of 16 levels. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/include/asm/kvm_host.h | 3 ++- arch/x86/kvm/hyperv.h | 6 ++ arch/x86/kvm/mmu.h | 1 + arch/x86/kvm/mmu/mmu.c | 3 +++ 4 files changed, 12 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 7712e31b7537..1f5a85d461ce 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -338,7 +338,8 @@ union kvm_mmu_page_role { unsigned ad_disabled:1; unsigned guest_mode:1; unsigned passthrough:1; - unsigned :5; + unsigned vtl:4; + unsigned :1; /* * This is left at the top of the word so that diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index b3d1113efe82..605e80b9e5eb 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -263,4 +263,10 @@ static inline bool kvm_hv_vsm_enabled(struct kvm *kvm) int kvm_vm_ioctl_get_hv_vsm_state(struct kvm *kvm, struct kvm_hv_vsm_state *state); +static inline void kvm_mmu_role_set_hv_bits(struct kvm_vcpu *vcpu, + union kvm_mmu_page_role *role) +{ + role->vtl = kvm_hv_get_active_vtl(vcpu); +} + #endif diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h index 253fb2093d5d..e170388c6da1 100644 --- a/arch/x86/kvm/mmu.h +++ b/arch/x86/kvm/mmu.h @@ -304,4 +304,5 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu, return gpa; return translate_nested_gpa(vcpu, gpa, access, exception); } + #endif diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index baeba8fc1c38..2afef86863fb 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -28,6 +28,7 @@ #include "page_track.h" #include "cpuid.h" #include "spte.h" +#include "hyperv.h" #include #include @@ -5197,6 +5198,7 @@ static union kvm_cpu_role kvm_calc_cpu_role(struct kvm_vcpu *vcpu, role.base.smm = is_smm(vcpu); role.base.guest_mode = is_guest_mode(vcpu); role.ext.valid = 1; + kvm_mmu_role_set_hv_bits(vcpu, &role.base); if (!is_cr0_pg(regs)) { role.base.direct = 1; @@ -5271,6 +5273,7 @@ kvm_calc_tdp_mmu_root_page_role(struct kvm_vcpu *vcpu, role.level = kvm_mmu_get_tdp_level(vcpu); role.direct = true; role.has_4_byte_gpte = false; + kvm_mmu_role_set_hv_bits(vcpu, &role); return role; } -- 2.40.1
[RFC 13/33] KVM: Allow polling vCPUs for events
A number of use cases have surfaced where it'd be beneficial to have a vCPU stop its execution in user-space, as opposed to having it sleep in-kernel. Be it in order to make better use of the pCPU's time while the vCPU is halted, or to implement security features like Hyper-V's VSM. A problem with this approach is that user-space has no way of knowing whether the vCPU has pending events (interrupts, timers, etc...), so we need a new interface to query if they are. poll() turned out to be a very good fit. So enable polling vCPUs. The poll() interface considers a vCPU has a pending event if it didn't enter the guest since being kicked by an event source (being kicked forces a guest exit). Kicking a vCPU that has pollers wakes up the polling threads. NOTES: - There is a race between the 'vcpu->kicked' check in the polling thread and the vCPU thread re-entering the guest. This hardly affects the use-cases stated above, but needs to be fixed. - This was tested alongside a WIP Hyper-V Virtual Trust Level implementation which makes ample use of the poll() interface. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/x86.c | 2 ++ include/linux/kvm_host.h | 2 ++ virt/kvm/kvm_main.c | 30 ++ 3 files changed, 34 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 57f9c58e1e32..bf4891bc044e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10788,6 +10788,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) goto cancel_injection; } + WRITE_ONCE(vcpu->kicked, false); + if (req_immediate_exit) { kvm_make_request(KVM_REQ_EVENT, vcpu); static_call(kvm_x86_request_immediate_exit)(vcpu); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 687589ce9f63..71e1e8cf8936 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -336,6 +336,7 @@ struct kvm_vcpu { #endif int mode; u64 requests; + bool kicked; unsigned long guest_debug; struct mutex mutex; @@ -395,6 +396,7 @@ struct kvm_vcpu { */ struct kvm_memory_slot *last_used_slot; u64 last_used_slot_gen; + wait_queue_head_t wqh; }; /* diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index ad9aab898a0c..fde004a0ac46 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -497,12 +497,14 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id) kvm_vcpu_set_dy_eligible(vcpu, false); vcpu->preempted = false; vcpu->ready = false; + vcpu->kicked = false; preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops); vcpu->last_used_slot = NULL; /* Fill the stats id string for the vcpu */ snprintf(vcpu->stats_id, sizeof(vcpu->stats_id), "kvm-%d/vcpu-%d", task_pid_nr(current), id); + init_waitqueue_head(&vcpu->wqh); } static void kvm_vcpu_destroy(struct kvm_vcpu *vcpu) @@ -3970,6 +3972,10 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu) if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) smp_send_reschedule(cpu); } + + if (!cmpxchg(&vcpu->kicked, false, true)) + wake_up_interruptible(&vcpu->wqh); + out: put_cpu(); } @@ -4174,6 +4180,29 @@ static int kvm_vcpu_mmap(struct file *file, struct vm_area_struct *vma) return 0; } +static __poll_t kvm_vcpu_poll(struct file *file, poll_table *wait) +{ + struct kvm_vcpu *vcpu = file->private_data; + + poll_wait(file, &vcpu->wqh, wait); + + /* +* Make sure we read vcpu->kicked after adding the vcpu into +* the waitqueue list. Otherwise we might have the following race: +* +* READ_ONCE(vcpu->kicked) +* cmpxchg(&vcpu->kicked, false, true)) +* wake_up_interruptible(&vcpu->wqh) +* list_add_tail(wait, &vcpu->wqh) +*/ + smp_mb(); + if (READ_ONCE(vcpu->kicked)) { + return EPOLLIN; + } + + return 0; +} + static int kvm_vcpu_release(struct inode *inode, struct file *filp) { struct kvm_vcpu *vcpu = filp->private_data; @@ -4186,6 +4215,7 @@ static const struct file_operations kvm_vcpu_fops = { .release= kvm_vcpu_release, .unlocked_ioctl = kvm_vcpu_ioctl, .mmap = kvm_vcpu_mmap, + .poll = kvm_vcpu_poll, .llseek = noop_llseek, KVM_COMPAT(kvm_vcpu_compat_ioctl), }; -- 2.40.1
[RFC 15/33] KVM: x86/mmu: Introduce infrastructure to handle non-executable faults
The upcoming per-VTL memory protections support needs to fault in non-executable memory. Introduce a new attribute in struct kvm_page_fault, map_executable, to control whether the gfn range should be mapped as executable. No functional change intended. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/mmu/mmu.c | 6 +- arch/x86/kvm/mmu/mmu_internal.h | 2 ++ arch/x86/kvm/mmu/tdp_mmu.c | 8 ++-- 3 files changed, 13 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 2afef86863fb..4e02d506cc25 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -3245,6 +3245,7 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) struct kvm_mmu_page *sp; int ret; gfn_t base_gfn = fault->gfn; + unsigned access = ACC_ALL; kvm_mmu_hugepage_adjust(vcpu, fault); @@ -3274,7 +3275,10 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) if (WARN_ON_ONCE(it.level != fault->goal_level)) return -EFAULT; - ret = mmu_set_spte(vcpu, fault->slot, it.sptep, ACC_ALL, + if (!fault->map_executable) + access &= ~ACC_EXEC_MASK; + + ret = mmu_set_spte(vcpu, fault->slot, it.sptep, access, base_gfn, fault->pfn, fault); if (ret == RET_PF_SPURIOUS) return ret; diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h index b66a7d47e0e4..bd62c4d5d5f1 100644 --- a/arch/x86/kvm/mmu/mmu_internal.h +++ b/arch/x86/kvm/mmu/mmu_internal.h @@ -239,6 +239,7 @@ struct kvm_page_fault { kvm_pfn_t pfn; hva_t hva; bool map_writable; + bool map_executable; /* * Indicates the guest is trying to write a gfn that contains one or @@ -298,6 +299,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, .req_level = PG_LEVEL_4K, .goal_level = PG_LEVEL_4K, .is_private = kvm_mem_is_private(vcpu->kvm, cr2_or_gpa >> PAGE_SHIFT), + .map_executable = true, }; int r; diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c index 6cd4dd631a2f..46f3e72ab770 100644 --- a/arch/x86/kvm/mmu/tdp_mmu.c +++ b/arch/x86/kvm/mmu/tdp_mmu.c @@ -957,14 +957,18 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu, u64 new_spte; int ret = RET_PF_FIXED; bool wrprot = false; + unsigned access = ACC_ALL; if (WARN_ON_ONCE(sp->role.level != fault->goal_level)) return RET_PF_RETRY; + if (!fault->map_executable) + access &= ~ACC_EXEC_MASK; + if (unlikely(!fault->slot)) - new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL); + new_spte = make_mmio_spte(vcpu, iter->gfn, access); else - wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn, + wrprot = make_spte(vcpu, sp, fault->slot, access, iter->gfn, fault->pfn, iter->old_spte, fault->prefetch, true, fault->map_writable, &new_spte); -- 2.40.1
[RFC 16/33] KVM: x86/mmu: Expose R/W/X flags during memory fault exits
Include the fault's read, write and execute status when exiting to user-space. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/mmu/mmu.c | 4 ++-- include/linux/kvm_host.h | 9 +++-- include/uapi/linux/kvm.h | 6 ++ 3 files changed, 15 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 4e02d506cc25..feca077c0210 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4300,8 +4300,8 @@ static inline u8 kvm_max_level_for_order(int order) static void kvm_mmu_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { - kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT, - PAGE_SIZE, fault->write, fault->exec, + kvm_prepare_memory_fault_exit(vcpu, fault->gfn << PAGE_SHIFT, PAGE_SIZE, + fault->write, fault->exec, fault->user, fault->is_private); } diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 71e1e8cf8936..631fd532c97a 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2367,14 +2367,19 @@ static inline void kvm_account_pgtable_pages(void *virt, int nr) static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, gpa_t gpa, gpa_t size, bool is_write, bool is_exec, -bool is_private) +bool is_read, bool is_private) { vcpu->run->exit_reason = KVM_EXIT_MEMORY_FAULT; vcpu->run->memory_fault.gpa = gpa; vcpu->run->memory_fault.size = size; - /* RWX flags are not (yet) defined or communicated to userspace. */ vcpu->run->memory_fault.flags = 0; + if (is_read) + vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_READ; + if (is_write) + vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_WRITE; + if (is_exec) + vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_EXECUTE; if (is_private) vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE; } diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 03f5c08fd7aa..0ddffb8b0c99 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -533,7 +533,13 @@ struct kvm_run { } notify; /* KVM_EXIT_MEMORY_FAULT */ struct { +#define KVM_MEMORY_EXIT_FLAG_READ (1ULL << 0) +#define KVM_MEMORY_EXIT_FLAG_WRITE (1ULL << 1) +#define KVM_MEMORY_EXIT_FLAG_EXECUTE (1ULL << 2) #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3) +#define KVM_MEMORY_EXIT_NO_ACCESS\ + (KVM_MEMORY_EXIT_FLAG_NR | KVM_MEMORY_EXIT_FLAG_NW | \ +KVM_MEMORY_EXIT_FLAG_NX) __u64 flags; __u64 gpa; __u64 size; -- 2.40.1
[RFC 17/33] KVM: x86/mmu: Allow setting memory attributes if VSM enabled
VSM is also a user of memory attributes, so let it use kvm_set_mem_attributes(). Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/mmu/mmu.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index feca077c0210..a1fbb905258b 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7265,7 +7265,8 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, * Zapping SPTEs in this case ensures KVM will reassess whether or not * a hugepage can be used for affected ranges. */ - if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm))) + if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm) && +!kvm_hv_vsm_enabled(kvm))) return false; return kvm_unmap_gfn_range(kvm, range); @@ -7322,7 +7323,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, * a range that has PRIVATE GFNs, and conversely converting a range to * SHARED may now allow hugepages. */ - if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm))) + if (WARN_ON_ONCE(!kvm_arch_has_private_mem(kvm) && +!kvm_hv_vsm_enabled(kvm))) return false; /* -- 2.40.1
[RFC 18/33] KVM: x86: Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array
Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array to allow other memory attribute sources to use the function. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/mmu/mmu.c | 5 +++-- include/linux/kvm_host.h | 8 +--- 2 files changed, 8 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a1fbb905258b..96421234ca88 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7301,7 +7301,7 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot, for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) { if (hugepage_test_mixed(slot, gfn, level - 1) || - attrs != kvm_get_memory_attributes(kvm, gfn)) + attrs != kvm_get_memory_attributes(&kvm->mem_attr_array, gfn)) return false; } return true; @@ -7401,7 +7401,8 @@ void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm, * be manually checked as the attributes may already be mixed. */ for (gfn = start; gfn < end; gfn += nr_pages) { - unsigned long attrs = kvm_get_memory_attributes(kvm, gfn); + unsigned long attrs = + kvm_get_memory_attributes(&kvm->mem_attr_array, gfn); if (hugepage_has_attrs(kvm, slot, gfn, level, attrs)) hugepage_clear_mixed(slot, gfn, level); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 631fd532c97a..4242588e3dfb 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2385,9 +2385,10 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, } #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES -static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn) +static inline unsigned long +kvm_get_memory_attributes(struct xarray *mem_attr_array, gfn_t gfn) { - return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)); + return xa_to_value(xa_load(mem_attr_array, gfn)); } bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, @@ -2400,7 +2401,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn) { return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) && - kvm_get_memory_attributes(kvm, gfn) & KVM_MEMORY_ATTRIBUTE_PRIVATE; + kvm_get_memory_attributes(&kvm->mem_attr_array, gfn) & + KVM_MEMORY_ATTRIBUTE_PRIVATE; } #else static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn) -- 2.40.1
[RFC 20/33] KVM: x86/mmu: Decouple hugepage_has_attrs() from struct kvm's mem_attr_array
Decouple hugepage_has_attrs() from struct kvm's mem_attr_array to allow other memory attribute sources to use the function. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/mmu/mmu.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 4ace2f8660b0..c0fd3afd6be5 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7290,19 +7290,19 @@ static void hugepage_set_mixed(struct kvm_memory_slot *slot, gfn_t gfn, lpage_info_slot(gfn, slot, level)->disallow_lpage |= KVM_LPAGE_MIXED_FLAG; } -static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot, - gfn_t gfn, int level, unsigned long attrs) +static bool hugepage_has_attrs(struct xarray *mem_attr_array, + struct kvm_memory_slot *slot, gfn_t gfn, + int level, unsigned long attrs) { const unsigned long start = gfn; const unsigned long end = start + KVM_PAGES_PER_HPAGE(level); if (level == PG_LEVEL_2M) - return kvm_range_has_memory_attributes(&kvm->mem_attr_array, - start, end, attrs); + return kvm_range_has_memory_attributes(mem_attr_array, start, end, attrs); for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) { if (hugepage_test_mixed(slot, gfn, level - 1) || - attrs != kvm_get_memory_attributes(&kvm->mem_attr_array, gfn)) + attrs != kvm_get_memory_attributes(mem_attr_array, gfn)) return false; } return true; @@ -7344,7 +7344,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, * misaligned address regardless of memory attributes. */ if (gfn >= slot->base_gfn) { - if (hugepage_has_attrs(kvm, slot, gfn, level, attrs)) + if (hugepage_has_attrs(&kvm->mem_attr_array, + slot, gfn, level, attrs)) hugepage_clear_mixed(slot, gfn, level); else hugepage_set_mixed(slot, gfn, level); @@ -7366,7 +7367,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, */ if (gfn < range->end && (gfn + nr_pages) <= (slot->base_gfn + slot->npages)) { - if (hugepage_has_attrs(kvm, slot, gfn, level, attrs)) + if (hugepage_has_attrs(&kvm->mem_attr_array, slot, gfn, + level, attrs)) hugepage_clear_mixed(slot, gfn, level); else hugepage_set_mixed(slot, gfn, level); @@ -7405,7 +7407,7 @@ void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm, unsigned long attrs = kvm_get_memory_attributes(&kvm->mem_attr_array, gfn); - if (hugepage_has_attrs(kvm, slot, gfn, level, attrs)) + if (hugepage_has_attrs(&kvm->mem_attr_array, slot, gfn, level, attrs)) hugepage_clear_mixed(slot, gfn, level); else hugepage_set_mixed(slot, gfn, level); -- 2.40.1
[RFC 19/33] KVM: x86: Decouple kvm_range_has_memory_attributes() from struct kvm's mem_attr_array
Decouple kvm_range_has_memory_attributes() from struct kvm's mem_attr_array to allow other memory attribute sources to use the function. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/mmu/mmu.c | 3 ++- include/linux/kvm_host.h | 4 ++-- virt/kvm/kvm_main.c | 9 + 3 files changed, 9 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 96421234ca88..4ace2f8660b0 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7297,7 +7297,8 @@ static bool hugepage_has_attrs(struct kvm *kvm, struct kvm_memory_slot *slot, const unsigned long end = start + KVM_PAGES_PER_HPAGE(level); if (level == PG_LEVEL_2M) - return kvm_range_has_memory_attributes(kvm, start, end, attrs); + return kvm_range_has_memory_attributes(&kvm->mem_attr_array, + start, end, attrs); for (gfn = start; gfn < end; gfn += KVM_PAGES_PER_HPAGE(level - 1)) { if (hugepage_test_mixed(slot, gfn, level - 1) || diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 4242588e3dfb..32cf05637647 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2391,8 +2391,8 @@ kvm_get_memory_attributes(struct xarray *mem_attr_array, gfn_t gfn) return xa_to_value(xa_load(mem_attr_array, gfn)); } -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, -unsigned long attrs); +bool kvm_range_has_memory_attributes(struct xarray *mem_attr_array, gfn_t start, +gfn_t end, unsigned long attrs); bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, struct kvm_gfn_range *range); bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index fde004a0ac46..6bb23eaf7aa6 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2440,10 +2440,10 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm, * Returns true if _all_ gfns in the range [@start, @end) have attributes * matching @attrs. */ -bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, -unsigned long attrs) +bool kvm_range_has_memory_attributes(struct xarray *mem_attr_array, gfn_t start, +gfn_t end, unsigned long attrs) { - XA_STATE(xas, &kvm->mem_attr_array, start); + XA_STATE(xas, mem_attr_array, start); unsigned long index; bool has_attrs; void *entry; @@ -2582,7 +2582,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, mutex_lock(&kvm->slots_lock); /* Nothing to do if the entire range as the desired attributes. */ - if (kvm_range_has_memory_attributes(kvm, start, end, attributes)) + if (kvm_range_has_memory_attributes(&kvm->mem_attr_array, start, end, + attributes)) goto out_unlock; /* -- 2.40.1
[RFC 21/33] KVM: Pass memory attribute array as a MMU notifier argument
Pass the memory attribute array through struct kvm_mmu_notifier_arg and use it in kvm_arch_post_set_memory_attributes() instead of defaulting on kvm->mem_attr_array. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/mmu/mmu.c | 8 include/linux/kvm_host.h | 5 - virt/kvm/kvm_main.c | 1 + 3 files changed, 9 insertions(+), 5 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index c0fd3afd6be5..c2bec2be2ba9 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7311,6 +7311,7 @@ static bool hugepage_has_attrs(struct xarray *mem_attr_array, bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, struct kvm_gfn_range *range) { + struct xarray *mem_attr_array = range->arg.mem_attr_array; unsigned long attrs = range->arg.attributes; struct kvm_memory_slot *slot = range->slot; int level; @@ -7344,8 +7345,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, * misaligned address regardless of memory attributes. */ if (gfn >= slot->base_gfn) { - if (hugepage_has_attrs(&kvm->mem_attr_array, - slot, gfn, level, attrs)) + if (hugepage_has_attrs(mem_attr_array, slot, + gfn, level, attrs)) hugepage_clear_mixed(slot, gfn, level); else hugepage_set_mixed(slot, gfn, level); @@ -7367,8 +7368,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, */ if (gfn < range->end && (gfn + nr_pages) <= (slot->base_gfn + slot->npages)) { - if (hugepage_has_attrs(&kvm->mem_attr_array, slot, gfn, - level, attrs)) + if (hugepage_has_attrs(mem_attr_array, slot, gfn, level, attrs)) hugepage_clear_mixed(slot, gfn, level); else hugepage_set_mixed(slot, gfn, level); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 32cf05637647..652656444c45 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -256,7 +256,10 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu); #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER union kvm_mmu_notifier_arg { pte_t pte; - unsigned long attributes; + struct { + unsigned long attributes; + struct xarray *mem_attr_array; + }; }; struct kvm_gfn_range { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 6bb23eaf7aa6..f20dafaedc72 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2569,6 +2569,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, .start = start, .end = end, .arg.attributes = attributes, + .arg.mem_attr_array = &kvm->mem_attr_array, .handler = kvm_arch_post_set_memory_attributes, .on_lock = kvm_mmu_invalidate_end, .may_block = true, -- 2.40.1
[RFC 22/33] KVM: Decouple kvm_ioctl_set_mem_attributes() from kvm's mem_attr_array
VSM will keep track of each VTL's memory protections in a separate mem_attr_array. Access to these arrays will happen by issuing KVM_SET_MEMORY_ATTRIBUTES ioctls to their respective KVM VTL devices (which is also introduced in subsequent patches). Let the VTL devices reuse kvm_ioctl_set_mem_attributes() by decoupling it from struct kvm's mem_attr_array. The xarray is now input as a function argument as well as the list of supported memory attributes. Signed-off-by: Nicolas Saenz Julienne --- include/linux/kvm_host.h | 3 +++ virt/kvm/kvm_main.c | 32 ++-- 2 files changed, 25 insertions(+), 10 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 652656444c45..ad104794037f 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2394,6 +2394,9 @@ kvm_get_memory_attributes(struct xarray *mem_attr_array, gfn_t gfn) return xa_to_value(xa_load(mem_attr_array, gfn)); } +int kvm_ioctl_set_mem_attributes(struct kvm *kvm, struct xarray *mem_attr_array, +u64 supported_attrs, +struct kvm_memory_attributes *attrs); bool kvm_range_has_memory_attributes(struct xarray *mem_attr_array, gfn_t start, gfn_t end, unsigned long attrs); bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index f20dafaedc72..74c4c42b2126 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2554,8 +2554,9 @@ static bool kvm_pre_set_memory_attributes(struct kvm *kvm, } /* Set @attributes for the gfn range [@start, @end). */ -static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, -unsigned long attributes) +static int kvm_set_mem_attributes(struct kvm *kvm, + struct xarray *mem_attr_array, gfn_t start, + gfn_t end, unsigned long attributes) { struct kvm_mmu_notifier_range pre_set_range = { .start = start, @@ -2569,7 +2570,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, .start = start, .end = end, .arg.attributes = attributes, - .arg.mem_attr_array = &kvm->mem_attr_array, + .arg.mem_attr_array = mem_attr_array, .handler = kvm_arch_post_set_memory_attributes, .on_lock = kvm_mmu_invalidate_end, .may_block = true, @@ -2583,7 +2584,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, mutex_lock(&kvm->slots_lock); /* Nothing to do if the entire range as the desired attributes. */ - if (kvm_range_has_memory_attributes(&kvm->mem_attr_array, start, end, + if (kvm_range_has_memory_attributes(mem_attr_array, start, end, attributes)) goto out_unlock; @@ -2592,7 +2593,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, * partway through setting the new attributes. */ for (i = start; i < end; i++) { - r = xa_reserve(&kvm->mem_attr_array, i, GFP_KERNEL_ACCOUNT); + r = xa_reserve(mem_attr_array, i, GFP_KERNEL_ACCOUNT); if (r) goto out_unlock; } @@ -2600,7 +2601,7 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, kvm_handle_gfn_range(kvm, &pre_set_range); for (i = start; i < end; i++) { - r = xa_err(xa_store(&kvm->mem_attr_array, i, entry, + r = xa_err(xa_store(mem_attr_array, i, entry, GFP_KERNEL_ACCOUNT)); KVM_BUG_ON(r, kvm); } @@ -2612,15 +2613,17 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end, return r; } -static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm, - struct kvm_memory_attributes *attrs) + +int kvm_ioctl_set_mem_attributes(struct kvm *kvm, struct xarray *mem_attr_array, +u64 supported_attrs, +struct kvm_memory_attributes *attrs) { gfn_t start, end; /* flags is currently not used. */ if (attrs->flags) return -EINVAL; - if (attrs->attributes & ~kvm_supported_mem_attributes(kvm)) + if (attrs->attributes & ~supported_attrs) return -EINVAL; if (attrs->size == 0 || attrs->address + attrs->size < attrs->address) return -EINVAL; @@ -2637,7 +2640,16 @@ static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm, */ BUILD_BUG_ON(sizeof(attrs->attributes) != sizeof(unsigned long)); - return kvm_vm_set_mem_attributes(kvm, start, end
[RFC 23/33] KVM: Expose memory attribute helper functions unanimously
Expose memory attribute helper functions even when CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES is disabled. Other KVM features, like Hyper-V VSM, make use of memory attributes but don't rely on the KVM ioctl. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/mmu/mmu.c | 2 +- include/linux/kvm_host.h | 2 +- virt/kvm/kvm_main.c | 18 +- 3 files changed, 11 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index c2bec2be2ba9..a76028aa8fb3 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -7250,7 +7250,6 @@ void kvm_mmu_pre_destroy_vm(struct kvm *kvm) kthread_stop(kvm->arch.nx_huge_page_recovery_thread); } -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, struct kvm_gfn_range *range) { @@ -7377,6 +7376,7 @@ bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, return false; } +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES void kvm_mmu_init_memslot_memory_attributes(struct kvm *kvm, struct kvm_memory_slot *slot) { diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ad104794037f..45e3e261755d 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -2387,7 +2387,6 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE; } -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES static inline unsigned long kvm_get_memory_attributes(struct xarray *mem_attr_array, gfn_t gfn) { @@ -2404,6 +2403,7 @@ bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, struct kvm_gfn_range *range); +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES static inline bool kvm_mem_is_private(struct kvm *kvm, gfn_t gfn) { return IS_ENABLED(CONFIG_KVM_PRIVATE_MEM) && diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 74c4c42b2126..b3f4b200f438 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2435,7 +2435,6 @@ static int kvm_vm_ioctl_clear_dirty_log(struct kvm *kvm, } #endif /* CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT */ -#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES /* * Returns true if _all_ gfns in the range [@start, @end) have attributes * matching @attrs. @@ -2472,14 +2471,6 @@ bool kvm_range_has_memory_attributes(struct xarray *mem_attr_array, gfn_t start, return has_attrs; } -static u64 kvm_supported_mem_attributes(struct kvm *kvm) -{ - if (!kvm || kvm_arch_has_private_mem(kvm)) - return KVM_MEMORY_ATTRIBUTE_PRIVATE; - - return 0; -} - static __always_inline void kvm_handle_gfn_range(struct kvm *kvm, struct kvm_mmu_notifier_range *range) { @@ -2644,6 +2635,15 @@ int kvm_ioctl_set_mem_attributes(struct kvm *kvm, struct xarray *mem_attr_array, attrs->attributes); } +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES +static u64 kvm_supported_mem_attributes(struct kvm *kvm) +{ + if (!kvm || kvm_arch_has_private_mem(kvm)) + return KVM_MEMORY_ATTRIBUTE_PRIVATE; + + return 0; +} + static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm, struct kvm_memory_attributes *attrs) { -- 2.40.1
[RFC 25/33] KVM: Introduce a set of new memory attributes
Introduce the following memory attributes: - KVM_MEMORY_ATTRIBUTE_READ - KVM_MEMORY_ATTRIBUTE_WRITE - KVM_MEMORY_ATTRIBUTE_EXECUTE - KVM_MEMORY_ATTRIBUTE_NO_ACCESS Note that NO_ACCESS is necessary in order to make a distinction between the lack of attributes for a gfn, which defaults to the memory protections of the backing memory, versus explicitly prohibiting any access to that gfn. These new memory attributes will, for now, only made be available through the VSM KVM device (which we introduce in subsequent patches). Signed-off-by: Nicolas Saenz Julienne --- include/uapi/linux/kvm.h | 4 1 file changed, 4 insertions(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index bd97c9852142..6b875c1040eb 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -2314,7 +2314,11 @@ struct kvm_memory_attributes { __u64 flags; }; +#define KVM_MEMORY_ATTRIBUTE_READ (1ULL << 0) +#define KVM_MEMORY_ATTRIBUTE_WRITE (1ULL << 1) +#define KVM_MEMORY_ATTRIBUTE_EXECUTE (1ULL << 2) #define KVM_MEMORY_ATTRIBUTE_PRIVATE (1ULL << 3) +#define KVM_MEMORY_ATTRIBUTE_NO_ACCESS (1ULL << 4) #define KVM_CREATE_GUEST_MEMFD _IOWR(KVMIO, 0xd4, struct kvm_create_guest_memfd) -- 2.40.1
[RFC 26/33] KVM: x86: hyper-vsm: Allow setting per-VTL memory attributes
Introduce KVM_SET_MEMORY_ATTRIBUTES ioctl support for VTL KVM devices. The attributes are stored in an xarray private to the VTL device. The following memory attributes are supported: - KVM_MEMORY_ATTRIBUTE_READ - KVM_MEMORY_ATTRIBUTE_WRITE - KVM_MEMORY_ATTRIBUTE_EXECUTE - KVM_MEMORY_ATTRIBUTE_NO_ACCESS Although only some combinations are valid, see code comment below. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/hyperv.c | 61 +++ 1 file changed, 61 insertions(+) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 0d8402dba596..bcace0258af1 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -62,6 +62,10 @@ */ #define HV_EXT_CALL_MAX (HV_EXT_CALL_QUERY_CAPABILITIES + 64) +#define KVM_HV_VTL_ATTRS \ + (KVM_MEMORY_ATTRIBUTE_READ | KVM_MEMORY_ATTRIBUTE_WRITE | \ +KVM_MEMORY_ATTRIBUTE_EXECUTE | KVM_MEMORY_ATTRIBUTE_NO_ACCESS) + static void stimer_mark_pending(struct kvm_vcpu_hv_stimer *stimer, bool vcpu_kick); @@ -3025,6 +3029,7 @@ int kvm_vm_ioctl_get_hv_vsm_state(struct kvm *kvm, struct kvm_hv_vsm_state *stat struct kvm_hv_vtl_dev { int vtl; + struct xarray mem_attrs; }; static int kvm_hv_vtl_get_attr(struct kvm_device *dev, @@ -3047,16 +3052,71 @@ static void kvm_hv_vtl_release(struct kvm_device *dev) { struct kvm_hv_vtl_dev *vtl_dev = dev->private; + xa_destroy(&vtl_dev->mem_attrs); kfree(vtl_dev); kfree(dev); /* alloc by kvm_ioctl_create_device, free by .release */ } +/* + * The TLFS lists the valid memory protection combinations (15.9.3): + * - No access + * - Read-only, no execute + * - Read-only, execute + * - Read/write, no execute + * - Read/write, execute + */ +static bool kvm_hv_validate_vtl_mem_attributes(struct kvm_memory_attributes *attrs) +{ + u64 attr = attrs->attributes; + + if (attr & ~KVM_HV_VTL_ATTRS) + return false; + + if (attr == KVM_MEMORY_ATTRIBUTE_NO_ACCESS) + return true; + + if (!(attr & KVM_MEMORY_ATTRIBUTE_READ)) + return false; + + return true; +} + +static long kvm_hv_vtl_ioctl(struct kvm_device *dev, unsigned int ioctl, +unsigned long arg) +{ + switch (ioctl) { + case KVM_SET_MEMORY_ATTRIBUTES: { + struct kvm_hv_vtl_dev *vtl_dev = dev->private; + struct kvm_memory_attributes attrs; + int r; + + if (copy_from_user(&attrs, (void __user *)arg, sizeof(attrs))) + return -EFAULT; + + r = -EINVAL; + if (!kvm_hv_validate_vtl_mem_attributes(&attrs)) + return r; + + r = kvm_ioctl_set_mem_attributes(dev->kvm, &vtl_dev->mem_attrs, +KVM_HV_VTL_ATTRS, &attrs); + if (r) + return r; + break; + } + default: + return -ENOTTY; + } + + return 0; +} + static int kvm_hv_vtl_create(struct kvm_device *dev, u32 type); static struct kvm_device_ops kvm_hv_vtl_ops = { .name = "kvm-hv-vtl", .create = kvm_hv_vtl_create, .release = kvm_hv_vtl_release, + .ioctl = kvm_hv_vtl_ioctl, .get_attr = kvm_hv_vtl_get_attr, }; @@ -3076,6 +3136,7 @@ static int kvm_hv_vtl_create(struct kvm_device *dev, u32 type) vtl++; vtl_dev->vtl = vtl; + xa_init(&vtl_dev->mem_attrs); dev->private = vtl_dev; return 0; -- 2.40.1
[RFC 24/33] KVM: x86: hyper-v: Introduce KVM VTL device
Introduce a new KVM device aimed at tracking partition wide VTL state, it'll be the one responsible from keeping track of VTL's memory protections. For now its functionality it's limited, it only exposes its VTL level through a device attribute. Additionally, the device type is only registered if the VSM cap is enabled. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/hyperv.c| 68 arch/x86/kvm/hyperv.h| 3 ++ arch/x86/kvm/x86.c | 3 ++ include/uapi/linux/kvm.h | 5 +++ 4 files changed, 79 insertions(+) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index a266c5d393f5..0d8402dba596 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -3022,3 +3022,71 @@ int kvm_vm_ioctl_get_hv_vsm_state(struct kvm *kvm, struct kvm_hv_vsm_state *stat state->vsm_code_page_offsets = hv->vsm_code_page_offsets.as_u64; return 0; } + +struct kvm_hv_vtl_dev { + int vtl; +}; + +static int kvm_hv_vtl_get_attr(struct kvm_device *dev, + struct kvm_device_attr *attr) +{ + struct kvm_hv_vtl_dev *vtl_dev = dev->private; + + switch (attr->group) { + case KVM_DEV_HV_VTL_GROUP: + switch (attr->attr){ + case KVM_DEV_HV_VTL_GROUP_VTLNUM: + return put_user(vtl_dev->vtl, (u32 __user *)attr->addr); + } + } + + return -EINVAL; +} + +static void kvm_hv_vtl_release(struct kvm_device *dev) +{ + struct kvm_hv_vtl_dev *vtl_dev = dev->private; + + kfree(vtl_dev); + kfree(dev); /* alloc by kvm_ioctl_create_device, free by .release */ +} + +static int kvm_hv_vtl_create(struct kvm_device *dev, u32 type); + +static struct kvm_device_ops kvm_hv_vtl_ops = { + .name = "kvm-hv-vtl", + .create = kvm_hv_vtl_create, + .release = kvm_hv_vtl_release, + .get_attr = kvm_hv_vtl_get_attr, +}; + +static int kvm_hv_vtl_create(struct kvm_device *dev, u32 type) +{ + struct kvm_hv_vtl_dev *vtl_dev; + struct kvm_device *tmp; + int vtl = 0; + + vtl_dev = kzalloc(sizeof(*vtl_dev), GFP_KERNEL_ACCOUNT); + if (!vtl_dev) + return -ENOMEM; + + /* Device creation is protected by kvm->lock */ + list_for_each_entry(tmp, &dev->kvm->devices, vm_node) + if (tmp->ops == &kvm_hv_vtl_ops) + vtl++; + + vtl_dev->vtl = vtl; + dev->private = vtl_dev; + + return 0; +} + +int kvm_hv_vtl_dev_register(void) +{ + return kvm_register_device_ops(&kvm_hv_vtl_ops, KVM_DEV_TYPE_HV_VSM_VTL); +} + +void kvm_hv_vtl_dev_unregister(void) +{ + kvm_unregister_device_ops(KVM_DEV_TYPE_HV_VSM_VTL); +} diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index 605e80b9e5eb..3cc664e144d8 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -269,4 +269,7 @@ static inline void kvm_mmu_role_set_hv_bits(struct kvm_vcpu *vcpu, role->vtl = kvm_hv_get_active_vtl(vcpu); } +int kvm_hv_vtl_dev_register(void); +void kvm_hv_vtl_dev_unregister(void); + #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bf4891bc044e..82d3b86d9c93 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6521,6 +6521,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, mutex_unlock(&kvm->lock); break; case KVM_CAP_HYPERV_VSM: + kvm_hv_vtl_dev_register(); kvm->arch.hyperv.hv_enable_vsm = true; r = 0; break; @@ -9675,6 +9676,8 @@ void kvm_x86_vendor_exit(void) mutex_lock(&vendor_module_lock); kvm_x86_ops.hardware_enable = NULL; mutex_unlock(&vendor_module_lock); + + kvm_hv_vtl_dev_unregister(); } EXPORT_SYMBOL_GPL(kvm_x86_vendor_exit); diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 0ddffb8b0c99..bd97c9852142 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -1471,6 +1471,9 @@ struct kvm_device_attr { #define KVM_DEV_VFIO_GROUP_DEL KVM_DEV_VFIO_FILE_DEL #define KVM_DEV_VFIO_GROUP_SET_SPAPR_TCE 3 +#define KVM_DEV_HV_VTL_GROUP 1 +#define KVM_DEV_HV_VTL_GROUP_VTLNUM 1 + enum kvm_device_type { KVM_DEV_TYPE_FSL_MPIC_20= 1, #define KVM_DEV_TYPE_FSL_MPIC_20 KVM_DEV_TYPE_FSL_MPIC_20 @@ -1494,6 +1497,8 @@ enum kvm_device_type { #define KVM_DEV_TYPE_ARM_PV_TIME KVM_DEV_TYPE_ARM_PV_TIME KVM_DEV_TYPE_RISCV_AIA, #define KVM_DEV_TYPE_RISCV_AIA KVM_DEV_TYPE_RISCV_AIA + KVM_DEV_TYPE_HV_VSM_VTL, +#define KVM_DEV_TYPE_HV_VSM_VTLKVM_DEV_TYPE_HV_VSM_VTL KVM_DEV_TYPE_MAX, }; -- 2.40.1
[RFC 28/33] x86/hyper-v: Introduce memory intercept message structure
Introduce struct hv_memory_intercept_message, which is used when issuing memory intercepts to a Hyper-V VSM guest. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/include/asm/hyperv-tlfs.h | 76 ++ 1 file changed, 76 insertions(+) diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index af594aa65307..d3d74fde6da1 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -799,6 +799,82 @@ struct hv_get_vp_from_apic_id_in { u32 apic_ids[]; } __packed; + +/* struct hv_intercept_header::access_type_mask */ +#define HV_INTERCEPT_ACCESS_MASK_NONE0 +#define HV_INTERCEPT_ACCESS_MASK_READ1 +#define HV_INTERCEPT_ACCESS_MASK_WRITE 2 +#define HV_INTERCEPT_ACCESS_MASK_EXECUTE 4 + +/* struct hv_intercept_exception::cache_type */ +#define HV_X64_CACHE_TYPE_UNCACHED 0 +#define HV_X64_CACHE_TYPE_WRITECOMBINING 1 +#define HV_X64_CACHE_TYPE_WRITETHROUGH 4 +#define HV_X64_CACHE_TYPE_WRITEPROTECTED 5 +#define HV_X64_CACHE_TYPE_WRITEBACK 6 + +/* Intecept message header */ +struct hv_intercept_header { + __u32 vp_index; + __u8 instruction_length; +#define HV_INTERCEPT_ACCESS_READ0 +#define HV_INTERCEPT_ACCESS_WRITE 1 +#define HV_INTERCEPT_ACCESS_EXECUTE 2 + __u8 access_type_mask; + union { + __u16 as_u16; + struct { + __u16 cpl:2; + __u16 cr0_pe:1; + __u16 cr0_am:1; + __u16 efer_lma:1; + __u16 debug_active:1; + __u16 interruption_pending:1; + __u16 reserved:9; + }; + } exec_state; + struct hv_x64_segment_register cs; + __u64 rip; + __u64 rflags; +} __packed; + +union hv_x64_memory_access_info { + __u8 as_u8; + struct { + __u8 gva_valid:1; + __u8 _reserved:7; + }; +}; + +struct hv_memory_intercept_message { + struct hv_intercept_header header; + __u32 cache_type; + __u8 instruction_byte_count; + union hv_x64_memory_access_info memory_access_info; + __u16 _reserved; + __u64 gva; + __u64 gpa; + __u8 instruction_bytes[16]; + struct hv_x64_segment_register ds; + struct hv_x64_segment_register ss; + __u64 rax; + __u64 rcx; + __u64 rdx; + __u64 rbx; + __u64 rsp; + __u64 rbp; + __u64 rsi; + __u64 rdi; + __u64 r8; + __u64 r9; + __u64 r10; + __u64 r11; + __u64 r12; + __u64 r13; + __u64 r14; + __u64 r15; +} __packed; + #include #endif -- 2.40.1
[RFC 27/33] KVM: x86/mmu/hyper-v: Validate memory faults against per-VTL memprots
Introduce a new step in __kvm_faultin_pfn() that'll validate the fault against the vCPU's VTL protections and generate a user space exit when invalid. Note that kvm_hv_faultin_pfn() has to be run after resolving the fault against the memslots, since that operation steps over 'fault->map_writable'. Non VSM users shouldn't see any behaviour change. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/hyperv.c | 66 ++ arch/x86/kvm/hyperv.h | 1 + arch/x86/kvm/mmu/mmu.c | 9 +- 3 files changed, 75 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index bcace0258af1..eb6a4848e306 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -42,6 +42,8 @@ #include "irq.h" #include "fpu.h" +#include "mmu/mmu_internal.h" + #define KVM_HV_MAX_SPARSE_VCPU_SET_BITS DIV_ROUND_UP(KVM_MAX_VCPUS, HV_VCPUS_PER_SPARSE_BANK) /* @@ -3032,6 +3034,55 @@ struct kvm_hv_vtl_dev { struct xarray mem_attrs; }; +static struct xarray *kvm_hv_vsm_get_memprots(struct kvm_vcpu *vcpu); + +bool kvm_hv_vsm_access_valid(struct kvm_page_fault *fault, unsigned long attrs) +{ + if (attrs == KVM_MEMORY_ATTRIBUTE_NO_ACCESS) + return false; + + /* We should never get here without read permissions, force a fault. */ + if (WARN_ON_ONCE(!(attrs & KVM_MEMORY_ATTRIBUTE_READ))) + return false; + + if (fault->write && !(attrs & KVM_MEMORY_ATTRIBUTE_WRITE)) + return false; + + if (fault->exec && !(attrs & KVM_MEMORY_ATTRIBUTE_EXECUTE)) + return false; + + return true; +} + +static unsigned long kvm_hv_vsm_get_memory_attributes(struct kvm_vcpu *vcpu, + gfn_t gfn) +{ + struct xarray *prots = kvm_hv_vsm_get_memprots(vcpu); + + if (!prots) + return 0; + + return xa_to_value(xa_load(prots, gfn)); +} + +int kvm_hv_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) +{ + unsigned long attrs; + + attrs = kvm_hv_vsm_get_memory_attributes(vcpu, fault->gfn); + if (!attrs) + return RET_PF_CONTINUE; + + if (kvm_hv_vsm_access_valid(fault, attrs)) { + fault->map_executable = + !!(attrs & KVM_MEMORY_ATTRIBUTE_EXECUTE); + fault->map_writable = !!(attrs & KVM_MEMORY_ATTRIBUTE_WRITE); + return RET_PF_CONTINUE; + } + + return -EFAULT; +} + static int kvm_hv_vtl_get_attr(struct kvm_device *dev, struct kvm_device_attr *attr) { @@ -3120,6 +3171,21 @@ static struct kvm_device_ops kvm_hv_vtl_ops = { .get_attr = kvm_hv_vtl_get_attr, }; +static struct xarray *kvm_hv_vsm_get_memprots(struct kvm_vcpu *vcpu) +{ + struct kvm_hv_vtl_dev *vtl_dev; + struct kvm_device *tmp; + + list_for_each_entry(tmp, &vcpu->kvm->devices, vm_node) + if (tmp->ops == &kvm_hv_vtl_ops) { + vtl_dev = tmp->private; + if (vtl_dev->vtl == kvm_hv_get_active_vtl(vcpu)) + return &vtl_dev->mem_attrs; + } + + return NULL; +} + static int kvm_hv_vtl_create(struct kvm_device *dev, u32 type) { struct kvm_hv_vtl_dev *vtl_dev; diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index 3cc664e144d8..ae781b4d4669 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -271,5 +271,6 @@ static inline void kvm_mmu_role_set_hv_bits(struct kvm_vcpu *vcpu, int kvm_hv_vtl_dev_register(void); void kvm_hv_vtl_dev_unregister(void); +int kvm_hv_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault); #endif diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index a76028aa8fb3..ba454c7277dc 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4374,7 +4374,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault fault->write, &fault->map_writable, &fault->hva); if (!async) - return RET_PF_CONTINUE; /* *pfn has correct page already */ + goto pf_continue; /* *pfn has correct page already */ if (!fault->prefetch && kvm_can_do_async_pf(vcpu)) { trace_kvm_try_async_get_page(fault->addr, fault->gfn); @@ -4395,6 +4395,13 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL, fault->write, &fault->map_writable, &fault->hva); +pf_continue: + if (kvm_hv_vsm_enabled(vcpu->kvm)) { + if (kvm_hv_faultin_pfn(vcpu, fault)) { + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); + re
[RFC 29/33] KVM: VMX: Save instruction length on EPT violation
Save the length of the instruction that triggered an EPT violation in struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory intercept messages. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/include/asm/kvm_host.h | 2 ++ arch/x86/kvm/vmx/vmx.c | 1 + 2 files changed, 3 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1f5a85d461ce..1a854776d91e 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -967,6 +967,8 @@ struct kvm_vcpu_arch { /* set at EPT violation at this point */ unsigned long exit_qualification; + u32 exit_instruction_len; + /* pv related host specific info */ struct { bool pv_unhalted; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 6e502ba93141..9c83ee3a293d 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -5773,6 +5773,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK; vcpu->arch.exit_qualification = exit_qualification; + vcpu->arch.exit_instruction_len = vmcs_read32(VM_EXIT_INSTRUCTION_LEN); /* * Check that the GPA doesn't exceed physical memory limits, as that is -- 2.40.1
[RFC 30/33] KVM: x86: hyper-v: Introduce KVM_REQ_HV_INJECT_INTERCEPT request
Introduce a new request type, KVM_REQ_HV_INJECT_INTERCEPT which allows injecting out-of-band Hyper-V secure intercepts. For now only memory access intercepts are supported. These are triggered when access a GPA protected by a higher VTL. The memory intercept metadata is filled based on the GPA provided through struct kvm_vcpu_hv_intercept_info, and injected into the guest through SynIC message. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/include/asm/kvm_host.h | 10 +++ arch/x86/kvm/hyperv.c | 114 arch/x86/kvm/hyperv.h | 2 + arch/x86/kvm/x86.c | 3 + 4 files changed, 129 insertions(+) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1a854776d91e..39671e07 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -113,6 +113,7 @@ KVM_ARCH_REQ_FLAGS(31, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) #define KVM_REQ_HV_TLB_FLUSH \ KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) +#define KVM_REQ_HV_INJECT_INTERCEPTKVM_ARCH_REQ(33) #define CR0_RESERVED_BITS \ (~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \ @@ -639,6 +640,13 @@ struct kvm_vcpu_hv_tlb_flush_fifo { DECLARE_KFIFO(entries, u64, KVM_HV_TLB_FLUSH_FIFO_SIZE); }; +struct kvm_vcpu_hv_intercept_info { + struct kvm_vcpu *vcpu; + int type; + u64 gpa; + u8 access; +}; + /* Hyper-V per vcpu emulation context */ struct kvm_vcpu_hv { struct kvm_vcpu *vcpu; @@ -673,6 +681,8 @@ struct kvm_vcpu_hv { u64 vm_id; u32 vp_id; } nested; + + struct kvm_vcpu_hv_intercept_info intercept_info; }; struct kvm_hypervisor_cpuid { diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index eb6a4848e306..38ee3abdef9c 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -2789,6 +2789,120 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) return 0; } +static void store_kvm_segment(const struct kvm_segment *kvmseg, + struct hv_x64_segment_register *reg) +{ + reg->base = kvmseg->base; + reg->limit = kvmseg->limit; + reg->selector = kvmseg->selector; + reg->segment_type = kvmseg->type; + reg->present = kvmseg->present; + reg->descriptor_privilege_level = kvmseg->dpl; + reg->_default = kvmseg->db; + reg->non_system_segment = kvmseg->s; + reg->_long = kvmseg->l; + reg->granularity = kvmseg->g; + reg->available = kvmseg->avl; +} + +static void deliver_gpa_intercept(struct kvm_vcpu *target_vcpu, + struct kvm_vcpu *intercepted_vcpu, u64 gpa, + u64 gva, u8 access_type_mask) +{ + ulong cr0; + struct hv_message msg = { 0 }; + struct hv_memory_intercept_message *intercept = (struct hv_memory_intercept_message *)msg.u.payload; + struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(target_vcpu); + struct x86_exception e; + struct kvm_segment kvmseg; + + msg.header.message_type = HVMSG_GPA_INTERCEPT; + msg.header.payload_size = sizeof(*intercept); + + intercept->header.vp_index = to_hv_vcpu(intercepted_vcpu)->vp_index; + intercept->header.instruction_length = intercepted_vcpu->arch.exit_instruction_len; + intercept->header.access_type_mask = access_type_mask; + kvm_x86_ops.get_segment(intercepted_vcpu, &kvmseg, VCPU_SREG_CS); + store_kvm_segment(&kvmseg, &intercept->header.cs); + + cr0 = kvm_read_cr0(intercepted_vcpu); + intercept->header.exec_state.cr0_pe = (cr0 & X86_CR0_PE); + intercept->header.exec_state.cr0_am = (cr0 & X86_CR0_AM); + intercept->header.exec_state.cpl = kvm_x86_ops.get_cpl(intercepted_vcpu); + intercept->header.exec_state.efer_lma = is_long_mode(intercepted_vcpu); + intercept->header.exec_state.debug_active = 0; + intercept->header.exec_state.interruption_pending = 0; + intercept->header.rip = kvm_rip_read(intercepted_vcpu); + intercept->header.rflags = kvm_get_rflags(intercepted_vcpu); + + /* +* For exec violations we don't have a way to decode an instruction that issued a fetch +* to a non-X page because CPU points RIP and GPA to the fetch destination in the faulted page. +* Instruction length though is the length of the fetch source. +* Seems like Hyper-V is aware of that and is not trying to access those fields. +*/ + if (access_type_mask == HV_INTERCEPT_ACCESS_EXECUTE) { + intercept->instruction_byte_count = 0; + } else { + intercept->instruction_byte_count = intercepted_vcpu->arch.exit_instruction_len; + if (intercept->instruction_byte_count > sizeof(intercept->instruction_bytes)) + intercept->
[RFC 31/33] KVM: x86: hyper-v: Inject intercept on VTL memory protection fault
Inject a Hyper-V secure intercept when a VTL tries to access memory that was protected by a more privileged VTL. The intercept is injected into the next enabled privileged VTL (for now, this patch takes a shortcut and assumes it's the one right after). After injecting the request, the KVM vCPU that took the fault will exit to user-space with a memory fault. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/hyperv.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 38ee3abdef9c..983bf8af5f64 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -3150,6 +3150,32 @@ struct kvm_hv_vtl_dev { static struct xarray *kvm_hv_vsm_get_memprots(struct kvm_vcpu *vcpu); +static void kvm_hv_inject_gpa_intercept(struct kvm_vcpu *vcpu, + struct kvm_page_fault *fault) +{ + struct kvm_vcpu *target_vcpu = + kvm_hv_get_vtl_vcpu(vcpu, kvm_hv_get_active_vtl(vcpu) + 1); + struct kvm_vcpu_hv_intercept_info *intercept = + &target_vcpu->arch.hyperv->intercept_info; + + /* +* No target VTL available, log a warning and let user-space deal with +* the fault. +*/ + if (WARN_ON_ONCE(!target_vcpu)) + return; + + intercept->type = HVMSG_GPA_INTERCEPT; + intercept->gpa = fault->addr; + intercept->access = (fault->user ? HV_INTERCEPT_ACCESS_READ : 0) | + (fault->write ? HV_INTERCEPT_ACCESS_WRITE : 0) | + (fault->exec ? HV_INTERCEPT_ACCESS_EXECUTE : 0); + intercept->vcpu = vcpu; + + kvm_make_request(KVM_REQ_HV_INJECT_INTERCEPT, target_vcpu); + kvm_vcpu_kick(target_vcpu); +} + bool kvm_hv_vsm_access_valid(struct kvm_page_fault *fault, unsigned long attrs) { if (attrs == KVM_MEMORY_ATTRIBUTE_NO_ACCESS) @@ -3194,6 +3220,7 @@ int kvm_hv_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) return RET_PF_CONTINUE; } + kvm_hv_inject_gpa_intercept(vcpu, fault); return -EFAULT; } -- 2.40.1
[RFC 32/33] KVM: x86: hyper-v: Implement HVCALL_TRANSLATE_VIRTUAL_ADDRESS
Introduce HVCALL_TRANSLATE_VIRTUAL_ADDRESS, the hypercall receives a GVA, generally from a less privileged VTL, and returns the GPA backing it. The GVA -> GPA conversion is done by walking the target VTL's vCPU MMU. NOTE: The hypercall implementation is incomplete and only shared for completion. Additionally we'd like to move the VTL aware parts to user-space. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/hyperv.c | 98 +++ arch/x86/kvm/trace.h | 23 include/asm-generic/hyperv-tlfs.h | 28 + 3 files changed, 149 insertions(+) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 983bf8af5f64..1cb53cd0708f 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -2540,6 +2540,7 @@ static bool is_xmm_fast_hypercall(struct kvm_hv_hcall *hc) case HVCALL_GET_VP_REGISTERS: case HVCALL_SET_VP_REGISTERS: case HVCALL_MODIFY_VTL_PROTECTION_MASK: + case HVCALL_TRANSLATE_VIRTUAL_ADDRESS: return true; } @@ -2556,6 +2557,96 @@ static void kvm_hv_hypercall_read_xmm(struct kvm_hv_hcall *hc) kvm_fpu_put(); } +static bool kvm_hv_xlate_va_validate_input(struct kvm_vcpu* vcpu, + struct hv_xlate_va_input *in, + u8 *vtl, u8 *flags) +{ + union hv_input_vtl in_vtl; + + if (in->partition_id != HV_PARTITION_ID_SELF) + return false; + + if (in->vp_index != HV_VP_INDEX_SELF && + in->vp_index != kvm_hv_get_vpindex(vcpu)) + return false; + + in_vtl.as_uint8 = in->control_flags >> 56; + *flags = in->control_flags & HV_XLATE_GVA_FLAGS_MASK; + if (*flags > (HV_XLATE_GVA_VAL_READ | + HV_XLATE_GVA_VAL_WRITE | + HV_XLATE_GVA_VAL_EXECUTE)) + pr_info_ratelimited("Translate VA control flags unsupported and will be ignored: 0x%llx\n", + in->control_flags); + + *vtl = in_vtl.use_target_vtl ? in_vtl.target_vtl : +kvm_hv_get_active_vtl(vcpu); + if (*vtl > kvm_hv_get_active_vtl(vcpu)) + return false; + + return true; +} + +static u64 kvm_hv_xlate_va_walk(struct kvm_vcpu* vcpu, u64 gva, u8 flags) +{ + struct kvm_mmu *mmu = vcpu->arch.walk_mmu; + u32 access = 0; + + if (flags & HV_XLATE_GVA_VAL_WRITE) + access |= PFERR_WRITE_MASK; + if (flags & HV_XLATE_GVA_VAL_EXECUTE) + access |= PFERR_FETCH_MASK; + + return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, mmu, gva, access, NULL); +} + +static u64 kvm_hv_translate_virtual_address(struct kvm_vcpu* vcpu, + struct kvm_hv_hcall *hc) +{ + struct hv_xlate_va_output output = {}; + struct hv_xlate_va_input input; + struct kvm_vcpu *target_vcpu; + u8 flags, target_vtl; + + if (hc->fast) { + input.partition_id = hc->ingpa; + input.vp_index = hc->outgpa & 0x; + input.control_flags = sse128_lo(hc->xmm[0]); + input.gva = sse128_hi(hc->xmm[0]); + } else { + if (kvm_read_guest(vcpu->kvm, hc->ingpa, &input, sizeof(input))) + return HV_STATUS_INVALID_HYPERCALL_INPUT; + } + + trace_kvm_hv_translate_virtual_address(input.partition_id, + input.vp_index, + input.control_flags, input.gva); + + if (!kvm_hv_xlate_va_validate_input(vcpu, &input, &target_vtl, &flags)) + return HV_STATUS_INVALID_HYPERCALL_INPUT; + + target_vcpu = kvm_hv_get_vtl_vcpu(vcpu, target_vtl); + output.gpa = kvm_hv_xlate_va_walk(target_vcpu, input.gva << PAGE_SHIFT, + flags); + if (output.gpa == INVALID_GPA) { + output.result_code = HV_XLATE_GVA_UNMAPPED; + } else { + output.gpa >>= PAGE_SHIFT; + output.result_code = HV_XLATE_GVA_SUCCESS; + output.cache_type = HV_CACHE_TYPE_X64_WB; + } + + if (hc->fast) { + memcpy(&hc->xmm[1], &output, sizeof(output)); + hc->xmm_dirty = true; + } else { + if (kvm_write_guest(vcpu->kvm, hc->outgpa, &output, + sizeof(output))) + return HV_STATUS_INVALID_HYPERCALL_INPUT; + } + + return HV_STATUS_SUCCESS; +} + static bool hv_check_hypercall_access(struct kvm_vcpu_hv *hv_vcpu, u16 code) { if (!hv_vcpu->enforce_cpuid) @@ -2766,6 +2857,13 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) case HVCALL_VTL_CALL: case HVCALL_VTL_RETURN: goto hypercall_userspace_exit; + case HVCALL_TRANSLATE_VIRTUAL_ADDRESS: +
[RFC 33/33] Documentation: KVM: Introduce "Emulating Hyper-V VSM with KVM"
Introduce "Emulating Hyper-V VSM with KVM", which describes the KVM APIs made available to a VMM that wants to emulate Hyper-V's VSM. Signed-off-by: Nicolas Saenz Julienne --- .../virt/kvm/x86/emulating-hyperv-vsm.rst | 136 ++ 1 file changed, 136 insertions(+) create mode 100644 Documentation/virt/kvm/x86/emulating-hyperv-vsm.rst diff --git a/Documentation/virt/kvm/x86/emulating-hyperv-vsm.rst b/Documentation/virt/kvm/x86/emulating-hyperv-vsm.rst new file mode 100644 index ..8f76bf09c530 --- /dev/null +++ b/Documentation/virt/kvm/x86/emulating-hyperv-vsm.rst @@ -0,0 +1,136 @@ +.. SPDX-License-Identifier: GPL-2.0 + +== +Emulating Hyper-V VSM with KVM +== + +Hyper-V's Virtual Secure Mode (VSM) is a virtualisation security feature +that leverages the hypervisor to create secure execution environments +within a guest. VSM is documented as part of Microsoft's Hypervisor Top +Level Functional Specification[1]. + +Emulating Hyper-V's Virtual Secure Mode with KVM requires coordination +between KVM and the VMM. Most of the VSM state and configuration is left +to be handled by user-space, but some has made its way into KVM. This +document describes the mechanisms through which a VMM can implement VSM +support. + +Virtual Trust Levels + + +The main concept VSM introduces are Virtual Trust Levels or VTLs. Each +VTL is a CPU mode, with its own private CPU architectural state, +interrupt subsystem (limited to a local APIC), and memory access +permissions. VTLs are hierarchical, where VTL0 corresponds to normal +guest execution and VTL > 0 to privileged execution contexts. In +practice, when virtualising Windows on top of KVM, we only see VTL0 and +VTL1. Although the spec allows going all the way to VTL15. VTLs are +orthogonal to ring levels, so each VTL is capable of runnig its own +operating system and user-space[2]. + + + ??? Normal Mode (VTL0) ??? ??? Secure Mode (VTL1) ??? + ??? ??? ??? ??? + ??? ??? User-mode Processes??? ??? ??? ???Secure User-mode Processes??? ??? + ??? ??? ??? ??? + ??? ??? ??? ??? + ??? ??? Kernel ??? ??? ??? ??? Secure Kernel ??? ??? + ??? ??? ??? ??? + + ??? + ??? Hypervisor/KVM??? + ??? + ??? + ??? Hardware??? + ??? + +VTLs break the core assumption that a vCPU has a single architectural +state, lAPIC state, SynIC state, etc. As such, each VTL is modeled as a +distinct KVM vCPU, with the restriction that only one is allowed to run +at any moment in time. Having multiple KVM vCPUs tracking a single guest +CPU complicates vCPU numbering. VMs that enable VSM are expected to use +CAP_APIC_ID_GROUPS to segregate vCPUs (and their lAPICs) into different +groups. For example, a 4 CPU VSM VM will setup the APIC ID groups feature +so only the first two bits of the APIC ID are exposed to the guest. The +remaining bits represent the vCPU's VTL. The 'sibling' vCPU to VTL0's +vCPU2 at VTL3 will h
Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
Hey Nicolas, On 08.11.23 12:17, Nicolas Saenz Julienne wrote: Hyper-V's Virtual Secure Mode (VSM) is a virtualisation security feature that leverages the hypervisor to create secure execution environments within a guest. VSM is documented as part of Microsoft's Hypervisor Top Level Functional Specification [1]. Security features that build upon VSM, like Windows Credential Guard, are enabled by default on Windows 11, and are becoming a prerequisite in some industries. This RFC series introduces the necessary infrastructure to emulate VSM enabled guests. It is a snapshot of the progress we made so far, and its main goal is to gather design feedback. Specifically on the KVM APIs we introduce. For a high level design overview, see the documentation in patch 33. Additionally, this topic will be discussed as part of the KVM Micro-conference, in this year's Linux Plumbers Conference [2]. Awesome, looking forward to the session! :) The series is accompanied by two repositories: - A PoC QEMU implementation of VSM [3]. - VSM kvm-unit-tests [4]. Note that this isn't a full VSM implementation. For now it only supports 2 VTLs, and only runs on uniprocessor guests. It is capable of booting Windows Sever 2016/2019, but is unstable during runtime. How much of these limitations are inherent in the current set of patches? What is missing to go beyond 2 VTLs and into SMP land? Anything that will require API changes? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support
On 08.11.23 12:17, Nicolas Saenz Julienne wrote: Prepare infrastructure to be able to return data through the XMM registers when Hyper-V hypercalls are issues in fast mode. The XMM registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and restored on successful hypercall completion. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/include/asm/hyperv-tlfs.h | 2 +- arch/x86/kvm/hyperv.c | 33 +- include/uapi/linux/kvm.h | 6 ++ 3 files changed, 39 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index 2ff26f53cd62..af594aa65307 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -49,7 +49,7 @@ /* Support for physical CPU dynamic partitioning events is available*/ #define HV_X64_CPU_DYNAMIC_PARTITIONING_AVAILABLE BIT(3) /* - * Support for passing hypercall input parameter block via XMM + * Support for passing hypercall input and output parameter block via XMM * registers is available */ #define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE BIT(4) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 238afd7335e4..e1bc861ab3b0 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -1815,6 +1815,7 @@ struct kvm_hv_hcall { u16 rep_idx; bool fast; bool rep; + bool xmm_dirty; sse128_t xmm[HV_HYPERCALL_MAX_XMM_REGISTERS]; /* @@ -2346,9 +2347,33 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result) return ret; } +static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm) +{ + int reg; + + kvm_fpu_get(); + for (reg = 0; reg < HV_HYPERCALL_MAX_XMM_REGISTERS; reg++) { + const sse128_t data = sse128(xmm[reg].low, xmm[reg].high); + _kvm_write_sse_reg(reg, &data); + } + kvm_fpu_put(); +} + +static bool kvm_hv_is_xmm_output_hcall(u16 code) +{ + return false; +} + static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu) { - return kvm_hv_hypercall_complete(vcpu, vcpu->run->hyperv.u.hcall.result); + bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT); + u16 code = vcpu->run->hyperv.u.hcall.input & 0x; + u64 result = vcpu->run->hyperv.u.hcall.result; + + if (kvm_hv_is_xmm_output_hcall(code) && hv_result_success(result) && fast) + kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm); + + return kvm_hv_hypercall_complete(vcpu, result); } static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc) @@ -2623,6 +2648,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) break; } + if ((ret & HV_HYPERCALL_RESULT_MASK) == HV_STATUS_SUCCESS && hc.xmm_dirty) + kvm_hv_write_xmm((struct kvm_hyperv_xmm_reg*)hc.xmm); + hypercall_complete: return kvm_hv_hypercall_complete(vcpu, ret); @@ -2632,6 +2660,8 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) vcpu->run->hyperv.u.hcall.input = hc.param; vcpu->run->hyperv.u.hcall.params[0] = hc.ingpa; vcpu->run->hyperv.u.hcall.params[1] = hc.outgpa; + if (hc.fast) + memcpy(vcpu->run->hyperv.u.hcall.xmm, hc.xmm, sizeof(hc.xmm)); vcpu->arch.complete_userspace_io = kvm_hv_hypercall_complete_userspace; return 0; } @@ -2780,6 +2810,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS; ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE; + ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE; Shouldn't this be guarded by an ENABLE_CAP to make sure old user space that doesn't know about xmm outputs is still able to run with newer kernels? ent->edx |= HV_FEATURE_FREQUENCY_MSRS_AVAILABLE; ent->edx |= HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE; diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index d7a01766bf21..5ce06a1eee2b 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -192,6 +192,11 @@ struct kvm_s390_cmma_log { __u64 values; }; +struct kvm_hyperv_xmm_reg { + __u64 low; + __u64 high; +}; + struct kvm_hyperv_exit { #define KVM_EXIT_HYPERV_SYNIC 1 #define KVM_EXIT_HYPERV_HCALL 2 @@ -210,6 +215,7 @@ struct kvm_hyperv_exit { __u64 input; __u64 result; __u64 params[2]; + struct kvm_hyperv_xmm_reg xmm[6]; Would this change the size of struct kvm_hyperv_exit? And if so, wouldn't that potentially be a UABI breakage? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsger
Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
On 08.11.23 12:17, Nicolas Saenz Julienne wrote: VTL call/return hypercalls have their own entry points in the hypercall page because they don't follow normal hyper-v hypercall conventions. Move the VTL call/return control input into ECX/RAX and set the hypercall code into EAX/RCX before calling the hypercall instruction in order to be able to use the Hyper-V hypercall entry function. Guests can read an emulated code page offsets register to know the offsets into the hypercall page for the VTL call/return entries. Signed-off-by: Nicolas Saenz Julienne --- My tree has the additional patch, we're still trying to understand under what conditions Windows expects the offset to be fixed. diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 54f7f36a89bf..9f2ea8c34447 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -294,6 +294,7 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data) /* VTL call/return entries */ if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) { + i = 22; #ifdef CONFIG_X86_64 if (is_64_bit_mode(vcpu)) { /* --- arch/x86/include/asm/kvm_host.h | 2 + arch/x86/kvm/hyperv.c | 78 ++- include/asm-generic/hyperv-tlfs.h | 11 + 3 files changed, 90 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index a2f224f95404..00cd21b09f8c 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1105,6 +1105,8 @@ struct kvm_hv { u64 hv_tsc_emulation_status; u64 hv_invtsc_control; + union hv_register_vsm_code_page_offsets vsm_code_page_offsets; + /* How many vCPUs have VP index != vCPU index */ atomic_t num_mismatched_vp_indexes; diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 78d053042667..d4b1b53ea63d 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -259,7 +259,8 @@ static void synic_exit(struct kvm_vcpu_hv_synic *synic, u32 msr) static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data) { struct kvm *kvm = vcpu->kvm; - u8 instructions[9]; + struct kvm_hv *hv = to_kvm_hv(kvm); + u8 instructions[0x30]; int i = 0; u64 addr; @@ -285,6 +286,81 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, u64 data) /* ret */ ((unsigned char *)instructions)[i++] = 0xc3; + /* VTL call/return entries */ + if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) { You don't introduce kvm_hv_vsm_enabled() before. Please do a quick test build of all individual commits of your patch set for v1 :). +#ifdef CONFIG_X86_64 Why do you need the ifdef here? is_long_mode() already has an ifdef that will always return false for is_64_bit_mode() on 32bit hosts. + if (is_64_bit_mode(vcpu)) { + /* +* VTL call 64-bit entry prologue: +* mov %rcx, %rax +* mov $0x11, %ecx +* jmp 0: +*/ + hv->vsm_code_page_offsets.vtl_call_offset = i; + instructions[i++] = 0x48; + instructions[i++] = 0x89; + instructions[i++] = 0xc8; + instructions[i++] = 0xb9; + instructions[i++] = 0x11; + instructions[i++] = 0x00; + instructions[i++] = 0x00; + instructions[i++] = 0x00; + instructions[i++] = 0xeb; + instructions[i++] = 0xe0; I think it would be a lot easier to read (because it's denser) if you move the opcodes into a character array: char vtl_entry[] = { 0x48, 0x89, 0xc8, 0xb9, 0x11, 0x00, 0x00, 0x00. 0xeb, 0xe0 }; and then just memcpy(). Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support
Alexander Graf writes: > On 08.11.23 12:17, Nicolas Saenz Julienne wrote: >> Prepare infrastructure to be able to return data through the XMM >> registers when Hyper-V hypercalls are issues in fast mode. The XMM >> registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and >> restored on successful hypercall completion. >> >> Signed-off-by: Nicolas Saenz Julienne >> --- >> arch/x86/include/asm/hyperv-tlfs.h | 2 +- >> arch/x86/kvm/hyperv.c | 33 +- >> include/uapi/linux/kvm.h | 6 ++ >> 3 files changed, 39 insertions(+), 2 deletions(-) >> >> diff --git a/arch/x86/include/asm/hyperv-tlfs.h >> b/arch/x86/include/asm/hyperv-tlfs.h >> index 2ff26f53cd62..af594aa65307 100644 >> --- a/arch/x86/include/asm/hyperv-tlfs.h >> +++ b/arch/x86/include/asm/hyperv-tlfs.h >> @@ -49,7 +49,7 @@ >> /* Support for physical CPU dynamic partitioning events is available*/ >> #define HV_X64_CPU_DYNAMIC_PARTITIONING_AVAILABLE BIT(3) >> /* >> - * Support for passing hypercall input parameter block via XMM >> + * Support for passing hypercall input and output parameter block via XMM >>* registers is available >>*/ >> #define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE BIT(4) >> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c >> index 238afd7335e4..e1bc861ab3b0 100644 >> --- a/arch/x86/kvm/hyperv.c >> +++ b/arch/x86/kvm/hyperv.c >> @@ -1815,6 +1815,7 @@ struct kvm_hv_hcall { >> u16 rep_idx; >> bool fast; >> bool rep; >> +bool xmm_dirty; >> sse128_t xmm[HV_HYPERCALL_MAX_XMM_REGISTERS]; >> >> /* >> @@ -2346,9 +2347,33 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu >> *vcpu, u64 result) >> return ret; >> } >> >> +static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm) >> +{ >> +int reg; >> + >> +kvm_fpu_get(); >> +for (reg = 0; reg < HV_HYPERCALL_MAX_XMM_REGISTERS; reg++) { >> +const sse128_t data = sse128(xmm[reg].low, xmm[reg].high); >> +_kvm_write_sse_reg(reg, &data); >> +} >> +kvm_fpu_put(); >> +} >> + >> +static bool kvm_hv_is_xmm_output_hcall(u16 code) >> +{ >> +return false; >> +} >> + >> static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu) >> { >> -return kvm_hv_hypercall_complete(vcpu, >> vcpu->run->hyperv.u.hcall.result); >> +bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT); >> +u16 code = vcpu->run->hyperv.u.hcall.input & 0x; >> +u64 result = vcpu->run->hyperv.u.hcall.result; >> + >> +if (kvm_hv_is_xmm_output_hcall(code) && hv_result_success(result) && >> fast) >> +kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm); >> + >> +return kvm_hv_hypercall_complete(vcpu, result); >> } >> >> static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, struct >> kvm_hv_hcall *hc) >> @@ -2623,6 +2648,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) >> break; >> } >> >> +if ((ret & HV_HYPERCALL_RESULT_MASK) == HV_STATUS_SUCCESS && >> hc.xmm_dirty) >> +kvm_hv_write_xmm((struct kvm_hyperv_xmm_reg*)hc.xmm); >> + >> hypercall_complete: >> return kvm_hv_hypercall_complete(vcpu, ret); >> >> @@ -2632,6 +2660,8 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) >> vcpu->run->hyperv.u.hcall.input = hc.param; >> vcpu->run->hyperv.u.hcall.params[0] = hc.ingpa; >> vcpu->run->hyperv.u.hcall.params[1] = hc.outgpa; >> +if (hc.fast) >> +memcpy(vcpu->run->hyperv.u.hcall.xmm, hc.xmm, sizeof(hc.xmm)); >> vcpu->arch.complete_userspace_io = kvm_hv_hypercall_complete_userspace; >> return 0; >> } >> @@ -2780,6 +2810,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct >> kvm_cpuid2 *cpuid, >> ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS; >> >> ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE; >> +ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE; > > > Shouldn't this be guarded by an ENABLE_CAP to make sure old user space > that doesn't know about xmm outputs is still able to run with newer kernels? > No, we don't do CAPs for new Hyper-V features anymore since we have KVM_GET_SUPPORTED_HV_CPUID. Userspace is not supposed to simply copy its output into guest visible CPUIDs, it must only enable features it knows. Even 'hv_passthrough' option in QEMU doesn't pass unknown features through. > >> ent->edx |= HV_FEATURE_FREQUENCY_MSRS_AVAILABLE; >> ent->edx |= HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE; >> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h >> index d7a01766bf21..5ce06a1eee2b 100644 >> --- a/include/uapi/linux/kvm.h >> +++ b/include/uapi/linux/kvm.h >> @@ -192,6 +192,11 @@ struct kvm_s390_cmma_log { >> __u64 values; >> }; >> >> +struct kvm_hyperv_xmm_reg { >> +__u64 low; >> +__u64 high; >> +}; >> + >> struct kvm_hyperv_exit
Re: [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS
On 08.11.23 12:17, Nicolas Saenz Julienne wrote: From: Anel Orazgaliyeva Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC ids into two. The lower bits, the physical APIC id, represent the part that's exposed to the guest. The higher bits, which are private to KVM, groups APICs together. APICs in different groups are isolated from each other, and IPIs can only be directed at APICs that share the same group as its source. Furthermore, groups are only relevant to IPIs, anything incoming from outside the local APIC complex: from the IOAPIC, MSIs, or PV-IPIs is targeted at the default APIC group, group 0. When routing IPIs with physical destinations, KVM will OR the source's vCPU APIC group with the ICR's destination ID and use that to resolve the target lAPIC. The APIC physical map is also made group aware in order to speed up this process. For the sake of simplicity, the logical map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI routing to the slower per-vCPU scan method. This capability serves as a building block to implement virtualisation based security features like Hyper-V's Virtual Secure Mode (VSM). VSM introduces a para-virtualised switch that allows for guest CPUs to jump into a different execution context, this switches into a different CPU state, lAPIC state, and memory protections. We model this in KVM by using distinct kvm_vcpus for each context. Moreover, execution contexts are hierarchical and its APICs are meant to remain functional even when the context isn't 'scheduled in'. For example, we have to keep track of timers' expirations, and interrupt execution of lesser priority contexts when relevant. Hence the need to alias physical APIC ids, while keeping the ability to target specific execution contexts. Signed-off-by: Anel Orazgaliyeva Co-developed-by: Nicolas Saenz Julienne Signed-off-by: Nicolas Saenz Julienne --- arch/x86/include/asm/kvm_host.h | 3 ++ arch/x86/include/uapi/asm/kvm.h | 5 +++ arch/x86/kvm/lapic.c| 59 - arch/x86/kvm/lapic.h| 33 ++ arch/x86/kvm/x86.c | 15 + include/uapi/linux/kvm.h| 2 ++ 6 files changed, 108 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dff10051e9b6..a2f224f95404 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1298,6 +1298,9 @@ struct kvm_arch { struct rw_semaphore apicv_update_lock; unsigned long apicv_inhibit_reasons; + u32 apic_id_group_mask; + u8 apic_id_group_shift; + gpa_t wall_clock; bool mwait_in_guest; diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index a448d0964fc0..f73d137784d7 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -565,4 +565,9 @@ struct kvm_pmu_event_filter { #define KVM_X86_DEFAULT_VM0 #define KVM_X86_SW_PROTECTED_VM 1 +/* for KVM_SET_APIC_ID_GROUPS */ +struct kvm_apic_id_groups { + __u8 n_bits; /* nr of bits used to represent group in the APIC ID */ +}; + #endif /* _ASM_X86_KVM_H */ diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 3e977dbbf993..f55d216cb2a0 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -141,7 +141,7 @@ static inline int apic_enabled(struct kvm_lapic *apic) static inline u32 kvm_x2apic_id(struct kvm_lapic *apic) { - return apic->vcpu->vcpu_id; + return kvm_apic_id(apic->vcpu); } static bool kvm_can_post_timer_interrupt(struct kvm_vcpu *vcpu) @@ -219,8 +219,8 @@ static int kvm_recalculate_phys_map(struct kvm_apic_map *new, bool *xapic_id_mismatch) { struct kvm_lapic *apic = vcpu->arch.apic; - u32 x2apic_id = kvm_x2apic_id(apic); - u32 xapic_id = kvm_xapic_id(apic); + u32 x2apic_id = kvm_apic_id_and_group(vcpu); + u32 xapic_id = kvm_apic_id_and_group(vcpu); u32 physical_id; /* @@ -299,6 +299,13 @@ static void kvm_recalculate_logical_map(struct kvm_apic_map *new, u16 mask; u32 ldr; + /* +* Using maps for logical destinations when KVM_CAP_APIC_ID_GRUPS is in +* use isn't supported. +*/ + if (kvm_apic_group(vcpu)) + new->logical_mode = KVM_APIC_MODE_MAP_DISABLED; + if (new->logical_mode == KVM_APIC_MODE_MAP_DISABLED) return; @@ -370,6 +377,25 @@ enum { DIRTY }; +int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm, + struct kvm_apic_id_groups *groups) +{ + u8 n_bits = groups->n_bits; + + if (n_bits > 32) + return -EINVAL; + + kvm->arch.apic_id_group_mask = n_bits ? GENMASK(31, 32 - n_bits): 0; + /* +* Bitshifts >= than the width of the type are UD, so set the +* apic group shi
Re: [RFC 11/33] KVM: x86: hyper-v: Handle GET/SET_VP_REGISTER hcall in user-space
On 08.11.23 12:17, Nicolas Saenz Julienne wrote: Let user-space handle HVCALL_GET_VP_REGISTERS and HVCALL_SET_VP_REGISTERS through the KVM_EXIT_HYPERV_HVCALL exit reason. Additionally, expose the cpuid bit. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/hyperv.c | 9 + include/asm-generic/hyperv-tlfs.h | 1 + 2 files changed, 10 insertions(+) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index caaa859932c5..a3970d52eef1 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -2456,6 +2456,9 @@ static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm) static bool kvm_hv_is_xmm_output_hcall(u16 code) { + if (code == HVCALL_GET_VP_REGISTERS) + return true; + return false; } @@ -2520,6 +2523,8 @@ static bool is_xmm_fast_hypercall(struct kvm_hv_hcall *hc) case HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX: case HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX: case HVCALL_SEND_IPI_EX: + case HVCALL_GET_VP_REGISTERS: + case HVCALL_SET_VP_REGISTERS: return true; } @@ -2738,6 +2743,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) break; } goto hypercall_userspace_exit; + case HVCALL_GET_VP_REGISTERS: + case HVCALL_SET_VP_REGISTERS: + goto hypercall_userspace_exit; default: ret = HV_STATUS_INVALID_HYPERCALL_CODE; break; @@ -2903,6 +2911,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, ent->ebx |= HV_POST_MESSAGES; ent->ebx |= HV_SIGNAL_EVENTS; ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS; + ent->ebx |= HV_ACCESS_VP_REGISTERS; Do we need to guard this? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support
On 08.11.23 13:11, Vitaly Kuznetsov wrote: Alexander Graf writes: On 08.11.23 12:17, Nicolas Saenz Julienne wrote: Prepare infrastructure to be able to return data through the XMM registers when Hyper-V hypercalls are issues in fast mode. The XMM registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and restored on successful hypercall completion. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/include/asm/hyperv-tlfs.h | 2 +- arch/x86/kvm/hyperv.c | 33 +- include/uapi/linux/kvm.h | 6 ++ 3 files changed, 39 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h index 2ff26f53cd62..af594aa65307 100644 --- a/arch/x86/include/asm/hyperv-tlfs.h +++ b/arch/x86/include/asm/hyperv-tlfs.h @@ -49,7 +49,7 @@ /* Support for physical CPU dynamic partitioning events is available*/ #define HV_X64_CPU_DYNAMIC_PARTITIONING_AVAILABLE BIT(3) /* - * Support for passing hypercall input parameter block via XMM + * Support for passing hypercall input and output parameter block via XMM * registers is available */ #define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE BIT(4) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 238afd7335e4..e1bc861ab3b0 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -1815,6 +1815,7 @@ struct kvm_hv_hcall { u16 rep_idx; bool fast; bool rep; +bool xmm_dirty; sse128_t xmm[HV_HYPERCALL_MAX_XMM_REGISTERS]; /* @@ -2346,9 +2347,33 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu *vcpu, u64 result) return ret; } +static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm) +{ +int reg; + +kvm_fpu_get(); +for (reg = 0; reg < HV_HYPERCALL_MAX_XMM_REGISTERS; reg++) { +const sse128_t data = sse128(xmm[reg].low, xmm[reg].high); +_kvm_write_sse_reg(reg, &data); +} +kvm_fpu_put(); +} + +static bool kvm_hv_is_xmm_output_hcall(u16 code) +{ +return false; +} + static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu) { -return kvm_hv_hypercall_complete(vcpu, vcpu->run->hyperv.u.hcall.result); +bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT); +u16 code = vcpu->run->hyperv.u.hcall.input & 0x; +u64 result = vcpu->run->hyperv.u.hcall.result; + +if (kvm_hv_is_xmm_output_hcall(code) && hv_result_success(result) && fast) +kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm); + +return kvm_hv_hypercall_complete(vcpu, result); } static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, struct kvm_hv_hcall *hc) @@ -2623,6 +2648,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) break; } +if ((ret & HV_HYPERCALL_RESULT_MASK) == HV_STATUS_SUCCESS && hc.xmm_dirty) +kvm_hv_write_xmm((struct kvm_hyperv_xmm_reg*)hc.xmm); + hypercall_complete: return kvm_hv_hypercall_complete(vcpu, ret); @@ -2632,6 +2660,8 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu) vcpu->run->hyperv.u.hcall.input = hc.param; vcpu->run->hyperv.u.hcall.params[0] = hc.ingpa; vcpu->run->hyperv.u.hcall.params[1] = hc.outgpa; +if (hc.fast) +memcpy(vcpu->run->hyperv.u.hcall.xmm, hc.xmm, sizeof(hc.xmm)); vcpu->arch.complete_userspace_io = kvm_hv_hypercall_complete_userspace; return 0; } @@ -2780,6 +2810,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid, ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS; ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE; +ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE; Shouldn't this be guarded by an ENABLE_CAP to make sure old user space that doesn't know about xmm outputs is still able to run with newer kernels? No, we don't do CAPs for new Hyper-V features anymore since we have KVM_GET_SUPPORTED_HV_CPUID. Userspace is not supposed to simply copy its output into guest visible CPUIDs, it must only enable features it knows. Even 'hv_passthrough' option in QEMU doesn't pass unknown features through. Ah, nice :). That simplifies things. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [RFC 09/33] KVM: x86: hyper-v: Introduce per-VTL vcpu helpers
On 08.11.23 12:17, Nicolas Saenz Julienne wrote: Introduce two helper functions. The first one queries a vCPU's VTL level, the second one, given a struct kvm_vcpu and VTL pair, returns the corresponding 'sibling' struct kvm_vcpu at the right VTL. We keep track of each VTL's state by having a distinct struct kvm_vpcu for each level. VTL-vCPUs that belong to the same guest CPU share the same physical APIC id, but belong to different APIC groups where the apic group represents the vCPU's VTL. Signed-off-by: Nicolas Saenz Julienne --- arch/x86/kvm/hyperv.h | 18 ++ 1 file changed, 18 insertions(+) diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h index 2bfed69ba0db..5433107e7cc8 100644 --- a/arch/x86/kvm/hyperv.h +++ b/arch/x86/kvm/hyperv.h @@ -23,6 +23,7 @@ #include #include "x86.h" +#include "lapic.h" /* "Hv#1" signature */ #define HYPERV_CPUID_SIGNATURE_EAX 0x31237648 @@ -83,6 +84,23 @@ static inline struct kvm_hv_syndbg *to_hv_syndbg(struct kvm_vcpu *vcpu) return &vcpu->kvm->arch.hyperv.hv_syndbg; } +static inline struct kvm_vcpu *kvm_hv_get_vtl_vcpu(struct kvm_vcpu *vcpu, int vtl) +{ + struct kvm *kvm = vcpu->kvm; + u32 target_id = kvm_apic_id(vcpu); + + kvm_apic_id_set_group(kvm, vtl, &target_id); + if (vcpu->vcpu_id == target_id) + return vcpu; + + return kvm_get_vcpu_by_id(kvm, target_id); +} + +static inline u8 kvm_hv_get_active_vtl(struct kvm_vcpu *vcpu) +{ + return kvm_apic_group(vcpu); Shouldn't this check whether VTL is active? If someone wants to use APIC groups for a different purpose in the future, they'd suddenly find themselves in VTL code paths in other code (such as memory protections), no? Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [RFC 25/33] KVM: Introduce a set of new memory attributes
On 08.11.23 12:17, Nicolas Saenz Julienne wrote: Introduce the following memory attributes: - KVM_MEMORY_ATTRIBUTE_READ - KVM_MEMORY_ATTRIBUTE_WRITE - KVM_MEMORY_ATTRIBUTE_EXECUTE - KVM_MEMORY_ATTRIBUTE_NO_ACCESS Note that NO_ACCESS is necessary in order to make a distinction between the lack of attributes for a gfn, which defaults to the memory protections of the backing memory, versus explicitly prohibiting any access to that gfn. If we negate the attributes (no read, no write, no execute), we can keep 0 == default and 0b111 becomes "no access". Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [PATCH v12 01/37] x86/cpufeatures: Add the cpu feature bit for WRMSRNS
On Mon, Oct 02, 2023 at 11:24:22PM -0700, Xin Li wrote: > Subject: Re: [PATCH v12 01/37] x86/cpufeatures: Add the cpu feature bit for > WRMSRNS For all your text: s/cpu/CPU/g > WRMSRNS is an instruction that behaves exactly like WRMSR, with > the only difference being that it is not a serializing instruction > by default. Under certain conditions, WRMSRNS may replace WRMSR to > improve performance. > > Add the CPU feature bit for WRMSRNS. > > Tested-by: Shan Kang > Signed-off-by: Xin Li > --- > arch/x86/include/asm/cpufeatures.h | 1 + > tools/arch/x86/include/asm/cpufeatures.h | 1 + > 2 files changed, 2 insertions(+) It looks to me like you can merge the first three patches into one as all they do is add that insn support. Then, further down in the patchset, it says: + if (cpu_feature_enabled(X86_FEATURE_FRED)) { + /* WRMSRNS is a baseline feature for FRED. */ but WRMSRNS is not mentioned in the FRED spec "Document Number: 346446-005US, Revision: 5.0" which, according to https://www.intel.com/content/www/us/en/content-details/780121/flexible-return-and-event-delivery-fred-specification.html is the latest. Am I looking at the wrong one? > diff --git a/arch/x86/include/asm/cpufeatures.h > b/arch/x86/include/asm/cpufeatures.h > index 58cb9495e40f..330876d34b68 100644 > --- a/arch/x86/include/asm/cpufeatures.h > +++ b/arch/x86/include/asm/cpufeatures.h > @@ -322,6 +322,7 @@ > #define X86_FEATURE_FSRS (12*32+11) /* "" Fast short REP STOSB */ > #define X86_FEATURE_FSRC (12*32+12) /* "" Fast short REP > {CMPSB,SCASB} */ > #define X86_FEATURE_LKGS (12*32+18) /* "" Load "kernel" > (userspace) GS */ > +#define X86_FEATURE_WRMSRNS (12*32+19) /* "" Non-Serializing Write > to Model Specific Register instruction */ /* "" Non-serializing WRMSR */ is more than enough. And now I'm wondering: when you're adding a separate CPUID bit, then the above should be + if (cpu_feature_enabled(X86_FEATURE_WRMSRNS)) { + /* WRMSRNS is a baseline feature for FRED. */ I see that you're adding a dependency: + { X86_FEATURE_FRED, X86_FEATURE_WRMSRNS }, which then means you don't need the X86_FEATURE_WRMSRNS definition at all and can use X86_FEATURE_FRED only. So, what's up? Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation
On 08.11.23 12:18, Nicolas Saenz Julienne wrote: Save the length of the instruction that triggered an EPT violation in struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory intercept messages. Signed-off-by: Nicolas Saenz Julienne In v1, please do this for SVM as well :) Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [RFC 30/33] KVM: x86: hyper-v: Introduce KVM_REQ_HV_INJECT_INTERCEPT request
On 08.11.23 12:18, Nicolas Saenz Julienne wrote: Introduce a new request type, KVM_REQ_HV_INJECT_INTERCEPT which allows injecting out-of-band Hyper-V secure intercepts. For now only memory access intercepts are supported. These are triggered when access a GPA protected by a higher VTL. The memory intercept metadata is filled based on the GPA provided through struct kvm_vcpu_hv_intercept_info, and injected into the guest through SynIC message. Signed-off-by: Nicolas Saenz Julienne IMHO memory protection violations should result in a user space exit. User space can then validate what to do with the violation and if necessary inject an intercept. That means from an API point of view, you want a new exit reason (violation) and an ioctl that allows you to transmit the violating CPU state into the target vCPU. I don't think the injection should even know that the source of data for the violation was a vCPU. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [RFC 32/33] KVM: x86: hyper-v: Implement HVCALL_TRANSLATE_VIRTUAL_ADDRESS
On 08.11.23 12:18, Nicolas Saenz Julienne wrote: Introduce HVCALL_TRANSLATE_VIRTUAL_ADDRESS, the hypercall receives a GVA, generally from a less privileged VTL, and returns the GPA backing it. The GVA -> GPA conversion is done by walking the target VTL's vCPU MMU. NOTE: The hypercall implementation is incomplete and only shared for completion. Additionally we'd like to move the VTL aware parts to user-space. Yes, please :). We should handle the complete hypercall in user space if possible. If you're afraid that gva -> gpa conversion may run out of sync between a user space and the kvm implementations, let's introduce an ioctl that allows you to perform that conversion. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [RFC 30/33] KVM: x86: hyper-v: Introduce KVM_REQ_HV_INJECT_INTERCEPT request
On Wed Nov 8, 2023 at 12:45 PM UTC, Alexander Graf wrote: > > On 08.11.23 12:18, Nicolas Saenz Julienne wrote: > > Introduce a new request type, KVM_REQ_HV_INJECT_INTERCEPT which allows > > injecting out-of-band Hyper-V secure intercepts. For now only memory > > access intercepts are supported. These are triggered when access a GPA > > protected by a higher VTL. The memory intercept metadata is filled based > > on the GPA provided through struct kvm_vcpu_hv_intercept_info, and > > injected into the guest through SynIC message. > > > > Signed-off-by: Nicolas Saenz Julienne > > > IMHO memory protection violations should result in a user space exit. It already does, it's not very explicit from the patch itself, since the functionality was introduced in through the "KVM: guest_memfd() and per-page attributes" series [1]. See this snippet in patch #27: + if (kvm_hv_vsm_enabled(vcpu->kvm)) { + if (kvm_hv_faultin_pfn(vcpu, fault)) { + kvm_mmu_prepare_memory_fault_exit(vcpu, fault); + return -EFAULT; + } + } Otherwise the doc in patch #33 also mentions this. :) > User space can then validate what to do with the violation and if > necessary inject an intercept. I do agree that secure intercept injection should be moved into to user-space, and happen as a reaction to a user-space memory fault exit. I was unable to do so yet, since the intercepts require a level of introspection that is not yet available to QEMU. For example, providing the length of the instruction that caused the fault. I'll work on exposing the necessary information to user-space and move the whole intercept concept there. Nicolas [1] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonz...@redhat.com/.
Re: [RFC 32/33] KVM: x86: hyper-v: Implement HVCALL_TRANSLATE_VIRTUAL_ADDRESS
On Wed Nov 8, 2023 at 12:49 PM UTC, Alexander Graf wrote: > > On 08.11.23 12:18, Nicolas Saenz Julienne wrote: > > Introduce HVCALL_TRANSLATE_VIRTUAL_ADDRESS, the hypercall receives a > > GVA, generally from a less privileged VTL, and returns the GPA backing > > it. The GVA -> GPA conversion is done by walking the target VTL's vCPU > > MMU. > > > > NOTE: The hypercall implementation is incomplete and only shared for > > completion. Additionally we'd like to move the VTL aware parts to > > user-space. > > > Yes, please :). We should handle the complete hypercall in user space if > possible. If you're afraid that gva -> gpa conversion may run out of > sync between a user space and the kvm implementations, let's introduce > an ioctl that allows you to perform that conversion. I'll look into introducing a generic API that performs MMU walks. The devil is in the details though, the hypercall introduces flags like: • HV_TRANSLATE_GVA_TLB_FLUSH_INHIBIT: Indicates that the TlbFlushInhibit flag in the virtual processor’s HvRegisterInterceptSuspend register should be set as a consequence of a successful return. This prevents other virtual processors associated with the target partition from flushing the stage 1 TLB of the specified virtual processor until after the TlbFlushInhibit flag is cleared. Which make things trickier. Nicolas
Re: [RFC 09/33] KVM: x86: hyper-v: Introduce per-VTL vcpu helpers
On Wed Nov 8, 2023 at 12:21 PM UTC, Alexander Graf wrote: > > On 08.11.23 12:17, Nicolas Saenz Julienne wrote: > > Introduce two helper functions. The first one queries a vCPU's VTL > > level, the second one, given a struct kvm_vcpu and VTL pair, returns the > > corresponding 'sibling' struct kvm_vcpu at the right VTL. > > > > We keep track of each VTL's state by having a distinct struct kvm_vpcu > > for each level. VTL-vCPUs that belong to the same guest CPU share the > > same physical APIC id, but belong to different APIC groups where the > > apic group represents the vCPU's VTL. > > > > Signed-off-by: Nicolas Saenz Julienne > > --- > > arch/x86/kvm/hyperv.h | 18 ++ > > 1 file changed, 18 insertions(+) > > > > diff --git a/arch/x86/kvm/hyperv.h b/arch/x86/kvm/hyperv.h > > index 2bfed69ba0db..5433107e7cc8 100644 > > --- a/arch/x86/kvm/hyperv.h > > +++ b/arch/x86/kvm/hyperv.h > > @@ -23,6 +23,7 @@ > > > > #include > > #include "x86.h" > > +#include "lapic.h" > > > > /* "Hv#1" signature */ > > #define HYPERV_CPUID_SIGNATURE_EAX 0x31237648 > > @@ -83,6 +84,23 @@ static inline struct kvm_hv_syndbg *to_hv_syndbg(struct > > kvm_vcpu *vcpu) > > return &vcpu->kvm->arch.hyperv.hv_syndbg; > > } > > > > +static inline struct kvm_vcpu *kvm_hv_get_vtl_vcpu(struct kvm_vcpu *vcpu, > > int vtl) > > +{ > > + struct kvm *kvm = vcpu->kvm; > > + u32 target_id = kvm_apic_id(vcpu); > > + > > + kvm_apic_id_set_group(kvm, vtl, &target_id); > > + if (vcpu->vcpu_id == target_id) > > + return vcpu; > > + > > + return kvm_get_vcpu_by_id(kvm, target_id); > > +} > > + > > +static inline u8 kvm_hv_get_active_vtl(struct kvm_vcpu *vcpu) > > +{ > > + return kvm_apic_group(vcpu); > > Shouldn't this check whether VTL is active? If someone wants to use APIC > groups for a different purpose in the future, they'd suddenly find > themselves in VTL code paths in other code (such as memory protections), no? Yes, indeed. This is solved by adding a couple of checks vs kvm_hv_vsm_enabled(). I don't have another use-case in mind for APIC ID groups so it's hard to picture if I'm just over engineering things, but I wonder it we need to introduce some sort of protection vs concurrent usages. For example we could introduce masks within the group bits and have consumers explicitly request what they want. Something like: vtl = kvm_apic_group(vcpu, HV_VTL); If user-space didn't reserve bits within the APIC ID group area and marked them with HV_VTL you'd get an error as opposed to 0 which is otherwise a valid group. Nicolas
Re: [RFC 05/33] KVM: x86: hyper-v: Introduce VTL call/return prologues in hypercall page
On Wed Nov 8, 2023 at 11:53 AM UTC, Alexander Graf wrote: [...] > > @@ -285,6 +286,81 @@ static int patch_hypercall_page(struct kvm_vcpu *vcpu, > > u64 data) > > /* ret */ > > ((unsigned char *)instructions)[i++] = 0xc3; > > > > + /* VTL call/return entries */ > > + if (!kvm_xen_hypercall_enabled(kvm) && kvm_hv_vsm_enabled(kvm)) { > > > You don't introduce kvm_hv_vsm_enabled() before. Please do a quick test > build of all individual commits of your patch set for v1 :). Yes, sorry for that. This happens for a couple of helpers, I'll fix it. > Why do you need the ifdef here? is_long_mode() already has an ifdef that > will always return false for is_64_bit_mode() on 32bit hosts. Noted, will remove. > > + if (is_64_bit_mode(vcpu)) { > > + /* > > +* VTL call 64-bit entry prologue: > > +* mov %rcx, %rax > > +* mov $0x11, %ecx > > +* jmp 0: > > +*/ > > + hv->vsm_code_page_offsets.vtl_call_offset = i; > > + instructions[i++] = 0x48; > > + instructions[i++] = 0x89; > > + instructions[i++] = 0xc8; > > + instructions[i++] = 0xb9; > > + instructions[i++] = 0x11; > > + instructions[i++] = 0x00; > > + instructions[i++] = 0x00; > > + instructions[i++] = 0x00; > > + instructions[i++] = 0xeb; > > + instructions[i++] = 0xe0; > > > I think it would be a lot easier to read (because it's denser) if you > move the opcodes into a character array: > > char vtl_entry[] = { 0x48, 0x89, 0xc8, 0xb9, 0x11, 0x00, 0x00, 0x00. > 0xeb, 0xe0 }; > > and then just memcpy(). Works for me, I'll rework it. Nicolas
Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
On Wed Nov 8, 2023 at 11:40 AM UTC, Alexander Graf wrote: > Hey Nicolas, [...] > > The series is accompanied by two repositories: > > - A PoC QEMU implementation of VSM [3]. > > - VSM kvm-unit-tests [4]. > > > > Note that this isn't a full VSM implementation. For now it only supports > > 2 VTLs, and only runs on uniprocessor guests. It is capable of booting > > Windows Sever 2016/2019, but is unstable during runtime. > > How much of these limitations are inherent in the current set of > patches? What is missing to go beyond 2 VTLs and into SMP land? Anything > that will require API changes? The main KVM concepts introduced by this series are ready to deal with any number of VTLs (APIC ID groups, VTL KVM device). KVM_HV_GET_VSM_STATE should provide a copy of 'vsm_code_page_offsets' per-VTL, since the hypercall page is partition wide but per-VTL. Attaching that information as a VTL KVM device attribute fits that requirement nicely. I'd prefer going that way especially if the VTL KVM device has a decent reception. Also, the secure memory intecepts and HVCALL_TRANSLATE_VIRTUAL_ADDRESS take some VTL related shortcuts, but those are going away. Otherwise, I don't see any necessary in-kernel changes. When virtualizing Windows with VSM I've never seen usages that go beyond VTL1. So enabling VTL > 1 will be mostly a kvm-unit-tests effort. As for SMP, it just a matter of work. Notably HvStartVirtualProcessor and HvGetVpIndexFromApicId need to be implemented, and making sure the QEMU VTL scheduling code holds up. Nicolas
Re: [RFC PATCH 0/2] Enhancing Boot Speed and Security with Delayed Module Signature Verification
On 9/14/23 07:27, Alessandro Carminati (Red Hat) wrote: This patch sets up a new feature to the Linux kernel to have the ability, while module signature checking is enabled, to delay the moment where these signatures are effectively checked. The feature is structure into two main key points, the feature can be enabled by a new command line kernel argument, while in delay mode, the kernel waits until the userspace communicates to start checking signature modules. This operation can be done by writing a value in a securityfs file, which works the same as /sys/kernel/security/lockdown. Patch 1/2: Modules: Introduce boot-time module signature flexibility The first patch in this set fundamentally alters the kernel's behavior at boot time by implementing a delayed module signature verification mechanism. It introduces a new boot-time kernel argument that allows users to request this delay. By doing so, we aim to capitalize on the cryptographic checks already performed on the kernel and initrd images during the secure boot process. As a result, we can significantly improve the boot speed without compromising system security. Patch 2/2: docs: Update kernel-parameters.txt for signature verification enhancement The second patch is just to update the kernel parameters list documentation. Background and Motivation In certain contexts, boot speed becomes crucial. This patch follows the recognition that security checks can at times be redundant. Therefore, it proves valuable to skip those checks that have already been validated. In a typical Secure Boot startup with an initrd, the bootloader is responsible for verifying artifacts before relinquishing control. In a verified initrd image, it is reasonable to assume that its content is also secure. Consequently, verifying module signatures may be deemed unnecessary. This patch introduces a feature to skip signature verification during the initrd boot phase. I think this is fine to do. There is some risk for users who may use this without realizing what they're actually doing and then would end up creating a security hole. But there are far worse ways you can do that with access to kernel paramaters. P. Alessandro Carminati (Red Hat) (2): Modules: Introduce boot-time module signature flexibility docs: Update kernel-parameters.txt for signature verification enhancement .../admin-guide/kernel-parameters.txt | 9 +++ include/linux/module.h| 4 ++ kernel/module/main.c | 14 +++-- kernel/module/signing.c | 56 +++ 4 files changed, 77 insertions(+), 6 deletions(-)
Re: [RFC PATCH 2/2] docs: Update kernel-parameters.txt for signature verification enhancement
On 9/14/23 07:27, Alessandro Carminati (Red Hat) wrote: Update kernel-parameters.txt to reflect new deferred signature verification. Enhances boot speed by allowing unsigned modules in initrd after bootloader check. Signed-off-by: Alessandro Carminati (Red Hat) --- Documentation/admin-guide/kernel-parameters.txt | 9 + 1 file changed, 9 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0c38a8af95ce..beec86f0dd05 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3410,6 +3410,15 @@ Note that if CONFIG_MODULE_SIG_FORCE is set, that is always true, so this option does nothing. + module_sig_check_wait= + This parameter enables delayed activation of module + signature checks, deferring the process until userspace + triggers it. Once activated, this setting becomes + permanent and cannot be reversed. This feature proves + valuable for incorporating unsigned modules within + initrd, especially after bootloader verification. + By employing this option, boot times can be quicker. + Please keep these in alphabetical order. Would making the kernel-parameters.txt warning a little bit more informative be a good thing? This should only be used in environments where some other signature verification method is employed. Also, for future reference, it would be good to have hard numbers to show the boot time improvement in the changelog. P. module_blacklist= [KNL] Do not load a comma-separated list of modules. Useful for debugging problem modules.
Re: [RFC 01/33] KVM: x86: Decouple lapic.h from hyperv.h
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > lapic.h has no dependencies with hyperv.h, so don't include it there. > > Additionally, cpuid.c implicitly relied on hyperv.h's inclusion through > lapic.h, so include it explicitly there. > > Signed-off-by: Nicolas Saenz Julienne > --- FWIW, feel free to post patches like this without the full context, I'm more than happy to take patches that resolve header inclusion issues even if the issue(s) only become visible with additional changes. I'll earmark this one for 6.8.
Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation
On Wed, Nov 08, 2023, Alexander Graf wrote: > > On 08.11.23 12:18, Nicolas Saenz Julienne wrote: > > Save the length of the instruction that triggered an EPT violation in > > struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory > > intercept messages. > > > > Signed-off-by: Nicolas Saenz Julienne > > > In v1, please do this for SVM as well :) Why? KVM caches values on VMX because VMREAD is measurable slower than memory accesses, especially when running nested. SVM has no such problems. I wouldn't be surprised if adding a "cache" is actually less performant due to increased pressure and misses on the hardware cache.
Re: [RFC 25/33] KVM: Introduce a set of new memory attributes
On Wed, Nov 08, 2023, Alexander Graf wrote: > > On 08.11.23 12:17, Nicolas Saenz Julienne wrote: > > Introduce the following memory attributes: > > - KVM_MEMORY_ATTRIBUTE_READ > > - KVM_MEMORY_ATTRIBUTE_WRITE > > - KVM_MEMORY_ATTRIBUTE_EXECUTE > > - KVM_MEMORY_ATTRIBUTE_NO_ACCESS > > > > Note that NO_ACCESS is necessary in order to make a distinction between > > the lack of attributes for a gfn, which defaults to the memory > > protections of the backing memory, versus explicitly prohibiting any > > access to that gfn. > > > If we negate the attributes (no read, no write, no execute), we can keep 0 > == default and 0b111 becomes "no access". Yes, I suggested this in the initial discussion[*]. I think it makes sense to have the uAPI flags have positive polarity, i.e. as above, but internally we can invert things so that the default 000b gives full RWX protections. Or we could make the push for a range-based xarray implementation so that storing 111b for all gfns is super cheap. Regardless of how KVM stores the information internally, there's no need for a NO_ACCESS flag in the uAPI. [*] https://lore.kernel.org/all/zgfuqblao+ci9...@google.com
Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > This RFC series introduces the necessary infrastructure to emulate VSM > enabled guests. It is a snapshot of the progress we made so far, and its > main goal is to gather design feedback. Heh, then please provide an overview of the design, and ideally context and/or justification for various design decisions. It doesn't need to be a proper design doc, and you can certainly point at other documentation for explaining VSM/VTLs, but a few paragraphs and/or verbose bullet points would go a long way. The documentation in patch 33 provides an explanation of VSM itself, and a little insight into how userspace can utilize the KVM implementation. But the documentation provides no explanation of the mechanics that KVM *developers* care about, e.g. the use of memory attributes, how memory attributes are enforced, whether or not an in-kernel local APIC is required, etc. Nor does the documentation explain *why*, e.g. why store a separate set of memory attributes per VTL "device", which by the by is broken and unnecessary. > Specifically on the KVM APIs we introduce. For a high level design overview, > see the documentation in patch 33. > > Additionally, this topic will be discussed as part of the KVM > Micro-conference, in this year's Linux Plumbers Conference [2]. > > The series is accompanied by two repositories: > - A PoC QEMU implementation of VSM [3]. > - VSM kvm-unit-tests [4]. > > Note that this isn't a full VSM implementation. For now it only supports > 2 VTLs, and only runs on uniprocessor guests. It is capable of booting > Windows Sever 2016/2019, but is unstable during runtime. > > The series is based on the v6.6 kernel release, and depends on the > introduction of KVM memory attributes, which is being worked on > independently in "KVM: guest_memfd() and per-page attributes" [5]. This doesn't actually apply on 6.6 with v14 of guest_memfd, because v14 of guest_memfd is based on kvm-6.7-1. Ah, and looking at your github repo, this isn't based on v14 at all, it's based on v12. That's totally fine, but the cover letter needs to explicitly, clearly, and *accurately* state the dependencies. I can obviously grab the full branch from github, but that's not foolproof, e.g. if you accidentally delete or force push to that branch. And I also prefer to know that what I'm replying to on list is the exact same code that I am looking at. > A full Linux tree is also made available [6]. > > Series rundown: > - Patch 2 introduces the concept of APIC ID groups. > - Patches 3-12 introduce the VSM capability and basic VTL awareness into >Hyper-V emulation. > - Patch 13 introduces vCPU polling support. > - Patches 14-31 use KVM's memory attributes to implement VTL memory >protections. Introduces the VTL KMV device and secure memory >intercepts. > - Patch 32 is a temporary implementation of >HVCALL_TRANSLATE_VIRTUAL_ADDRESS necessary to boot Windows 2019. > - Patch 33 introduces documentation. > > Our intention is to integrate feedback gathered in the RFC and LPC while > we finish the VSM implementation. In the future, we will split the series > into distinct feature patch sets and upstream these independently. > > Thanks, > Nicolas > > [1] > https://raw.githubusercontent.com/Microsoft/Virtualization-Documentation/master/tlfs/Hypervisor%20Top%20Level%20Functional%20Specification%20v6.0b.pdf > [2] https://lpc.events/event/17/sessions/166/#20231114 > [3] https://github.com/vianpl/qemu/tree/vsm-rfc-v1 > [4] https://github.com/vianpl/kvm-unit-tests/tree/vsm-rfc-v1 > [5] https://lore.kernel.org/lkml/20231105163040.14904-1-pbonz...@redhat.com/. > [6] Full tree: https://github.com/vianpl/linux/tree/vsm-rfc-v1. When providing github links, my preference is to format the pointers as: or tags/ e.g. https://github.com/vianpl/linux vsm-rfc-v1 so that readers can copy+paste the full thing directly into `git fetch`. It's a minor thing, but AFAIK no one actually does review by clicking through github's webview. > There are also two small dependencies with > https://marc.info/?l=kvm&m=167887543028109&w=2 and > https://lkml.org/lkml/2023/10/17/972 Please use lore links, there's zero reason to use anything else these days. For those of us that use b4, lore links make life much easier.
Re: [RFC 18/33] KVM: x86: Decouple kvm_get_memory_attributes() from struct kvm's mem_attr_array
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 631fd532c97a..4242588e3dfb 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -2385,9 +2385,10 @@ static inline void > kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu, > } > > #ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES > -static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t > gfn) > +static inline unsigned long > +kvm_get_memory_attributes(struct xarray *mem_attr_array, gfn_t gfn) Do not wrap before the function name. Linus has a nice explanation/rant on this[*]. [*] https://lore.kernel.org/all/CAHk-=wjoLAYG446ZNHfg=ghjsy6nfmub_wa8fyd5ilbnxjo...@mail.gmail.com
Re: [RFC 21/33] KVM: Pass memory attribute array as a MMU notifier argument
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > Pass the memory attribute array through struct kvm_mmu_notifier_arg and > use it in kvm_arch_post_set_memory_attributes() instead of defaulting on > kvm->mem_attr_array. > > Signed-off-by: Nicolas Saenz Julienne > --- > arch/x86/kvm/mmu/mmu.c | 8 > include/linux/kvm_host.h | 5 - > virt/kvm/kvm_main.c | 1 + > 3 files changed, 9 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index c0fd3afd6be5..c2bec2be2ba9 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -7311,6 +7311,7 @@ static bool hugepage_has_attrs(struct xarray > *mem_attr_array, > bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, >struct kvm_gfn_range *range) > { > + struct xarray *mem_attr_array = range->arg.mem_attr_array; > unsigned long attrs = range->arg.attributes; > struct kvm_memory_slot *slot = range->slot; > int level; > @@ -7344,8 +7345,8 @@ bool kvm_arch_post_set_memory_attributes(struct kvm > *kvm, >* misaligned address regardless of memory attributes. >*/ > if (gfn >= slot->base_gfn) { > - if (hugepage_has_attrs(&kvm->mem_attr_array, > -slot, gfn, level, attrs)) > + if (hugepage_has_attrs(mem_attr_array, slot, > +gfn, level, attrs)) This is wildly broken. The hugepage tracking is per VM, whereas the attributes here are per-VTL. I.e. KVM will (dis)allow hugepages based on whatever VTL last changed its protections.
Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation
On 08.11.23 17:15, Sean Christopherson wrote: On Wed, Nov 08, 2023, Alexander Graf wrote: On 08.11.23 12:18, Nicolas Saenz Julienne wrote: Save the length of the instruction that triggered an EPT violation in struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory intercept messages. Signed-off-by: Nicolas Saenz Julienne In v1, please do this for SVM as well :) Why? KVM caches values on VMX because VMREAD is measurable slower than memory accesses, especially when running nested. SVM has no such problems. I wouldn't be surprised if adding a "cache" is actually less performant due to increased pressure and misses on the hardware cache. My understanding was that this patch wasn't about caching it, it was about storing it somewhere generically so we can use it for the fault injection code path in the following patch. And if we don't set this variable for SVM, it just means Credential Guard fault injection would be broken there. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > Save the length of the instruction that triggered an EPT violation in > struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory > intercept messages. This is silly and unnecessarily obfuscates *why* (as my response regarding SVM shows), i.e. that this is "needed" becuase the value is consumed by a *different* vCPU, not because of performance concerns. It's also broken, AFAICT nothing prevents the intercepted vCPU from hitting a different EPT violation before the target vCPU consumes exit_instruction_len. Holy cow. All of deliver_gpa_intercept() is wildly unsafe. Aside from race conditions, which in and of themselves are a non-starter, nothing guarantees that the intercepted vCPU actually cached all of the information that is held in its VMCS. The sane way to do this is to snapshot *all* information on the intercepted vCPU, and then hand that off as a payload to the target vCPU. That is, assuming the cross-vCPU stuff is actually necessary. At a glance, I don't see anything that explains *why*.
Re: [RFC 14/33] KVM: x86: Add VTL to the MMU role
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > With the upcoming introduction of per-VTL memory protections, make MMU > roles VTL aware. This will avoid sharing PTEs between vCPUs that belong > to different VTLs, and that have distinct memory access restrictions. > > Four bits are allocated to store the VTL number in the MMU role, since > the TLFS states there is a maximum of 16 levels. How many does KVM actually allow/support? Multiplying the number of possible roots by 16x is a *major* change.
Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation
On 08.11.23 18:20, Sean Christopherson wrote: On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: Save the length of the instruction that triggered an EPT violation in struct kvm_vcpu_arch. This will be used to populate Hyper-V VSM memory intercept messages. This is silly and unnecessarily obfuscates *why* (as my response regarding SVM shows), i.e. that this is "needed" becuase the value is consumed by a *different* vCPU, not because of performance concerns. It's also broken, AFAICT nothing prevents the intercepted vCPU from hitting a different EPT violation before the target vCPU consumes exit_instruction_len. Holy cow. All of deliver_gpa_intercept() is wildly unsafe. Aside from race conditions, which in and of themselves are a non-starter, nothing guarantees that the intercepted vCPU actually cached all of the information that is held in its VMCS. The sane way to do this is to snapshot *all* information on the intercepted vCPU, and then hand that off as a payload to the target vCPU. That is, assuming the cross-vCPU stuff is actually necessary. At a glance, I don't see anything that explains *why*. Yup, I believe you repeated the comment I had on the function - and Nicolas already agreed :). This should go through user space which automatically means you need to bubble up all necessary trap data to user space on the faulting vCPU and then inject the full set of data into the receiving one. My point with the comment on this patch was "Don't break AMD (or ancient VMX without instruction length decoding [Does that exist? I know SVM has old CPUs that don't do it]) please". Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Re: [RFC 02/33] KVM: x86: Introduce KVM_CAP_APIC_ID_GROUPS
On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > From: Anel Orazgaliyeva > > Introduce KVM_CAP_APIC_ID_GROUPS, this capability segments the VM's APIC > ids into two. The lower bits, the physical APIC id, represent the part > that's exposed to the guest. The higher bits, which are private to KVM, > groups APICs together. APICs in different groups are isolated from each > other, and IPIs can only be directed at APICs that share the same group > as its source. Furthermore, groups are only relevant to IPIs, anything > incoming from outside the local APIC complex: from the IOAPIC, MSIs, or > PV-IPIs is targeted at the default APIC group, group 0. > > When routing IPIs with physical destinations, KVM will OR the source's > vCPU APIC group with the ICR's destination ID and use that to resolve > the target lAPIC. Is all of the above arbitrary KVM behavior or defined by the TLFS? > The APIC physical map is also made group aware in > order to speed up this process. For the sake of simplicity, the logical > map is not built while KVM_CAP_APIC_ID_GROUPS is in use and we defer IPI > routing to the slower per-vCPU scan method. Why? I mean, I kinda sorta understand what it does for VSM, but it's not at all obvious why this information needs to be shoved into the APIC IDs. E.g. why not have an explicit group_id and then maintain separate optimization maps for each? > This capability serves as a building block to implement virtualisation > based security features like Hyper-V's Virtual Secure Mode (VSM). VSM > introduces a para-virtualised switch that allows for guest CPUs to jump > into a different execution context, this switches into a different CPU > state, lAPIC state, and memory protections. We model this in KVM by Who is "we"? As a general rule, avoid pronouns. "we" and "us" in particular should never show up in a changelog. I genuinely don't know if "we" means userspace or KVM, and the distinction matters because it clarifies whether or not KVM is actively involved in the modeling versus KVM being little more than a dumb pipe to provide the plumbing. > using distinct kvm_vcpus for each context. > > Moreover, execution contexts are hierarchical and its APICs are meant to > remain functional even when the context isn't 'scheduled in'. Please explain the relationship and rules of execution contexts. E.g. are execution contexts the same thing as VTLs? Do all "real" vCPUs belong to every execution context? If so, is that a requirement? > For example, we have to keep track of > timers' expirations, and interrupt execution of lesser priority contexts > when relevant. Hence the need to alias physical APIC ids, while keeping > the ability to target specific execution contexts. > > Signed-off-by: Anel Orazgaliyeva > Co-developed-by: Nicolas Saenz Julienne > Signed-off-by: Nicolas Saenz Julienne > --- > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h > index e1021517cf04..542bd208e52b 100644 > --- a/arch/x86/kvm/lapic.h > +++ b/arch/x86/kvm/lapic.h > @@ -97,6 +97,8 @@ void kvm_lapic_set_tpr(struct kvm_vcpu *vcpu, unsigned long > cr8); > void kvm_lapic_set_eoi(struct kvm_vcpu *vcpu); > void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value); > u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu); > +int kvm_vm_ioctl_set_apic_id_groups(struct kvm *kvm, > + struct kvm_apic_id_groups *groups); > void kvm_recalculate_apic_map(struct kvm *kvm); > void kvm_apic_set_version(struct kvm_vcpu *vcpu); > void kvm_apic_after_set_mcg_cap(struct kvm_vcpu *vcpu); > @@ -277,4 +279,35 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic) > return kvm_lapic_get_reg(apic, APIC_ID) >> 24; > } > > +static inline u32 kvm_apic_id(struct kvm_vcpu *vcpu) > +{ > + return vcpu->vcpu_id & ~vcpu->kvm->arch.apic_id_group_mask; This is *extremely* misleading. KVM forces the x2APIC ID to match vcpu_id, but in xAPIC mode the ID is fully writable.
Re: [RFC 29/33] KVM: VMX: Save instruction length on EPT violation
On Wed, Nov 8, 2023 at 9:27 AM Alexander Graf wrote: > My point with the comment on this patch was "Don't break AMD (or ancient > VMX without instruction length decoding [Does that exist? I know SVM has > old CPUs that don't do it]) please". VM-exit instruction length is not defined for all VM-exit reasons (EPT misconfiguration is one that is notably absent), but the field has been there since Prescott.
Re: [RFC 0/33] KVM: x86: hyperv: Introduce VSM support
On Wed, Nov 08, 2023, Sean Christopherson wrote: > On Wed, Nov 08, 2023, Nicolas Saenz Julienne wrote: > > This RFC series introduces the necessary infrastructure to emulate VSM > > enabled guests. It is a snapshot of the progress we made so far, and its > > main goal is to gather design feedback. > > Heh, then please provide an overview of the design, and ideally context and/or > justification for various design decisions. It doesn't need to be a proper > design > doc, and you can certainly point at other documentation for explaining > VSM/VTLs, > but a few paragraphs and/or verbose bullet points would go a long way. > > The documentation in patch 33 provides an explanation of VSM itself, and a > little > insight into how userspace can utilize the KVM implementation. But the > documentation > provides no explanation of the mechanics that KVM *developers* care about, > e.g. > the use of memory attributes, how memory attributes are enforced, whether or > not > an in-kernel local APIC is required, etc. > > Nor does the documentation explain *why*, e.g. why store a separate set of > memory > attributes per VTL "device", which by the by is broken and unnecessary. After speed reading the series.. An overview of the design, why you made certain choices, and the tradeoffs between various options is definitely needed. A few questions off the top of my head: - What is the split between userspace and KVM? How did you arrive at that split? - How much *needs* to be in KVM? I.e. how much can be pushed to userspace while maintaininly good performance? - Why not make VTLs a first-party concept in KVM? E.g. rather than bury info in a VTL device and APIC ID groups, why not modify "struct kvm" to support replicating state that needs to be tracked per-VTL? Because of how memory attributes affect hugepages, duplicating *memslots* might actually be easier than teaching memslots to be VTL-aware. - Is "struct kvm_vcpu" the best representation of an execution context (if I'm getting the terminology right)? E.g. if 90% of the state is guaranteed to be identical for a given vCPU across execution contexts, then modeling that with separate kvm_vcpu structures is very inefficient. I highly doubt it's 90%, but it might be quite high depending on how much the TFLS restricts the state of the vCPU, e.g. if it's 64-bit only. The more info you can provide before LPC, the better, e.g. so that we can spend time discussing options instead of you getting peppered with questions about the requirements and whatnot.