(see below) On Mon, Aug 15, 2022 at 03:06:05PM -0400, Dave Voutila wrote: > > Mike Larkin <mlar...@nested.page> writes: > > > On Fri, Jul 15, 2022 at 12:27:04PM -0400, Dave Voutila wrote: > >> The following diff adds in formalization around mmio assists for nested > >> page/ept faults on Intel and AMD vmm(4) hosts. It provides what little > >> information is available to userland in terms of either the instruction > >> bytes (on AMD) or the instruction length (on Intel). > >> > >> vmd is updated to intercept these vm exit events, but will currently log > >> the issue and cause the vm process to exit. This is the same behavior > >> folks experience currently when a guest attempts to read/write guest > >> physical addresses in the mmio-reserved ranges, but now with an > >> explanation of the reason and current {e,r,}ip value. > >> > >> This is the foundation I'll build upon while implementing instruction > >> decode and emulation support in userland. No noticeable change should > >> occur for existing guests that don't trigger mmio assist events. > >> > >> ok? > >> > > > > See below. > > > > -ml > > > >> -dv > >> > >> > >> diff refs/heads/master refs/heads/mmio > >> commit - 10e026163f31687dba11fb4655500afb4e616258 > >> commit + 7d92e26b51c3fd520807dbcd5233f14b76bc611e > >> blob - 84da19438b74377276b16b4b4f7db45ae9ec6be2 > >> blob + a89a4dc0fbe7b31a47390de363109092b76ffa22 > >> --- sys/arch/amd64/amd64/vmm.c > >> +++ sys/arch/amd64/amd64/vmm.c > >> @@ -4891,11 +4891,20 @@ vcpu_run_vmx(struct vcpu *vcpu, struct > >> vm_run_params * > >> vcpu->vc_gueststate.vg_rax = > >> vcpu->vc_exit.vei.vei_data; > >> break; > >> + case VMX_EXIT_EPT_VIOLATION: > >> + ret = vcpu_writeregs_vmx(vcpu, VM_RWREGS_GPRS, 0, > >> + &vcpu->vc_exit.vrs); > >> + if (ret) { > >> + printf("%s: vm %d vcpu %d failed to update " > >> + "registers\n", __func__, > >> + vcpu->vc_parent->vm_id, vcpu->vc_id); > >> + return (EINVAL); > >> + } > >> + break; > >> case VM_EXIT_NONE: > >> case VMX_EXIT_HLT: > >> case VMX_EXIT_INT_WINDOW: > >> case VMX_EXIT_EXTINT: > >> - case VMX_EXIT_EPT_VIOLATION: > >> case VMX_EXIT_CPUID: > >> case VMX_EXIT_XSETBV: > >> break; > >> @@ -4927,6 +4936,7 @@ vcpu_run_vmx(struct vcpu *vcpu, struct vm_run_params > >> * > >> break; > >> #endif /* VMM_DEBUG */ > >> } > >> + memset(&vcpu->vc_exit, 0, sizeof(vcpu->vc_exit)); > >> } > >> > >> setregion(&gdt, ci->ci_gdt, GDT_SIZE - 1); > >> @@ -5658,7 +5668,7 @@ vmm_get_guest_memtype(struct vm *vm, paddr_t gpa) > >> > >> if (gpa >= VMM_PCI_MMIO_BAR_BASE && gpa <= VMM_PCI_MMIO_BAR_END) { > >> DPRINTF("guest mmio access @ 0x%llx\n", (uint64_t)gpa); > >> - return (VMM_MEM_TYPE_REGULAR); > >> + return (VMM_MEM_TYPE_MMIO); > >> } > >> > >> /* XXX Use binary search? */ > >> @@ -5782,18 +5792,28 @@ int > >> svm_handle_np_fault(struct vcpu *vcpu) > >> { > >> uint64_t gpa; > >> - int gpa_memtype, ret; > >> + int gpa_memtype, ret = 0; > >> struct vmcb *vmcb = (struct vmcb *)vcpu->vc_control_va; > >> + struct vm_exit_eptviolation *vee = &vcpu->vc_exit.vee; > >> > >> - ret = 0; > >> + memset(vee, 0, sizeof(*vee)); > >> + vee->vee_fault_type = VEE_FAULT_INVALID; > >> > >> gpa = vmcb->v_exitinfo2; > >> > >> gpa_memtype = vmm_get_guest_memtype(vcpu->vc_parent, gpa); > >> switch (gpa_memtype) { > >> case VMM_MEM_TYPE_REGULAR: > >> + vee->vee_fault_type = VEE_FAULT_HANDLED; > >> ret = svm_fault_page(vcpu, gpa); > >> break; > >> + case VMM_MEM_TYPE_MMIO: > >> + vee->vee_fault_type = VEE_FAULT_MMIO_ASSIST; > >> + vee->vee_insn_len = vmcb->v_n_bytes_fetched; > > > > If you are going to depend on decode assists, we should probably validate > > that the CPU has them: > > > > CPUID Fn8000_000A_EDX[7]: > > DecodeAssists Decode assists. Indicates support for the decode assists. > > See “Decode Assists.” > > > > It does look like we sorta assume that we have some of these features (like > > support for cleanbits, and even NP itself), so there are probably a few > > checks we should do in attach. We have something similar in vmx but it's > > done later on vm create IIRC because at the time we wanted to run on > > a variety of host CPUs with different capabilities. I don't think that > > is important anymore, so maybe a short-circuit check for some of these > > required features in vmmattach for both svm and vmx is warranted now. > > > > FWIW I've never seen a svm CPU without these assists (but I bet they exist). > > > > Added in checks for AMD. I can't find anything definitive for Intel and > the SDM seems cryptic. FWIW the one bit I found in the > IA32_VMX_EPT_VPID_CAP msr (bit 21) doesn't seem to be checked by > Linux/KVM. > > > > > I think you will need a "vee_insn_bytes_valid" flag unless you are planning > > on filling out vee_insn_bytes in vmm(4) before returning (in the VMX case). > > > > Added a flag for if length and/or bytes are valid. Maybe in the future a > platform or vmm provide both. > > Approach I took for the SVM code path is to still relay the need for an > assist to userland as worst case vmd can just do extra work and it'll be > slower to service. > > > Otherwise, vmd does not know (or today, care) about what type of CPU it's > > running on and there will be no way to know that "vee_insn_len" bytes, on > > VMX, should not be interpreted as anything to do with vee_insn_bytes. > > > > Was your plan to do PT walks to fetch arguments (and/or insn bytes) > > in vmd or in vmm? > > > > Plan is PT walks entirely in userland and only use what the CPU gives us > for free (if anything) in decode assist. > > How's this (first hunk was some whitespace, btw)...I've tested on my AMD > and Intel machines. It's safe for others to try and make sure they don't > have any change in current workflow. > >
This is ok mlarkin, but see below. -ml > diff refs/heads/master refs/heads/mmio > commit - d1de5a4a40a1f0a466a361526282e3b2cd5207fa > commit + 3634ab1f67c960eba84f75bf0410bf2a13c2dbe7 > blob - e1748de7bc15b62cf16931a067a31948a20fd187 > blob + 0441f4093b5711d4a815b89a4263fc348c4f185b > --- sys/arch/amd64/amd64/identcpu.c > +++ sys/arch/amd64/amd64/identcpu.c > @@ -335,7 +335,7 @@ cpu_hz_update_sensor(void *args) > > if (mdelta > 0) { > val = (adelta * 1000000) / mdelta * tsc_frequency; > - val = ((val + FREQ_50MHZ / 2) / FREQ_50MHZ) * FREQ_50MHZ; > + val = ((val + FREQ_50MHZ / 2) / FREQ_50MHZ) * FREQ_50MHZ; > ci->ci_hz_sensor.value = val; > } > > @@ -1055,6 +1055,9 @@ cpu_check_vmm_cap(struct cpu_info *ci) > > if (edx & AMD_SVM_VMCB_CLEAN_CAP) > ci->ci_vmm_cap.vcc_svm.svm_vmcb_clean = 1; > + > + if (edx & AMD_SVM_DECODE_ASSIST_CAP) > + ci->ci_vmm_cap.vcc_svm.svm_decode_assist = 1; > } > > /* > blob - 354cfac206f43be18f3438507856005ec62363ab > blob + 34036a56ccde7fe2c231ee6c3a86ec1017d8176d > --- sys/arch/amd64/amd64/vmm.c > +++ sys/arch/amd64/amd64/vmm.c > @@ -4891,11 +4891,20 @@ vcpu_run_vmx(struct vcpu *vcpu, struct vm_run_params * > vcpu->vc_gueststate.vg_rax = > vcpu->vc_exit.vei.vei_data; > break; > + case VMX_EXIT_EPT_VIOLATION: > + ret = vcpu_writeregs_vmx(vcpu, VM_RWREGS_GPRS, 0, > + &vcpu->vc_exit.vrs); > + if (ret) { > + printf("%s: vm %d vcpu %d failed to update " > + "registers\n", __func__, > + vcpu->vc_parent->vm_id, vcpu->vc_id); > + return (EINVAL); > + } > + break; > case VM_EXIT_NONE: > case VMX_EXIT_HLT: > case VMX_EXIT_INT_WINDOW: > case VMX_EXIT_EXTINT: > - case VMX_EXIT_EPT_VIOLATION: > case VMX_EXIT_CPUID: > case VMX_EXIT_XSETBV: > break; > @@ -4927,6 +4936,7 @@ vcpu_run_vmx(struct vcpu *vcpu, struct vm_run_params * > break; > #endif /* VMM_DEBUG */ > } > + memset(&vcpu->vc_exit, 0, sizeof(vcpu->vc_exit)); > } > > setregion(&gdt, ci->ci_gdt, GDT_SIZE - 1); > @@ -5658,7 +5668,7 @@ vmm_get_guest_memtype(struct vm *vm, paddr_t gpa) > > if (gpa >= VMM_PCI_MMIO_BAR_BASE && gpa <= VMM_PCI_MMIO_BAR_END) { > DPRINTF("guest mmio access @ 0x%llx\n", (uint64_t)gpa); > - return (VMM_MEM_TYPE_REGULAR); > + return (VMM_MEM_TYPE_MMIO); > } > > /* XXX Use binary search? */ > @@ -5782,18 +5792,31 @@ int > svm_handle_np_fault(struct vcpu *vcpu) > { > uint64_t gpa; > - int gpa_memtype, ret; > + int gpa_memtype, ret = 0; > struct vmcb *vmcb = (struct vmcb *)vcpu->vc_control_va; > + struct vm_exit_eptviolation *vee = &vcpu->vc_exit.vee; > + struct cpu_info *ci = curcpu(); > > - ret = 0; > + memset(vee, 0, sizeof(*vee)); > > gpa = vmcb->v_exitinfo2; > > gpa_memtype = vmm_get_guest_memtype(vcpu->vc_parent, gpa); > switch (gpa_memtype) { > case VMM_MEM_TYPE_REGULAR: > + vee->vee_fault_type = VEE_FAULT_HANDLED; > ret = svm_fault_page(vcpu, gpa); > break; > + case VMM_MEM_TYPE_MMIO: > + vee->vee_fault_type = VEE_FAULT_MMIO_ASSIST; > + if (ci->ci_vmm_cap.vcc_svm.svm_decode_assist) { > + vee->vee_insn_len = vmcb->v_n_bytes_fetched; > + memcpy(&vee->vee_insn_bytes, vmcb->v_guest_ins_bytes, > + sizeof(vee->vee_insn_bytes)); > + vee->vee_insn_info |= VEE_BYTES_VALID; > + } > + ret = EAGAIN; > + break; > default: > printf("unknown memory type %d for GPA 0x%llx\n", > gpa_memtype, gpa); > @@ -5862,10 +5885,12 @@ vmx_fault_page(struct vcpu *vcpu, paddr_t gpa) > int > vmx_handle_np_fault(struct vcpu *vcpu) > { > - uint64_t gpa; > - int gpa_memtype, ret; > + uint64_t insn_len = 0, gpa; > + int gpa_memtype, ret = 0; > + struct vm_exit_eptviolation *vee = &vcpu->vc_exit.vee; > > - ret = 0; > + memset(vee, 0, sizeof(*vee)); > + > if (vmread(VMCS_GUEST_PHYSICAL_ADDRESS, &gpa)) { > printf("%s: cannot extract faulting pa\n", __func__); > return (EINVAL); > @@ -5874,8 +5899,22 @@ vmx_handle_np_fault(struct vcpu *vcpu) > gpa_memtype = vmm_get_guest_memtype(vcpu->vc_parent, gpa); > switch (gpa_memtype) { > case VMM_MEM_TYPE_REGULAR: > + vee->vee_fault_type = VEE_FAULT_HANDLED; > ret = vmx_fault_page(vcpu, gpa); > break; > + case VMM_MEM_TYPE_MMIO: > + vee->vee_fault_type = VEE_FAULT_MMIO_ASSIST; > + if (vmread(VMCS_INSTRUCTION_LENGTH, &insn_len) || > + insn_len == 0 || insn_len > 15) { > + printf("%s: failed to extract instruction length\n", > + __func__); > + ret = EINVAL; > + } else { > + vee->vee_insn_len = (uint32_t)insn_len; > + vee->vee_insn_info |= VEE_LEN_VALID; > + ret = EAGAIN; > + } > + break; > default: > printf("unknown memory type %d for GPA 0x%llx\n", > gpa_memtype, gpa); > @@ -7321,7 +7360,19 @@ vcpu_run_svm(struct vcpu *vcpu, struct vm_run_params * > vcpu->vc_exit.vei.vei_data; > vmcb->v_rax = vcpu->vc_gueststate.vg_rax; > } > + break; > + case SVM_VMEXIT_NPF: > + ret = vcpu_writeregs_svm(vcpu, VM_RWREGS_GPRS, > + &vcpu->vc_exit.vrs); > + if (ret) { > + printf("%s: vm %d vcpu %d failed to update " > + "registers\n", __func__, > + vcpu->vc_parent->vm_id, vcpu->vc_id); > + return (EINVAL); > + } > + break; > } > + memset(&vcpu->vc_exit, 0, sizeof(vcpu->vc_exit)); > } > > while (ret == 0) { > blob - b8db48f27142521bf69b65a2ffdade1ce6ec5cde > blob + 7b964b9ff20ed228aa127b2cd672412a166196d7 > --- sys/arch/amd64/include/cpu.h > +++ sys/arch/amd64/include/cpu.h > @@ -82,6 +82,7 @@ struct svm { > uint32_t svm_max_asid; > uint8_t svm_flush_by_asid; > uint8_t svm_vmcb_clean; > + uint8_t svm_decode_assist; > }; > > union vmm_cpu_cap { > blob - 34701ffb864757bba5441855225bedeaa4fd613c > blob + ede5897ab172d5ea1a3d3a1f185f71640e7adadf > --- sys/arch/amd64/include/specialreg.h > +++ sys/arch/amd64/include/specialreg.h > @@ -1273,6 +1273,7 @@ > #define AMD_SVM_NESTED_PAGING_CAP (1 << 0) > #define AMD_SVM_VMCB_CLEAN_CAP (1 << 5) > #define AMD_SVM_FLUSH_BY_ASID_CAP (1 << 6) > +#define AMD_SVM_DECODE_ASSIST_CAP (1 << 7) > #define AMD_SVMDIS 0x10 > > #define SVM_TLB_CONTROL_FLUSH_NONE 0 > blob - 6fbb5e1bdc22666010e1c7484035d5d09cd7955b > blob + 19a4bdf940c4737e727214f6133f621cb859f474 > --- sys/arch/amd64/include/vmmvar.h > +++ sys/arch/amd64/include/vmmvar.h > @@ -324,7 +324,9 @@ enum { > }; > > enum { > + VEE_FAULT_INVALID = 0, > VEE_FAULT_HANDLED, > + VEE_FAULT_MMIO_ASSIST, > VEE_FAULT_PROTECT, > }; > > @@ -361,7 +363,12 @@ struct vm_exit_inout { > * vm_exit_eptviolation : describes an EPT VIOLATION exit > */ > struct vm_exit_eptviolation { > - uint8_t vee_fault_type; > + uint8_t vee_fault_type; /* type of vm exit */ > + uint8_t vee_insn_info; /* bitfield */ > +#define VEE_LEN_VALID 0x1 /* vee_insn_len is > valid */ > +#define VEE_BYTES_VALID 0x2 /* vee_insn_bytes is > valid */ > + uint8_t vee_insn_len; /* [VMX] instruction length */ > + uint8_t vee_insn_bytes[15]; /* [SVM] bytes at {R,E,}IP */ > }; > > /* > @@ -709,6 +716,7 @@ enum { > > enum { > VMM_MEM_TYPE_REGULAR, > + VMM_MEM_TYPE_MMIO, > VMM_MEM_TYPE_UNKNOWN > }; > > blob - eac6616ea918716c2986bab0a05724b31bcce2ec > blob + 1291416d98656f33b1c92b2cd072d2a3b594dfa2 > --- usr.sbin/vmd/vm.c > +++ usr.sbin/vmd/vm.c > @@ -1652,26 +1652,36 @@ vcpu_exit_inout(struct vm_run_params *vrp) > * > * Return values: > * 0: no action required > - * EAGAIN: a protection fault occured, kill the vm. > + * EFAULT: a protection fault occured, kill the vm. > */ > int > vcpu_exit_eptviolation(struct vm_run_params *vrp) > { > + int ret = 0; > + uint8_t fault_type; > struct vm_exit *ve = vrp->vrp_exit; > > - /* > - * vmd may be exiting to vmd to handle a pending interrupt > - * but last exit type may have been VMX_EXIT_EPT_VIOLATION, > - * check the fault_type to ensure we really are processing > - * a VMX_EXIT_EPT_VIOLATION. > - */ > - if (ve->vee.vee_fault_type == VEE_FAULT_PROTECT) { > - log_debug("%s: EPT Violation: rip=0x%llx", > - __progname, vrp->vrp_exit->vrs.vrs_gprs[VCPU_REGS_RIP]); > - return (EAGAIN); > + fault_type = ve->vee.vee_fault_type; > + switch (fault_type) { > + case VEE_FAULT_HANDLED: > + log_debug("%s: fault already handled", __func__); > + break; > + case VEE_FAULT_MMIO_ASSIST: > + log_warnx("%s: mmio assist required: rip=0x%llx", __progname, > + ve->vrs.vrs_gprs[VCPU_REGS_RIP]); This may be spammy. Maybe log_debug? Although previously, the VM would have terminated, so maybe it's not as bad as I think. > + ret = EFAULT; > + break; > + case VEE_FAULT_PROTECT: > + log_debug("%s: EPT Violation: rip=0x%llx", __progname, > + ve->vrs.vrs_gprs[VCPU_REGS_RIP]); > + ret = EFAULT; > + break; > + default: > + fatalx("%s: invalid fault_type %d", __progname, fault_type); > + /* UNREACHED */ > } > > - return (0); > + return (ret); > } > > /* > @@ -1704,7 +1714,6 @@ vcpu_exit(struct vm_run_params *vrp) > case VMX_EXIT_CPUID: > case VMX_EXIT_EXTINT: > case SVM_VMEXIT_INTR: > - case SVM_VMEXIT_NPF: > case SVM_VMEXIT_MSR: > case SVM_VMEXIT_CPUID: > /* > @@ -1715,11 +1724,11 @@ vcpu_exit(struct vm_run_params *vrp) > * in more vmd log spam). > */ > break; > + case SVM_VMEXIT_NPF: > case VMX_EXIT_EPT_VIOLATION: > ret = vcpu_exit_eptviolation(vrp); > if (ret) > return (ret); > - > break; > case VMX_EXIT_IO: > case SVM_VMEXIT_IOIO: