Mike Larkin <mlar...@nested.page> writes:

> On Fri, Jul 15, 2022 at 12:27:04PM -0400, Dave Voutila wrote:
>> The following diff adds in formalization around mmio assists for nested
>> page/ept faults on Intel and AMD vmm(4) hosts. It provides what little
>> information is available to userland in terms of either the instruction
>> bytes (on AMD) or the instruction length (on Intel).
>>
>> vmd is updated to intercept these vm exit events, but will currently log
>> the issue and cause the vm process to exit. This is the same behavior
>> folks experience currently when a guest attempts to read/write guest
>> physical addresses in the mmio-reserved ranges, but now with an
>> explanation of the reason and current {e,r,}ip value.
>>
>> This is the foundation I'll build upon while implementing instruction
>> decode and emulation support in userland. No noticeable change should
>> occur for existing guests that don't trigger mmio assist events.
>>
>> ok?
>>
>
> See below.
>
> -ml
>
>> -dv
>>
>>
>> diff refs/heads/master refs/heads/mmio
>> commit - 10e026163f31687dba11fb4655500afb4e616258
>> commit + 7d92e26b51c3fd520807dbcd5233f14b76bc611e
>> blob - 84da19438b74377276b16b4b4f7db45ae9ec6be2
>> blob + a89a4dc0fbe7b31a47390de363109092b76ffa22
>> --- sys/arch/amd64/amd64/vmm.c
>> +++ sys/arch/amd64/amd64/vmm.c
>> @@ -4891,11 +4891,20 @@ vcpu_run_vmx(struct vcpu *vcpu, struct vm_run_params 
>> *
>>                              vcpu->vc_gueststate.vg_rax =
>>                                  vcpu->vc_exit.vei.vei_data;
>>                      break;
>> +            case VMX_EXIT_EPT_VIOLATION:
>> +                    ret = vcpu_writeregs_vmx(vcpu, VM_RWREGS_GPRS, 0,
>> +                        &vcpu->vc_exit.vrs);
>> +                    if (ret) {
>> +                            printf("%s: vm %d vcpu %d failed to update "
>> +                                "registers\n", __func__,
>> +                                vcpu->vc_parent->vm_id, vcpu->vc_id);
>> +                            return (EINVAL);
>> +                    }
>> +                    break;
>>              case VM_EXIT_NONE:
>>              case VMX_EXIT_HLT:
>>              case VMX_EXIT_INT_WINDOW:
>>              case VMX_EXIT_EXTINT:
>> -            case VMX_EXIT_EPT_VIOLATION:
>>              case VMX_EXIT_CPUID:
>>              case VMX_EXIT_XSETBV:
>>                      break;
>> @@ -4927,6 +4936,7 @@ vcpu_run_vmx(struct vcpu *vcpu, struct vm_run_params *
>>                      break;
>>  #endif /* VMM_DEBUG */
>>              }
>> +            memset(&vcpu->vc_exit, 0, sizeof(vcpu->vc_exit));
>>      }
>>
>>      setregion(&gdt, ci->ci_gdt, GDT_SIZE - 1);
>> @@ -5658,7 +5668,7 @@ vmm_get_guest_memtype(struct vm *vm, paddr_t gpa)
>>
>>      if (gpa >= VMM_PCI_MMIO_BAR_BASE && gpa <= VMM_PCI_MMIO_BAR_END) {
>>              DPRINTF("guest mmio access @ 0x%llx\n", (uint64_t)gpa);
>> -            return (VMM_MEM_TYPE_REGULAR);
>> +            return (VMM_MEM_TYPE_MMIO);
>>      }
>>
>>      /* XXX Use binary search? */
>> @@ -5782,18 +5792,28 @@ int
>>  svm_handle_np_fault(struct vcpu *vcpu)
>>  {
>>      uint64_t gpa;
>> -    int gpa_memtype, ret;
>> +    int gpa_memtype, ret = 0;
>>      struct vmcb *vmcb = (struct vmcb *)vcpu->vc_control_va;
>> +    struct vm_exit_eptviolation *vee = &vcpu->vc_exit.vee;
>>
>> -    ret = 0;
>> +    memset(vee, 0, sizeof(*vee));
>> +    vee->vee_fault_type = VEE_FAULT_INVALID;
>>
>>      gpa = vmcb->v_exitinfo2;
>>
>>      gpa_memtype = vmm_get_guest_memtype(vcpu->vc_parent, gpa);
>>      switch (gpa_memtype) {
>>      case VMM_MEM_TYPE_REGULAR:
>> +            vee->vee_fault_type = VEE_FAULT_HANDLED;
>>              ret = svm_fault_page(vcpu, gpa);
>>              break;
>> +    case VMM_MEM_TYPE_MMIO:
>> +            vee->vee_fault_type = VEE_FAULT_MMIO_ASSIST;
>> +            vee->vee_insn_len = vmcb->v_n_bytes_fetched;
>
> If you are going to depend on decode assists, we should probably validate
> that the CPU has them:
>
> CPUID Fn8000_000A_EDX[7]:
>  DecodeAssists Decode assists. Indicates support for the decode assists. See 
> “Decode Assists.”
>
> It does look like we sorta assume that we have some of these features (like
> support for cleanbits, and even NP itself), so there are probably a few
> checks we should do in attach. We have something similar in vmx but it's
> done later on vm create IIRC because at the time we wanted to run on
> a variety of host CPUs with different capabilities. I don't think that
> is important anymore, so maybe a short-circuit check for some of these
> required features in vmmattach for both svm and vmx is warranted now.
>
> FWIW I've never seen a svm CPU without these assists (but I bet they exist).
>

Added in checks for AMD. I can't find anything definitive for Intel and
the SDM seems cryptic. FWIW the one bit I found in the
IA32_VMX_EPT_VPID_CAP msr (bit 21) doesn't seem to be checked by
Linux/KVM.

>
> I think you will need a "vee_insn_bytes_valid" flag unless you are planning
> on filling out vee_insn_bytes in vmm(4) before returning (in the VMX case).
>

Added a flag for if length and/or bytes are valid. Maybe in the future a
platform or vmm provide both.

Approach I took for the SVM code path is to still relay the need for an
assist to userland as worst case vmd can just do extra work and it'll be
slower to service.

> Otherwise, vmd does not know (or today, care) about what type of CPU it's
> running on and there will be no way to know that "vee_insn_len" bytes, on
> VMX, should not be interpreted as anything to do with vee_insn_bytes.
>
> Was your plan to do PT walks to fetch arguments (and/or insn bytes)
> in vmd or in vmm?
>

Plan is PT walks entirely in userland and only use what the CPU gives us
for free (if anything) in decode assist.

How's this (first hunk was some whitespace, btw)...I've tested on my AMD
and Intel machines. It's safe for others to try and make sure they don't
have any change in current workflow.


diff refs/heads/master refs/heads/mmio
commit - d1de5a4a40a1f0a466a361526282e3b2cd5207fa
commit + 3634ab1f67c960eba84f75bf0410bf2a13c2dbe7
blob - e1748de7bc15b62cf16931a067a31948a20fd187
blob + 0441f4093b5711d4a815b89a4263fc348c4f185b
--- sys/arch/amd64/amd64/identcpu.c
+++ sys/arch/amd64/amd64/identcpu.c
@@ -335,7 +335,7 @@ cpu_hz_update_sensor(void *args)

        if (mdelta > 0) {
                val = (adelta * 1000000) / mdelta * tsc_frequency;
-               val = ((val + FREQ_50MHZ / 2) / FREQ_50MHZ) * FREQ_50MHZ;
+               val = ((val + FREQ_50MHZ / 2) / FREQ_50MHZ) * FREQ_50MHZ;
                ci->ci_hz_sensor.value = val;
        }

@@ -1055,6 +1055,9 @@ cpu_check_vmm_cap(struct cpu_info *ci)

                if (edx & AMD_SVM_VMCB_CLEAN_CAP)
                        ci->ci_vmm_cap.vcc_svm.svm_vmcb_clean = 1;
+
+               if (edx & AMD_SVM_DECODE_ASSIST_CAP)
+                       ci->ci_vmm_cap.vcc_svm.svm_decode_assist = 1;
        }

        /*
blob - 354cfac206f43be18f3438507856005ec62363ab
blob + 34036a56ccde7fe2c231ee6c3a86ec1017d8176d
--- sys/arch/amd64/amd64/vmm.c
+++ sys/arch/amd64/amd64/vmm.c
@@ -4891,11 +4891,20 @@ vcpu_run_vmx(struct vcpu *vcpu, struct vm_run_params *
                                vcpu->vc_gueststate.vg_rax =
                                    vcpu->vc_exit.vei.vei_data;
                        break;
+               case VMX_EXIT_EPT_VIOLATION:
+                       ret = vcpu_writeregs_vmx(vcpu, VM_RWREGS_GPRS, 0,
+                           &vcpu->vc_exit.vrs);
+                       if (ret) {
+                               printf("%s: vm %d vcpu %d failed to update "
+                                   "registers\n", __func__,
+                                   vcpu->vc_parent->vm_id, vcpu->vc_id);
+                               return (EINVAL);
+                       }
+                       break;
                case VM_EXIT_NONE:
                case VMX_EXIT_HLT:
                case VMX_EXIT_INT_WINDOW:
                case VMX_EXIT_EXTINT:
-               case VMX_EXIT_EPT_VIOLATION:
                case VMX_EXIT_CPUID:
                case VMX_EXIT_XSETBV:
                        break;
@@ -4927,6 +4936,7 @@ vcpu_run_vmx(struct vcpu *vcpu, struct vm_run_params *
                        break;
 #endif /* VMM_DEBUG */
                }
+               memset(&vcpu->vc_exit, 0, sizeof(vcpu->vc_exit));
        }

        setregion(&gdt, ci->ci_gdt, GDT_SIZE - 1);
@@ -5658,7 +5668,7 @@ vmm_get_guest_memtype(struct vm *vm, paddr_t gpa)

        if (gpa >= VMM_PCI_MMIO_BAR_BASE && gpa <= VMM_PCI_MMIO_BAR_END) {
                DPRINTF("guest mmio access @ 0x%llx\n", (uint64_t)gpa);
-               return (VMM_MEM_TYPE_REGULAR);
+               return (VMM_MEM_TYPE_MMIO);
        }

        /* XXX Use binary search? */
@@ -5782,18 +5792,31 @@ int
 svm_handle_np_fault(struct vcpu *vcpu)
 {
        uint64_t gpa;
-       int gpa_memtype, ret;
+       int gpa_memtype, ret = 0;
        struct vmcb *vmcb = (struct vmcb *)vcpu->vc_control_va;
+       struct vm_exit_eptviolation *vee = &vcpu->vc_exit.vee;
+       struct cpu_info *ci = curcpu();

-       ret = 0;
+       memset(vee, 0, sizeof(*vee));

        gpa = vmcb->v_exitinfo2;

        gpa_memtype = vmm_get_guest_memtype(vcpu->vc_parent, gpa);
        switch (gpa_memtype) {
        case VMM_MEM_TYPE_REGULAR:
+               vee->vee_fault_type = VEE_FAULT_HANDLED;
                ret = svm_fault_page(vcpu, gpa);
                break;
+       case VMM_MEM_TYPE_MMIO:
+               vee->vee_fault_type = VEE_FAULT_MMIO_ASSIST;
+               if (ci->ci_vmm_cap.vcc_svm.svm_decode_assist) {
+                       vee->vee_insn_len = vmcb->v_n_bytes_fetched;
+                       memcpy(&vee->vee_insn_bytes, vmcb->v_guest_ins_bytes,
+                           sizeof(vee->vee_insn_bytes));
+                       vee->vee_insn_info |= VEE_BYTES_VALID;
+               }
+               ret = EAGAIN;
+               break;
        default:
                printf("unknown memory type %d for GPA 0x%llx\n",
                    gpa_memtype, gpa);
@@ -5862,10 +5885,12 @@ vmx_fault_page(struct vcpu *vcpu, paddr_t gpa)
 int
 vmx_handle_np_fault(struct vcpu *vcpu)
 {
-       uint64_t gpa;
-       int gpa_memtype, ret;
+       uint64_t insn_len = 0, gpa;
+       int gpa_memtype, ret = 0;
+       struct vm_exit_eptviolation *vee = &vcpu->vc_exit.vee;

-       ret = 0;
+       memset(vee, 0, sizeof(*vee));
+
        if (vmread(VMCS_GUEST_PHYSICAL_ADDRESS, &gpa)) {
                printf("%s: cannot extract faulting pa\n", __func__);
                return (EINVAL);
@@ -5874,8 +5899,22 @@ vmx_handle_np_fault(struct vcpu *vcpu)
        gpa_memtype = vmm_get_guest_memtype(vcpu->vc_parent, gpa);
        switch (gpa_memtype) {
        case VMM_MEM_TYPE_REGULAR:
+               vee->vee_fault_type = VEE_FAULT_HANDLED;
                ret = vmx_fault_page(vcpu, gpa);
                break;
+       case VMM_MEM_TYPE_MMIO:
+               vee->vee_fault_type = VEE_FAULT_MMIO_ASSIST;
+               if (vmread(VMCS_INSTRUCTION_LENGTH, &insn_len) ||
+                   insn_len == 0 || insn_len > 15) {
+                       printf("%s: failed to extract instruction length\n",
+                           __func__);
+                       ret = EINVAL;
+               } else {
+                       vee->vee_insn_len = (uint32_t)insn_len;
+                       vee->vee_insn_info |= VEE_LEN_VALID;
+                       ret = EAGAIN;
+               }
+               break;
        default:
                printf("unknown memory type %d for GPA 0x%llx\n",
                    gpa_memtype, gpa);
@@ -7321,7 +7360,19 @@ vcpu_run_svm(struct vcpu *vcpu, struct vm_run_params *
                                    vcpu->vc_exit.vei.vei_data;
                                vmcb->v_rax = vcpu->vc_gueststate.vg_rax;
                        }
+                       break;
+               case SVM_VMEXIT_NPF:
+                       ret = vcpu_writeregs_svm(vcpu, VM_RWREGS_GPRS,
+                           &vcpu->vc_exit.vrs);
+                       if (ret) {
+                               printf("%s: vm %d vcpu %d failed to update "
+                                   "registers\n", __func__,
+                                   vcpu->vc_parent->vm_id, vcpu->vc_id);
+                               return (EINVAL);
+                       }
+                       break;
                }
+               memset(&vcpu->vc_exit, 0, sizeof(vcpu->vc_exit));
        }

        while (ret == 0) {
blob - b8db48f27142521bf69b65a2ffdade1ce6ec5cde
blob + 7b964b9ff20ed228aa127b2cd672412a166196d7
--- sys/arch/amd64/include/cpu.h
+++ sys/arch/amd64/include/cpu.h
@@ -82,6 +82,7 @@ struct svm {
        uint32_t        svm_max_asid;
        uint8_t         svm_flush_by_asid;
        uint8_t         svm_vmcb_clean;
+       uint8_t         svm_decode_assist;
 };

 union vmm_cpu_cap {
blob - 34701ffb864757bba5441855225bedeaa4fd613c
blob + ede5897ab172d5ea1a3d3a1f185f71640e7adadf
--- sys/arch/amd64/include/specialreg.h
+++ sys/arch/amd64/include/specialreg.h
@@ -1273,6 +1273,7 @@
 #define AMD_SVM_NESTED_PAGING_CAP      (1 << 0)
 #define AMD_SVM_VMCB_CLEAN_CAP         (1 << 5)
 #define AMD_SVM_FLUSH_BY_ASID_CAP      (1 << 6)
+#define AMD_SVM_DECODE_ASSIST_CAP      (1 << 7)
 #define AMD_SVMDIS                     0x10

 #define SVM_TLB_CONTROL_FLUSH_NONE     0
blob - 6fbb5e1bdc22666010e1c7484035d5d09cd7955b
blob + 19a4bdf940c4737e727214f6133f621cb859f474
--- sys/arch/amd64/include/vmmvar.h
+++ sys/arch/amd64/include/vmmvar.h
@@ -324,7 +324,9 @@ enum {
 };

 enum {
+       VEE_FAULT_INVALID = 0,
        VEE_FAULT_HANDLED,
+       VEE_FAULT_MMIO_ASSIST,
        VEE_FAULT_PROTECT,
 };

@@ -361,7 +363,12 @@ struct vm_exit_inout {
  *  vm_exit_eptviolation       : describes an EPT VIOLATION exit
  */
 struct vm_exit_eptviolation {
-       uint8_t         vee_fault_type;
+       uint8_t         vee_fault_type;         /* type of vm exit */
+       uint8_t         vee_insn_info;          /* bitfield */
+#define VEE_LEN_VALID          0x1             /* vee_insn_len is valid */
+#define VEE_BYTES_VALID                0x2             /* vee_insn_bytes is 
valid */
+       uint8_t         vee_insn_len;           /* [VMX] instruction length */
+       uint8_t         vee_insn_bytes[15];     /* [SVM] bytes at {R,E,}IP */
 };

 /*
@@ -709,6 +716,7 @@ enum {

 enum {
        VMM_MEM_TYPE_REGULAR,
+       VMM_MEM_TYPE_MMIO,
        VMM_MEM_TYPE_UNKNOWN
 };

blob - eac6616ea918716c2986bab0a05724b31bcce2ec
blob + 1291416d98656f33b1c92b2cd072d2a3b594dfa2
--- usr.sbin/vmd/vm.c
+++ usr.sbin/vmd/vm.c
@@ -1652,26 +1652,36 @@ vcpu_exit_inout(struct vm_run_params *vrp)
  *
  * Return values:
  *  0: no action required
- *  EAGAIN: a protection fault occured, kill the vm.
+ *  EFAULT: a protection fault occured, kill the vm.
  */
 int
 vcpu_exit_eptviolation(struct vm_run_params *vrp)
 {
+       int ret = 0;
+       uint8_t fault_type;
        struct vm_exit *ve = vrp->vrp_exit;

-       /*
-        * vmd may be exiting to vmd to handle a pending interrupt
-        * but last exit type may have been VMX_EXIT_EPT_VIOLATION,
-        * check the fault_type to ensure we really are processing
-        * a VMX_EXIT_EPT_VIOLATION.
-        */
-       if (ve->vee.vee_fault_type == VEE_FAULT_PROTECT) {
-               log_debug("%s: EPT Violation: rip=0x%llx",
-                   __progname, vrp->vrp_exit->vrs.vrs_gprs[VCPU_REGS_RIP]);
-               return (EAGAIN);
+       fault_type = ve->vee.vee_fault_type;
+       switch (fault_type) {
+       case VEE_FAULT_HANDLED:
+               log_debug("%s: fault already handled", __func__);
+               break;
+       case VEE_FAULT_MMIO_ASSIST:
+               log_warnx("%s: mmio assist required: rip=0x%llx", __progname,
+                   ve->vrs.vrs_gprs[VCPU_REGS_RIP]);
+               ret = EFAULT;
+               break;
+       case VEE_FAULT_PROTECT:
+               log_debug("%s: EPT Violation: rip=0x%llx", __progname,
+                   ve->vrs.vrs_gprs[VCPU_REGS_RIP]);
+               ret = EFAULT;
+               break;
+       default:
+               fatalx("%s: invalid fault_type %d", __progname, fault_type);
+               /* UNREACHED */
        }

-       return (0);
+       return (ret);
 }

 /*
@@ -1704,7 +1714,6 @@ vcpu_exit(struct vm_run_params *vrp)
        case VMX_EXIT_CPUID:
        case VMX_EXIT_EXTINT:
        case SVM_VMEXIT_INTR:
-       case SVM_VMEXIT_NPF:
        case SVM_VMEXIT_MSR:
        case SVM_VMEXIT_CPUID:
                /*
@@ -1715,11 +1724,11 @@ vcpu_exit(struct vm_run_params *vrp)
                 * in more vmd log spam).
                 */
                break;
+       case SVM_VMEXIT_NPF:
        case VMX_EXIT_EPT_VIOLATION:
                ret = vcpu_exit_eptviolation(vrp);
                if (ret)
                        return (ret);
-
                break;
        case VMX_EXIT_IO:
        case SVM_VMEXIT_IOIO:

Reply via email to