(see below)

On Mon, Aug 15, 2022 at 03:06:05PM -0400, Dave Voutila wrote:
>
> Mike Larkin <mlar...@nested.page> writes:
>
> > On Fri, Jul 15, 2022 at 12:27:04PM -0400, Dave Voutila wrote:
> >> The following diff adds in formalization around mmio assists for nested
> >> page/ept faults on Intel and AMD vmm(4) hosts. It provides what little
> >> information is available to userland in terms of either the instruction
> >> bytes (on AMD) or the instruction length (on Intel).
> >>
> >> vmd is updated to intercept these vm exit events, but will currently log
> >> the issue and cause the vm process to exit. This is the same behavior
> >> folks experience currently when a guest attempts to read/write guest
> >> physical addresses in the mmio-reserved ranges, but now with an
> >> explanation of the reason and current {e,r,}ip value.
> >>
> >> This is the foundation I'll build upon while implementing instruction
> >> decode and emulation support in userland. No noticeable change should
> >> occur for existing guests that don't trigger mmio assist events.
> >>
> >> ok?
> >>
> >
> > See below.
> >
> > -ml
> >
> >> -dv
> >>
> >>
> >> diff refs/heads/master refs/heads/mmio
> >> commit - 10e026163f31687dba11fb4655500afb4e616258
> >> commit + 7d92e26b51c3fd520807dbcd5233f14b76bc611e
> >> blob - 84da19438b74377276b16b4b4f7db45ae9ec6be2
> >> blob + a89a4dc0fbe7b31a47390de363109092b76ffa22
> >> --- sys/arch/amd64/amd64/vmm.c
> >> +++ sys/arch/amd64/amd64/vmm.c
> >> @@ -4891,11 +4891,20 @@ vcpu_run_vmx(struct vcpu *vcpu, struct 
> >> vm_run_params *
> >>                            vcpu->vc_gueststate.vg_rax =
> >>                                vcpu->vc_exit.vei.vei_data;
> >>                    break;
> >> +          case VMX_EXIT_EPT_VIOLATION:
> >> +                  ret = vcpu_writeregs_vmx(vcpu, VM_RWREGS_GPRS, 0,
> >> +                      &vcpu->vc_exit.vrs);
> >> +                  if (ret) {
> >> +                          printf("%s: vm %d vcpu %d failed to update "
> >> +                              "registers\n", __func__,
> >> +                              vcpu->vc_parent->vm_id, vcpu->vc_id);
> >> +                          return (EINVAL);
> >> +                  }
> >> +                  break;
> >>            case VM_EXIT_NONE:
> >>            case VMX_EXIT_HLT:
> >>            case VMX_EXIT_INT_WINDOW:
> >>            case VMX_EXIT_EXTINT:
> >> -          case VMX_EXIT_EPT_VIOLATION:
> >>            case VMX_EXIT_CPUID:
> >>            case VMX_EXIT_XSETBV:
> >>                    break;
> >> @@ -4927,6 +4936,7 @@ vcpu_run_vmx(struct vcpu *vcpu, struct vm_run_params 
> >> *
> >>                    break;
> >>  #endif /* VMM_DEBUG */
> >>            }
> >> +          memset(&vcpu->vc_exit, 0, sizeof(vcpu->vc_exit));
> >>    }
> >>
> >>    setregion(&gdt, ci->ci_gdt, GDT_SIZE - 1);
> >> @@ -5658,7 +5668,7 @@ vmm_get_guest_memtype(struct vm *vm, paddr_t gpa)
> >>
> >>    if (gpa >= VMM_PCI_MMIO_BAR_BASE && gpa <= VMM_PCI_MMIO_BAR_END) {
> >>            DPRINTF("guest mmio access @ 0x%llx\n", (uint64_t)gpa);
> >> -          return (VMM_MEM_TYPE_REGULAR);
> >> +          return (VMM_MEM_TYPE_MMIO);
> >>    }
> >>
> >>    /* XXX Use binary search? */
> >> @@ -5782,18 +5792,28 @@ int
> >>  svm_handle_np_fault(struct vcpu *vcpu)
> >>  {
> >>    uint64_t gpa;
> >> -  int gpa_memtype, ret;
> >> +  int gpa_memtype, ret = 0;
> >>    struct vmcb *vmcb = (struct vmcb *)vcpu->vc_control_va;
> >> +  struct vm_exit_eptviolation *vee = &vcpu->vc_exit.vee;
> >>
> >> -  ret = 0;
> >> +  memset(vee, 0, sizeof(*vee));
> >> +  vee->vee_fault_type = VEE_FAULT_INVALID;
> >>
> >>    gpa = vmcb->v_exitinfo2;
> >>
> >>    gpa_memtype = vmm_get_guest_memtype(vcpu->vc_parent, gpa);
> >>    switch (gpa_memtype) {
> >>    case VMM_MEM_TYPE_REGULAR:
> >> +          vee->vee_fault_type = VEE_FAULT_HANDLED;
> >>            ret = svm_fault_page(vcpu, gpa);
> >>            break;
> >> +  case VMM_MEM_TYPE_MMIO:
> >> +          vee->vee_fault_type = VEE_FAULT_MMIO_ASSIST;
> >> +          vee->vee_insn_len = vmcb->v_n_bytes_fetched;
> >
> > If you are going to depend on decode assists, we should probably validate
> > that the CPU has them:
> >
> > CPUID Fn8000_000A_EDX[7]:
> >  DecodeAssists Decode assists. Indicates support for the decode assists. 
> > See “Decode Assists.”
> >
> > It does look like we sorta assume that we have some of these features (like
> > support for cleanbits, and even NP itself), so there are probably a few
> > checks we should do in attach. We have something similar in vmx but it's
> > done later on vm create IIRC because at the time we wanted to run on
> > a variety of host CPUs with different capabilities. I don't think that
> > is important anymore, so maybe a short-circuit check for some of these
> > required features in vmmattach for both svm and vmx is warranted now.
> >
> > FWIW I've never seen a svm CPU without these assists (but I bet they exist).
> >
>
> Added in checks for AMD. I can't find anything definitive for Intel and
> the SDM seems cryptic. FWIW the one bit I found in the
> IA32_VMX_EPT_VPID_CAP msr (bit 21) doesn't seem to be checked by
> Linux/KVM.
>
> >
> > I think you will need a "vee_insn_bytes_valid" flag unless you are planning
> > on filling out vee_insn_bytes in vmm(4) before returning (in the VMX case).
> >
>
> Added a flag for if length and/or bytes are valid. Maybe in the future a
> platform or vmm provide both.
>
> Approach I took for the SVM code path is to still relay the need for an
> assist to userland as worst case vmd can just do extra work and it'll be
> slower to service.
>
> > Otherwise, vmd does not know (or today, care) about what type of CPU it's
> > running on and there will be no way to know that "vee_insn_len" bytes, on
> > VMX, should not be interpreted as anything to do with vee_insn_bytes.
> >
> > Was your plan to do PT walks to fetch arguments (and/or insn bytes)
> > in vmd or in vmm?
> >
>
> Plan is PT walks entirely in userland and only use what the CPU gives us
> for free (if anything) in decode assist.
>
> How's this (first hunk was some whitespace, btw)...I've tested on my AMD
> and Intel machines. It's safe for others to try and make sure they don't
> have any change in current workflow.
>
>

This is ok mlarkin, but see below.

-ml

> diff refs/heads/master refs/heads/mmio
> commit - d1de5a4a40a1f0a466a361526282e3b2cd5207fa
> commit + 3634ab1f67c960eba84f75bf0410bf2a13c2dbe7
> blob - e1748de7bc15b62cf16931a067a31948a20fd187
> blob + 0441f4093b5711d4a815b89a4263fc348c4f185b
> --- sys/arch/amd64/amd64/identcpu.c
> +++ sys/arch/amd64/amd64/identcpu.c
> @@ -335,7 +335,7 @@ cpu_hz_update_sensor(void *args)
>
>       if (mdelta > 0) {
>               val = (adelta * 1000000) / mdelta * tsc_frequency;
> -             val = ((val + FREQ_50MHZ / 2) / FREQ_50MHZ) * FREQ_50MHZ;
> +             val = ((val + FREQ_50MHZ / 2) / FREQ_50MHZ) * FREQ_50MHZ;
>               ci->ci_hz_sensor.value = val;
>       }
>
> @@ -1055,6 +1055,9 @@ cpu_check_vmm_cap(struct cpu_info *ci)
>
>               if (edx & AMD_SVM_VMCB_CLEAN_CAP)
>                       ci->ci_vmm_cap.vcc_svm.svm_vmcb_clean = 1;
> +
> +             if (edx & AMD_SVM_DECODE_ASSIST_CAP)
> +                     ci->ci_vmm_cap.vcc_svm.svm_decode_assist = 1;
>       }
>
>       /*
> blob - 354cfac206f43be18f3438507856005ec62363ab
> blob + 34036a56ccde7fe2c231ee6c3a86ec1017d8176d
> --- sys/arch/amd64/amd64/vmm.c
> +++ sys/arch/amd64/amd64/vmm.c
> @@ -4891,11 +4891,20 @@ vcpu_run_vmx(struct vcpu *vcpu, struct vm_run_params *
>                               vcpu->vc_gueststate.vg_rax =
>                                   vcpu->vc_exit.vei.vei_data;
>                       break;
> +             case VMX_EXIT_EPT_VIOLATION:
> +                     ret = vcpu_writeregs_vmx(vcpu, VM_RWREGS_GPRS, 0,
> +                         &vcpu->vc_exit.vrs);
> +                     if (ret) {
> +                             printf("%s: vm %d vcpu %d failed to update "
> +                                 "registers\n", __func__,
> +                                 vcpu->vc_parent->vm_id, vcpu->vc_id);
> +                             return (EINVAL);
> +                     }
> +                     break;
>               case VM_EXIT_NONE:
>               case VMX_EXIT_HLT:
>               case VMX_EXIT_INT_WINDOW:
>               case VMX_EXIT_EXTINT:
> -             case VMX_EXIT_EPT_VIOLATION:
>               case VMX_EXIT_CPUID:
>               case VMX_EXIT_XSETBV:
>                       break;
> @@ -4927,6 +4936,7 @@ vcpu_run_vmx(struct vcpu *vcpu, struct vm_run_params *
>                       break;
>  #endif /* VMM_DEBUG */
>               }
> +             memset(&vcpu->vc_exit, 0, sizeof(vcpu->vc_exit));
>       }
>
>       setregion(&gdt, ci->ci_gdt, GDT_SIZE - 1);
> @@ -5658,7 +5668,7 @@ vmm_get_guest_memtype(struct vm *vm, paddr_t gpa)
>
>       if (gpa >= VMM_PCI_MMIO_BAR_BASE && gpa <= VMM_PCI_MMIO_BAR_END) {
>               DPRINTF("guest mmio access @ 0x%llx\n", (uint64_t)gpa);
> -             return (VMM_MEM_TYPE_REGULAR);
> +             return (VMM_MEM_TYPE_MMIO);
>       }
>
>       /* XXX Use binary search? */
> @@ -5782,18 +5792,31 @@ int
>  svm_handle_np_fault(struct vcpu *vcpu)
>  {
>       uint64_t gpa;
> -     int gpa_memtype, ret;
> +     int gpa_memtype, ret = 0;
>       struct vmcb *vmcb = (struct vmcb *)vcpu->vc_control_va;
> +     struct vm_exit_eptviolation *vee = &vcpu->vc_exit.vee;
> +     struct cpu_info *ci = curcpu();
>
> -     ret = 0;
> +     memset(vee, 0, sizeof(*vee));
>
>       gpa = vmcb->v_exitinfo2;
>
>       gpa_memtype = vmm_get_guest_memtype(vcpu->vc_parent, gpa);
>       switch (gpa_memtype) {
>       case VMM_MEM_TYPE_REGULAR:
> +             vee->vee_fault_type = VEE_FAULT_HANDLED;
>               ret = svm_fault_page(vcpu, gpa);
>               break;
> +     case VMM_MEM_TYPE_MMIO:
> +             vee->vee_fault_type = VEE_FAULT_MMIO_ASSIST;
> +             if (ci->ci_vmm_cap.vcc_svm.svm_decode_assist) {
> +                     vee->vee_insn_len = vmcb->v_n_bytes_fetched;
> +                     memcpy(&vee->vee_insn_bytes, vmcb->v_guest_ins_bytes,
> +                         sizeof(vee->vee_insn_bytes));
> +                     vee->vee_insn_info |= VEE_BYTES_VALID;
> +             }
> +             ret = EAGAIN;
> +             break;
>       default:
>               printf("unknown memory type %d for GPA 0x%llx\n",
>                   gpa_memtype, gpa);
> @@ -5862,10 +5885,12 @@ vmx_fault_page(struct vcpu *vcpu, paddr_t gpa)
>  int
>  vmx_handle_np_fault(struct vcpu *vcpu)
>  {
> -     uint64_t gpa;
> -     int gpa_memtype, ret;
> +     uint64_t insn_len = 0, gpa;
> +     int gpa_memtype, ret = 0;
> +     struct vm_exit_eptviolation *vee = &vcpu->vc_exit.vee;
>
> -     ret = 0;
> +     memset(vee, 0, sizeof(*vee));
> +
>       if (vmread(VMCS_GUEST_PHYSICAL_ADDRESS, &gpa)) {
>               printf("%s: cannot extract faulting pa\n", __func__);
>               return (EINVAL);
> @@ -5874,8 +5899,22 @@ vmx_handle_np_fault(struct vcpu *vcpu)
>       gpa_memtype = vmm_get_guest_memtype(vcpu->vc_parent, gpa);
>       switch (gpa_memtype) {
>       case VMM_MEM_TYPE_REGULAR:
> +             vee->vee_fault_type = VEE_FAULT_HANDLED;
>               ret = vmx_fault_page(vcpu, gpa);
>               break;
> +     case VMM_MEM_TYPE_MMIO:
> +             vee->vee_fault_type = VEE_FAULT_MMIO_ASSIST;
> +             if (vmread(VMCS_INSTRUCTION_LENGTH, &insn_len) ||
> +                 insn_len == 0 || insn_len > 15) {
> +                     printf("%s: failed to extract instruction length\n",
> +                         __func__);
> +                     ret = EINVAL;
> +             } else {
> +                     vee->vee_insn_len = (uint32_t)insn_len;
> +                     vee->vee_insn_info |= VEE_LEN_VALID;
> +                     ret = EAGAIN;
> +             }
> +             break;
>       default:
>               printf("unknown memory type %d for GPA 0x%llx\n",
>                   gpa_memtype, gpa);
> @@ -7321,7 +7360,19 @@ vcpu_run_svm(struct vcpu *vcpu, struct vm_run_params *
>                                   vcpu->vc_exit.vei.vei_data;
>                               vmcb->v_rax = vcpu->vc_gueststate.vg_rax;
>                       }
> +                     break;
> +             case SVM_VMEXIT_NPF:
> +                     ret = vcpu_writeregs_svm(vcpu, VM_RWREGS_GPRS,
> +                         &vcpu->vc_exit.vrs);
> +                     if (ret) {
> +                             printf("%s: vm %d vcpu %d failed to update "
> +                                 "registers\n", __func__,
> +                                 vcpu->vc_parent->vm_id, vcpu->vc_id);
> +                             return (EINVAL);
> +                     }
> +                     break;
>               }
> +             memset(&vcpu->vc_exit, 0, sizeof(vcpu->vc_exit));
>       }
>
>       while (ret == 0) {
> blob - b8db48f27142521bf69b65a2ffdade1ce6ec5cde
> blob + 7b964b9ff20ed228aa127b2cd672412a166196d7
> --- sys/arch/amd64/include/cpu.h
> +++ sys/arch/amd64/include/cpu.h
> @@ -82,6 +82,7 @@ struct svm {
>       uint32_t        svm_max_asid;
>       uint8_t         svm_flush_by_asid;
>       uint8_t         svm_vmcb_clean;
> +     uint8_t         svm_decode_assist;
>  };
>
>  union vmm_cpu_cap {
> blob - 34701ffb864757bba5441855225bedeaa4fd613c
> blob + ede5897ab172d5ea1a3d3a1f185f71640e7adadf
> --- sys/arch/amd64/include/specialreg.h
> +++ sys/arch/amd64/include/specialreg.h
> @@ -1273,6 +1273,7 @@
>  #define AMD_SVM_NESTED_PAGING_CAP    (1 << 0)
>  #define AMD_SVM_VMCB_CLEAN_CAP               (1 << 5)
>  #define AMD_SVM_FLUSH_BY_ASID_CAP    (1 << 6)
> +#define AMD_SVM_DECODE_ASSIST_CAP    (1 << 7)
>  #define AMD_SVMDIS                   0x10
>
>  #define SVM_TLB_CONTROL_FLUSH_NONE   0
> blob - 6fbb5e1bdc22666010e1c7484035d5d09cd7955b
> blob + 19a4bdf940c4737e727214f6133f621cb859f474
> --- sys/arch/amd64/include/vmmvar.h
> +++ sys/arch/amd64/include/vmmvar.h
> @@ -324,7 +324,9 @@ enum {
>  };
>
>  enum {
> +     VEE_FAULT_INVALID = 0,
>       VEE_FAULT_HANDLED,
> +     VEE_FAULT_MMIO_ASSIST,
>       VEE_FAULT_PROTECT,
>  };
>
> @@ -361,7 +363,12 @@ struct vm_exit_inout {
>   *  vm_exit_eptviolation     : describes an EPT VIOLATION exit
>   */
>  struct vm_exit_eptviolation {
> -     uint8_t         vee_fault_type;
> +     uint8_t         vee_fault_type;         /* type of vm exit */
> +     uint8_t         vee_insn_info;          /* bitfield */
> +#define VEE_LEN_VALID                0x1             /* vee_insn_len is 
> valid */
> +#define VEE_BYTES_VALID              0x2             /* vee_insn_bytes is 
> valid */
> +     uint8_t         vee_insn_len;           /* [VMX] instruction length */
> +     uint8_t         vee_insn_bytes[15];     /* [SVM] bytes at {R,E,}IP */
>  };
>
>  /*
> @@ -709,6 +716,7 @@ enum {
>
>  enum {
>       VMM_MEM_TYPE_REGULAR,
> +     VMM_MEM_TYPE_MMIO,
>       VMM_MEM_TYPE_UNKNOWN
>  };
>
> blob - eac6616ea918716c2986bab0a05724b31bcce2ec
> blob + 1291416d98656f33b1c92b2cd072d2a3b594dfa2
> --- usr.sbin/vmd/vm.c
> +++ usr.sbin/vmd/vm.c
> @@ -1652,26 +1652,36 @@ vcpu_exit_inout(struct vm_run_params *vrp)
>   *
>   * Return values:
>   *  0: no action required
> - *  EAGAIN: a protection fault occured, kill the vm.
> + *  EFAULT: a protection fault occured, kill the vm.
>   */
>  int
>  vcpu_exit_eptviolation(struct vm_run_params *vrp)
>  {
> +     int ret = 0;
> +     uint8_t fault_type;
>       struct vm_exit *ve = vrp->vrp_exit;
>
> -     /*
> -      * vmd may be exiting to vmd to handle a pending interrupt
> -      * but last exit type may have been VMX_EXIT_EPT_VIOLATION,
> -      * check the fault_type to ensure we really are processing
> -      * a VMX_EXIT_EPT_VIOLATION.
> -      */
> -     if (ve->vee.vee_fault_type == VEE_FAULT_PROTECT) {
> -             log_debug("%s: EPT Violation: rip=0x%llx",
> -                 __progname, vrp->vrp_exit->vrs.vrs_gprs[VCPU_REGS_RIP]);
> -             return (EAGAIN);
> +     fault_type = ve->vee.vee_fault_type;
> +     switch (fault_type) {
> +     case VEE_FAULT_HANDLED:
> +             log_debug("%s: fault already handled", __func__);
> +             break;
> +     case VEE_FAULT_MMIO_ASSIST:
> +             log_warnx("%s: mmio assist required: rip=0x%llx", __progname,
> +                 ve->vrs.vrs_gprs[VCPU_REGS_RIP]);

This may be spammy. Maybe log_debug? Although previously, the VM would have
terminated, so maybe it's not as bad as I think.

> +             ret = EFAULT;
> +             break;
> +     case VEE_FAULT_PROTECT:
> +             log_debug("%s: EPT Violation: rip=0x%llx", __progname,
> +                 ve->vrs.vrs_gprs[VCPU_REGS_RIP]);
> +             ret = EFAULT;
> +             break;
> +     default:
> +             fatalx("%s: invalid fault_type %d", __progname, fault_type);
> +             /* UNREACHED */
>       }
>
> -     return (0);
> +     return (ret);
>  }
>
>  /*
> @@ -1704,7 +1714,6 @@ vcpu_exit(struct vm_run_params *vrp)
>       case VMX_EXIT_CPUID:
>       case VMX_EXIT_EXTINT:
>       case SVM_VMEXIT_INTR:
> -     case SVM_VMEXIT_NPF:
>       case SVM_VMEXIT_MSR:
>       case SVM_VMEXIT_CPUID:
>               /*
> @@ -1715,11 +1724,11 @@ vcpu_exit(struct vm_run_params *vrp)
>                * in more vmd log spam).
>                */
>               break;
> +     case SVM_VMEXIT_NPF:
>       case VMX_EXIT_EPT_VIOLATION:
>               ret = vcpu_exit_eptviolation(vrp);
>               if (ret)
>                       return (ret);
> -
>               break;
>       case VMX_EXIT_IO:
>       case SVM_VMEXIT_IOIO:

Reply via email to