On 6/2/25 4:12 PM, Jan Beulich wrote:

On 02.06.2025 15:39, Manuel Andreas wrote:
I've discovered an issue in the nested VMX implementation, where an
unprivileged domain is able to force Xen to dereference a NULL pointer,
resulting in a panic.
Sadly you provide no details on this NULL deref.
Here's the respective dump:

----[ Xen-4.20.0  x86_64  debug=y Tainted:     H  ]----
(XEN) CPU:    1
(XEN) RIP:    e008:[<ffff82d0402ae2b8>] nvmx_handle_vmx_insn+0x7ab/0xccb
(XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d1v0)
(XEN) rax: 0000000000000000   rbx: 0000000000000000   rcx: 8000000000000002
(XEN) rdx: 0000000000000000   rsi: 01ffffffffffffff   rdi: ffff82e0020155e0
(XEN) rbp: ffff830179407e68   rsp: ffff830179407e00   r8: ffff82c00023b000
(XEN) r9:  ffff830179413c40   r10: 0000000000000000   r11: 0000000000000200
(XEN) r12: ffff83010483d000  r13: ffff830179407ef8   r14: 0000000000000000
(XEN) r15: 0000000000000000   cr0: 0000000080050033   cr4: 00000000003526e0
(XEN) cr3: 000000010498f000   cr2: 0000000000000000
(XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen code around <ffff82d0402ae2b8> (nvmx_handle_vmx_insn+0x7ab/0xccb):
(XEN)  75 b0 0f 86 12 05 00 00 <81> 08 00 00 00 80 41 8b 84 24 f4 05 00 00 80 cc
(XEN) Xen stack trace from rsp=ffff830179407e00:
(XEN)    ffff83010483d000 000000000011e000 0000000000000000 0000000000000000
(XEN)    ffff82d0402bfc4a 0000000100000000 0000000000119fa8 ffff82d000000008
(XEN)    ffff830100000006 ffff830179407ef8 0000000000000015 ffff83010483d000
(XEN)    0000000000000000 ffff830179407ee8 ffff82d0402a9a19 ffff82d04020361b
(XEN)    0000000000000000 0000000000000000 0000000000000000 ffff830100997000
(XEN)    ffff82d040203615 ffff82d04020361b ffff82d040203615 ffff82d04020361b
(XEN)    ffff83010483d000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 00007cfe86bf80e7 ffff82d040203673 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000001
(XEN)    0000000000000000 0000000000000000 0000000000000043 000000000000007b
(XEN)    0000000000000043 0000000000000000 0000000000000000 0000000011e57e00
(XEN)    0000000000000000 0000000000000000 0000beef0000beef 0000000000103fa2
(XEN)    000000bf0000beef 0000000000000046 0000000000119fa0 000000000000beef
(XEN)    000000000000beef 000000000000beef 000000000000beef 000000000000beef
(XEN)    0000e01000000001 ffff83010483d000 0000003136627000 00000000003526e0
(XEN)    0000000000000000 0000000000000000 0000000300000001 0000004e00000003
(XEN) Xen call trace:
(XEN)    [<ffff82d0402ae2b8>] R nvmx_handle_vmx_insn+0x7ab/0xccb
(XEN)    [<ffff82d0402a9a19>] F vmx_vmexit_handler+0xd97/0x1e14
(XEN)    [<ffff82d040203673>] F vmx_asm_vmexit_handler+0x103/0x220
(XEN)
(XEN) Pagetable walk from 0000000000000000:

(XEN)  L4[0x000] = 0000000000000000 ffffffffffffffff

Where nvmx_handle_vmx_insn+0x7ab/0xccb resolves to xen/arch/x86/hvm/vmx/vvmx.c:1169
Specifically, in nvmx_handle_vmptrld we have:

1830    if ( cpu_has_vmx_vmcs_shadowing )
1831        nvmx_set_vmcs_pointer(v, nvcpu->nv_vvmcx);

Which is reachable with nvcpu->nv_vvmcx untouched in the case of the p2m mapping for gpa not being writable (see lines 1793 - 1827) . Therefore, nvcpu->nv_vvmcx will remain NULL (as initialized in nvmx_vcpu_initialize).
Further, inside nvmx_set_vmcs_pointer:

1164    static void nvmx_set_vmcs_pointer(struct vcpu *v, struct vmcs_struct *vvmcs)
1165    {
1166        paddr_t vvmcs_maddr = v->arch.hvm.vmx.vmcs_shadow_maddr;
1167
1168        __vmpclear(vvmcs_maddr);
1169        vvmcs->vmcs_revision_id |= VMCS_RID_TYPE_MASK;
1170        /* ... */

will then dereference vvmcs, which is NULL.

This is possible when:

  1. The malicious domain has nested HVM capabilities.
  2. The CPU is running on top of VMX and supports shadow VMCS.

To trigger the bug, the domain must first enable VMX operation for
itself, execute VMXON and then finally execute VMPTRLD on a guest
physical address that is backed by a non-writable p2m mapping.
In `nvmx_handle_vmptrld`, after attempting to map the nested VMCS, Xen
will check whether or not this mapping is suitable for writing and if
not immediately unmap the nested VMCS again and abort the setup of
`nvcpu->nv_vvmcx`. However, Xen at this point erroneously continues
emulation of the VMPTRLD. In particular, if VMCS shadowing is available,
Xen will nonetheless attempt to link up the nested VMCS to its own VMCS
in `nvmx_set_vmcs_pointer`. Importantly, Xen here attempts to
dereference the presumably mapped nested VMCS (which now is merely a
NULL pointer) in order to mark it as a shadow VMCS by applying the
`VMCS_RID_TYPE_MASK` to its revision identifier. Following, the page
fault handler will panic Xen.

I've attached an XTF reproducer that triggers the bug. To setup such a
non-writable p2m mapping for the malicious VMCS, I first setup an
appropriate grant table entry. I've tested it on Xen version 4.20.0.
I expect this to not work anymore on current staging or 4.20.1-pre.
See a8325f981ce4 ("x86/P2M: synchronize fast and slow paths of
p2m_get_page_from_gfn()").
On first glance I don't see how that would impact the type of the established p2m mapping. If you want I can compile staging and test if the PoC does not work there anymore.

To fix the issue I believe the following patch should be suitable:

--- a/xen/arch/x86/hvm/vmx/vvmx.c
+++ b/xen/arch/x86/hvm/vmx/vvmx.c
@@ -1817,7 +1817,9 @@ static int nvmx_handle_vmptrld(struct
cpu_user_regs *regs)
               else
               {
                   hvm_unmap_guest_frame(vvmcx, 1);
-                vvmcx = NULL;
+                vmfail(regs, VMX_INSN_VMPTRLD_INVALID_PHYADDR);
+
+                return X86EMUL_OKAY;
               }
           }
           else

The VMX error AFAICT does not strictly adhere to the Intel SDM, but
providing the guest some indication on what went wrong is likely more
sensible than silently failing.
Giving the guest some indication is certainly right. If we want to follow
the above route, I think the change would want doing a little differently,
to take the path that presently is the "else" at the bottom of the hunk
above. However, I can't presently see how invoking vmfail() would make a
difference as to the subsequent NULL deref: The guest could continue the
same irrespective of the failure. Hence why I'd like to understand what
NULL deref you did observe. (We may hence need two patches - one along
the above lines, and another one dealing with the NULL issue.)

The issue is really just in the latter part of nvmx_handle_vmptrld, which attempts to initialize its shadow VMCS even though establishing a mapping for the nested VMCS failed. An early exit from that function (as my patch suggests) should be sufficient for that case. Other parts of the nested VMX code already check for a valid nested VMCS via vvmcx_valid. There simply isn't such a check here presumably because the assumption was that any previous errors in mapping the nested VMCS were already handled by returning from the function or jumping to the out label.

Best,
Manuel

Reply via email to