[Bug 268744] Running the wifibox (bhyve VM) crashes system with vmm lock error

2023-01-04 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=268744

Bug ID: 268744
   Summary: Running the wifibox (bhyve VM) crashes system with vmm
lock error
   Product: Base System
   Version: CURRENT
  Hardware: Any
OS: Any
Status: New
  Severity: Affects Only Me
  Priority: ---
 Component: bhyve
  Assignee: virtualization@FreeBSD.org
  Reporter: mmata...@gmail.com

The error is:

panic: Lock (sx) vm mem_segs not locked @ /usr/src/sys/amd64/vmm/vmm.c:1188

I'm on a1f28ec729f7491da8607e8eeaee1b0f547c60d0

I've included a picture of the error.

-- 
You are receiving this mail because:
You are the assignee for the bug.


[Bug 268744] Running the wifibox (bhyve VM) crashes system with vmm lock error

2023-01-04 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=268744

--- Comment #1 from bj...@baerlin.eu ---
This seems to be an issue with pci-passthru, a vm with pci-passthru crashes the
host, same vm without pci-passthru works

-- 
You are receiving this mail because:
You are the assignee for the bug.


Re: Windows 11 22H2 with passed-through PCI devices hangs in vm_handle_rendezvous() at boot

2023-01-04 Thread Robert Crowston
So it looks like the problem is:

- in one thread we call vcpu_set_state_locked() [from a VM_MAP_PPTDEV_MMIO call 
from userspace]
-- both the new and old states are VCPU_FROZEN
-- the threads enters a loop while vcpu->state != VCPU_IDLE
-- it gets stuck here forever since nothing will ever change the state to 
VCPU_IDLE
-- apparently this is to stop two ioctl()s acting on the same vCPU 
simultaneously, but I don't see any other ioctl against the vCPU in kgdb.

- in all the other threads, we sit in vm_handle_rendezvous()
-- these threads are waiting for the rendezvous to complete
-- every vCPU has completed the rendezvous except for the one stuck in 
vcpu_set_state_locked()

I see a lot of commits in -CURRENT since my cut of -STABLE, but nothing that 
looks too relevant. I'll try against CURRENT next.

    — RHC.


--- Original Message ---
On Tuesday, January 3rd, 2023 at 23:54, Robert Crowston 
 wrote:


> Still investigating this. AMD 1700, FreeBSD 13.1 stable@3dd6497894. VM is 
> Windows 11 22H2.
> 
> It happens on the setup disk -- at the TianoCore logo, before the "ring" has 
> finished its first rotation -- so very early in the boot process. It's 
> eventually happened for every Win 11 install I have made. Removing the 
> passthrough devices and installing Windows, then re-adding the devices, a 
> fresh install will boot with the passthrough devices a few times, but then 
> shows the same hang behaviour forever after. Windows Boot Repair also hangs. 
> On the host, bhyvectl --destroy hangs. gdb cannot stop bhyve and just hangs 
> as well. None of these hangs show any CPU use. kldunload vmm crashes the host 
> with a page fault. Only a reboot of the host will kill the guest.
> 
> Setting the guest cpu count to 1, or removing all the passthrough devices 
> allows Windows 11 to boot. The same behaviour happens for two different USB 
> controllers I have and two different GPUs. The same bhyve configurations 
> reliably boot Windows Server 2022 and Windows 10 with passthrough working.
> 
> Debugging in userspace, I can see that Windows 11 does PCI enumeration in 
> parallel across multiple cores, and sometimes during boot one vCPU writes a 
> PCI config register at approximately the same time as another vCPU reads that 
> exact register. The hang seems to be aligned with this synchronized 
> write/read. Also, I can sometimes boot successfully under gdb when single 
> stepping PCI cfg register writes, but it's difficult to be sure because my 
> debugging is probably disturbing the timing. I looked at the bhyve code and I 
> don't see what here could be racing in user space. In any event, it's a 
> kernel-side bug.
> 
> Spinning up the kernel debugger, what I always see is:
> 1. 1 bhyve thread in vioapic_mmio_write() -> ... -> vm_handle_rendezvous() -> 
> _sleep()
> 
> 2. 1 bhyve thread in vcpu_lock_one() -> ... -> vcpu_set_state_locked() -> 
> msleep_spin_sbt()
> 
> 3. All remaining bhyve threads, if any, in vm_run() -> vm_handle_rendezvous() 
> -> _sleep().
> 
> 
> Example backtrace attached.
> 
> So it looks like we have some kind of a deadlock between vcpu_lock_one() and 
> vioapci_mmio_write()? Anyone seen anything like it?
> 
>     — RHC.