Re: [edk2-devel] [BUG] Extremely slow boot times with CPU and GPU passthrough and host phys-bits > 40

xpahos via groups.io Wed, 20 Nov 2024 01:35:07 -0800

Hello, Mitchell.

> Thanks for the suggestion. I'm not necessarily saying this patch itself has 
> an issue, just that it is the point in the git history at which this slow 
> boot time issue manifests for us. This may be because the patch does actually 
> fix the other issue I described above related to BAR assignment not working 
> correctly in versions before that patch, despite boot being faster back then. 
> (in those earlier versions, the PCI devices for the GPUs were passed through, 
> but the BAR assignment was erroneous, so we couldn't actually use them - the 
> Nvidia GPU driver would just throw errors.)


tl;dr For GPU instances, a huge amount of memory is required for the VM to be 
able to map BARs. So, the amount of memory required for MMIO could be 
insufficient and OVMF was rejecting some PCI devices during the initialisation 
phase. To fix this, there is an opt/ovmf/X-PciMmio64Mb option that increases 
the MMIO size. This patch adds functionality that automatically adjusts the 
MMIO size based on the number of physical bits. As a starting point, I would 
try running an old build of OVMF and running grep on ‘rejected’ to make sure 
that no GPUs were taken out of service while OVMF was running.

> After I initially posted here, we also discovered another kernel issue that 
> was contributing to the boot times for this config exceeding 5 minutes - so 
> with that isolated, I can say that my config only takes about a 5 minutes for 
> a full boot: 1-2 minutes for `virsh start` (which scales with guest memory 
> allocation), and about 2-3 minutes of time spent on PCIe initialization / BAR 
> assignment for 2 to 4 GPUs (attached). This was still the case when I tried 
> with my GPUs attached in the way you suggested. I'll attach the xml config 
> for that and for my original VM in case I may have configured something 
> incorrectly there.
> With that said, I have a more basic question - do you expect that it should 
> take upwards of 30 seconds after `virsh start` completes before I see any 
> output in `virsh console`, or that PCI devices' memory window assignments in 
> the VM should take 45-90 seconds per passed-through GPU? (given that when the 
> same kernel on the host initializes these devices, it doesn't take nearly 
> this long?)

I'm not sure I can help you, we don't use virsh. But the linux kernel also 
takes a long time to initialise NVIDIA GPU using SeaBIOS. Another way to check 
the boot time is to hot-plug the cards after booting. I don't know how this 
works in virsh. I made a script for expect to emulate hot-plug:

```
#!/bin/bash
CWD="$(dirname "$(realpath "$0")")"
/usr/bin/expect <<EOF
spawn $CWD/qmp-shell $CWD/qmp.sock
send -- "query-pci\r"
send -- "device_add driver=pci-gpu-testdev bus=s30 regions=mpx2M vendorid=5555 
deviceid=4126\r"
```

> I'm going to attempt to profile ovmf next to see what part of the code path 
> is taking up the most time, but if you already have an idea of what that 
> might be (and whether it is actually a bug or expected to take that long), 
> that insight would be appreciated.

We just started migration from SeaBIOS to UEFI/SecureBoot, so I know only some 
parts of the OVMF code which is used for enumeration/initialisation of PCI 
devices. I'm not core developer of edk2, just solving the same problems with 
starting VMs with GPUs.


-=-=-=-=-=-=-=-=-=-=-=-
Groups.io Links: You receive all messages sent to this group.
View/Reply Online (#120803): https://edk2.groups.io/g/devel/message/120803
Mute This Topic: https://groups.io/mt/109651206/21656
Group Owner: devel+ow...@edk2.groups.io
Unsubscribe: https://edk2.groups.io/g/devel/unsub [arch...@mail-archive.com]
-=-=-=-=-=-=-=-=-=-=-=-

Re: [edk2-devel] [BUG] Extremely slow boot times with CPU and GPU passthrough and host phys-bits > 40

Reply via email to