X-PciMmio64Mb value unconditionally" to Noble

Mitchell Augustin Thu, 10 Apr 2025 12:50:52 -0700

> the outcome quoted below is awesome

Thanks @Andreas! :)


> Some parameter was considered a lower limit before, and now it's no
longer a lower limit.

Correct. Without my patch, if a user sets opt/ovmf/X-PciMmio64Mb, that
value is only used if it is *higher* than the value that
PlatformDynamicMmioWindow computes - so by specifying it, you tell OVMF
that the MMIO window size should be "No lower than
opt/ovmf/X-PciMmio64Mb". With my patch, by specifying
opt/ovmf/X-PciMmio64Mb, you tell OVMF that the MMIO window size should
be "exactly opt/ovmf/X-PciMmio64Mb".

> Is it documented and known that prior to this change
opt/ovmf/X-PciMmio64Mb is a lower limit?

Not really. This was something that I only learned as part of my
discussion with edk2 upstream, when the maintainer clarified it for me
[0]. AFAIK this isn't outlined anywhere in the edk2 docs explicitly.

> Would users be surprised (negatively) that now it's no longer a limit
of any kind?

I doubt it. If any users are specifying a lower value for 
opt/ovmf/X-PciMmio64Mb than what PlatformDynamicMmioWindow calculates, their 
opt/ovmf/X-PciMmio64Mb value is simply being ignored entirely today. I suppose 
there could be some configuration out there where someone specified a lower 
value which was ignored, then never removed it when it had no effect - and that 
user could be impacted after this update if the value is lower than what their 
GPU requires, but aside from "I forgot to remove this dead code", there should 
not be any actual users of such a construct on Noble.
Really, the common use case for opt/ovmf/X-PciMmio64Mb up until now has been to 
make the window size *larger* than what PDMW computes - and that behavior is 
not being impacted by this patch.


> I can now set "the aperture" to a value lower than opt/ovmf/X-PciMmio64Mb

Not quite - this patch enables you to set opt/ovmf/X-PciMmio64Mb to a
value lower than *the MMIO window size computed by
PlatformDynamicMmioWindow*, and actually have the opt/ovmf/X-PciMmio64Mb
value be honored. (in other words, opt/ovmf/X-PciMmio64Mb now *is* the
aperature size, rather than just a the minimum allowed aperture size
that may be undermined by PDMW.)

Without my patch, an aperture size setting in opt/ovmf/X-PciMmio64Mb
which is lower than the aperture size calculation from PDMW is simply
ignored.

> From what I understood of the patch description, this makes the "long
PCI initialization process be skipped". what are the consequences of
that skip?

The long PCI init process is only skipped if the value you specify for
opt/ovmf/X-PciMmio64Mb is lower than the GPU's actual MMIO window size.
(so, for example, if you have a 128GB VRAM GPU, and you set
opt/ovmf/X-PciMmio64Mb to 1GB, that GPU's PCI init steps are skipped.)
However, if your GPU only has 32GB VRAM, and you set
opt/ovmf/X-PciMmio64Mb to 64GB, there should be no change in behavior
(thus no init skip).

On its own, the skip means that the MMIO regions for the impacted
devices are not usable (since they were not able to be mapped.) The
reason we need pci=nocrs is to tell the kernel to disregard the mappings
reported by OVMF, and pci=realloc is to have the kernel compute its own
mappings. (As I understand).

> isn't there a less intrusive way to have that skip happen, instead of
setting a lower aperture size? Sounds like we are just taking advantage
of a side effect of setting the aperture to a value lower than the
previous limit. Why not, let's say, a new option called
"pci_initialization_skip=1"? Because that's a kernel change?

Theoretically, a kernel change could enable this, but I would be hesitant to 
pursue it for a few reasons:
1) The method I'm proposing enables Noble users to opt into behavior that is 
essentially functionally equivalent to how things work in Jammy OVMF today, and 
from the perspective of the Nvidia DGX team, we already have years of Nvidia 
testing coverage on Jammy that show that `pci=realloc pci=nocrs` with a lower 
aperture size than the DGX GPUs support is sufficient for their customers and 
doesn't cause any known issues. A novel kernel patch would not benefit from 
this history, and I doubt it would be less intrusive from a code PoV than my 
proposed solution.
2) IMO, even outside of its usefulness for our workaround, the behavior of 
opt/ovmf/X-PciMmio64Mb should have always honored any specified value; the fact 
that it doesn't is arguably a bug in itself which is not documented clearly 
anywhere.
3) Your proposed kernel approach would impact *any* PCI initialization, as I 
understand from what you described, which I believe could have a larger 'blast 
radius' when set than my OVMF approach, which would only impact devices with 
BARs larger than the memory size the user explicitly sets.

More generally, my concern is that this kind of kernel approach would
end up being significant additional development and testing effort to
get to a point that is ultimately no less of a 'workaround' than this
OVMF approach. (In an ideal world, we would be able to actually backport
the kernel series that fixes the underlying issue "for real" without any
workarounds [3], but I have discussed this with the kernel team already,
and they do not think it would be a reasonable SRU request for 6.8 since
it would also require backporting huge pfnmap support as a prerequisite,
which contains substantial ABI changes.)

> In other words, isn't this change just kicking the can down the road,
and making something else slow now, something that used to be fast (like
post-boot operations)

We have not observed any post-boot slowness or instability with this
workaround. The upstream pci maintainer and I think [4] this is because
decode is already disabled once we reach the decode disable step in
drivers/pci/probe.c [1] in scenarios where the GPU is being configured
post-boot, which allows the slow parts (the pci_write_config_word(dev,
PCI_COMMAND, orig_cmd & ~PCI_COMMAND_DECODE_ENABLE) call and its
corresponding re-enable later in the function) to be skipped entirely.

> kernel command-line: "pci=realloc pci=nocrs".
> c1) Are the mentioned options default?

No, and they probably shouldn't be defaults, since they really would
only be needed within guest VM kernels in this very specific type of
configuration.

> c2) Are users expected to be using those options with this kind of
hardware?

Setting `pci=nocrs pci=realloc` is the approach we have been
recommending to Nvidia and their customers on Jammy for a few years [2],
since Jammy requires non-default options to get GPU passthrough working
at all. (On Jammy, users can either set `pci=nocrs pci=realloc` in the
guest, *OR* set opt/ovmf/X-PciMmio64Mb to a value large enough for their
GPUs - but the latter is not the recommended approach since it produces
the same slow boot as we are trying to avoid with this workaround.)

> c3.2) If the host is updated with this new ed2k, running guests would
only be impacted if they reboot, I presume?

Correct

> d) From what I understood, impacted users would still have to make
changes to their systems in order to take advantage of this update,
right? Is that change "obvious"? Which value should they set their
aperture size to? How can they find that out?

Correct. Impacted users will need to do the following to see any change in 
behavior:
1. Install the updated OVMF on their host
2. Set `pci=realloc pci=nocrs` in their guest kernel, then `sudo update-grub`
3. Set opt/ovmf/X-PciMmio64Mb in their per-guest QEMU or libvirt configuration 
to any arbitrary value *lower* than the amount of VRAM their passed-through 
GPUs have, then restart the guest VM

Historically, it seems like Nvidia has not had any issues deploying
changes of this nature internally, and they have not expressed any
concerns about getting their customers on board with this approach since
I have proposed it to them. (It'd be similar to what they had to do to
get Jammy passhthrough working on these platforms initially, from their
POV.)

Let me know if you have additional questions.

[0]: https://edk2.groups.io/g/devel/message/120831
[1]: 
https://web.git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/drivers/pci/probe.c?h=v6.12.1#n189
[2]: https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1849563/comments/10
[3]: 
https://lore.kernel.org/all/20250205231728.2527186-1-alex.william...@redhat.com/
[4]: 
https://lore.kernel.org/all/20241203163045.3e068562.alex.william...@redhat.com/

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2101903

Title:
  Backport "OvmfPkg: Use user-specified opt/ovmf/X-PciMmio64Mb value
  unconditionally" to Noble

To manage notifications about this bug go to:
https://bugs.launchpad.net/edk2/+bug/2101903/+subscriptions


-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

[Bug 2101903] Re: Backport "OvmfPkg: Use user-specified opt/ovmf/X-PciMmio64Mb value unconditionally" to Noble

Reply via email to