On Mon, Oct 14, 2024 at 9:08 PM Oliver Steffen <ostef...@redhat.com> wrote:
> Since the PixieFail CVE fixes, a strong random number generator is
> required to use network functionality, such as booting via PXE or
> On modern x86_64 CPUs this is not a problem because these support the
> RDRAND instruction.
> On older models one needs to add a virtio-rng device otherwise network
> initialization fails.
> We now observe a very strange problem [1]:
> Network initialization still fails when adding a virtio-rng to a VM
> with an old CPU, under certain hardware configurations.
> For example in combination with COM1 and COM2 isa-serial port, while
> it works if only one of them is there (it doesn't matter which one, as
> long as they are not both configured in QEMU).
> Steps to reproduce the issue:
> Use a recent edk2 master branch, for example 596773f5e33e. We used
> qemu-8.2.7-1.fc40.
> Build OVMF for X64 like this:
> build -t GCC5 -b DEBUG -a X64 \
>   -p OvmfPkg/OvmfPkgX64.dsc \
> Run QEMU with a CPU that does not feature RDRAND:
> qemu-system-x86_64 \
>   -machine q35,accel=kvm -m 1G -display none -nodefaults \
>   -drive file=OVMF_CODE.fd,if=pflash,format=raw,unit=0,readonly=on \
>   -drive file=OVMF_VARS.fd,if=pflash,format=raw,unit=1,readonly=on \
>   -chardev file,id=fw,path="firmware.log" -device
> isa-debugcon,iobase=0x402,chardev=fw \
>   -drive 
> file=UefiShell.iso,format=raw,if=none,media=cdrom,id=drive-cd1,readonly=on
> \
>   -device ide-cd,drive=drive-cd1,id=cd1,bootindex=1 \
>   -netdev user,id=net0 -device virtio-net-pci,netdev=net0,bootindex=2 \
>   -device virtio-rng-pci \
>   -serial stdio \
>   -serial null \
>   -cpu core2duo
> The attached CD-Rom image [2] contains a EFI Shell executable that is booted.
> From the shell one can investigate the available boot options:
> # bcfg boot dump
> Expectation: PXE and HTTP options are listed.
> Observation: No network boot options present.
> Changing the CPU model on the QEMU command line to “max” makes PXE and
> HTTP options available. We suspected that a virtio-rng-pci is not
> working and network support is unavailable due to the lack of an RNG.
> But the same can be achieved by removing the second serial port
> (“-serial null”) while keeping the CPU model. We can’t explain this at
> all.
> While network boot can be achieved by changing other parts of the
> command line too (modifying bootindex, for example) it is very strange
> that simply the serial port configuration influences network boot.
> Bisection:
> Doing a bisection, the commit that introduces this problem is
> 4c4ceb2ceb ("NetworkPkg: SECURITY PATCH CVE-2023-45237").
> The problem seems to be pre-existing, but as of this commit, DxeNetLib
> has a new Depex with gEfiRngProtocolGuid
> (3152BCA5-EADE-433D-862E-C01CDC291F44) since it is now a consumer.
> Producers can be VirtioRng (when the device is present) and RngDxe
> (when the CPU supports for example instructions like RDRAND). Removing
> the Depex, just for confirmation, solves the problem, but of course
> DxeNetLib fails on an assert where it expects to find random
> generators.
> Observing the logs [3,4] with DEBUG_DISPATCH enabled and adding some
> printing in VirtioRng, we noticed that in both cases (PXE working or
> not), VirtioRng is started at the same time in the log (see on both
> logs attached at line 22240), but with both COM1 and COM2 we no longer
> see any dispatcher messages after VirtioRng has started, while we see
> them when there is only one of them. Just this last stage of the
> dispatcher will load the network modules, finding the dependency with
> gEfiRngProtocolGuid true.

Going in this direction, I found a hack that solves the problem, but
it's obviously not the right solution (sorry, I have little experience
in edk2).

By analyzing the calls to the dispatcher (`gDS->Dispatch ()`) I found
that when we only have COM1, EfiBootManagerConnectDevicePath() at some
point invokes `gDS->Dispatch ()` after VirtioRng has started. This call
will then get DxeNetLib loaded.

With both COM1 and COM2 on the other hand, I don't see this call, maybe
because `RemainingDevicePath` in this case is empty, since EDK2 was able 
to initialize both, but this is just an idea.

So the hack is the following, where I force the call to the dispatcher
on every call of EfiBootManagerConnectDevicePath():

diff --git a/MdeModulePkg/Library/UefiBootManagerLib/BmConnect.c 
index d1fb0f72ba..621f90d297 100644
--- a/MdeModulePkg/Library/UefiBootManagerLib/BmConnect.c
+++ b/MdeModulePkg/Library/UefiBootManagerLib/BmConnect.c
@@ -121,6 +121,8 @@ EfiBootManagerConnectDevicePath (
   CurrentTpl = EfiGetCurrentTpl ();
+  Status = gDS->Dispatch ();^M
+  DEBUG ((DEBUG_INFO, "%a extra gDS->Dispatch () - Status: %r\n", __func__, 
   // Start the real work of connect with RemainingDevicePath

I try to better understand how the dispatcher works, but I think it is
related to the dispatcher and some dependency, but my knowledge is
limited. Any suggestions are more than welcome.


> Any help is very much appreciated!
> Regards,
>    Stefano and Oliver
> [1] https://issues.redhat.com/browse/RHEL-58631
> [2] https://osteffen.fedorapeople.org/OvmfNetbootRngIssue/UefiShell.iso
> [3] 
> https://osteffen.fedorapeople.org/OvmfNetbootRngIssue/edk2_PXE_issue_COM1_COM2.log
> [4] 
> https://osteffen.fedorapeople.org/OvmfNetbootRngIssue/edk2_PXE_working_COM1.log

