17.03.2023 3:44, Attila Nagy wrote: > Hi, > > As this is super annoying, I'm willing to pay a $500 bounty for solving this > issue (whomever is first, however I don't anticipate a big competition :) > Having an invoice would be best, but I'm willing to accept individuals as > well). > I can't give remote access, but can run debug builds with serial console. > stable/13 branch. > > I have a bunch of netbooted machines, one set in a cluster is older (HP DL80 > G9, 2x8C, Intel I350 -igb- NICs), the other set is newer (HP XL225n G10, AMD > EPYC2x16C, BCM57412 -bnxt- NICs). > All of these boot from the network, which is basically: > - get IP and options with DHCP with the help of the NIC's PXE stack > - get the loader and kernel, start it > - do another round of DHCP from the kernel (bootp_subr.c) > - mount the root via NFS and let everything work as usual > > The problem is that the newer machines take an indefinite time to boot. The > older ones (with igb NIC) work reliably, they always boot fast. > The process of getting an IP address via DHCP (bootpc_call from bootp_subr.c) > either succeeds normally (in a few seconds), or takes a lot of time. > Common (measured) times to boot range from 10s of minutes to anywhere between > a few hours (1-6). > Sometimes it just gets stuck and couldn't get past bootpc_call (getting the > DHCP lease). > > What I've already tried: > - we have a redundant set of DHCP servers which offer static leases (so there > are two DHCPOFFERs), so I tried to turn off one of them, nothing has changed > - tried to disable SMP, the effect is the same > - tried to see whether it's a network issue. The NIC's PXE stack always gets > the lease quickly and booting FreeBSD from an ISO and issuing dhclient on the > same interface is also fast. After the machines have booted, there are no > network issues, they work reliably (since more than a year for 20+ machines, > so not just a few hours) > > This issue wasn't so bad previously (only a few mins to tens of minutes > delay), but recently it got pretty unbearable, even making some machines > unbootable for days... > > First I thought it might be a packet loss (or more exactly packet delivery > from the DHCP server to the receiving socket), either in the network or in > the NIC/kernel itself, so I placed a few random printfs into bootp_subr.c and > udp_usrreq.c. > > After spending some time trying to understand the problem it feels like a > race condition in > bootpc_call, but I don't know the code well enough to effectively verify that.
For me, it looks like timekeeping problem. Please show output of: sysctl kern.timecounter kern.eventtimer After it booted to single- or multi-user mode. Also, show verbose boot log (bootverbose). Sometimes UEFI/BIOS SETUP has some settings for ACPI/HPET timers (enable/disable), did you try "playing" with such options? Note that there is loader tunnable kern.timecounter.hardware="HPET" that can be used to force some timecounter source for kernel using loader.conf or device.hints, any way that puts it to kenv; kenv/device.hints may be compiled into custom kernel binary even.