17.03.2023 3:44, Attila Nagy wrote:
> Hi,
> 
> As this is super annoying, I'm willing to pay a $500 bounty for solving this 
> issue (whomever is first, however I don't anticipate a big competition :) 
> Having an invoice would be best, but I'm willing to accept individuals as 
> well).
> I can't give remote access, but can run debug builds with serial console. 
> stable/13 branch.
> 
> I have a bunch of netbooted machines, one set in a cluster is older (HP DL80 
> G9, 2x8C, Intel I350 -igb- NICs), the other set is newer (HP XL225n G10, AMD 
> EPYC2x16C, BCM57412 -bnxt- NICs).
> All of these boot from the network, which is basically:
> - get IP and options with DHCP with the help of the NIC's PXE stack
> - get the loader and kernel, start it
> - do another round of DHCP from the kernel (bootp_subr.c)
> - mount the root via NFS and let everything work as usual
> 
> The problem is that the newer machines take an indefinite time to boot. The 
> older ones (with igb NIC) work reliably, they always boot fast.
> The process of getting an IP address via DHCP (bootpc_call from bootp_subr.c) 
> either succeeds normally (in a few seconds), or takes a lot of time.
> Common (measured) times to boot range from 10s of minutes to anywhere between 
> a few hours (1-6).
> Sometimes it just gets stuck and couldn't get past bootpc_call (getting the 
> DHCP lease).
> 
> What I've already tried:
> - we have a redundant set of DHCP servers which offer static leases (so there 
> are two DHCPOFFERs), so I tried to turn off one of them, nothing has changed
> - tried to disable SMP, the effect is the same
> - tried to see whether it's a network issue. The NIC's PXE stack always gets 
> the lease quickly and booting FreeBSD from an ISO and issuing dhclient on the 
> same interface is also fast. After the machines have booted, there are no 
> network issues, they work reliably (since more than a year for 20+ machines, 
> so not just a few hours)
> 
> This issue wasn't so bad previously (only a few mins to tens of minutes 
> delay), but recently it got pretty unbearable, even making some machines 
> unbootable for days...
> 
> First I thought it might be a packet loss (or more exactly packet delivery 
> from the DHCP server to the receiving socket), either in the network or in 
> the NIC/kernel itself, so I placed a few random printfs into bootp_subr.c and 
> udp_usrreq.c.
> 
> After spending some time trying to understand the problem it feels like a 
> race condition in
> bootpc_call, but I don't know the code well enough to effectively verify that.

For me, it looks like timekeeping problem. Please show output of:
sysctl kern.timecounter kern.eventtimer

After it booted to single- or multi-user mode.
Also, show verbose boot log (bootverbose).

Sometimes UEFI/BIOS SETUP has some settings for ACPI/HPET timers 
(enable/disable),
did you try "playing" with such options?

Note that there is loader tunnable kern.timecounter.hardware="HPET"
that can be used to force some timecounter source for kernel using loader.conf 
or device.hints,
any way that puts it to kenv; kenv/device.hints may be compiled into custom 
kernel binary even.



Reply via email to