Hi list, We hit a semi-reproducible crash (depending on the hardware, memory allocations etc.) where the HAProxy master process is killed by its own watchdog timer while inside mworker_catch_sigchld(). The crash happens when many worker processes exit simultaneously. It seems to be more common on CPUs with a lower clock frequence, where the loop eats up more CPU time. In a worst-case scenario the CPU usage can be quite high.
Example for the call trace from the crash (HAProxy 3.0.16, Linux x86_64, but
also applies to 3.4):
2026-02-14 20:21:24.026Thread 1 is about to kill the process.
2026-02-14 20:21:24.026*>Thread 1 : id=0x0 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0
rqsz=0
2026-02-14 20:21:24.026 1/1 stuck=1 prof=0 harmless=0 isolated=0
2026-02-14 20:21:24.026 cpu_ns: poll=3529859368618
now=3532886227785 diff=3026859167
2026-02-14 20:21:24.026 curr_task=0
2026-02-14 20:21:24.026 call trace(15):
2026-02-14 20:21:24.026 | 0x5dce5c [eb cc 66 90 64 48 8b 04]:
ha_thread_dump_fill+0xcc/0xee
2026-02-14 20:21:24.026 | 0x5e1ea3 [48 89 c3 48 85 c0 74 cf]:
ha_dump_backtrace+0x7eb3
2026-02-14 20:21:24.026 | 0x6c65d4 [0f 1f 40 00 4c 8b 45 a8]:
wdt_handler+0x184/0x24a
2026-02-14 20:21:24.026 | 0x7f64e7421900 [48 c7 c0 0f 00 00 00 0f]:
libc:+0x57900
2026-02-14 20:21:24.026 | 0x7f64e74e0c87 [48 3d 00 f0 ff ff 77 31]:
libc:wait4+0x59/0xa5
2026-02-14 20:21:24.026 | 0x67c812 [41 89 c7 85 c0 0f 8e db]:
mworker_catch_sigchld+0x42/0x4c0
2026-02-14 20:21:24.026 | 0x6bba13 [49 8b 07 4c 89 ff 4d 39]:
__signal_process_queue+0xb3/0x1c3
The watchdog fires every ~1 CPU-second. On the first fire it sets
TH_FL_STUCK on the thread. On the next fire, if the flag is still set,
it has no choice but to call ha_panic().
I assume that the flag is meant to be cleared by each work loop at every
iteration,
proving forward progress to the watchdog. This is done consistently
everywhere else:
task.c:584 _HA_ATOMIC_AND(&th_ctx->flags, ~TH_FL_STUCK);
task.c:710 _HA_ATOMIC_AND(&th_ctx->flags, ~TH_FL_STUCK);
fd.c:630 _HA_ATOMIC_AND(&th_ctx->flags, ~TH_FL_STUCK);
listener.c:1544 _HA_ATOMIC_AND(&th_ctx->flags, ~TH_FL_STUCK);
The `restart_wait` loop in mworker_catch_sigchld() is structurally
Identical, it iterates to completion, does real work each time, and can
run for an unbounded number of iterations, but was never given the same
treatment.
As far as I can see there is no harm in resetting the flag here and I think
there is a good chance that this would prevent such issues in the future.
Ideally, this fix should also be backported IMO.
Best regards,
Alexander
0001-BUG-MEDIUM-mworker-clear-TH_FL_STUCK-in-the-restart_.patch
Description: 0001-BUG-MEDIUM-mworker-clear-TH_FL_STUCK-in-the-restart_.patch

