Hi list,

We hit a semi-reproducible crash (depending on the hardware, memory allocations 
etc.) where the HAProxy master
process is killed by its own watchdog timer while inside
mworker_catch_sigchld().  The crash happens when many
worker processes exit simultaneously. It seems to be more common on CPUs with
a lower clock frequence, where the loop eats up more CPU time.
In a worst-case scenario the CPU usage can be quite high.

Example for the call trace from the crash (HAProxy 3.0.16, Linux x86_64, but 
also applies to 3.4):

2026-02-14 20:21:24.026Thread 1 is about to kill the process.
2026-02-14 20:21:24.026*>Thread 1 : id=0x0 act=0 glob=0 wq=0 rq=0 tl=0 tlsz=0 
rqsz=0
2026-02-14 20:21:24.026      1/1    stuck=1 prof=0 harmless=0 isolated=0
2026-02-14 20:21:24.026             cpu_ns: poll=3529859368618 
now=3532886227785 diff=3026859167
2026-02-14 20:21:24.026             curr_task=0
2026-02-14 20:21:24.026             call trace(15):
2026-02-14 20:21:24.026             |       0x5dce5c [eb cc 66 90 64 48 8b 04]: 
ha_thread_dump_fill+0xcc/0xee
2026-02-14 20:21:24.026             |       0x5e1ea3 [48 89 c3 48 85 c0 74 cf]: 
ha_dump_backtrace+0x7eb3
2026-02-14 20:21:24.026             |       0x6c65d4 [0f 1f 40 00 4c 8b 45 a8]: 
wdt_handler+0x184/0x24a
2026-02-14 20:21:24.026             | 0x7f64e7421900 [48 c7 c0 0f 00 00 00 0f]: 
libc:+0x57900
2026-02-14 20:21:24.026             | 0x7f64e74e0c87 [48 3d 00 f0 ff ff 77 31]: 
libc:wait4+0x59/0xa5
2026-02-14 20:21:24.026             |       0x67c812 [41 89 c7 85 c0 0f 8e db]: 
mworker_catch_sigchld+0x42/0x4c0
2026-02-14 20:21:24.026             |       0x6bba13 [49 8b 07 4c 89 ff 4d 39]: 
__signal_process_queue+0xb3/0x1c3

The watchdog fires every ~1 CPU-second.  On the first fire it sets
TH_FL_STUCK on the thread.  On the next fire, if the flag is still set,
it has no choice but to call ha_panic().

I assume that the flag is meant to be cleared by each work loop at every 
iteration,
proving forward progress to the watchdog.  This is done consistently
everywhere else:

    task.c:584   _HA_ATOMIC_AND(&th_ctx->flags, ~TH_FL_STUCK);
    task.c:710   _HA_ATOMIC_AND(&th_ctx->flags, ~TH_FL_STUCK);
    fd.c:630     _HA_ATOMIC_AND(&th_ctx->flags, ~TH_FL_STUCK);
    listener.c:1544  _HA_ATOMIC_AND(&th_ctx->flags, ~TH_FL_STUCK);

The `restart_wait` loop in mworker_catch_sigchld() is structurally
Identical, it iterates to completion, does real work each time, and can
run for an unbounded number of iterations, but was never given the same
treatment.

As far as I can see there is no harm in resetting the flag here and I think 
there is a good chance that this would prevent such issues in the future.

Ideally, this fix should also be backported IMO.

Best regards,

Alexander


Attachment: 0001-BUG-MEDIUM-mworker-clear-TH_FL_STUCK-in-the-restart_.patch
Description: 0001-BUG-MEDIUM-mworker-clear-TH_FL_STUCK-in-the-restart_.patch

Reply via email to