Thanks a lot @Fabian and @Fiona for helping me debug this! The problem is that some libaries do overwrite the SIGCHLD handler temporarily, if the library is called fast enough this can lead to lost CHLD signals which in turn prevents `worker_reaper` from being called in RESTEnvironment. So tasks won't get cleaned-up until a different SIGCHLD arrives at the same `pvedeamon` process triggering `worker_reaper`.
As @Fabian mentioned in [1] a general re-work of the task handling, potentially with `pidfd`s, would make a lot of sense. These two patches address the problem in the task handling structure as it currently is. They - run the PAM lib call in a fork, so signal handler changes the library does are isloated from our process - run `worker_reaper` periodically (5s) do catch any other potential instances of this, since it would be possible that the same happens with other libs, not just PAM [1] https://lore.proxmox.com/pve-devel/[email protected]/T/#m7b0f3873be5755f330e288cfa50905744f225b2b pve-common: Hannes Laimer (1): RESTEnvironment: periodically reap workers as SIGCHLD fallback src/PVE/RESTEnvironment.pm | 9 +++++++++ 1 file changed, 9 insertions(+) pve-access-control: Hannes Laimer (1): pam: fork for PAM authentication to isolate SIGCHLD handler src/PVE/Auth/PAM.pm | 74 +++++++++++++++++++++++++-------------------- 1 file changed, 42 insertions(+), 32 deletions(-) Summary over all repositories: 2 files changed, 51 insertions(+), 32 deletions(-) -- Generated by murpp 0.9.0
