I also encountered hanging tasks while running e2e tests, often leading
to tests running into timeouts even if the task was already "OK". I
applied these patches to the test VMs and did not encounter problems
with hanging tasks anymore, significantly speeding up the test runs.

Consider this:
Tested-by: Michael Köppl <[email protected]>

On Wed Mar 4, 2026 at 2:46 PM CET, Hannes Laimer wrote:
> Thanks a lot @Fabian and @Fiona for helping me debug this!
>
> The problem is that some libaries do overwrite the SIGCHLD handler
> temporarily, if the library is called fast enough this can lead to lost
> CHLD signals which in turn prevents `worker_reaper` from being called in
> RESTEnvironment. So tasks won't get cleaned-up until a different SIGCHLD
> arrives at the same `pvedeamon` process triggering `worker_reaper`.
>
> As @Fabian mentioned in [1] a general re-work of the task handling,
> potentially with `pidfd`s, would make a lot of sense.
>
> These two patches address the problem in the task handling structure as
> it currently is. They
>  - run the PAM lib call in a fork, so signal handler changes the library
>    does are isloated from our process
>  - run `worker_reaper` periodically (5s) do catch any other potential
>    instances of this, since it would be possible that the same happens
>    with other libs, not just PAM
>
> [1] 
> https://lore.proxmox.com/pve-devel/[email protected]/T/#m7b0f3873be5755f330e288cfa50905744f225b2b
>
>
> pve-common:
>
> Hannes Laimer (1):
>   RESTEnvironment: periodically reap workers as SIGCHLD fallback
>
>  src/PVE/RESTEnvironment.pm | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
>
> pve-access-control:
>
> Hannes Laimer (1):
>   pam: fork for PAM authentication to isolate SIGCHLD handler
>
>  src/PVE/Auth/PAM.pm | 74 +++++++++++++++++++++++++--------------------
>  1 file changed, 42 insertions(+), 32 deletions(-)
>
>
> Summary over all repositories:
>   2 files changed, 51 insertions(+), 32 deletions(-)




Reply via email to