Applied this today while developing my integration tests and haven't
encountered issues w.r.t tasks hanging since then on my test instances.

Consider this:
Tested-by: Stefan Hanreich <[email protected]>

On 3/4/26 2:47 PM, Hannes Laimer wrote:
> Thanks a lot @Fabian and @Fiona for helping me debug this!
> 
> The problem is that some libaries do overwrite the SIGCHLD handler
> temporarily, if the library is called fast enough this can lead to lost
> CHLD signals which in turn prevents `worker_reaper` from being called in
> RESTEnvironment. So tasks won't get cleaned-up until a different SIGCHLD
> arrives at the same `pvedeamon` process triggering `worker_reaper`.
> 
> As @Fabian mentioned in [1] a general re-work of the task handling,
> potentially with `pidfd`s, would make a lot of sense.
> 
> These two patches address the problem in the task handling structure as
> it currently is. They
>  - run the PAM lib call in a fork, so signal handler changes the library
>    does are isloated from our process
>  - run `worker_reaper` periodically (5s) do catch any other potential
>    instances of this, since it would be possible that the same happens
>    with other libs, not just PAM
> 
> [1] 
> https://lore.proxmox.com/pve-devel/[email protected]/T/#m7b0f3873be5755f330e288cfa50905744f225b2b
> 
> 
> pve-common:
> 
> Hannes Laimer (1):
>   RESTEnvironment: periodically reap workers as SIGCHLD fallback
> 
>  src/PVE/RESTEnvironment.pm | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> 
> pve-access-control:
> 
> Hannes Laimer (1):
>   pam: fork for PAM authentication to isolate SIGCHLD handler
> 
>  src/PVE/Auth/PAM.pm | 74 +++++++++++++++++++++++++--------------------
>  1 file changed, 42 insertions(+), 32 deletions(-)
> 
> 
> Summary over all repositories:
>   2 files changed, 51 insertions(+), 32 deletions(-)
> 




Reply via email to