Dear all,

We just upgraded from 17.02.10 to 17.11.5 (using auks and cgroups) and we are hitting a nasty problem: finished jobs are hanging (indefinitely) in the completing state.

On the node I see only two processes remaining: 'slurmstepd' and it's child 'auks'. Looking at the slurmstepd with strace I couldn't identify any attempts to close/kill auks (but I could very well have missed them). Slurmstepd is regularly checking the cgroups. In the cgroups tasks list I see (only) the slurmstepd and auks threads.

Killing (-9) auks makes the slurmstepd complete succesfully.

Does this sound familiar to anyone?

Or is there anyone out there who is successfully running 17.11.5 in combination with auks and cgroups?

I'm wandering if there may be some kind of deadlock between not killing auks and waiting for the cgroups to become empty?

Regards,

Robbert

Reply via email to