Re: [slurm-users] 17.11+auks+cgroups: finished jobs hang in completing state

2018-03-26 Thread Christopher Samuel
On 26/03/18 20:50, Robbert Eggermont wrote: The suggest fix (use sigkill instead of sigterm in slurm_spank_auks to stop auks) seems to work (so far). Excellent, so glad to hear that! All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC

Re: [slurm-users] 17.11+auks+cgroups: finished jobs hang in completing state

2018-03-26 Thread Robbert Eggermont
FYI: I think we've run into this issue: https://github.com/hautreux/auks/issues/24 It seems to be triggered by a change in signal blocking in slurmstepd: https://github.com/SchedMD/slurm/commit/d2c83807097605f10f0b19cf2c5cb5c2c6f35ad6 The suggest fix (use sigkill instead of sigterm in slurm_s

Re: [slurm-users] 17.11+auks+cgroups: finished jobs hang in completing state

2018-03-26 Thread Robbert Eggermont
Hi Chris, On 26-03-18 05:04, Christopher Samuel wrote: Does the slurmd log report it trying to kill the auks process? The first thing I need to do is turn up the logging verbosity. https://bugs.schedmd.com/show_bug.cgi?id=4733 The fact that auks is hanging around makes me wonder if this i

Re: [slurm-users] 17.11+auks+cgroups: finished jobs hang in completing state

2018-03-25 Thread Christopher Samuel
On 26/03/18 12:43, Robbert Eggermont wrote: Does this sound familiar to anyone? Does the slurmd log report it trying to kill the auks process? Also you might want to have a look at: https://bugs.schedmd.com/show_bug.cgi?id=4733 to see if that bug fits what you're seeing. Basically I get a

[slurm-users] 17.11+auks+cgroups: finished jobs hang in completing state

2018-03-25 Thread Robbert Eggermont
Dear all, We just upgraded from 17.02.10 to 17.11.5 (using auks and cgroups) and we are hitting a nasty problem: finished jobs are hanging (indefinitely) in the completing state. On the node I see only two processes remaining: 'slurmstepd' and it's child 'auks'. Looking at the slurmstepd wit