Re: [slurm-users] 17.11+auks+cgroups: finished jobs hang in completing state

2018-03-25 Thread Christopher Samuel
On 26/03/18 12:43, Robbert Eggermont wrote: Does this sound familiar to anyone? Does the slurmd log report it trying to kill the auks process? Also you might want to have a look at: https://bugs.schedmd.com/show_bug.cgi?id=4733 to see if that bug fits what you're seeing. Basically I get a

[slurm-users] 17.11+auks+cgroups: finished jobs hang in completing state

2018-03-25 Thread Robbert Eggermont
Dear all, We just upgraded from 17.02.10 to 17.11.5 (using auks and cgroups) and we are hitting a nasty problem: finished jobs are hanging (indefinitely) in the completing state. On the node I see only two processes remaining: 'slurmstepd' and it's child 'auks'. Looking at the slurmstepd wit

[slurm-users] Troubleshooting scheduling

2018-03-25 Thread E.S. Rosenberg
Hi everyone, Is there a guide anywhere on how to figure out why jobs aren't being started? We have a cluster with nodes of mixed sizes/powers currently roughly half the cluster is idle even though there are ~5k jobs queued. All jobs are queued due to priority while only 1 job is marked as waiting