Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2018-02-04 Thread Alan Orth
I came here looking for this! The last time I tried it in early 2017-12 it was still "broken" with SLURM 17.11.0. Glad to see that it was fixed with 17.11.1 (and to know why). I've now got PAM limits being applied correctly on my cluster. Thanks for the link, Andy. Cheers, On Fri, Dec 8, 2017 at

Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-12-08 Thread Andy Riebs
Answering my own question, I got private email which points to , describing both the problem and the solution. (Thanks Matthieu!) Andy On 12/08/2017 11:06 AM, Andy Riebs wrote: I've gathered more information, and I am probably having a fight wi

Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-12-08 Thread Andy Riebs
I've gathered more information, and I am probably having a fight with pam.  First, of note, this problem can be reproduced with a single node, single task job, such as $ sbatch -N1 --reservation awr #!/bin/bash hostname Submitted batch job 90436 $ sinfo -R batch job complete f slurm 2017-12

Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-11-30 Thread Matthieu Hautreux
Hi, You should look at that bug : https://bugs.schedmd.com/show_bug.cgi?id=4412 I thought it would be resolved in 17.11.0. Regards Matthieu Le 30 nov. 2017 00:56, "Andy Riebs" a écrit : > We've just installed 17.11.0 on our 100+ node x86_64 cluster running > CentOS 7.4 this afternoon, and per

[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-11-30 Thread Andy Riebs
We've just installed 17.11.0 on our 100+ node x86_64 cluster running CentOS 7.4 this afternoon, and periodically see a single node (perhaps the first node in an allocation?) get drained with the message "batch job complete failure". On one node in question, slurmd.log reports pam_unix(slur

[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-11-29 Thread Andy Riebs
We've just installed 17.11.0 on our 100+ node x86_64 cluster running CentOS 7.4 this afternoon, and periodically see a single node (perhaps the first node in an allocation?) get drained with the message "batch job complete failure". On one node in question, slurmd.log reports pam_unix(slur