Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

Matthieu Hautreux Thu, 30 Nov 2017 09:40:23 -0800

Hi,

You should look at that bug : https://bugs.schedmd.com/show_bug.cgi?id=4412


I thought it would be resolved in 17.11.0.

Regards
Matthieu

Le 30 nov. 2017 00:56, "Andy Riebs" <andy.ri...@hpe.com> a écrit :

> We've just installed 17.11.0 on our 100+ node x86_64 cluster running
> CentOS 7.4 this afternoon, and periodically see a single node (perhaps the
> first node in an allocation?) get drained with the message "batch job
> complete failure".
>
> On one node in question, slurmd.log reports
>
> pam_unix(slurm:session): open_session - error recovering username
> pam_loginuid(slurm:session): unexpected response from failed conversation
> function
>
> On another node drained for the same reason,
>
> error: pam_open_session: Cannot make/remove an entry for the specified
> session
> error: error in pam_setup
> error: job_manager exiting abnormally, rc = 4020
> sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0
>
> slurmctld has logged
>
> error: slurmd error running JobId=33 on node(s)=node048: Slurmd could not
> execve job
>
> drain_nodes: node Summer0c048 state set to DRAIN
>
> It's been a long day (for other reasons), so I'll go dig into this
> tomorrow. But if anyone can shine some light on where I should start
> looking, I shall be most obliged!
>
> Andy
>
> --
> Andy riebsandy.ri...@hpe.com
> Hewlett-Packard Enterprise
> High Performance Computing Software Engineering+1 404 648 9024 
> <(404)%20648-9024>
> My opinions are not necessarily those of HPE
>     May the source be with you!
>
>

Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

Reply via email to