Hi, You should look at that bug : https://bugs.schedmd.com/show_bug.cgi?id=4412
I thought it would be resolved in 17.11.0. Regards Matthieu Le 30 nov. 2017 00:56, "Andy Riebs" <andy.ri...@hpe.com> a écrit : > We've just installed 17.11.0 on our 100+ node x86_64 cluster running > CentOS 7.4 this afternoon, and periodically see a single node (perhaps the > first node in an allocation?) get drained with the message "batch job > complete failure". > > On one node in question, slurmd.log reports > > pam_unix(slurm:session): open_session - error recovering username > pam_loginuid(slurm:session): unexpected response from failed conversation > function > > On another node drained for the same reason, > > error: pam_open_session: Cannot make/remove an entry for the specified > session > error: error in pam_setup > error: job_manager exiting abnormally, rc = 4020 > sending REQUEST_COMPLETE_BATCH_SCRIPT, error:4020 status 0 > > slurmctld has logged > > error: slurmd error running JobId=33 on node(s)=node048: Slurmd could not > execve job > > drain_nodes: node Summer0c048 state set to DRAIN > > It's been a long day (for other reasons), so I'll go dig into this > tomorrow. But if anyone can shine some light on where I should start > looking, I shall be most obliged! > > Andy > > -- > Andy riebsandy.ri...@hpe.com > Hewlett-Packard Enterprise > High Performance Computing Software Engineering+1 404 648 9024 > <(404)%20648-9024> > My opinions are not necessarily those of HPE > May the source be with you! > >