On 12/7/19 7:39 am, Pär Lundö wrote:

Presumably, the first 8 tasks originates from the first node (in this case the lxclient11), and the other node (lxclient10) response as predicted.

That looks right, it seems the other node has two processes fighting over the same socket and that's breaking Slurm there.

Is it neccessary to have passwordless ssh communication alongside the munge authentication?

No, srun doesn't need (or use) that at all.

In addition I checked the slurmctld-log from both the server and client and found something (noted in bold):

This is from the slurmd log on the client from the look of it.

*[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using sched affinity for tasks lurm.pmix.83.0: Address already in use[98]* [2019-07-12T14:57:53.682][83.0] error: lxclient[0] /pmix.server.c:386 [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv [2019-07-12T14:57:53.683][83.0] error: (null) [0] /mpi_pmix:156 [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed

That indicates that something else has grabbed the socket it wants and that's why the setup of the MPI ranks on the second node fails.

You'll want to poke around there to see what's using it.

Best of luck!
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Reply via email to