Hi,

Thank you so much for your quick responses!
It is much appreciated.
I dont have access to the cluster until next week, but I’ll be sure to follow 
up on all of your suggestions and get back you next week.

Have a nice weekend!
Best regards
Palle

________________________________
From: "slurm-users" <slurm-users-boun...@lists.schedmd.com>
Sent: 12 juli 2019 17:37
To: "Slurm User Community List" <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Running pyMPI on several nodes

Par, by 'poking around' Crhis means to use tools such as netstat and lsof.
Also I would look as ps -eaf --forest to make sure there are no 'orphaned' 
jusbs sitting on that compute node.

Having said that though, I have a dim memory of a classic PBSPro error message 
which says something about a network connection,
but really means that you cannot open a remote session on that compute server.

As an aside, you have checked that your username exists on that compue server?  
    getent passwd par
Also that your home directory is mounted - or something substituting for your 
home directory?


On Fri, 12 Jul 2019 at 15:55, Chris Samuel < 
ch...@csamuel.org<mailto:ch...@csamuel.org>> wrote:
On 12/7/19 7:39 am, Pär Lundö wrote:

> Presumably, the first 8 tasks originates from the first node (in this
> case the lxclient11), and the other node (lxclient10) response as
> predicted.

That looks right, it seems the other node has two processes fighting
over the same socket and that's breaking Slurm there.

> Is it neccessary to have passwordless ssh communication alongside the
> munge authentication?

No, srun doesn't need (or use) that at all.

> In addition I checked the slurmctld-log from both the server and client
> and found something (noted in bold):

This is from the slurmd log on the client from the look of it.

> *[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using sched affinity
> for tasks lurm.pmix.83.0: Address already in use[98]*
> [2019-07-12T14:57:53.682][83.0] error: lxclient[0] /pmix.server.c:386
> [pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
> [2019-07-12T14:57:53.683][83.0] error: (null) [0] /mpi_pmix:156
> [p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed

That indicates that something else has grabbed the socket it wants and
that's why the setup of the MPI ranks on the second node fails.

You'll want to poke around there to see what's using it.

Best of luck!
Chris
--
  Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA

Reply via email to