Re: [slurm-users] Running pyMPI on several nodes

Pär Lundö Fri, 12 Jul 2019 07:41:36 -0700

Hi,
Thank you for your response.
When I do run it ("srun -N2 -n8 hostname") I get an error stating:


"srun: job step 83.0 aborted before step completely launched.
srun: error: task 0 launced failed: Unspecified error.
srun: error: task 1 launced failed: Unspecified error.
srun: error: task 2 Launced failed: Unspecified error.
srun: error: task 3 launced failed: Unspecified error.
srun: error: task 4 launced failed: Unspecified error.
srun: error: task 5 launced failed: Unspecified error.
srun: error: task 6 launced failed: Unspecified error.
 srun: error: task 7 launced failed: Unspecified error.
lxclient10
lxclient10
lxclient10
lxclient10
lxclient10
lxclient10
lxclient10
lxclient10
"
Presumably, the first 8 tasks originates from the first node (in this case the 
lxclient11), and the other node (lxclient10) response as predicted.
Is it neccessary to have passwordless ssh communication alongside the munge 
authentication?

In addition I checked the slurmctld-log from both the server and client and 
found something (noted in bold):
"[2019-07-12T14:57:53.543] launch task 83.0 from UID 1000 GID: 1000 
HOST:192.168.1.1 PORT:4810
[2019-07-12T14:57:53.544] lllp distribution jobid[83] implicit auto binding: 
cores.one_thread.dist 8192
[2019-07-12T14:57:53.544] _task_layout_lllp_cyclic
[2019-07-12T14:57:53.544] _lllp_generate_cpu bind jobid [83]: mask_cpu, 
one_thread, 0x10, 0x01, 0x20, 0x02, 0x40, 0x04, 0x80
[2019-07-12T14:57:53.545] _run_prolog: run job script took usec=11
[2019-07-12T14:57:53.543] _run_prolog: prolog with lock for job 83 ran for 0 
seconds
[2019-07-12T14:57:53.771] [83.0] task_p_pre_launch: Using sched_affinity for 
tasks
[2019-07-12T14:57:53.771][83.0] task_p_pre_launch: Using sched affinity for 
tasks lurm.pmix.83.0: Address already in use[98]
[2019-07-12T14:57:53.682][83.0] error: lxclient[0] /pmix.server.c:386 
[pmix_stepd_init] mpi/pmix: ERROR: pmixp_usock_create_srv
[2019-07-12T14:57:53.683][83.0] error: (null) [0] /mpi_pmix:156 
[p_mpi_hook_slurmstepd_prefork] mpi/pmix: ERROR: pmixp_stepd_init() failed
[2019-07-12T14:57:53.686][83.0] error: Failed mpi_hook_slurmstepd_prefork
[2019-07-12T14:57:53.691][83.0] error: job_manage existing abnormally, rc=1
ks
[2019-07-12T14:57:53.772][83.0] task_p_pre_launch: Using sched_affinity for 
tasks
[2019-07-12T14:57:53.772][83.0] task_p_pre_launch: Using sched_affinity for 
tasks
[2019-07-12T14:57:53.772][83.0] task_p_pre_launch: Using sched_affinity for 
tasks
[2019-07-12T14:57:53.775][83.0] task_p_pre_launch: Using sched_affinity for 
tasks
[2019-07-12T14:57:56.004][83.0] done with job
[2019-07-12T14:57:56.005][83.0] error: Unable to unlink domain socket 
´/var/spool/slurmd/lxclient10_83.0´: No such file or directory
[2019-07-12T14:57:56.019][83.0] done with job
"

Best regards
Palle


________________________________
From: "slurm-users" <slurm-users-boun...@lists.schedmd.com>
Sent: 12 juli 2019 08:46
To: "Slurm User Community List" <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Running pyMPI on several nodes

MY apology. You do say that the Python program simply printe the rank - so is a 
hello world program.

On Fri, 12 Jul 2019 at 07:45, John Hearns < 
hear...@googlemail.com<mailto:hear...@googlemail.com>> wrote:
Please try something very simple such as a hello world program or
srun -N2 -n8 hostname

What is the error message which you have ?

On Fri, 12 Jul 2019 at 07:07, Pär Lundö < 
par.lu...@foi.se<mailto:par.lu...@foi.se>> wrote:

Hi there Slurm-experts!
I am  trouble using or running a python-mpi program involving more than one 
node. The pythom-mpi program is very simple, it only lists the number of ranks 
that is available in its environment. I have a munge-daemon running prior to 
starting the slurm-service and the program works when using a single node (so I 
suppose munge is working).
In addition, I have tested to run a simple sbatch-script where each available 
node (four nodes) states its hostname and returns.
Since authentication with Slurm is used via munge, do I need a passwordless SSH 
communication between the slurmctl and the nodes? (I found a guide,probably 
outdated stating that passwordless SSH communication is a neccessity for slurm, 
HTTP://admin-magazine.com/HPC/Articles/Resource-Management-with-Slurm<http://admin-magazine.com/HPC/Articles/Resource-Management-with-Slurm>).

I run the python-mpi program via a sbatch-script,invoking a srun-command. Each 
node has 8 CPUs.
The srun-command is :
”srun -N2 -n8 python3 python-mpi.py” ,
when tested on two nodes.
It works fine running on a single node(with ”-N1” instead of ”-N2”), but it is 
aborted or stopped when running on two nodes.
Should I have ”-n16” when running on two nodes? (In order to allocate the 
complete number of CPUs available of the two nodes.)
Slurm is configured and built with pmix.
I am running Slurm 19.05 on Ubuntu 18.04 as server and the nodes are running 
same slurm-version on Ubuntu 18.10.

Best rehards,

Palle

Re: [slurm-users] Running pyMPI on several nodes

Reply via email to