Not sure if this is a SLURM or OMPI issue so please bear with the
cross-posting...

The OpenMPI FAQ mentions an issue with slurm 2.6.3/pmi2.
https://www.open-mpi.org/faq/?category=slurm#slurm-2.6.3-issue

I have built both 1.7.5/1.8 against slurm 14.03/pmi2.

When I launch openmpi/examples/hello_c on a single node allocation:

srun --mpi=pmi2 -N 1 hello_c:

srun -N 1 --mpi=pmi2 hello_c
srun: error: _server_read: fd 18 got error or unexpected eof reading header
srun: error: step_launch_notify_io_failure: aborting, io error with
slurmstepd on node 0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete


with --slurmd-debug=9: (I'm not sure what is the meaning of "ip
111.110.61.48 sd 14"
below, is that ip as in ip address? It is not the ip address of any Nodes
in my partition)

slurmstepd: mpi/pmi2: client_resp_send: 26    cmd=kvs-put-response;rc=0;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: got client request: 14     cmd=kvs-fence;
slurmstepd: mpi/pmi2: _tree_listen_readable
slurmstepd: mpi/pmi2: _task_readable
slurmstepd: mpi/pmi2: _tree_listen_read
slurmstepd: _tree_listen_read: accepted tree connection: ip 111.110.61.48
sd 14
slurmstepd: _handle_accept_rank: going to read() client rank
slurmstepd: _handle_accept_rank: got client rank 1478164480 on fd 14
srun: error: _server_read: fd 18 got error or unexpected eof reading header
srun: error: step_launch_notify_io_failure: aborting, io error with
slurmstepd on node 0
srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
srun: error: Timed out waiting for job step to complete

Launching with salloc/sbatch works.

- Anthony

Reply via email to