Hi,

stack:
el6.7, mlnx ofed 3.1 (IB FDR) and slurm 15.08.9 (whithout *.la libs).

problem:
OpenMPI 1.10.x built with pmi support does not work when trying to use 
sbatch/salloc - mpirun combination. srun ompi_mpi_app works fine.

Older 1.8.x version works fine under same salloc session.

./configure --with-slurm --with-verbs --with-hwloc=internal --with-pmi 
--with-cuda=/appl/opt/cuda/7.5/ --with-pic --enable-shared 
--enable-mpi-thread-multiple --enable-contrib-no-build=vt


I tried 1.10.3a from git also.


mpirun  -debug-daemons ./1103aompitest 
Daemon [[44437,0],1] checking in as pid 40979 on host g59
Daemon [[44437,0],2] checking in as pid 23566 on host g60
[g59:40979] [[44437,0],1] orted: up and running - waiting for commands!
[g60:23566] [[44437,0],2] orted: up and running - waiting for commands!
[g59:40979] [[44437,0],1] tcp_peer_send_blocking: send() to socket 9 failed: 
Broken pipe (32)
[g59:40979] [[44437,0],1]:errmgr_default_orted.c(260) updating exit status to 1
[g60:23566] [[44437,0],2] tcp_peer_send_blocking: send() to socket 9 failed: 
Broken pipe (32)
[g60:23566] [[44437,0],2]:errmgr_default_orted.c(260) updating exit status to 1
srun: error: g59: task 0: Exited with exit code 1
srun: Terminating job step 8922923.1
srun: Job step aborted: Waiting up to 12 seconds for job step to finish.
srun: error: g60: task 1: Exited with exit code 1
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
[login2:48425] [[44437,0],0] orted:comm:process_commands() Processing Command: 
ORTE_DAEMON_HALT_VM_CMD
[login2:48425] [[44437,0],0] orted_cmd: received halt_vm cmd


[GPU-Env mpi]$ srun ./1103aompitest 
g59: Before MPI_INIT 
g59: After MPI_INIT 
Hello world! I'm 0 of 2 on g59
g60: Before MPI_INIT 
g60: After MPI_INIT 
Hello world! I'm 1 of 2 on g60

ompi_info  --parsable |grep pmi

mca:db:pmi:version:mca:2.0.0
mca:db:pmi:version:api:1.0.0
mca:db:pmi:version:component:1.10.3
mca:ess:pmi:version:mca:2.0.0
mca:ess:pmi:version:api:3.0.0
mca:ess:pmi:version:component:1.10.3
mca:grpcomm:pmi:version:mca:2.0.0
mca:grpcomm:pmi:version:api:2.0.0
mca:grpcomm:pmi:version:component:1.10.3
mca:pubsub:pmi:version:mca:2.0.0
mca:pubsub:pmi:version:api:2.0.0
mca:pubsub:pmi:version:component:1.10.3


module swap openmpi openmpi/1.8.6


[GPU-Env mpi]$ mpirun -debug-daemons ./ompigcc184 
Daemon [[810,0],2] checking in as pid 55443 on host g60
Daemon [[810,0],1] checking in as pid 73091 on host g59
[g60:55443] [[810,0],2] orted: up and running - waiting for commands!
[g59:73091] [[810,0],1] orted: up and running - waiting for commands!
[login2:05014] [[810,0],0] orted_cmd: received add_local_procs
[g59:73091] [[810,0],1] orted_cmd: received add_local_procs
[g60:55443] [[810,0],2] orted_cmd: received add_local_procs
g60: Before MPI_INIT 
g59: Before MPI_INIT 
[g60:55443] [[810,0],2] orted_recv: received sync+nidmap from local proc 
[[810,1],1]
[g59:73091] [[810,0],1] orted_recv: received sync+nidmap from local proc 
[[810,1],0]
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_forward_output = 0
MPIR_proctable_size = 2
MPIR_proctable:
(i, host, exe, pid) = (0, g59, ompigcc184, 73096)
(i, host, exe, pid) = (1, g60, ompigcc184, 55448)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[login2:05014] [[810,0],0] orted_cmd: received message_local_procs
[g59:73091] [[810,0],1] orted_cmd: received message_local_procs
[g60:55443] [[810,0],2] orted_cmd: received message_local_procs
[taito-login2.csc.fi:05014] [[810,0],0] orted_cmd: received message_local_procs
[g59:73091] [[810,0],1] orted_cmd: received message_local_procs
[g60:55443] [[810,0],2] orted_cmd: received message_local_procs
g59: After MPI_INIT 
Hello world! I'm 0 of 2 on g59
g60: After MPI_INIT 
Hello world! I'm 1 of 2 on g60
[login2:5014] [[810,0],0] orted_cmd: received message_local_procs
[g60:55443] [[810,0],2] orted_cmd: received message_local_procs
[g59:73091] [[810,0],1] orted_cmd: received message_local_procs
[g59:73091] [[810,0],1] orted_recv: received sync from local proc [[810,1],0]
[g60:55443] [[810,0],2] orted_recv: received sync from local proc [[810,1],1]
[login2:05014] [[810,0],0] orted_cmd: received exit cmd
[g60:55443] [[810,0],2] orted_cmd: received exit cmd
[g59:73091] [[810,0],1] orted_cmd: received exit cmd
[g60:55443] [[810,0],2] orted_cmd: all routes and children gone - exiting
[g59:73091] [[810,0],1] orted_cmd: all routes and children gone - exiting


[GPU-Env mpi]$ ompi_info -parsable |grep pmi
mca:db:pmi:version:mca:2.0
mca:db:pmi:version:api:1.0
mca:db:pmi:version:component:1.8.6
mca:ess:pmi:version:mca:2.0
mca:ess:pmi:version:api:3.0
mca:ess:pmi:version:component:1.8.6
mca:grpcomm:pmi:version:mca:2.0
mca:grpcomm:pmi:version:api:2.0
mca:grpcomm:pmi:version:component:1.8.6
mca:pubsub:pmi:version:mca:2.0
mca:pubsub:pmi:version:api:2.0
mca:pubsub:pmi:version:component:1.8.6

Reply via email to