Hi, stack: el6.7, mlnx ofed 3.1 (IB FDR) and slurm 15.08.9 (whithout *.la libs).
problem: OpenMPI 1.10.x built with pmi support does not work when trying to use sbatch/salloc - mpirun combination. srun ompi_mpi_app works fine. Older 1.8.x version works fine under same salloc session. ./configure --with-slurm --with-verbs --with-hwloc=internal --with-pmi --with-cuda=/appl/opt/cuda/7.5/ --with-pic --enable-shared --enable-mpi-thread-multiple --enable-contrib-no-build=vt I tried 1.10.3a from git also. mpirun -debug-daemons ./1103aompitest Daemon [[44437,0],1] checking in as pid 40979 on host g59 Daemon [[44437,0],2] checking in as pid 23566 on host g60 [g59:40979] [[44437,0],1] orted: up and running - waiting for commands! [g60:23566] [[44437,0],2] orted: up and running - waiting for commands! [g59:40979] [[44437,0],1] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [g59:40979] [[44437,0],1]:errmgr_default_orted.c(260) updating exit status to 1 [g60:23566] [[44437,0],2] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32) [g60:23566] [[44437,0],2]:errmgr_default_orted.c(260) updating exit status to 1 srun: error: g59: task 0: Exited with exit code 1 srun: Terminating job step 8922923.1 srun: Job step aborted: Waiting up to 12 seconds for job step to finish. srun: error: g60: task 1: Exited with exit code 1 -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- [login2:48425] [[44437,0],0] orted:comm:process_commands() Processing Command: ORTE_DAEMON_HALT_VM_CMD [login2:48425] [[44437,0],0] orted_cmd: received halt_vm cmd [GPU-Env mpi]$ srun ./1103aompitest g59: Before MPI_INIT g59: After MPI_INIT Hello world! I'm 0 of 2 on g59 g60: Before MPI_INIT g60: After MPI_INIT Hello world! I'm 1 of 2 on g60 ompi_info --parsable |grep pmi mca:db:pmi:version:mca:2.0.0 mca:db:pmi:version:api:1.0.0 mca:db:pmi:version:component:1.10.3 mca:ess:pmi:version:mca:2.0.0 mca:ess:pmi:version:api:3.0.0 mca:ess:pmi:version:component:1.10.3 mca:grpcomm:pmi:version:mca:2.0.0 mca:grpcomm:pmi:version:api:2.0.0 mca:grpcomm:pmi:version:component:1.10.3 mca:pubsub:pmi:version:mca:2.0.0 mca:pubsub:pmi:version:api:2.0.0 mca:pubsub:pmi:version:component:1.10.3 module swap openmpi openmpi/1.8.6 [GPU-Env mpi]$ mpirun -debug-daemons ./ompigcc184 Daemon [[810,0],2] checking in as pid 55443 on host g60 Daemon [[810,0],1] checking in as pid 73091 on host g59 [g60:55443] [[810,0],2] orted: up and running - waiting for commands! [g59:73091] [[810,0],1] orted: up and running - waiting for commands! [login2:05014] [[810,0],0] orted_cmd: received add_local_procs [g59:73091] [[810,0],1] orted_cmd: received add_local_procs [g60:55443] [[810,0],2] orted_cmd: received add_local_procs g60: Before MPI_INIT g59: Before MPI_INIT [g60:55443] [[810,0],2] orted_recv: received sync+nidmap from local proc [[810,1],1] [g59:73091] [[810,0],1] orted_recv: received sync+nidmap from local proc [[810,1],0] MPIR_being_debugged = 0 MPIR_debug_state = 1 MPIR_partial_attach_ok = 1 MPIR_i_am_starter = 0 MPIR_forward_output = 0 MPIR_proctable_size = 2 MPIR_proctable: (i, host, exe, pid) = (0, g59, ompigcc184, 73096) (i, host, exe, pid) = (1, g60, ompigcc184, 55448) MPIR_executable_path: NULL MPIR_server_arguments: NULL [login2:05014] [[810,0],0] orted_cmd: received message_local_procs [g59:73091] [[810,0],1] orted_cmd: received message_local_procs [g60:55443] [[810,0],2] orted_cmd: received message_local_procs [taito-login2.csc.fi:05014] [[810,0],0] orted_cmd: received message_local_procs [g59:73091] [[810,0],1] orted_cmd: received message_local_procs [g60:55443] [[810,0],2] orted_cmd: received message_local_procs g59: After MPI_INIT Hello world! I'm 0 of 2 on g59 g60: After MPI_INIT Hello world! I'm 1 of 2 on g60 [login2:5014] [[810,0],0] orted_cmd: received message_local_procs [g60:55443] [[810,0],2] orted_cmd: received message_local_procs [g59:73091] [[810,0],1] orted_cmd: received message_local_procs [g59:73091] [[810,0],1] orted_recv: received sync from local proc [[810,1],0] [g60:55443] [[810,0],2] orted_recv: received sync from local proc [[810,1],1] [login2:05014] [[810,0],0] orted_cmd: received exit cmd [g60:55443] [[810,0],2] orted_cmd: received exit cmd [g59:73091] [[810,0],1] orted_cmd: received exit cmd [g60:55443] [[810,0],2] orted_cmd: all routes and children gone - exiting [g59:73091] [[810,0],1] orted_cmd: all routes and children gone - exiting [GPU-Env mpi]$ ompi_info -parsable |grep pmi mca:db:pmi:version:mca:2.0 mca:db:pmi:version:api:1.0 mca:db:pmi:version:component:1.8.6 mca:ess:pmi:version:mca:2.0 mca:ess:pmi:version:api:3.0 mca:ess:pmi:version:component:1.8.6 mca:grpcomm:pmi:version:mca:2.0 mca:grpcomm:pmi:version:api:2.0 mca:grpcomm:pmi:version:component:1.8.6 mca:pubsub:pmi:version:mca:2.0 mca:pubsub:pmi:version:api:2.0 mca:pubsub:pmi:version:component:1.8.6