Here it is: $ LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so mpirun -x LD_PRELOAD --mca plm_base_verbose 10 --debug-daemons -np 1 hello_c
[access1:29064] mca: base: components_register: registering plm components [access1:29064] mca: base: components_register: found loaded component isolated [access1:29064] mca: base: components_register: component isolated has no register or open function [access1:29064] mca: base: components_register: found loaded component rsh [access1:29064] mca: base: components_register: component rsh register function successful [access1:29064] mca: base: components_register: found loaded component slurm [access1:29064] mca: base: components_register: component slurm register function successful [access1:29064] mca: base: components_open: opening plm components [access1:29064] mca: base: components_open: found loaded component isolated [access1:29064] mca: base: components_open: component isolated open function successful [access1:29064] mca: base: components_open: found loaded component rsh [access1:29064] mca: base: components_open: component rsh open function successful [access1:29064] mca: base: components_open: found loaded component slurm [access1:29064] mca: base: components_open: component slurm open function successful [access1:29064] mca:base:select: Auto-selecting plm components [access1:29064] mca:base:select:( plm) Querying component [isolated] [access1:29064] mca:base:select:( plm) Query of component [isolated] set priority to 0 [access1:29064] mca:base:select:( plm) Querying component [rsh] [access1:29064] mca:base:select:( plm) Query of component [rsh] set priority to 10 [access1:29064] mca:base:select:( plm) Querying component [slurm] [access1:29064] mca:base:select:( plm) Query of component [slurm] set priority to 75 [access1:29064] mca:base:select:( plm) Selected component [slurm] [access1:29064] mca: base: close: component isolated closed [access1:29064] mca: base: close: unloading component isolated [access1:29064] mca: base: close: component rsh closed [access1:29064] mca: base: close: unloading component rsh Daemon was launched on node1-128-17 - beginning to initialize Daemon was launched on node1-128-18 - beginning to initialize Daemon [[63607,0],2] checking in as pid 24538 on host node1-128-18 [node1-128-18:24538] [[63607,0],2] orted: up and running - waiting for commands! Daemon [[63607,0],1] checking in as pid 17192 on host node1-128-17 [node1-128-17:17192] [[63607,0],1] orted: up and running - waiting for commands! srun: error: node1-128-18: task 1: Exited with exit code 1 srun: Terminating job step 645191.1 srun: error: node1-128-17: task 0: Exited with exit code 1 -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- [access1:29064] [[63607,0],0] orted_cmd: received halt_vm cmd [access1:29064] mca: base: close: component slurm closed [access1:29064] mca: base: close: unloading component slurm Wed, 16 Jul 2014 14:20:33 +0300 от Mike Dubman <mi...@dev.mellanox.co.il>: >please add following flags to mpirun "--mca plm_base_verbose 10 >--debug-daemons" and attach output. >Thx > > >On Wed, Jul 16, 2014 at 11:12 AM, Timur Ismagilov < tismagi...@mail.ru > >wrote: >>Hello! >>I have Open MPI v1.9a1r32142 and slurm 2.5.6. >> >>I can not use mpirun after salloc: >> >>$salloc -N2 --exclusive -p test -J ompi >>$LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so >> mpirun -np 1 hello_c >>----------------------------------------------------------------------------------------------------- >>An ORTE daemon has unexpectedly failed after launch and before >>communicating back to mpirun. This could be caused by a number >>of factors, including an inability to create a connection back >>to mpirun due to a lack of common network interfaces and/or no >>route found between them. Please check network connectivity >>(including firewalls and network routing requirements). >>------------------------------------------------------------------------------------------------------ >>But if i use mpirun in sbutch script it looks correct: >>$cat ompi_mxm3.0 >>#!/bin/sh >>LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so >> mpirun -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --map-by slot:pe=8 "$@" >> >>$sbatch -N2 --exclusive -p test -J ompi ompi_mxm3.0 ./hello_c >>Submitted batch job 645039 >>$cat slurm-645039.out >>[warn] Epoll ADD(1) on fd 0 failed. Old events were 0; read change was 1 >>(add); write change was 0 (none): Operation not permitted >>[warn] Epoll ADD(4) on fd 1 failed. Old events were 0; read change was 0 >>(none); write change was 1 (add): Operation not permitted >>Hello, world, I am 0 of 2, (Open MPI v1.9a1, package: Open MPI >>semenov@compiler-2 Distribution, ident: 1.9a1r32142, repo rev: r32142, Jul >>04, 2014 (nightly snapshot tarball), 146) >>Hello, world, I am 1 of 2, (Open MPI v1.9a1, package: Open MPI >>semenov@compiler-2 Distribution, ident: 1.9a1r32142, repo rev: r32142, Jul >>04, 2014 (nightly snapshot tarball), 146) >> >>Regards, >>Timur >>_______________________________________________ >>users mailing list >>us...@open-mpi.org >>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>Link to this post: >>http://www.open-mpi.org/community/lists/users/2014/07/24777.php >