Try adding -mca oob_base_verbose 10 -mca rml_base_verbose 10 to your cmd line. 
It looks to me like we are unable to connect back to the node where you are 
running mpirun for some reason.


On Jul 20, 2014, at 9:16 AM, Timur Ismagilov <tismagi...@mail.ru> wrote:

> I have the same problem in openmpi 1.8.1(Apr 23, 2014).
> Does the srun command have  a --map-by<foo> mpirun parameter, or can i chage 
> it from bash enviroment?
> 
> 
> 
> -------- Пересылаемое сообщение --------
> От кого: Timur Ismagilov <tismagi...@mail.ru>
> Кому: Mike Dubman <mi...@dev.mellanox.co.il>
> Копия: Open MPI Users <us...@open-mpi.org>
> Дата: Thu, 17 Jul 2014 16:42:24 +0400
> Тема: Re[4]: [OMPI users] Salloc and mpirun problem
> 
> 
> With Open MPI 1.9a1r32252 (Jul 16, 2014 (nightly snapshot tarball)) i got 
> this output (same?):
> 
> $ salloc -N2 --exclusive -p test -J ompi
> salloc: Granted job allocation 645686
> 
> $LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>   mpirun  -mca mca_base_env_list 'LD_PRELOAD'  --mca plm_base_verbose 10 
> --debug-daemons -np 1 hello_c
> 
> [access1:04312] mca: base: components_register: registering plm components
> [access1:04312] mca: base: components_register: found loaded component 
> isolated
> [access1:04312] mca: base: components_register: component isolated has no 
> register or open function
> [access1:04312] mca: base: components_register: found loaded component rsh
> [access1:04312] mca: base: components_register: component rsh register 
> function successful
> [access1:04312] mca: base: components_register: found loaded component slurm
> [access1:04312] mca: base: components_register: component slurm register 
> function successful
> [access1:04312] mca: base: components_open: opening plm components
> [access1:04312] mca: base: components_open: found loaded component isolated
> [access1:04312] mca: base: components_open: component isolated open function 
> successful
> [access1:04312] mca: base: components_open: found loaded component rsh
> [access1:04312] mca: base: components_open: component rsh open function 
> successful
> [access1:04312] mca: base: components_open: found loaded component slurm
> [access1:04312] mca: base: components_open: component slurm open function 
> successful
> [access1:04312] mca:base:select: Auto-selecting plm components
> [access1:04312] mca:base:select:( plm) Querying component [isolated]
> [access1:04312] mca:base:select:( plm) Query of component [isolated] set 
> priority to 0
> [access1:04312] mca:base:select:( plm) Querying component [rsh]
> [access1:04312] mca:base:select:( plm) Query of component [rsh] set priority 
> to 10
> [access1:04312] mca:base:select:( plm) Querying component [slurm]
> [access1:04312] mca:base:select:( plm) Query of component [slurm] set 
> priority to 75
> [access1:04312] mca:base:select:( plm) Selected component [slurm]
> [access1:04312] mca: base: close: component isolated closed
> [access1:04312] mca: base: close: unloading component isolated
> [access1:04312] mca: base: close: component rsh closed
> [access1:04312] mca: base: close: unloading component rsh
> Daemon was launched on node1-128-09 - beginning to initialize
> Daemon was launched on node1-128-15 - beginning to initialize
> Daemon [[39207,0],1] checking in as pid 26240 on host node1-128-09
> [node1-128-09:26240] [[39207,0],1] orted: up and running - waiting for 
> commands!
> Daemon [[39207,0],2] checking in as pid 30129 on host node1-128-15
> [node1-128-15:30129] [[39207,0],2] orted: up and running - waiting for 
> commands!
> srun: error: node1-128-09: task 0: Exited with exit code 1
> srun: Terminating job step 645686.3
> srun: error: node1-128-15: task 1: Exited with exit code 1
> --------------------------------------------------------------------------
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --------------------------------------------------------------------------
> [access1:04312] [[39207,0],0] orted_cmd: received halt_vm cmd
> [access1:04312] mca: base: close: component slurm closed
> [access1:04312] mca: base: close: unloading component slurm
> 
> 
> 
> Thu, 17 Jul 2014 11:40:24 +0300 от Mike Dubman <mi...@dev.mellanox.co.il>:
> 
> can you use latest ompi-1.8 from svn/git?
> Ralph - could you please suggest.
> Thx
> 
> 
> On Wed, Jul 16, 2014 at 2:48 PM, Timur Ismagilov <tismagi...@mail.ru> wrote:
> Here it is:
> 
> 
> $ 
> LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>   mpirun  -x LD_PRELOAD --mca plm_base_verbose 10 --debug-daemons -np 1 
> hello_c
> 
> [access1:29064] mca: base: components_register: registering plm components
> [access1:29064] mca: base: components_register: found loaded component 
> isolated
> [access1:29064] mca: base: components_register: component isolated has no 
> register or open function
> [access1:29064] mca: base: components_register: found loaded component rsh
> [access1:29064] mca: base: components_register: component rsh register 
> function successful
> [access1:29064] mca: base: components_register: found loaded component slurm
> [access1:29064] mca: base: components_register: component slurm register 
> function successful
> [access1:29064] mca: base: components_open: opening plm components
> [access1:29064] mca: base: components_open: found loaded component isolated
> [access1:29064] mca: base: components_open: component isolated open function 
> successful
> [access1:29064] mca: base: components_open: found loaded component rsh
> [access1:29064] mca: base: components_open: component rsh open function 
> successful
> [access1:29064] mca: base: components_open: found loaded component slurm
> [access1:29064] mca: base: components_open: component slurm open function 
> successful
> [access1:29064] mca:base:select: Auto-selecting plm components
> [access1:29064] mca:base:select:(  plm) Querying component [isolated]
> [access1:29064] mca:base:select:(  plm) Query of component [isolated] set 
> priority to 0
> [access1:29064] mca:base:select:(  plm) Querying component [rsh]
> [access1:29064] mca:base:select:(  plm) Query of component [rsh] set priority 
> to 10
> [access1:29064] mca:base:select:(  plm) Querying component [slurm]
> [access1:29064] mca:base:select:(  plm) Query of component [slurm] set 
> priority to 75
> [access1:29064] mca:base:select:(  plm) Selected component [slurm]
> [access1:29064] mca: base: close: component isolated closed
> [access1:29064] mca: base: close: unloading component isolated
> [access1:29064] mca: base: close: component rsh closed
> [access1:29064] mca: base: close: unloading component rsh
> Daemon was launched on node1-128-17 - beginning to initialize
> Daemon was launched on node1-128-18 - beginning to initialize
> Daemon [[63607,0],2] checking in as pid 24538 on host node1-128-18
> [node1-128-18:24538] [[63607,0],2] orted: up and running - waiting for 
> commands!
> Daemon [[63607,0],1] checking in as pid 17192 on host node1-128-17
> [node1-128-17:17192] [[63607,0],1] orted: up and running - waiting for 
> commands!
> srun: error: node1-128-18: task 1: Exited with exit code 1
> srun: Terminating job step 645191.1
> srun: error: node1-128-17: task 0: Exited with exit code 1
> 
> 
> --------------------------------------------------------------------------
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --------------------------------------------------------------------------
> [access1:29064] [[63607,0],0] orted_cmd: received halt_vm cmd
> [access1:29064] mca: base: close: component slurm closed
> [access1:29064] mca: base: close: unloading component slurm
> 
> 
> Wed, 16 Jul 2014 14:20:33 +0300 от Mike Dubman <mi...@dev.mellanox.co.il>:
> 
> please add following flags to mpirun "--mca plm_base_verbose 10 
> --debug-daemons" and attach output.
> Thx
> 
> 
> On Wed, Jul 16, 2014 at 11:12 AM, Timur Ismagilov <tismagi...@mail.ru> wrote:
> Hello!
> I have Open MPI v1.9a1r32142 and slurm 2.5.6.
> 
> I can not use mpirun after salloc:
> 
> $salloc -N2 --exclusive -p test -J ompi
> $LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>  mpirun -np 1 hello_c
> -----------------------------------------------------------------------------------------------------
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> ------------------------------------------------------------------------------------------------------
> But if i use mpirun in sbutch script it looks correct:
> $cat ompi_mxm3.0
> #!/bin/sh
> LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>   mpirun  -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --map-by slot:pe=8 "$@"
> 
> $sbatch -N2  --exclusive -p test -J ompi  ompi_mxm3.0 ./hello_c
> Submitted batch job 645039
> $cat slurm-645039.out 
> [warn] Epoll ADD(1) on fd 0 failed.  Old events were 0; read change was 1 
> (add); write change was 0 (none): Operation not permitted
> [warn] Epoll ADD(4) on fd 1 failed.  Old events were 0; read change was 0 
> (none); write change was 1 (add): Operation not permitted
> Hello, world, I am 0 of 2, (Open MPI v1.9a1, package: Open MPI 
> semenov@compiler-2 Distribution, ident: 1.9a1r32142, repo rev: r32142, Jul 
> 04, 2014 (nightly snapshot tarball), 146)
> Hello, world, I am 1 of 2, (Open MPI v1.9a1, package: Open MPI 
> semenov@compiler-2 Distribution, ident: 1.9a1r32142, repo rev: r32142, Jul 
> 04, 2014 (nightly snapshot tarball), 146)
> 
> Regards,
> Timur
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/07/24777.php
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/07/24823.php

Reply via email to