I have the same problem in openmpi 1.8.1( Apr 23, 2014 ). Does the srun command have a --map-by<foo> mpirun parameter, or can i chage it from bash enviroment?
-------- Пересылаемое сообщение -------- От кого: Timur Ismagilov <tismagi...@mail.ru> Кому: Mike Dubman <mi...@dev.mellanox.co.il> Копия: Open MPI Users <us...@open-mpi.org> Дата: Thu, 17 Jul 2014 16:42:24 +0400 Тема: Re[4]: [OMPI users] Salloc and mpirun problem With Open MPI 1.9a1r32252 (Jul 16, 2014 (nightly snapshot tarball)) i got this output (same?): $ salloc -N2 --exclusive -p test -J ompi salloc: Granted job allocation 645686 $LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so mpirun -mca mca_base_env_list 'LD_PRELOAD' --mca plm_base_verbose 10 --debug-daemons -np 1 hello_c [access1:04312] mca: base: components_register: registering plm components [access1:04312] mca: base: components_register: found loaded component isolated [access1:04312] mca: base: components_register: component isolated has no register or open function [access1:04312] mca: base: components_register: found loaded component rsh [access1:04312] mca: base: components_register: component rsh register function successful [access1:04312] mca: base: components_register: found loaded component slurm [access1:04312] mca: base: components_register: component slurm register function successful [access1:04312] mca: base: components_open: opening plm components [access1:04312] mca: base: components_open: found loaded component isolated [access1:04312] mca: base: components_open: component isolated open function successful [access1:04312] mca: base: components_open: found loaded component rsh [access1:04312] mca: base: components_open: component rsh open function successful [access1:04312] mca: base: components_open: found loaded component slurm [access1:04312] mca: base: components_open: component slurm open function successful [access1:04312] mca:base:select: Auto-selecting plm components [access1:04312] mca:base:select:( plm) Querying component [isolated] [access1:04312] mca:base:select:( plm) Query of component [isolated] set priority to 0 [access1:04312] mca:base:select:( plm) Querying component [rsh] [access1:04312] mca:base:select:( plm) Query of component [rsh] set priority to 10 [access1:04312] mca:base:select:( plm) Querying component [slurm] [access1:04312] mca:base:select:( plm) Query of component [slurm] set priority to 75 [access1:04312] mca:base:select:( plm) Selected component [slurm] [access1:04312] mca: base: close: component isolated closed [access1:04312] mca: base: close: unloading component isolated [access1:04312] mca: base: close: component rsh closed [access1:04312] mca: base: close: unloading component rsh Daemon was launched on node1-128-09 - beginning to initialize Daemon was launched on node1-128-15 - beginning to initialize Daemon [[39207,0],1] checking in as pid 26240 on host node1-128-09 [node1-128-09:26240] [[39207,0],1] orted: up and running - waiting for commands! Daemon [[39207,0],2] checking in as pid 30129 on host node1-128-15 [node1-128-15:30129] [[39207,0],2] orted: up and running - waiting for commands! srun: error: node1-128-09: task 0: Exited with exit code 1 srun: Terminating job step 645686.3 srun: error: node1-128-15: task 1: Exited with exit code 1 -------------------------------------------------------------------------- An ORTE daemon has unexpectedly failed after launch and before communicating back to mpirun. This could be caused by a number of factors, including an inability to create a connection back to mpirun due to a lack of common network interfaces and/or no route found between them. Please check network connectivity (including firewalls and network routing requirements). -------------------------------------------------------------------------- [access1:04312] [[39207,0],0] orted_cmd: received halt_vm cmd [access1:04312] mca: base: close: component slurm closed [access1:04312] mca: base: close: unloading component slurm Thu, 17 Jul 2014 11:40:24 +0300 от Mike Dubman <mi...@dev.mellanox.co.il>: >can you use latest ompi-1.8 from svn/git? >Ralph - could you please suggest. >Thx > > >On Wed, Jul 16, 2014 at 2:48 PM, Timur Ismagilov < tismagi...@mail.ru > wrote: >>Here it is: >> >>$ >>LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so >> mpirun -x LD_PRELOAD --mca plm_base_verbose 10 --debug-daemons -np 1 >>hello_c >> >>[access1:29064] mca: base: components_register: registering plm components >>[access1:29064] mca: base: components_register: found loaded component >>isolated >>[access1:29064] mca: base: components_register: component isolated has no >>register or open function >>[access1:29064] mca: base: components_register: found loaded component rsh >>[access1:29064] mca: base: components_register: component rsh register >>function successful >>[access1:29064] mca: base: components_register: found loaded component slurm >>[access1:29064] mca: base: components_register: component slurm register >>function successful >>[access1:29064] mca: base: components_open: opening plm components >>[access1:29064] mca: base: components_open: found loaded component isolated >>[access1:29064] mca: base: components_open: component isolated open function >>successful >>[access1:29064] mca: base: components_open: found loaded component rsh >>[access1:29064] mca: base: components_open: component rsh open function >>successful >>[access1:29064] mca: base: components_open: found loaded component slurm >>[access1:29064] mca: base: components_open: component slurm open function >>successful >>[access1:29064] mca:base:select: Auto-selecting plm components >>[access1:29064] mca:base:select:( plm) Querying component [isolated] >>[access1:29064] mca:base:select:( plm) Query of component [isolated] set >>priority to 0 >>[access1:29064] mca:base:select:( plm) Querying component [rsh] >>[access1:29064] mca:base:select:( plm) Query of component [rsh] set priority >>to 10 >>[access1:29064] mca:base:select:( plm) Querying component [slurm] >>[access1:29064] mca:base:select:( plm) Query of component [slurm] set >>priority to 75 >>[access1:29064] mca:base:select:( plm) Selected component [slurm] >>[access1:29064] mca: base: close: component isolated closed >>[access1:29064] mca: base: close: unloading component isolated >>[access1:29064] mca: base: close: component rsh closed >>[access1:29064] mca: base: close: unloading component rsh >>Daemon was launched on node1-128-17 - beginning to initialize >>Daemon was launched on node1-128-18 - beginning to initialize >>Daemon [[63607,0],2] checking in as pid 24538 on host node1-128-18 >>[node1-128-18:24538] [[63607,0],2] orted: up and running - waiting for >>commands! >>Daemon [[63607,0],1] checking in as pid 17192 on host node1-128-17 >>[node1-128-17:17192] [[63607,0],1] orted: up and running - waiting for >>commands! >>srun: error: node1-128-18: task 1: Exited with exit code 1 >>srun: Terminating job step 645191.1 >>srun: error: node1-128-17: task 0: Exited with exit code 1 >> >>-------------------------------------------------------------------------- >>An ORTE daemon has unexpectedly failed after launch and before >>communicating back to mpirun. This could be caused by a number >>of factors, including an inability to create a connection back >>to mpirun due to a lack of common network interfaces and/or no >>route found between them. Please check network connectivity >>(including firewalls and network routing requirements). >>-------------------------------------------------------------------------- >>[access1:29064] [[63607,0],0] orted_cmd: received halt_vm cmd >>[access1:29064] mca: base: close: component slurm closed >>[access1:29064] mca: base: close: unloading component slurm >> >> >>Wed, 16 Jul 2014 14:20:33 +0300 от Mike Dubman < mi...@dev.mellanox.co.il >: >>>please add following flags to mpirun "--mca plm_base_verbose 10 >>>--debug-daemons" and attach output. >>>Thx >>> >>> >>>On Wed, Jul 16, 2014 at 11:12 AM, Timur Ismagilov < tismagi...@mail.ru > >>>wrote: >>>>Hello! >>>>I have Open MPI v1.9a1r32142 and slurm 2.5.6. >>>> >>>>I can not use mpirun after salloc: >>>> >>>>$salloc -N2 --exclusive -p test -J ompi >>>>$LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so >>>> mpirun -np 1 hello_c >>>>----------------------------------------------------------------------------------------------------- >>>>An ORTE daemon has unexpectedly failed after launch and before >>>>communicating back to mpirun. This could be caused by a number >>>>of factors, including an inability to create a connection back >>>>to mpirun due to a lack of common network interfaces and/or no >>>>route found between them. Please check network connectivity >>>>(including firewalls and network routing requirements). >>>>------------------------------------------------------------------------------------------------------ >>>>But if i use mpirun in sbutch script it looks correct: >>>>$cat ompi_mxm3.0 >>>>#!/bin/sh >>>>LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so >>>> mpirun -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --map-by slot:pe=8 "$@" >>>> >>>>$sbatch -N2 --exclusive -p test -J ompi ompi_mxm3.0 ./hello_c >>>>Submitted batch job 645039 >>>>$cat slurm-645039.out >>>>[warn] Epoll ADD(1) on fd 0 failed. Old events were 0; read change was 1 >>>>(add); write change was 0 (none): Operation not permitted >>>>[warn] Epoll ADD(4) on fd 1 failed. Old events were 0; read change was 0 >>>>(none); write change was 1 (add): Operation not permitted >>>>Hello, world, I am 0 of 2, (Open MPI v1.9a1, package: Open MPI >>>>semenov@compiler-2 Distribution, ident: 1.9a1r32142, repo rev: r32142, Jul >>>>04, 2014 (nightly snapshot tarball), 146) >>>>Hello, world, I am 1 of 2, (Open MPI v1.9a1, package: Open MPI >>>>semenov@compiler-2 Distribution, ident: 1.9a1r32142, repo rev: r32142, Jul >>>>04, 2014 (nightly snapshot tarball), 146) >>>> >>>>Regards, >>>>Timur >>>>_______________________________________________ >>>>users mailing list >>>>us...@open-mpi.org >>>>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>Link to this post: >>>>http://www.open-mpi.org/community/lists/users/2014/07/24777.php >>> >> >> >> > ----------------------------------------------------------------------