Yeah, we aren't connecting back - is there a firewall running? You need to leave the "--debug-daemons --mca plm_base_verbose 5" on there as well to see the entire problem.
What you can see here is that mpirun is listening on several interfaces: > [access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.251.51 to our list of > V4 connections > > [access1:24264] [[55095,0],0] oob:tcp:init adding 10.2.251.11 to our list of > V4 connections > > [access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.0.111 to our list of > V4 connections > > [access1:24264] [[55095,0],0] oob:tcp:init adding 10.128.0.1 to our list of > V4 connections > > [access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list of > V4 connections > It looks like you have multiple interfaces connected to the same subnet - this is generally a bad idea. I also saw that the last one in the list shows up twice in the kernel array - not sure why, but is there something special about that NIC? What do the NICs look like on the remote hosts? On Jul 20, 2014, at 10:59 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: > > > > -------- Пересылаемое сообщение -------- > От кого: Timur Ismagilov <tismagi...@mail.ru> > Кому: Ralph Castain <r...@open-mpi.org> > Дата: Sun, 20 Jul 2014 21:58:41 +0400 > Тема: Re[2]: [OMPI users] Fwd: Re[4]: Salloc and mpirun problem > > Here it is: > > $ salloc -N2 --exclusive -p test -J ompi > salloc: Granted job allocation 647049 > > > $ mpirun -mca mca_base_env_list 'LD_PRELOAD' -mca oob_base_verbose 10 -mca > rml_base_verbose 10 -np 2 hello_c > > [access1:24264] mca: base: components_register: registering oob components > [access1:24264] mca: base: components_register: found loaded component tcp > [access1:24264] mca: base: components_register: component tcp register > function successful > [access1:24264] mca: base: components_open: opening oob components > [access1:24264] mca: base: components_open: found loaded component tcp > [access1:24264] mca: base: components_open: component tcp open function > successful > [access1:24264] mca:oob:select: checking available component tcp > [access1:24264] mca:oob:select: Querying component [tcp] > [access1:24264] oob:tcp: component_available called > [access1:24264] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4 > [access1:24264] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4 > [access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.251.51 to our list of > V4 connections > [access1:24264] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4 > [access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.0.111 to our list of > V4 connections > [access1:24264] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4 > [access1:24264] [[55095,0],0] oob:tcp:init adding 10.2.251.11 to our list of > V4 connections > [access1:24264] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4 > [access1:24264] [[55095,0],0] oob:tcp:init adding 10.128.0.1 to our list of > V4 connections > [access1:24264] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4 > [access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list of > V4 connections > [access1:24264] WORKING INTERFACE 7 KERNEL INDEX 7 FAMILY: V4 > [access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list of > V4 connections > [access1:24264] [[55095,0],0] TCP STARTUP > [access1:24264] [[55095,0],0] attempting to bind to IPv4 port 0 > [access1:24264] [[55095,0],0] assigned IPv4 port 47756 > [access1:24264] mca:oob:select: Adding component to end > [access1:24264] mca:oob:select: Found 1 active transports > [access1:24264] mca: base: components_register: registering rml components > [access1:24264] mca: base: components_register: found loaded component oob > [access1:24264] mca: base: components_register: component oob has no register > or open function > [access1:24264] mca: base: components_open: opening rml components > [access1:24264] mca: base: components_open: found loaded component oob > [access1:24264] mca: base: components_open: component oob open function > successful > [access1:24264] orte_rml_base_select: initializing rml component oob > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 30 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 15 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 32 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 33 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 5 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 10 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 12 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 9 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 34 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 2 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 21 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 22 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 45 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 46 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 1 for peer > [[WILDCARD],WILDCARD] > [access1:24264] [[55095,0],0] posting recv > [access1:24264] [[55095,0],0] posting persistent recv on tag 27 for peer > [[WILDCARD],WILDCARD] > -------------------------------------------------------------------------- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -------------------------------------------------------------------------- > [access1:24264] mca: base: close: component oob closed > [access1:24264] mca: base: close: unloading component oob > [access1:24264] [[55095,0],0] TCP SHUTDOWN > [access1:24264] mca: base: close: component tcp closed > [access1:24264] mca: base: close: unloading component tcp > > When i use srun i got: > > $ salloc -N2 --exclusive -p test -J ompi > .... > $srun -N 2 ./hello_c > Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI > semenov@compiler-2 Distribution, ident: 1.9a1r32252, repo rev: r32252, Jul > 16, 2014 (nightly snapshot tarball), 146) > Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI > semenov@compiler-2 Distribution, ident: 1.9a1r32252, repo rev: r32252, Jul > 16, 2014 (nightly snapshot tarball), 146) > > > Sun, 20 Jul 2014 09:28:13 -0700 от Ralph Castain <r...@open-mpi.org>: > > Try adding -mca oob_base_verbose 10 -mca rml_base_verbose 10 to your cmd > line. It looks to me like we are unable to connect back to the node where you > are running mpirun for some reason. > > > On Jul 20, 2014, at 9:16 AM, Timur Ismagilov <tismagi...@mail.ru> wrote: > >> I have the same problem in openmpi 1.8.1(Apr 23, 2014). >> Does the srun command have a --map-by<foo> mpirun parameter, or can i chage >> it from bash enviroment? >> >> >> >> -------- Пересылаемое сообщение -------- >> От кого: Timur Ismagilov <tismagi...@mail.ru> >> Кому: Mike Dubman <mi...@dev.mellanox.co.il> >> Копия: Open MPI Users <us...@open-mpi.org> >> Дата: Thu, 17 Jul 2014 16:42:24 +0400 >> Тема: Re[4]: [OMPI users] Salloc and mpirun problem >> >> >> With Open MPI 1.9a1r32252 (Jul 16, 2014 (nightly snapshot tarball)) i got >> this output (same?): >> >> $ salloc -N2 --exclusive -p test -J ompi >> salloc: Granted job allocation 645686 >> >> $LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so >> mpirun -mca mca_base_env_list 'LD_PRELOAD' --mca plm_base_verbose 10 >> --debug-daemons -np 1 hello_c >> >> [access1:04312] mca: base: components_register: registering plm components >> [access1:04312] mca: base: components_register: found loaded component >> isolated >> [access1:04312] mca: base: components_register: component isolated has no >> register or open function >> [access1:04312] mca: base: components_register: found loaded component rsh >> [access1:04312] mca: base: components_register: component rsh register >> function successful >> [access1:04312] mca: base: components_register: found loaded component slurm >> [access1:04312] mca: base: components_register: component slurm register >> function successful >> [access1:04312] mca: base: components_open: opening plm components >> [access1:04312] mca: base: components_open: found loaded component isolated >> [access1:04312] mca: base: components_open: component isolated open function >> successful >> [access1:04312] mca: base: components_open: found loaded component rsh >> [access1:04312] mca: base: components_open: component rsh open function >> successful >> [access1:04312] mca: base: components_open: found loaded component slurm >> [access1:04312] mca: base: components_open: component slurm open function >> successful >> [access1:04312] mca:base:select: Auto-selecting plm components >> [access1:04312] mca:base:select:( plm) Querying component [isolated] >> [access1:04312] mca:base:select:( plm) Query of component [isolated] set >> priority to 0 >> [access1:04312] mca:base:select:( plm) Querying component [rsh] >> [access1:04312] mca:base:select:( plm) Query of component [rsh] set priority >> to 10 >> [access1:04312] mca:base:select:( plm) Querying component [slurm] >> [access1:04312] mca:base:select:( plm) Query of component [slurm] set >> priority to 75 >> [access1:04312] mca:base:select:( plm) Selected component [slurm] >> [access1:04312] mca: base: close: component isolated closed >> [access1:04312] mca: base: close: unloading component isolated >> [access1:04312] mca: base: close: component rsh closed >> [access1:04312] mca: base: close: unloading component rsh >> Daemon was launched on node1-128-09 - beginning to initialize >> Daemon was launched on node1-128-15 - beginning to initialize >> Daemon [[39207,0],1] checking in as pid 26240 on host node1-128-09 >> [node1-128-09:26240] [[39207,0],1] orted: up and running - waiting for >> commands! >> Daemon [[39207,0],2] checking in as pid 30129 on host node1-128-15 >> [node1-128-15:30129] [[39207,0],2] orted: up and running - waiting for >> commands! >> srun: error: node1-128-09: task 0: Exited with exit code 1 >> srun: Terminating job step 645686.3 >> srun: error: node1-128-15: task 1: Exited with exit code 1 >> -------------------------------------------------------------------------- >> An ORTE daemon has unexpectedly failed after launch and before >> communicating back to mpirun. This could be caused by a number >> of factors, including an inability to create a connection back >> to mpirun due to a lack of common network interfaces and/or no >> route found between them. Please check network connectivity >> (including firewalls and network routing requirements). >> -------------------------------------------------------------------------- >> [access1:04312] [[39207,0],0] orted_cmd: received halt_vm cmd >> [access1:04312] mca: base: close: component slurm closed >> [access1:04312] mca: base: close: unloading component slurm >> >> >> >> > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/07/24828.php