-------- Пересылаемое сообщение --------
От кого: Timur Ismagilov <tismagi...@mail.ru>
Кому: Ralph Castain <r...@open-mpi.org>
Дата: Sun, 20 Jul 2014 21:58:41 +0400
Тема: Re[2]: [OMPI users] Fwd: Re[4]: Salloc and mpirun problem
Here it is:
$ salloc -N2 --exclusive -p test -J ompi
salloc: Granted job allocation 647049
$ mpirun -mca mca_base_env_list 'LD_PRELOAD' -mca oob_base_verbose 10 -mca
rml_base_verbose 10 -np 2 hello_c
[access1:24264] mca: base: components_register: registering oob components
[access1:24264] mca: base: components_register: found loaded component tcp
[access1:24264] mca: base: components_register: component tcp register function
successful
[access1:24264] mca: base: components_open: opening oob components
[access1:24264] mca: base: components_open: found loaded component tcp
[access1:24264] mca: base: components_open: component tcp open function
successful
[access1:24264] mca:oob:select: checking available component tcp
[access1:24264] mca:oob:select: Querying component [tcp]
[access1:24264] oob:tcp: component_available called
[access1:24264] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
[access1:24264] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
[access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.251.51 to our list of V4
connections
[access1:24264] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
[access1:24264] [[55095,0],0] oob:tcp:init adding 10.0.0.111 to our list of V4
connections
[access1:24264] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
[access1:24264] [[55095,0],0] oob:tcp:init adding 10.2.251.11 to our list of V4
connections
[access1:24264] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
[access1:24264] [[55095,0],0] oob:tcp:init adding 10.128.0.1 to our list of V4
connections
[access1:24264] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
[access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list of V4
connections
[access1:24264] WORKING INTERFACE 7 KERNEL INDEX 7 FAMILY: V4
[access1:24264] [[55095,0],0] oob:tcp:init adding 93.180.7.36 to our list of V4
connections
[access1:24264] [[55095,0],0] TCP STARTUP
[access1:24264] [[55095,0],0] attempting to bind to IPv4 port 0
[access1:24264] [[55095,0],0] assigned IPv4 port 47756
[access1:24264] mca:oob:select: Adding component to end
[access1:24264] mca:oob:select: Found 1 active transports
[access1:24264] mca: base: components_register: registering rml components
[access1:24264] mca: base: components_register: found loaded component oob
[access1:24264] mca: base: components_register: component oob has no register
or open function
[access1:24264] mca: base: components_open: opening rml components
[access1:24264] mca: base: components_open: found loaded component oob
[access1:24264] mca: base: components_open: component oob open function
successful
[access1:24264] orte_rml_base_select: initializing rml component oob
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 30 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 15 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 32 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 33 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 5 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 10 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 12 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 9 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 34 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 2 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 21 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 22 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 45 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 46 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 1 for peer
[[WILDCARD],WILDCARD]
[access1:24264] [[55095,0],0] posting recv
[access1:24264] [[55095,0],0] posting persistent recv on tag 27 for peer
[[WILDCARD],WILDCARD]
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
[access1:24264] mca: base: close: component oob closed
[access1:24264] mca: base: close: unloading component oob
[access1:24264] [[55095,0],0] TCP SHUTDOWN
[access1:24264] mca: base: close: component tcp closed
[access1:24264] mca: base: close: unloading component tcp
When i use srun i got:
$ salloc -N2 --exclusive -p test -J ompi
....
$srun -N 2 ./hello_c
Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI
semenov@compiler-2 Distribution, ident: 1.9a1r32252, repo rev: r32252, Jul 16,
2014 (nightly snapshot tarball), 146)
Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI
semenov@compiler-2 Distribution, ident: 1.9a1r32252, repo rev: r32252, Jul 16,
2014 (nightly snapshot tarball), 146)
Sun, 20 Jul 2014 09:28:13 -0700 от Ralph Castain <r...@open-mpi.org>:
>Try adding -mca oob_base_verbose 10 -mca rml_base_verbose 10 to your cmd line.
>It looks to me like we are unable to connect back to the node where you are
>running mpirun for some reason.
>
>
>On Jul 20, 2014, at 9:16 AM, Timur Ismagilov < tismagi...@mail.ru > wrote:
>>I have the same problem in openmpi 1.8.1( Apr 23, 2014 ).
>>Does the srun command have a --map-by<foo> mpirun parameter, or can i chage
>>it from bash enviroment?
>>
>>
>>-------- Пересылаемое сообщение --------
>>От кого: Timur Ismagilov < tismagi...@mail.ru >
>>Кому: Mike Dubman < mi...@dev.mellanox.co.il >
>>Копия: Open MPI Users < us...@open-mpi.org >
>>Дата: Thu, 17 Jul 2014 16:42:24 +0400
>>Тема: Re[4]: [OMPI users] Salloc and mpirun problem
>>
>>With Open MPI 1.9a1r32252 (Jul 16, 2014 (nightly snapshot tarball)) i got
>>this output (same?):
>>$ salloc -N2 --exclusive -p test -J ompi
>>salloc: Granted job allocation 645686
>>
>>$LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
>> mpirun -mca mca_base_env_list 'LD_PRELOAD' --mca plm_base_verbose 10
>>--debug-daemons -np 1 hello_c
>>[access1:04312] mca: base: components_register: registering plm components
>>[access1:04312] mca: base: components_register: found loaded component
>>isolated
>>[access1:04312] mca: base: components_register: component isolated has no
>>register or open function
>>[access1:04312] mca: base: components_register: found loaded component rsh
>>[access1:04312] mca: base: components_register: component rsh register
>>function successful
>>[access1:04312] mca: base: components_register: found loaded component slurm
>>[access1:04312] mca: base: components_register: component slurm register
>>function successful
>>[access1:04312] mca: base: components_open: opening plm components
>>[access1:04312] mca: base: components_open: found loaded component isolated
>>[access1:04312] mca: base: components_open: component isolated open function
>>successful
>>[access1:04312] mca: base: components_open: found loaded component rsh
>>[access1:04312] mca: base: components_open: component rsh open function
>>successful
>>[access1:04312] mca: base: components_open: found loaded component slurm
>>[access1:04312] mca: base: components_open: component slurm open function
>>successful
>>[access1:04312] mca:base:select: Auto-selecting plm components
>>[access1:04312] mca:base:select:( plm) Querying component [isolated]
>>[access1:04312] mca:base:select:( plm) Query of component [isolated] set
>>priority to 0
>>[access1:04312] mca:base:select:( plm) Querying component [rsh]
>>[access1:04312] mca:base:select:( plm) Query of component [rsh] set priority
>>to 10
>>[access1:04312] mca:base:select:( plm) Querying component [slurm]
>>[access1:04312] mca:base:select:( plm) Query of component [slurm] set
>>priority to 75
>>[access1:04312] mca:base:select:( plm) Selected component [slurm]
>>[access1:04312] mca: base: close: component isolated closed
>>[access1:04312] mca: base: close: unloading component isolated
>>[access1:04312] mca: base: close: component rsh closed
>>[access1:04312] mca: base: close: unloading component rsh
>>Daemon was launched on node1-128-09 - beginning to initialize
>>Daemon was launched on node1-128-15 - beginning to initialize
>>Daemon [[39207,0],1] checking in as pid 26240 on host node1-128-09
>>[node1-128-09:26240] [[39207,0],1] orted: up and running - waiting for
>>commands!
>>Daemon [[39207,0],2] checking in as pid 30129 on host node1-128-15
>>[node1-128-15:30129] [[39207,0],2] orted: up and running - waiting for
>>commands!
>>srun: error: node1-128-09: task 0: Exited with exit code 1
>>srun: Terminating job step 645686.3
>>srun: error: node1-128-15: task 1: Exited with exit code 1
>>--------------------------------------------------------------------------
>>An ORTE daemon has unexpectedly failed after launch and before
>>communicating back to mpirun. This could be caused by a number
>>of factors, including an inability to create a connection back
>>to mpirun due to a lack of common network interfaces and/or no
>>route found between them. Please check network connectivity
>>(including firewalls and network routing requirements).
>>--------------------------------------------------------------------------
>>[access1:04312] [[39207,0],0] orted_cmd: received halt_vm cmd
>>[access1:04312] mca: base: close: component slurm closed
>>[access1:04312] mca: base: close: unloading component slurm
>>
>>
----------------------------------------------------------------------