Steve Kargl wrote:
On Fri, Jul 08, 2011 at 02:19:27PM -0400, Jeff Squyres wrote:
The easiest way to fix this is likely to use the btl_tcp_if_include
or btl_tcp_if_exclude MCA parameters -- i.e., tell OMPI exactly
which interfaces to use:
http://www.open-mpi.org/faq/?category=tcp#tcp-selection
Perhaps, I'm again misreading the output, but it appears that
1.4.4rc2 does not even see the 2nd nic.
hpc:kargl[317] ifconfig bge0
bge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE>
ether 00:e0:81:40:48:92
inet 10.208.78.111 netmask 0xffffff00 broadcast 10.208.78.255
inet6 fe80::2e0:81ff:fe40:4892%bge0 prefixlen 64 scopeid 0x3
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
media: Ethernet autoselect (1000baseT <full-duplex>)
status: active
hpc:kargl[318] ifconfig bge1
bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE>
ether 00:e0:81:40:48:93
inet 192.168.0.10 netmask 0xffffff00 broadcast 192.168.0.255
inet6 fe80::2e0:81ff:fe40:4893%bge1 prefixlen 64 scopeid 0x4
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
media: Ethernet autoselect (1000baseT <full-duplex>)
status: active
kargl[319] /usr/local/openmpi-1.4.4/bin/mpiexec --mca btl_base_verbose 30 \
--mca btl_tcp_if_include bge1 -machinefile mf1 ./z
hpc:kargl[320] /usr/local/openmpi-1.4.4/bin/mpiexec --mca btl_base_verbose 10
--mca btl_tcp_if_include bge1 -machinefile mf1 ./z
[hpc.apl.washington.edu:12295] mca: base: components_open: Looking for btl
components
[hpc.apl.washington.edu:12295] mca: base: components_open: opening btl
components
[hpc.apl.washington.edu:12295] mca: base: components_open: found loaded
component self
[hpc.apl.washington.edu:12295] mca: base: components_open: component self has
no register function
[hpc.apl.washington.edu:12295] mca: base: components_open: component self open
function successful
[hpc.apl.washington.edu:12295] mca: base: components_open: found loaded
component sm
[hpc.apl.washington.edu:12295] mca: base: components_open: component sm has no
register function
[hpc.apl.washington.edu:12295] mca: base: components_open: component sm open
function successful
[hpc.apl.washington.edu:12295] mca: base: components_open: found loaded
component tcp
[hpc.apl.washington.edu:12295] mca: base: components_open: component tcp has no
register function
[hpc.apl.washington.edu:12295] mca: base: components_open: component tcp open
function successful
[hpc.apl.washington.edu:12295] select: initializing btl component self
[hpc.apl.washington.edu:12295] select: init of component self returned success
[hpc.apl.washington.edu:12295] select: initializing btl component sm
[hpc.apl.washington.edu:12295] select: init of component sm returned success
[hpc.apl.washington.edu:12295] select: initializing btl component tcp
[hpc.apl.washington.edu:12295] select: init of component tcp returned success
[node11.cimu.org:21878] mca: base: components_open: Looking for btl components
[node11.cimu.org:21878] mca: base: components_open: opening btl components
[node11.cimu.org:21878] mca: base: components_open: found loaded component self
[node11.cimu.org:21878] mca: base: components_open: component self has no
register function
[node11.cimu.org:21878] mca: base: components_open: component self open
function successful
[node11.cimu.org:21878] mca: base: components_open: found loaded component sm
[node11.cimu.org:21878] mca: base: components_open: component sm has no
register function
[node11.cimu.org:21878] mca: base: components_open: component sm open function
successful
[node11.cimu.org:21878] mca: base: components_open: found loaded component tcp
[node11.cimu.org:21878] mca: base: components_open: component tcp has no
register function
[node11.cimu.org:21878] mca: base: components_open: component tcp open function
successful
[node11.cimu.org:21878] select: initializing btl component self
[node11.cimu.org:21878] select: init of component self returned success
[node11.cimu.org:21878] select: initializing btl component sm
[node11.cimu.org:21878] select: init of component sm returned success
[node11.cimu.org:21878] select: initializing btl component tcp
[node11.cimu.org][[13916,1],1][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances]
invalid interface "bge1"
[node11.cimu.org:21878] select: init of component tcp returned success
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications. This means that no Open MPI device has indicated
that it can be used to communicate between these processes. This is
an error; Open MPI requires that all MPI processes be able to reach
each other. This error can sometimes be the result of forgetting to
specify the "self" BTL.
Hi Steve
It is complaining that bge1 is not valid on node11, not on node10/hpc,
where you ran ifconfig.
Would the names of the interfaces and the matching subnet/IP
vary from node to node?
(E.g. bge0 be associated to 192.168.0.11 on node11, instead of bge1.)
Would it be possible that only on node10 bge1 is on the 192.168.0.0
subnet, but on the other nodes it is bge0 that connects
to the 192.168.0.0 subnet perhaps?
If you're including only bge1 on your mca btl switch,
supposedly all nodes are able to reach
each other via an interface called bge1.
Is this really the case?
You may want to run ifconfig on all nodes to check.
Alternatively, you could exclude node10 from your host file
and try to run the job on the remaining nodes
(and maybe not restrict the interface names with any btl switch).
I hope this helps,
Gus Correa
PS - Your next email, saying that it works with
"--mca btl_tcp_if_include bge1,bge0"
kind of hints that node11 and higher use bge0 for 192.168.0.0,
instead of bge1.