On Fri, Jul 08, 2011 at 04:26:35PM -0400, Gus Correa wrote: > Steve Kargl wrote: > >On Fri, Jul 08, 2011 at 02:19:27PM -0400, Jeff Squyres wrote: > >>The easiest way to fix this is likely to use the btl_tcp_if_include > >>or btl_tcp_if_exclude MCA parameters -- i.e., tell OMPI exactly > >>which interfaces to use: > >> > >> http://www.open-mpi.org/faq/?category=tcp#tcp-selection > >> > > > >Perhaps, I'm again misreading the output, but it appears that > >1.4.4rc2 does not even see the 2nd nic. > > > >hpc:kargl[317] ifconfig bge0 > >bge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 > > > > options=8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE> > > ether 00:e0:81:40:48:92 > > inet 10.208.78.111 netmask 0xffffff00 broadcast 10.208.78.255 > >hpc:kargl[318] ifconfig bge1 > >bge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 > > > > options=8009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,LINKSTATE> > > ether 00:e0:81:40:48:93 > > inet 192.168.0.10 netmask 0xffffff00 broadcast 192.168.0.255 > > > >kargl[319] /usr/local/openmpi-1.4.4/bin/mpiexec --mca btl_base_verbose 30 \ > > --mca btl_tcp_if_include bge1 -machinefile mf1 ./z > > > >hpc:kargl[320] /usr/local/openmpi-1.4.4/bin/mpiexec --mca btl_base_verbose > >10 --mca btl_tcp_if_include bge1 -machinefile mf1 ./z > >[hpc.apl.washington.edu:12295] mca: base: components_open: Looking for btl > >[node11.cimu.org:21878] select: init of component self returned success > >[node11.cimu.org:21878] select: initializing btl component sm > >[node11.cimu.org:21878] select: init of component sm returned success > >[node11.cimu.org:21878] select: initializing btl component tcp > >[node11.cimu.org][[13916,1],1][btl_tcp_component.c:468:mca_btl_tcp_component_create_instances] > > invalid interface "bge1" > >[node11.cimu.org:21878] select: init of component tcp returned success > >-------------------------------------------------------------------------- > >At least one pair of MPI processes are unable to reach each other for > >MPI communications. This means that no Open MPI device has indicated > >that it can be used to communicate between these processes. This is > >an error; Open MPI requires that all MPI processes be able to reach > >each other. This error can sometimes be the result of forgetting to > >specify the "self" BTL. > > > Hi Steve > > It is complaining that bge1 is not valid on node11, not on node10/hpc, > where you ran ifconfig. > > Would the names of the interfaces and the matching subnet/IP > vary from node to node? > (E.g. bge0 be associated to 192.168.0.11 on node11, instead of bge1.) > > Would it be possible that only on node10 bge1 is on the 192.168.0.0 > subnet, but on the other nodes it is bge0 that connects > to the 192.168.0.0 subnet perhaps?
node10 has bge0 = 10.208.x.y and bge1 = 192.168.0.10. node11 through node21 use bge0 = 192.168.0.N where N = 11, ..., 21. > If you're including only bge1 on your mca btl switch, > supposedly all nodes are able to reach > each other via an interface called bge1. > Is this really the case? > You may want to run ifconfig on all nodes to check. > > Alternatively, you could exclude node10 from your host file > and try to run the job on the remaining nodes > (and maybe not restrict the interface names with any btl switch). Completely exclude node10 would appear to work. Of course, this then loses the 4 cpus and 16 GB of memory that are in node. The question to me is why does 1.4.2 work without a problem, and 1.4.3 and 1.4.4 have problems with a node with 2 NICs. I suppose a follow-on question is: Is there some way to get 1.4.4 to exclusive use bge1 on node10 while using bge0 on the other nodes? -- Steve