Tena Sakai wrote:
Hi Kevin,

Thanks for your reply.
Dasher is physically located under my desk and vixen is in a
cecure data center.

 does dasher have any network interfaces that vixen does not?

No, I don't think so.
Here is more definitive info:
  [tsakai@dasher Rmpi]$ ifconfig
  eth0      Link encap:Ethernet  HWaddr 00:1A:A0:E1:84:A9
            inet addr:172.16.0.116  Bcast:172.16.3.255  Mask:255.255.252.0
            inet6 addr: fe80::21a:a0ff:fee1:84a9/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
            RX packets:2347 errors:0 dropped:0 overruns:0 frame:0
            TX packets:1005 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:100
            RX bytes:531809 (519.3 KiB)  TX bytes:269872 (263.5 KiB)
            Memory:c2200000-c2220000

  lo        Link encap:Local Loopback
            inet addr:127.0.0.1  Mask:255.0.0.0
            inet6 addr: ::1/128 Scope:Host
            UP LOOPBACK RUNNING  MTU:16436  Metric:1
            RX packets:74 errors:0 dropped:0 overruns:0 frame:0
            TX packets:74 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:0
            RX bytes:7824 (7.6 KiB)  TX bytes:7824 (7.6 KiB)

  [tsakai@dasher Rmpi]$

However, vixen has two ethernet[tsakai@vixen Rmpi]$ cat moo
  [root@vixen ec2]# /sbin/ifconfig
  eth0      Link encap:Ethernet  HWaddr 00:1A:A0:1C:00:31
            inet addr:10.1.1.2  Bcast:192.168.255.255  Mask:255.0.0.0
            inet6 addr: fe80::21a:a0ff:fe1c:31/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
            RX packets:61913135 errors:0 dropped:0 overruns:0 frame:0
            TX packets:61923635 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:47832124690 (44.5 GiB)  TX bytes:54515478860 (50.7 GiB)
            Interrupt:185 Memory:ea000000-ea012100
eth1 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:33
            inet addr:172.16.1.107  Bcast:172.16.3.255  Mask:255.255.252.0
            inet6 addr: fe80::21a:a0ff:fe1c:33/64 Scope:Link
            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
            RX packets:5204431112 errors:0 dropped:0 overruns:0 frame:0
            TX packets:8935796075 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:371123590892 (345.6 GiB)  TX bytes:13424246629869 (12.2
TiB)
            Interrupt:193 Memory:ec000000-ec012100
lo Link encap:Local Loopback
            inet addr:127.0.0.1  Mask:255.0.0.0
            inet6 addr: ::1/128 Scope:Host
            UP LOOPBACK RUNNING  MTU:16436  Metric:1
            RX packets:244169216 errors:0 dropped:0 overruns:0 frame:0
            TX packets:244169216 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:0
            RX bytes:1190976360356 (1.0 TiB)  TX bytes:1190976360356 (1.0
TiB)
[root@vixen ec2]# interfaces:

Please see the mail posting that follows this, my reply to Ashley,
whom nailed the problem precisely.

Regards,

Tena


On 2/14/11 1:35 PM, "kevin.buck...@ecs.vuw.ac.nz"
<kevin.buck...@ecs.vuw.ac.nz> wrote:

This probably shows my lack of understanding as to how OpenMPI
negotiates the connectivity between nodes when given a choice
of interfaces but anyway:

 does dasher have any network interfaces that vixen does not?

The scenario I am imgaining would be that you ssh into dasher
from vixen using a "network" that both share and similarly, when
you mpirun from vixen, the network that OpenMPI uses is constrained
by the interfaces that can be seen from vixen, so you are fine.

However when you are on dasher, mpirun sees another interface which
it takes a liking to and so tries to use that, but that interface
is not available to vixen so the OpenMPI processes spawned there
terminate when they can't find that interface so as to talk back
to dasher's controlling process.

I know that you are no longer working with VMs but it's along those
lines that I was thinking: extra network interfaces that you assume
won't be used but which are and which could then be overcome by use
of an explicit

 --mca btl_tcp_if_exclude virbr0

or some such construction (virbr0 used as an example here).

Kevin


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Hi Tena


They seem to be connected through the LAN 172.16.0.0/255.255.252.0,
with private IPs 172.16.0.116 (dashen,eth0) and
172.16.1.107 (vixen,eth1).
These addresses are probably what OpenMPI is using.
Not much like a cluster, but just machines in a LAN.

Hence, I don't understand why the lack of symmetry in the
firewall protection.
Either vixen's is too loose, or dashen's is too tight, I'd risk to say.
Maybe dashen was installed later, just got whatever boilerplate firewall
that comes with RedHat, CentOS, Fedora.
If there is a gateway for this LAN somewhere with another firewall,
which is probably the case,
I'd guess it is OK to turn off dashen's firewall.

Do you have Internet access from either machine?

Vixen has yet another private IP 10.1.1.2 (eth0),
with a bit weird combination of broadcast address 192.168.255.255 (?),
and mask 255.0.0.0.
Maybe vixen is/was part of another group of machines, via this other IP,
a cluster perhaps?

What is in your ${TORQUE}/server_priv/nodes file?
IPs or names (vixen & dashen).

Are they on a DNS server or do you resolve their names/IPs
via /etc/hosts?

Hopefully vixen's name resolves as 172.16.1.107.
(ping -R vixen may tell).

Gus Correa


Reply via email to