Hi Gus, Please read my comments inline.
On 2/14/11 7:05 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: > Hi Tena > > Answers inline. > > Tena Sakai wrote: >> Hi Gus, >> >>> Hence, I don't understand why the lack of symmetry in the >>> firewall protection. >>> Either vixen's is too loose, or dashen's is too tight, I'd risk to say. >>> Maybe dashen was installed later, just got whatever boilerplate firewall >>> that comes with RedHat, CentOS, Fedora. >>> If there is a gateway for this LAN somewhere with another firewall, >>> which is probably the case, >> >> You are correct. We had a system administrator, but we lost >> that person and I installed dasher from scratch myslef and >> I did use boilerplage firewall from centos 5.5 distribution. >> > > I read your answer to Ashley and Reuti telling that you > turned the firewall off and OpenMPI now works with vixen and dashen. > That's good news! > >>> Do you have Internet access from either machine? >> >> Yes, I do. > > The LAN gateway is probably doing NAT. I think that's the case. > I would guess it also has its own firewall. Yes, I believe so. > Is there anybody there that could tell you about this? I am afraid not... Every time I ask something, I get run-around or disinformation. > >> >>> Vixen has yet another private IP 10.1.1.2 (eth0), >>> with a bit weird combination of broadcast address 192.168.255.255(?), >>> mask 255.0.0.0. >>> vixen is/was part of another group of machines, via this other IP, >>> cluster perhaps? >> >> We have a Rocks HPC cluster. The cluster head is called blitzen >> and there are 8 nodes in the cluster. We have completely outgrown >> this setting. For example, I am running an application for last >> 2 weeks with 4 of 8 nodes and the other 4 nodes have been used >> by my colleagues and I expect my jobs to run another 2-3 weeks. >> Which is why I am interested in cloud. >> >> Vixen is not part of the Rocks cluster, but it is an nfs server, >> as well as database server. Here's ifconfig of blitzen: >> >> [tsakai@blitzen Rmpi]$ ifconfig >> eth0 Link encap:Ethernet HWaddr 00:19:B9:E0:C0:0B >> inet addr:10.1.1.1 Bcast:10.255.255.255 Mask:255.0.0.0 >> inet6 addr: fe80::219:b9ff:fee0:c00b/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:58859908 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:38795319 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:14637456238 (13.6 GiB) TX bytes:25487423161 (23.7 GiB) >> Interrupt:193 Memory:ec000000-ec012100 >> >> eth1 Link encap:Ethernet HWaddr 00:19:B9:E0:C0:0D >> inet addr:172.16.1.106 Bcast:172.16.3.255 Mask:255.255.252.0 >> inet6 addr: fe80::219:b9ff:fee0:c00d/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:99465693 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:46026372 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:44685802310 (41.6 GiB) TX bytes:28223858173 (26.2 GiB) >> Interrupt:193 Memory:ea000000-ea012100 >> >> lo Link encap:Local Loopback >> inet addr:127.0.0.1 Mask:255.0.0.0 >> inet6 addr: ::1/128 Scope:Host >> UP LOOPBACK RUNNING MTU:16436 Metric:1 >> RX packets:80078179 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:80078179 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:0 >> RX bytes:27450135463 (25.5 GiB) TX bytes:27450135463 (25.5 GiB) >> >> And here's the same thing of vixen: >> [tsakai@vixen Rmpi]$ cat moo >> eth0 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:31 >> inet addr:10.1.1.2 Bcast:192.168.255.255 Mask:255.0.0.0 >> inet6 addr: fe80::21a:a0ff:fe1c:31/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:61942079 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:61950934 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:47837093368 (44.5 GiB) TX bytes:54525223424 (50.7 GiB) >> Interrupt:185 Memory:ea000000-ea012100 >> >> eth1 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:33 >> inet addr:172.16.1.107 Bcast:172.16.3.255 Mask:255.255.252.0 >> inet6 addr: fe80::21a:a0ff:fe1c:33/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:5204606192 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:8935890067 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:371146631795 (345.6 GiB) TX bytes:13424275898600 (12.2 >> TiB) >> Interrupt:193 Memory:ec000000-ec012100 >> >> lo Link encap:Local Loopback >> inet addr:127.0.0.1 Mask:255.0.0.0 >> inet6 addr: ::1/128 Scope:Host >> UP LOOPBACK RUNNING MTU:16436 Metric:1 >> RX packets:244240818 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:244240818 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:0 >> RX bytes:1190988294201 (1.0 TiB) TX bytes:1190988294201 (1.0 >> TiB) >> >> I think you are also correct as to: >> >>> a bit weird combination of broadcast address 192.168.255.255 (?), >>> and mask 255.0.0.0. >> >> I think they are both misconfigured. I will fix them when I can. >> > > Blitzen's configuration looks like standard Rocks to me: > eth0 for private net, eth1 for LAN or WAN. > I think it is not misconfigured. Good, that's one fewer things I have to do. > > Also, beware that Rocks has its own ways/commands to configure things > (i.e., '$ rocks do this and that'). > Using the Linux tools directly sometimes breaks or leaves loose > ends on Rocks. Thank you for the tip. I was unaware of such. > > Vixen eth0 looks weird, but now that you mentioned your Rocks cluster, > it may be that its eth0 is used to connect vixen to the > cluster's private subnet, and serve NFS to it. > Still the Bcast address doesn't look right. > I would expect it to be 10.255.255.255 (as in blitzen's eth0), if vixen > serves NFS to the cluster via eth0. Thanks for this tip as well. I will make a note of it. As strange as it looks, things are working and therefore I won't fix it at the moment. Better things to do on my Plate. > >>> What is in your ${TORQUE}/server_priv/nodes file? >>> IPs or names (vixen & dashen). >> >> We don't use TORQUE. We do use SGE from blitzen. >> > > Oh, sorry, you said before you don't use Torque. > I forgot that one. > > What I really meant to ask is about your OpenMPI hostfile, > or how the --app file refers to the machines, > but I guess you use host names there, not IPs. > Yes, I have been pretty consistently doing dns approach. >>> Are they on a DNS server or do you resolve their names/IPs >>> via /etc/hosts? >>> Hopefully vixen's name resolves as 172.16.1.107. >> >> They are on dns server: >> >> [tsakai@dasher Rmpi]$ nslookup vixen.egcrc.org >> Server: 172.16.1.2 >> Address: 172.16.1.2#53 >> >> Name: vixen.egcrc.org >> Address: 172.16.1.107 >> >> [tsakai@dasher Rmpi]$ nslookup blitzen >> Server: 172.16.1.2 >> Address: 172.16.1.2#53 >> >> Name: blitzen.egcrc.org >> Address: 172.16.1.106 >> >> [tsakai@dasher Rmpi]$ >> [tsakai@dasher Rmpi]$ >> > > DNS makes it easier for you, specially on a LAN, where machines > change often in ways that you can't control. > You don't need to worry about resolving names with /etc/hosts, > which is an the easy thing to do in a cluster. All our machines have static IP address and they are served out of dns server. The only exception to this is dasher. I have not asked for a static IP for dasher yet. > >> One more point that I over looked in a previous post: >> >>> I have yet to understand whether you copy your compiled tools >>> (OpenMPI, R, etc) from your local machines to EC2, >>> or if you build/compile them directly on the EC2 environment. >> >> Tools like OpenMPI, R, and for that matter gcc, must be part >> of ami. The ami is stored on amazon device, it could be on >> an S3 (simple storage server) or volume (which is what Ashley >> recommends). So I put R and everything I needed on the ami >> before I uploaded it onto amazon. Only I didn't put OpenMPI >> on it. I did wget from my ami instance to download OpenMPI >> source, compiled it on the instance, and saved that image >> on S3. So now when I launch the instance OpenMPI is part of >> the ami. >> > > It is more clear to me now. > It sounds right, although other than storage, > I can't fathom the difference between what you > did and what Ashley suggested. > Yet, somehow Ashley got it to work. > There may be something to pursue there. There I agree with you as well. > >>> Also, it's not clear to me if the OS in EC2 is an image >>> from your local machines' OS/Linux distro, or independent of them, >>> or if you can choose to have it either way. >> >> The OS in EC2 is either linux or windows. (I have never >> used windows in my life.) > > I did. > Don't worry. > It is not a sin. :) > > But seriously, from the problems I read on the MPICH2 mailing list, > I it seems to be hard to use it for HPC and parallel programing at least. > > >> For linux, it can be any linux >> as one chooses. In my case, I built an ami from centos >> distribution with everything I needed. It is essentially >> the same thing as dasher. > > Except for the firewall, I suppose. > Did you check if it is turned off on your EC2 replica of dasher? > I don't know if this question makes any sense in the EC2 context, > but maybe it does. I did just now: [tsakai@ip-10-212-231-223 ~]$ sudo su bash-3.2# bash-3.2# iptables --list Chain INPUT (policy ACCEPT) target prot opt source destination Chain FORWARD (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination bash-3.2# bash-3.2# cat cat /etc/sysconfig/iptables cat: cat: No such file or directory cat: /etc/sysconfig/iptables: No such file or directory bash-3.2# bash-3.2# /etc/rc.d/init.d/iptables status Table: filter Chain INPUT (policy ACCEPT) num target prot opt source destination Chain FORWARD (policy ACCEPT) num target prot opt source destination Chain OUTPUT (policy ACCEPT) num target prot opt source destination bash-3.2# I think I can conclude that there is no firewall running on the EC2 instance. > >> >>> On another posting, Ashley Pittman reported to >>> be using OpenMPI in Amazon EC2 without problems, >>> suggests pathway and gives several tips for that. >>> That is probably a more promising path, >>> which you may want to try. >> >> I have a feeling that I will be in need of more help >> from her. >> > > Save a mistake, I have the feeling that the > Ashley Pitmann we've been talking to is a gentleman: > > http://uk.linkedin.com/in/ashleypittman > > not the jewelry designer: > > http://www.ashleypittman.com/company-ashley-pittman.php > Oops, I hope I didn't offend him. >> Regards, >> >> Tena >> >> >> > > Best, > Gus Lastly, I broke something, though I have no idea what it is. I have been wrenching my head last 3-4 hours... I did something to my ami between Friday and today. As a result, I can no longer replicate what I was able to show you Friday on EC2. Namely, I cannot generate 'pipe function call failed...' error anymore. It, instead, hangs. I am sure I am doing something stupid... More on this tomorrow. Many thanks for your thoughts. I appreciate it very much. Regards, Tena > > >> On 2/14/11 3:46 PM, "Gus Correa" <g...@ldeo.columbia.edu> wrote: >> >>> Tena Sakai wrote: >>>> Hi Kevin, >>>> >>>> Thanks for your reply. >>>> Dasher is physically located under my desk and vixen is in a >>>> cecure data center. >>>> >>>>> does dasher have any network interfaces that vixen does not? >>>> No, I don't think so. >>>> Here is more definitive info: >>>> [tsakai@dasher Rmpi]$ ifconfig >>>> eth0 Link encap:Ethernet HWaddr 00:1A:A0:E1:84:A9 >>>> inet addr:172.16.0.116 Bcast:172.16.3.255 Mask:255.255.252.0 >>>> inet6 addr: fe80::21a:a0ff:fee1:84a9/64 Scope:Link >>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>>> RX packets:2347 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:1005 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:100 >>>> RX bytes:531809 (519.3 KiB) TX bytes:269872 (263.5 KiB) >>>> Memory:c2200000-c2220000 >>>> >>>> lo Link encap:Local Loopback >>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>> inet6 addr: ::1/128 Scope:Host >>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>> RX packets:74 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:74 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:0 >>>> RX bytes:7824 (7.6 KiB) TX bytes:7824 (7.6 KiB) >>>> >>>> [tsakai@dasher Rmpi]$ >>>> >>>> However, vixen has two ethernet[tsakai@vixen Rmpi]$ cat moo >>>> [root@vixen ec2]# /sbin/ifconfig >>>> eth0 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:31 >>>> inet addr:10.1.1.2 Bcast:192.168.255.255 Mask:255.0.0.0 >>>> inet6 addr: fe80::21a:a0ff:fe1c:31/64 Scope:Link >>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>>> RX packets:61913135 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:61923635 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:1000 >>>> RX bytes:47832124690 (44.5 GiB) TX bytes:54515478860 (50.7 >>>> GiB) >>>> Interrupt:185 Memory:ea000000-ea012100 >>>> >>>> eth1 Link encap:Ethernet HWaddr 00:1A:A0:1C:00:33 >>>> inet addr:172.16.1.107 Bcast:172.16.3.255 Mask:255.255.252.0 >>>> inet6 addr: fe80::21a:a0ff:fe1c:33/64 Scope:Link >>>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>>> RX packets:5204431112 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:8935796075 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:1000 >>>> RX bytes:371123590892 (345.6 GiB) TX bytes:13424246629869 >>>> (12.2 >>>> TiB) >>>> Interrupt:193 Memory:ec000000-ec012100 >>>> >>>> lo Link encap:Local Loopback >>>> inet addr:127.0.0.1 Mask:255.0.0.0 >>>> inet6 addr: ::1/128 Scope:Host >>>> UP LOOPBACK RUNNING MTU:16436 Metric:1 >>>> RX packets:244169216 errors:0 dropped:0 overruns:0 frame:0 >>>> TX packets:244169216 errors:0 dropped:0 overruns:0 carrier:0 >>>> collisions:0 txqueuelen:0 >>>> RX bytes:1190976360356 (1.0 TiB) TX bytes:1190976360356 (1.0 >>>> TiB) >>>> >>>> [root@vixen ec2]# interfaces: >>>> >>>> Please see the mail posting that follows this, my reply to Ashley, >>>> whom nailed the problem precisely. >>>> >>>> Regards, >>>> >>>> Tena >>>> >>>> >>>> On 2/14/11 1:35 PM, "kevin.buck...@ecs.vuw.ac.nz" >>>> <kevin.buck...@ecs.vuw.ac.nz> wrote: >>>> >>>>> This probably shows my lack of understanding as to how OpenMPI >>>>> negotiates the connectivity between nodes when given a choice >>>>> of interfaces but anyway: >>>>> >>>>> does dasher have any network interfaces that vixen does not? >>>>> >>>>> The scenario I am imgaining would be that you ssh into dasher >>>>> from vixen using a "network" that both share and similarly, when >>>>> you mpirun from vixen, the network that OpenMPI uses is constrained >>>>> by the interfaces that can be seen from vixen, so you are fine. >>>>> >>>>> However when you are on dasher, mpirun sees another interface which >>>>> it takes a liking to and so tries to use that, but that interface >>>>> is not available to vixen so the OpenMPI processes spawned there >>>>> terminate when they can't find that interface so as to talk back >>>>> to dasher's controlling process. >>>>> >>>>> I know that you are no longer working with VMs but it's along those >>>>> lines that I was thinking: extra network interfaces that you assume >>>>> won't be used but which are and which could then be overcome by use >>>>> of an explicit >>>>> >>>>> --mca btl_tcp_if_exclude virbr0 >>>>> >>>>> or some such construction (virbr0 used as an example here). >>>>> >>>>> Kevin >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Hi Tena >>> >>> >>> They seem to be connected through the LAN 172.16.0.0/255.255.252.0, >>> with private IPs 172.16.0.116 (dashen,eth0) and >>> 172.16.1.107 (vixen,eth1). >>> These addresses are probably what OpenMPI is using. >>> Not much like a cluster, but just machines in a LAN. >>> >>> Hence, I don't understand why the lack of symmetry in the >>> firewall protection. >>> Either vixen's is too loose, or dashen's is too tight, I'd risk to say. >>> Maybe dashen was installed later, just got whatever boilerplate firewall >>> that comes with RedHat, CentOS, Fedora. >>> If there is a gateway for this LAN somewhere with another firewall, >>> which is probably the case, >>> I'd guess it is OK to turn off dashen's firewall. >>> >>> Do you have Internet access from either machine? >>> >>> Vixen has yet another private IP 10.1.1.2 (eth0), >>> with a bit weird combination of broadcast address 192.168.255.255 (?), >>> and mask 255.0.0.0. >>> Maybe vixen is/was part of another group of machines, via this other IP, >>> a cluster perhaps? >>> >>> What is in your ${TORQUE}/server_priv/nodes file? >>> IPs or names (vixen & dashen). >>> >>> Are they on a DNS server or do you resolve their names/IPs >>> via /etc/hosts? >>> >>> Hopefully vixen's name resolves as 172.16.1.107. >>> (ping -R vixen may tell). >>> >>> Gus Correa >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users