Tena, If I understand you correctly, the configuration you're trying to use is
submission host[ec2 instance 0] <-> slave [ec2 instance 1] I haven't tried this yet (although I will in the next few days). I've tried (a) submission host[non-ec2 system with static IP, direct net connection] <-> slave [ec2 instance 1] (b) submission host[non-ec2 system with local static IP, connected to net via router] <-> slave [ec2 instance 1] (a) works, (b) does not, presumably because opmpi does not support NAT (see Jeff Squyres comments, later in the thread). I notice that you're using the 'internal' uri to specify hostnames. This makes sense in principle, but have you tried using the public/external uri? Presumably opmpi has to lookup these hostnames. I don't know how that's done, but trying to lookup the internal uri might be a problem. If you try this (or anything else), I'd appreciate it if you'd post your results. bw On 2/17/11 4:08 AM, Tena Sakai wrote: > Hi Barnet, > > Allow me to interject. > Are you saying that you run master on your local machine and launching > openMPI process on EC2? You are saying that 1) tcp port > tcp://192.168.1.101:35272 is on your local system and 2) the ec2 > instance is trying to connect your local machine’s port 35272 , and > hanging. Is that correct? > > I have just a bit different situation. I am running 2 ec2 instances > and trying to run mpirun on both instances. My ssh debug output looks > quite similar to yours and mpirun behavior also very similar. Here’s > what I captured: > Sending command: orted --daemonize -mca ess env -mca orte_ess_jobid > 1025769472 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri > "1025769472.0;tcp://10.118.23.4:60941" > And here’s what I did on the instance from which I issued mpirun: > [tsakai@ip-10-118-23-4 ~]$ nslookup `hostname` > Server: 172.16.0.23 > Address: 172.16.0.23#53 > > Non-authoritative answer: > Name: ip-10-118-23-4.ec2.internal > Address: 10.118.23.4 > > So that tcp port does belong to this instance. Furthermore, it cannot > come into it. No router (which may perform address translation?) is > involved and it appears the same thing as what you describe is > happening. Incidentally, here’s how I ran mpirun: > [tsakai@ip-10-118-23-4 ~]$ mpirun -app app.ac > With app.ac file: > [tsakai@ip-10-118-23-4 ~]$ cat app.ac > -H ip-10-118-23-4.ec2.internal -np 1 /bin/hostname > -H ip-10-118-23-4.ec2.internal -np 1 /bin/hostname > -H ip-10-118-18-172.ec2.internal -np 1 /bin/hostname > -H ip-10-118-18-172.ec2.internal -np 1 /bin/hostname > > The first two lines spawns /bin/hostname on this instance > (ip-10-118-23-4.ec2.internal) and the bottom 2 lines on the remote > instance. > Here’s the security group used for these instances: > > connetion protocol from to source > ------------- ----------- ------ ----- ------------ > *SSH *tcp 22 22 0.0.0.0/0 > > Am I making sense? > > Regards, > > Tena > > > > > On 2/16/11 8:56 PM, "Barnet Wagman" <b...@norbl.com> wrote: > > I've run into a problem involving accessing a remote host via a > router and I think need to understand how opmpi determines ip > addresses. If there's anything posted on this subject, please > point me to it. > > Here's the problem: > > I've installed opmpi (1.4.3) on a remote system (an Amazon ec2 > instance). If the local system I'm working on has a static ip > address (and a direct connection to the internet), there's no > problem. But if the local system accesses the internet through a > router (which itself gets it's ip via dhcp), a call to runmpi > command hangs. > > This is not firewall problem - I've disabled the firewalls on all > the system that are involved (and the router). > > It is also not an ssh problem. The ssh connection is being made > and it appears that the application has been launched on the > remote system. After the runmpi command has been launched > locally, a ps on the remote system shows a process > > > orted --daemonize -mca ess env -mca orte_ess_jobid 1187643392 > -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri > 1187643392.0;tcp://192.168.1.101:35272 > > > > While I don't really understand the orted process, I assume this > indicates that a command to execute an app has been received and > that opmpi is trying to run it. > > I suspect that the problem is related to the '--hnp-uri ... > tcp://192.168.1.101' argument. 192.168.1.101 is the address of my > local system on my local network (attached to the router), which > of course is not accessible over the net. It appears that opmpi > is transmitting the local (static) ip address to the remote host. > > It would help to know how opmpi determines and distributes IP > addresses. And if there's any way to control this. > > Any thoughts on dealing with this would be greatly appreciated. > > Thanks, > > bw > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users