Tena,

If I understand you correctly, the configuration you're trying to use is

    submission host[ec2 instance 0] <-> slave [ec2 instance 1]

I haven't tried this yet (although I will in the next few days). 

I've tried

    (a)  submission host[non-ec2 system with static IP, direct net
    connection] <-> slave [ec2 instance 1]
    (b)  submission host[non-ec2 system with local static IP, connected
    to net via router] <-> slave [ec2 instance 1]

(a) works, (b) does not, presumably because opmpi does not support NAT
(see Jeff Squyres comments, later in the thread).


I notice that you're using the 'internal' uri to specify hostnames. This
makes sense in principle, but have you tried using the public/external
uri?  Presumably opmpi has to lookup these hostnames.  I don't know how
that's done, but trying to lookup the internal uri might be a problem.

If you try this (or anything else), I'd appreciate it if you'd post your
results.

bw


On 2/17/11 4:08 AM, Tena Sakai wrote:
> Hi Barnet,
>
> Allow me to interject.
> Are you saying that you run master on your local machine and launching
> openMPI process on EC2?  You are saying that 1) tcp port
> tcp://192.168.1.101:35272 is on your local system and 2) the ec2
> instance is trying to connect your local machine’s port 35272 , and
> hanging.  Is that correct?
>
> I have just a bit different situation.  I am running 2 ec2 instances
> and trying to run mpirun on both instances.  My ssh debug output looks
> quite similar to yours and mpirun behavior also very similar.  Here’s
> what I captured:
>   Sending command:  orted --daemonize -mca ess env -mca orte_ess_jobid
> 1025769472 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
> "1025769472.0;tcp://10.118.23.4:60941"
> And here’s what I did on the instance from which I issued mpirun:
>   [tsakai@ip-10-118-23-4 ~]$ nslookup `hostname`
>   Server:         172.16.0.23
>   Address:        172.16.0.23#53
>
>   Non-authoritative answer:
>   Name:   ip-10-118-23-4.ec2.internal
>   Address: 10.118.23.4
>
> So that tcp port does belong to this instance.  Furthermore, it cannot
> come into it.  No router (which may perform address translation?) is
> involved and it appears the same thing as what you describe is
> happening.  Incidentally, here’s how I ran mpirun:
>   [tsakai@ip-10-118-23-4 ~]$ mpirun -app app.ac
> With app.ac file:
>   [tsakai@ip-10-118-23-4 ~]$ cat app.ac
>   -H ip-10-118-23-4.ec2.internal -np 1 /bin/hostname
>   -H ip-10-118-23-4.ec2.internal -np 1 /bin/hostname
>   -H ip-10-118-18-172.ec2.internal -np 1 /bin/hostname
>   -H ip-10-118-18-172.ec2.internal -np 1 /bin/hostname
>
> The first two lines spawns /bin/hostname on this instance
> (ip-10-118-23-4.ec2.internal) and the bottom 2 lines on the remote
> instance.
> Here’s the security group used for these instances:
>
>   connetion       protocol   from     to      source
>   -------------        -----------   ------    -----   ------------
>   *SSH                 *tcp           22      22    0.0.0.0/0
>
> Am I making sense?
>
> Regards,
>
> Tena
>
>
>
>
> On 2/16/11 8:56 PM, "Barnet Wagman" <b...@norbl.com> wrote:
>
>       I've run into a problem involving accessing a remote host via a
>     router and I think need to understand how opmpi determines ip
>     addresses.  If there's anything posted on this subject, please
>     point me to it.
>      
>      Here's the problem:
>      
>      I've installed opmpi (1.4.3) on a remote system (an Amazon ec2
>     instance).  If the local system I'm working on has a static ip
>     address (and a direct connection to the internet), there's no
>     problem.  But if the local system accesses the internet through a
>     router (which itself gets it's ip via dhcp), a call to runmpi
>     command hangs.
>      
>      This is not firewall problem - I've disabled the firewalls on all
>     the system that are involved (and the router).
>      
>      It is also not an ssh problem.  The ssh connection is being made
>     and it appears that the application has been launched on the
>     remote system.  After the runmpi command has been launched
>     locally, a ps on the remote system shows a process
>      
>
>         orted --daemonize -mca ess env -mca orte_ess_jobid 1187643392
>         -mca orte_ess_vpid 1 -mca orte_ess_num_procs 2 --hnp-uri
>         1187643392.0;tcp://192.168.1.101:35272
>          
>
>
>      While I don't really understand the orted process, I assume this
>     indicates that a command to execute an app has been received and
>     that opmpi is trying to run it.
>      
>      I suspect that the problem is related to the '--hnp-uri ...
>     tcp://192.168.1.101' argument.  192.168.1.101 is the address of my
>     local system on my local network (attached to the router), which
>     of course is not accessible over the net.  It appears that opmpi
>     is transmitting the local (static) ip address to the remote host.
>      
>      It would help to know how opmpi determines and distributes IP
>     addresses.  And if there's any way to control this.
>      
>      Any thoughts on dealing with this would be greatly appreciated.
>      
>      Thanks,
>      
>      bw
>      
>      
>      
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to