Brian,

Thank you so much! It is working now.

David

***** Correspondence *****



> From: Brian Barrett <brbar...@open-mpi.org>
> Reply-To: Open MPI Users <us...@open-mpi.org>
> Date: Thu, 2 Mar 2006 20:32:25 -0500
> To: Open MPI Users <us...@open-mpi.org>
> Subject: Re: [OMPI users] Problem running open mpi across nodes.
> 
> On Mar 2, 2006, at 8:19 PM, Xiaoning (David) Yang wrote:
> 
>> My G5s only have one ethernet card each and are connected to the
>> network
>> through those cards. I upgraded to Open MPI 1.0.2. The problem
>> remains the
>> same.
>> 
>> A somewhat detailed description of the problem is like this. When I
>> run jobs
>> from the 4-cpu machine, specifying 6 processes, orted, orterun and 4
>> processes will start on this machine. orted and 2 processes will
>> start on
>> the 2-cpu machine. The processes hang for a while and then I get
>> the error
>> message . After that, the processes still hang. If I Ctrl+c, all
>> processes
>> on both machines including both orteds and the orterun will quit.
>> If I run
>> jobs from the 2-cpu machin, specfying 6 processes, orted, orterun
>> and 2
>> processes will start on this machine. Only orted will start on the
>> 4-cpu
>> machine and no processes will start. The job then hang and I don't
>> get any
>> response. If I Ctrl+c, orted, orterun and the 2 processes on the 2-cpu
>> machine will quit. But orted on the 4-cpu machine will not quit.
>> 
>> Does this have anything to do with the IP addresses? The IP address
>> xxx.xxx.aaa.bbb for one machine is different from the IP address
>> xxx.xxx.cc.dd for the other machine in that not only bbb is not dd,
>> but also
>> aaa is not cc.
> 
> Well, you can't guess right all the time :).  But I think you gave
> enough information for the next thing to try.  It sounds like there
> might be a firewall running on the 2 process machine.  When you
> orterun on the 4 cpu machine, the remote orted can clearly connect
> back to orterun because it is getting the process startup and
> shutdown messages.  Things only fail when the MPI process on the 4
> cpu machine try to connect to the other processes.  On the other
> hand, when you start on the 2 cpu machine, the orted on the 4 cpu
> machine starts but can't even connect back to orterun to find out
> what processes to start, nor can it get the shutdown request.  So you
> get a hang.
> 
> If you go into System Preferences -> Sharing, make sure that the
> firewall is turned off  in the "firewall" tab.  Hopefully, that will
> do the trick.
> 
> Brian
> 
> 
> 
>>> From: Brian Barrett <brbar...@open-mpi.org>
>>> Reply-To: Open MPI Users <us...@open-mpi.org>
>>> Date: Thu, 2 Mar 2006 18:50:35 -0500
>>> To: Open MPI Users <us...@open-mpi.org>
>>> Subject: Re: [OMPI users] Problem running open mpi across nodes.
>>> 
>>> On Mar 2, 2006, at 3:56 PM, Xiaoning (David) Yang wrote:
>>> 
>>>> I installed Open MPI on two Mac G5s, one with 2 cpus and the other
>>>> with 4
>>>> cpus. I can run jobs on either of the machines fine. But when I ran
>>>> a job on
>>>> machine one across the two nodes, the all processes I requested
>>>> would start,
>>>> but they then seemed to hang and I got the error message:
>>>> 
>>>> [0,1,1][btl_tcp_endpoint.c:
>>>> 559:mca_btl_tcp_endpoint_complete_connect]
>>>> connect() failed with
>>>> errno=60[0,1,0][btl_tcp_endpoint.c:
>>>> 559:mca_btl_tcp_endpoint_complete_connect
>>>> ] connect() failed with errno=60
>>>> 
>>>> When I ran the job on machine two across the nodes, only processes
>>>> on this
>>>> machine would start and then hung. No processes would start on
>>>> machine one
>>>> and I didn't get any messages. In both cases, I have to Ctrl+C to
>>>> kill the
>>>> jobs. Any idea what was wrong? Thanks a lot!
>>> 
>>> errno 60 is ETIMEDOUT, which means that the connect() timed out
>>> before the remote side answered.  The other way was probably a
>>> similar problem - there's something strange going on with the routing
>>> on the two nodes that's causing OMPI to get confused.  Do your G5
>>> machines have ethernet adapters other than the primary GigE cards
>>> (wireless, a second GigE card, a Firewire TCP stack) by any chance?
>>> There's an issue with situations where there are multiple ethernet
>>> cards that causes the TCP btl to behave badly like this.  We think we
>>> have it fixed in the latest 1.0.2 pre-release tarball of Open MPI, so
>>> it might help to upgrade to that version:
>>> 
>>>    http://www.open-mpi.org/software/ompi/v1.0/
>>> 
>>> Brian
>>> 
>>> -- 
>>>    Brian Barrett
>>>    Open MPI developer
>>>    http://www.open-mpi.org/
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to