Brian, Thank you so much! It is working now.
David ***** Correspondence ***** > From: Brian Barrett <brbar...@open-mpi.org> > Reply-To: Open MPI Users <us...@open-mpi.org> > Date: Thu, 2 Mar 2006 20:32:25 -0500 > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] Problem running open mpi across nodes. > > On Mar 2, 2006, at 8:19 PM, Xiaoning (David) Yang wrote: > >> My G5s only have one ethernet card each and are connected to the >> network >> through those cards. I upgraded to Open MPI 1.0.2. The problem >> remains the >> same. >> >> A somewhat detailed description of the problem is like this. When I >> run jobs >> from the 4-cpu machine, specifying 6 processes, orted, orterun and 4 >> processes will start on this machine. orted and 2 processes will >> start on >> the 2-cpu machine. The processes hang for a while and then I get >> the error >> message . After that, the processes still hang. If I Ctrl+c, all >> processes >> on both machines including both orteds and the orterun will quit. >> If I run >> jobs from the 2-cpu machin, specfying 6 processes, orted, orterun >> and 2 >> processes will start on this machine. Only orted will start on the >> 4-cpu >> machine and no processes will start. The job then hang and I don't >> get any >> response. If I Ctrl+c, orted, orterun and the 2 processes on the 2-cpu >> machine will quit. But orted on the 4-cpu machine will not quit. >> >> Does this have anything to do with the IP addresses? The IP address >> xxx.xxx.aaa.bbb for one machine is different from the IP address >> xxx.xxx.cc.dd for the other machine in that not only bbb is not dd, >> but also >> aaa is not cc. > > Well, you can't guess right all the time :). But I think you gave > enough information for the next thing to try. It sounds like there > might be a firewall running on the 2 process machine. When you > orterun on the 4 cpu machine, the remote orted can clearly connect > back to orterun because it is getting the process startup and > shutdown messages. Things only fail when the MPI process on the 4 > cpu machine try to connect to the other processes. On the other > hand, when you start on the 2 cpu machine, the orted on the 4 cpu > machine starts but can't even connect back to orterun to find out > what processes to start, nor can it get the shutdown request. So you > get a hang. > > If you go into System Preferences -> Sharing, make sure that the > firewall is turned off in the "firewall" tab. Hopefully, that will > do the trick. > > Brian > > > >>> From: Brian Barrett <brbar...@open-mpi.org> >>> Reply-To: Open MPI Users <us...@open-mpi.org> >>> Date: Thu, 2 Mar 2006 18:50:35 -0500 >>> To: Open MPI Users <us...@open-mpi.org> >>> Subject: Re: [OMPI users] Problem running open mpi across nodes. >>> >>> On Mar 2, 2006, at 3:56 PM, Xiaoning (David) Yang wrote: >>> >>>> I installed Open MPI on two Mac G5s, one with 2 cpus and the other >>>> with 4 >>>> cpus. I can run jobs on either of the machines fine. But when I ran >>>> a job on >>>> machine one across the two nodes, the all processes I requested >>>> would start, >>>> but they then seemed to hang and I got the error message: >>>> >>>> [0,1,1][btl_tcp_endpoint.c: >>>> 559:mca_btl_tcp_endpoint_complete_connect] >>>> connect() failed with >>>> errno=60[0,1,0][btl_tcp_endpoint.c: >>>> 559:mca_btl_tcp_endpoint_complete_connect >>>> ] connect() failed with errno=60 >>>> >>>> When I ran the job on machine two across the nodes, only processes >>>> on this >>>> machine would start and then hung. No processes would start on >>>> machine one >>>> and I didn't get any messages. In both cases, I have to Ctrl+C to >>>> kill the >>>> jobs. Any idea what was wrong? Thanks a lot! >>> >>> errno 60 is ETIMEDOUT, which means that the connect() timed out >>> before the remote side answered. The other way was probably a >>> similar problem - there's something strange going on with the routing >>> on the two nodes that's causing OMPI to get confused. Do your G5 >>> machines have ethernet adapters other than the primary GigE cards >>> (wireless, a second GigE card, a Firewire TCP stack) by any chance? >>> There's an issue with situations where there are multiple ethernet >>> cards that causes the TCP btl to behave badly like this. We think we >>> have it fixed in the latest 1.0.2 pre-release tarball of Open MPI, so >>> it might help to upgrade to that version: >>> >>> http://www.open-mpi.org/software/ompi/v1.0/ >>> >>> Brian >>> >>> -- >>> Brian Barrett >>> Open MPI developer >>> http://www.open-mpi.org/ >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users