Brian, My G5s only have one ethernet card each and are connected to the network through those cards. I upgraded to Open MPI 1.0.2. The problem remains the same.
A somewhat detailed description of the problem is like this. When I run jobs from the 4-cpu machine, specifying 6 processes, orted, orterun and 4 processes will start on this machine. orted and 2 processes will start on the 2-cpu machine. The processes hang for a while and then I get the error message . After that, the processes still hang. If I Ctrl+c, all processes on both machines including both orteds and the orterun will quit. If I run jobs from the 2-cpu machin, specfying 6 processes, orted, orterun and 2 processes will start on this machine. Only orted will start on the 4-cpu machine and no processes will start. The job then hang and I don't get any response. If I Ctrl+c, orted, orterun and the 2 processes on the 2-cpu machine will quit. But orted on the 4-cpu machine will not quit. Does this have anything to do with the IP addresses? The IP address xxx.xxx.aaa.bbb for one machine is different from the IP address xxx.xxx.cc.dd for the other machine in that not only bbb is not dd, but also aaa is not cc. David ***** Correspondence ***** > From: Brian Barrett <brbar...@open-mpi.org> > Reply-To: Open MPI Users <us...@open-mpi.org> > Date: Thu, 2 Mar 2006 18:50:35 -0500 > To: Open MPI Users <us...@open-mpi.org> > Subject: Re: [OMPI users] Problem running open mpi across nodes. > > On Mar 2, 2006, at 3:56 PM, Xiaoning (David) Yang wrote: > >> I installed Open MPI on two Mac G5s, one with 2 cpus and the other >> with 4 >> cpus. I can run jobs on either of the machines fine. But when I ran >> a job on >> machine one across the two nodes, the all processes I requested >> would start, >> but they then seemed to hang and I got the error message: >> >> [0,1,1][btl_tcp_endpoint.c:559:mca_btl_tcp_endpoint_complete_connect] >> connect() failed with >> errno=60[0,1,0][btl_tcp_endpoint.c: >> 559:mca_btl_tcp_endpoint_complete_connect >> ] connect() failed with errno=60 >> >> When I ran the job on machine two across the nodes, only processes >> on this >> machine would start and then hung. No processes would start on >> machine one >> and I didn't get any messages. In both cases, I have to Ctrl+C to >> kill the >> jobs. Any idea what was wrong? Thanks a lot! > > errno 60 is ETIMEDOUT, which means that the connect() timed out > before the remote side answered. The other way was probably a > similar problem - there's something strange going on with the routing > on the two nodes that's causing OMPI to get confused. Do your G5 > machines have ethernet adapters other than the primary GigE cards > (wireless, a second GigE card, a Firewire TCP stack) by any chance? > There's an issue with situations where there are multiple ethernet > cards that causes the TCP btl to behave badly like this. We think we > have it fixed in the latest 1.0.2 pre-release tarball of Open MPI, so > it might help to upgrade to that version: > > http://www.open-mpi.org/software/ompi/v1.0/ > > Brian > > -- > Brian Barrett > Open MPI developer > http://www.open-mpi.org/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users