On Mar 2, 2006, at 8:19 PM, Xiaoning (David) Yang wrote:
My G5s only have one ethernet card each and are connected to the
network
through those cards. I upgraded to Open MPI 1.0.2. The problem
remains the
same.
A somewhat detailed description of the problem is like this. When I
run jobs
from the 4-cpu machine, specifying 6 processes, orted, orterun and 4
processes will start on this machine. orted and 2 processes will
start on
the 2-cpu machine. The processes hang for a while and then I get
the error
message . After that, the processes still hang. If I Ctrl+c, all
processes
on both machines including both orteds and the orterun will quit.
If I run
jobs from the 2-cpu machin, specfying 6 processes, orted, orterun
and 2
processes will start on this machine. Only orted will start on the
4-cpu
machine and no processes will start. The job then hang and I don't
get any
response. If I Ctrl+c, orted, orterun and the 2 processes on the 2-cpu
machine will quit. But orted on the 4-cpu machine will not quit.
Does this have anything to do with the IP addresses? The IP address
xxx.xxx.aaa.bbb for one machine is different from the IP address
xxx.xxx.cc.dd for the other machine in that not only bbb is not dd,
but also
aaa is not cc.
Well, you can't guess right all the time :). But I think you gave
enough information for the next thing to try. It sounds like there
might be a firewall running on the 2 process machine. When you
orterun on the 4 cpu machine, the remote orted can clearly connect
back to orterun because it is getting the process startup and
shutdown messages. Things only fail when the MPI process on the 4
cpu machine try to connect to the other processes. On the other
hand, when you start on the 2 cpu machine, the orted on the 4 cpu
machine starts but can't even connect back to orterun to find out
what processes to start, nor can it get the shutdown request. So you
get a hang.
If you go into System Preferences -> Sharing, make sure that the
firewall is turned off in the "firewall" tab. Hopefully, that will
do the trick.
Brian
From: Brian Barrett <brbar...@open-mpi.org>
Reply-To: Open MPI Users <us...@open-mpi.org>
Date: Thu, 2 Mar 2006 18:50:35 -0500
To: Open MPI Users <us...@open-mpi.org>
Subject: Re: [OMPI users] Problem running open mpi across nodes.
On Mar 2, 2006, at 3:56 PM, Xiaoning (David) Yang wrote:
I installed Open MPI on two Mac G5s, one with 2 cpus and the other
with 4
cpus. I can run jobs on either of the machines fine. But when I ran
a job on
machine one across the two nodes, the all processes I requested
would start,
but they then seemed to hang and I got the error message:
[0,1,1][btl_tcp_endpoint.c:
559:mca_btl_tcp_endpoint_complete_connect]
connect() failed with
errno=60[0,1,0][btl_tcp_endpoint.c:
559:mca_btl_tcp_endpoint_complete_connect
] connect() failed with errno=60
When I ran the job on machine two across the nodes, only processes
on this
machine would start and then hung. No processes would start on
machine one
and I didn't get any messages. In both cases, I have to Ctrl+C to
kill the
jobs. Any idea what was wrong? Thanks a lot!
errno 60 is ETIMEDOUT, which means that the connect() timed out
before the remote side answered. The other way was probably a
similar problem - there's something strange going on with the routing
on the two nodes that's causing OMPI to get confused. Do your G5
machines have ethernet adapters other than the primary GigE cards
(wireless, a second GigE card, a Firewire TCP stack) by any chance?
There's an issue with situations where there are multiple ethernet
cards that causes the TCP btl to behave badly like this. We think we
have it fixed in the latest 1.0.2 pre-release tarball of Open MPI, so
it might help to upgrade to that version:
http://www.open-mpi.org/software/ompi/v1.0/
Brian
--
Brian Barrett
Open MPI developer
http://www.open-mpi.org/
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users