Well, it is getting better! :-) On your cmd line, what btl's are you specifying? You should try -mca btl sm,tcp,self for this to work. Reason: sometimes systems block tcp loopback on the node. What I see below indicates that inter-node comm was fine, but the two procs that share a node couldn't communicate. Including shared memory should remove that problem.
The port numbers are fine and can be different or the same - it is totally random. The procs exchange their respective port info during wireup. On Wed, Aug 12, 2009 at 12:51 PM, Jody Klymak <jkly...@uvic.ca> wrote: > Hi Ralph, > That gives me something more to work with... > > > On Aug 12, 2009, at 9:44 AM, Ralph Castain wrote: > > I believe TCP works fine, Jody, as it is used on Macs fairly widely. I > suspect this is something funny about your installation. > > One thing I have found is that you can get this error message when you have > multiple NICs installed, each with a different subnet, and the procs try to > connect across different ones. Do you by chance have multiple NICs? > > > The head node has two active NICs: > en0: public > en1: private > > The server nodes only have one connection > en0:private > > > Have you tried telling OMPI which TCP interface to use? You can do so with > -mca btl_tcp_if_include eth0 (or whatever you want to use). > > > If I try this, I get the same results. (though I need to use "en0" on my > machine)... > > If I include -mca btl_base_verbose 30 I get for n=2: > > ++++++++++ > [xserve03.local:00841] select: init of component tcp returned success > Done MPI init > checking connection between rank 0 on xserve02.local and rank 1 > Done MPI init > [xserve02.local:01094] btl: tcp: attempting to connect() to address > 192.168.2.103 on port 4 > Done checking connection between rank 0 on xserve02.local and rank 1 > Connectivity test on 2 processes PASSED. > ++++++++++ > > If I try n=3 the job hangs and I have to kill: > > ++++++++++ > Done MPI init > checking connection between rank 0 on xserve02.local and rank 1 > [xserve02.local:01110] btl: tcp: attempting to connect() to address > 192.168.2.103 on port 4 > Done MPI init > Done MPI init > checking connection between rank 1 on xserve03.local and rank 2 > [xserve03.local:00860] btl: tcp: attempting to connect() to address > 192.168.2.102 on port 4 > Done checking connection between rank 0 on xserve02.local and rank 1 > checking connection between rank 0 on xserve02.local and rank 2 > Done checking connection between rank 0 on xserve02.local and rank 2 > mpirun: killing job... > ++++++++++ > > Those ip addresses are correct, no idea if port 4 make sense. Sometimes I > get port 260. Should xserve03 and xserve02 be trying to use the same port > for these comms? > > > Thanks, Jody > > > > > > On Wed, Aug 12, 2009 at 10:01 AM, Jody Klymak <jkly...@uvic.ca> wrote: > >> >> On Aug 11, 2009, at 18:55 PM, Gus Correa wrote: >> >> >>> Did you wipe off the old directories before reinstalling? >>> >> >> Check. >> >> I prefer to install on a NFS mounted directory, >>> >> >> Check >> >> >> Have you tried to ssh from node to node on all possible pairs? >>> >> >> check - fixed this today, works fine with the spawning user... >> >> How could you roll back to 1.1.5, >>> now that you overwrote the directories? >>> >> >> Oh, I still have it on another machine off the cluster in >> /usr/local/openmpi. Will take just 5 mintues to reinstall. >> >> Launching jobs with Torque is way much better than >>> using barebones mpirun. >>> >> >> And you don't want to stay behind with the OpenMPI versions >>> and improvements either. >>> >> >> Sure, but I'd like the jobs to be able to run at all.. >> >> Is there any sense in rolling back to to 1.2.3 since that is known to work >> with OS X (its the one that comes with 10.5)? My only guess at this point >> is other OS X users are using non-tcpip communication, and the tcp stuff >> just doesn't work in 1.3.3. >> >> Thanks, Jody >> >> -- >> Jody Klymak >> http://web.uvic.ca/~jklymak/ <http://web.uvic.ca/%7Ejklymak/> >> >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jody Klymak > http://web.uvic.ca/~jklymak/ <http://web.uvic.ca/%7Ejklymak/> > > > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >