Hi Hugh Again, just to make sure, are the hostnames in your host file well-known? I.e. when you say you can do ssh nodename uptime do you use exactly the same nodename in your host file? (I'm trying to eliminate all non-Open-MPI error sources, because with your setup it should basically work.)
One more point to consider is to update to Open-MPI 1.3. I don't think your OPen-MPI version is the cause of your trouble, but there have been quite some changes since v1.2.5 Jody On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson <h.j.dickin...@durham.ac.uk> wrote: > Hi Jody, > > Indeed, all the nodes are running the same version of Open MPI. Perhaps I > was incorrect to describe the cluster as heterogeneous. In fact, all the > nodes run the same operating system (Scientific Linux 5.2), it's only the > hardware that's different and even then they're all i386 or i686. I'm also > attaching the output of ompi_info --all as I've seen it's suggested in the > mailing list instructions. > > Cheers, > > Hugh > > Hi Hugh > > Just to make sure: > You have installed Open-MPI on all your nodes? > Same version everywhere? > > Jody > > On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson > <h.j.dickinson_at_[hidden]> wrote: >> Hi all, >> >> First of all let me make it perfectly clear that I'm a complete beginner >> as >> far as MPI is concerned, so this may well be a trivial problem! >> >> I've tried to set up Open MPI to use SSH to communicate between nodes on a >> heterogeneous cluster. I've set up passwordless SSH and it seems to be >> working fine. For example by hand I can do: >> >> ssh nodename uptime >> >> and it returns the appropriate information for each node. >> I then tried running a non-MPI program on all the nodes at the same time: >> >> mpirun -np 10 --hostfile hostfile uptime >> >> Where hostfile is a list of the 10 cluster node names with slots=1 after >> each one i.e >> >> nodename1 slots=1 >> nodename2 slots=2 >> etc... >> >> Nothing happens! The process just seems to hang. If I interrupt the >> process >> with Ctrl-C I get: >> >> " >> >> mpirun: killing job... >> >> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in file >> base/pls_base_orted_cmds.c at line 275 >> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in file >> pls_rsh_module.c at line 1166 >> -------------------------------------------------------------------------- >> WARNING: mpirun has exited before it received notification that all >> started processes had terminated. You should double check and ensure >> that there are no runaway processes still executing. >> -------------------------------------------------------------------------- >> >> " >> >> If, instead of using the hostfile, I specify on the command line the host >> from which I'm running mpirun, e.g.: >> >> mpirun -np 1 --host nodename uptime >> >> then it works (i.e. if it doesn't need to communicate with other nodes). >> Do >> I need to tell Open MPI it should be using SSH to communicate? If so, how >> do >> I do this? To be honest I think it's trying to do so, because before I set >> up passwordless SSH it challenged me for lots of passwords. >> >> I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2. Let me >> reiterate, it's very likely that I've done something stupid, so all >> suggestions are welcome. >> >> Cheers, >> >> Hugh >> >> _______________________________________________ >> users mailing list >> users_at_[hidden] >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >