In this instance, OMPI is complaining that you are attempting to use Infiniband, but no suitable devices are found.
I assume you have Ethernet between your nodes? Can you run this with the following added to your mpirun cmd line: -mca btl tcp,self That will cause OMPI to ignore the Infiniband subsystem and attempt to run via TCP over any available Ethernet. On Tue, Apr 28, 2009 at 12:16 PM, Hugh Dickinson <h.j.dickin...@durham.ac.uk > wrote: > Many thanks for your help nonetheless. > > Hugh > > > On 28 Apr 2009, at 17:23, jody wrote: > > Hi Hugh >> >> I'm sorry, but i must admit that i have never encountered these messages, >> and i don't know what their cause exactly is. >> >> Perhaps one of the developers can give an explanation? >> >> Jody >> >> On Tue, Apr 28, 2009 at 5:52 PM, Hugh Dickinson >> <h.j.dickin...@durham.ac.uk> wrote: >> >>> Hi again, >>> >>> I tried a simple mpi c++ program: >>> >>> -- >>> #include <iostream> >>> #include <mpi.h> >>> >>> using namespace MPI; >>> using namespace std; >>> >>> int main(int argc, char* argv[]) { >>> int rank,size; >>> Init(argc,argv); >>> rank=COMM_WORLD.Get_rank(); >>> size=COMM_WORLD.Get_size(); >>> cout << "P:" << rank << " out of " << size << endl; >>> Finalize(); >>> } >>> -- >>> It didn't work over all the nodes, again same problem - the system seems >>> to >>> hang. However, by forcing mpirun to use only the node on which I'm >>> launching mpirun I get some more error messages >>> >>> -- >>> libibverbs: Fatal: couldn't read uverbs ABI version. >>> libibverbs: Fatal: couldn't read uverbs ABI version. >>> >>> -------------------------------------------------------------------------- >>> [0,1,0]: OpenIB on host gamma2 was unable to find any HCAs. >>> Another transport will be used instead, although this may result in >>> lower performance. >>> >>> -------------------------------------------------------------------------- >>> >>> -------------------------------------------------------------------------- >>> [0,1,1]: OpenIB on host gamma2 was unable to find any HCAs. >>> Another transport will be used instead, although this may result in >>> lower performance. >>> >>> -------------------------------------------------------------------------- >>> >>> -------------------------------------------------------------------------- >>> [0,1,1]: uDAPL on host gamma2 was unable to find any NICs. >>> Another transport will be used instead, although this may result in >>> lower performance. >>> >>> -------------------------------------------------------------------------- >>> >>> -------------------------------------------------------------------------- >>> [0,1,0]: uDAPL on host gamma2 was unable to find any NICs. >>> Another transport will be used instead, although this may result in >>> lower performance. >>> >>> -------------------------------------------------------------------------- >>> -- >>> >>> However, as before the program does work in this special case, and I get: >>> -- >>> P:0 out of 2 >>> P:1 out of 2 >>> -- >>> >>> Do these errors indicate a problem with the Open MPI installation? >>> >>> Hugh >>> >>> On 28 Apr 2009, at 16:36, Hugh Dickinson wrote: >>> >>> Hi Jody, >>>> >>>> I can paswordlessly ssh between all nodes (to and from) >>>> Almost none of these mpirun commands work. The only working case is if >>>> nodenameX is the node from which you are running the command. I don't >>>> know >>>> if this gives you extra diagnostic information, but if I explicitly set >>>> the >>>> wrong prefix (using --prefix), then I get errors from all the nodes >>>> telling >>>> me the daemon would not start. I don't get these errors normally. It >>>> seems >>>> to me that the communication is working okay, at least in the outwards >>>> direction (and from all nodes). Could this be a problem with forwarding >>>> of >>>> standard output? If I were to try a simple hello world program, is this >>>> more >>>> likely to work, or am I just adding another layer of complexity? >>>> >>>> Cheers, >>>> >>>> Hugh >>>> >>>> On 28 Apr 2009, at 15:55, jody wrote: >>>> >>>> Hi Hugh >>>>> You're right, there is no initialization command (like lamboot) you >>>>> have to call. >>>>> >>>>> I don't really know why your sewtup doesn't work, so i'm making some >>>>> more "blind shots" >>>>> >>>>> can you do passwordless ssh from between any two of your nodes? >>>>> >>>>> does >>>>> mpirun -np 1 --host nodenameX uptime >>>>> work for every X when called from any of your nodes? >>>>> >>>>> Have you tried >>>>> mpirun -np 2 --host nodename1,nodename2 uptime >>>>> (i.e. not using the host file) >>>>> >>>>> Jody >>>>> >>>>> On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson >>>>> <h.j.dickin...@durham.ac.uk> wrote: >>>>> >>>>>> >>>>>> Hi Jody, >>>>>> >>>>>> The node names are exactly the same. I wanted to avoid updating the >>>>>> version >>>>>> because I'm not the system administrator, and it could take some time >>>>>> before >>>>>> it gets done. If it's likely to fix the problem though I'll try it. >>>>>> I'm >>>>>> assuming that I don't have to do something analogous to the old >>>>>> "lamboot" >>>>>> command to initialise Open MPI on all the nodes. I've seen no >>>>>> documentation >>>>>> anywhere that says I should. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Hugh >>>>>> >>>>>> On 28 Apr 2009, at 15:28, jody wrote: >>>>>> >>>>>> Hi Hugh >>>>>>> >>>>>>> Again, just to make sure, are the hostnames in your host file >>>>>>> well-known? >>>>>>> I.e. when you say you can do >>>>>>> ssh nodename uptime >>>>>>> do you use exactly the same nodename in your host file? >>>>>>> (I'm trying to eliminate all non-Open-MPI error sources, >>>>>>> because with your setup it should basically work.) >>>>>>> >>>>>>> One more point to consider is to update to Open-MPI 1.3. >>>>>>> I don't think your OPen-MPI version is the cause of your trouble, >>>>>>> but there have been quite some changes since v1.2.5 >>>>>>> >>>>>>> Jody >>>>>>> >>>>>>> On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson >>>>>>> <h.j.dickin...@durham.ac.uk> wrote: >>>>>>> >>>>>>>> >>>>>>>> Hi Jody, >>>>>>>> >>>>>>>> Indeed, all the nodes are running the same version of Open MPI. >>>>>>>> Perhaps I >>>>>>>> was incorrect to describe the cluster as heterogeneous. In fact, all >>>>>>>> the >>>>>>>> nodes run the same operating system (Scientific Linux 5.2), it's >>>>>>>> only >>>>>>>> the >>>>>>>> hardware that's different and even then they're all i386 or i686. >>>>>>>> I'm >>>>>>>> also >>>>>>>> attaching the output of ompi_info --all as I've seen it's suggested >>>>>>>> in >>>>>>>> the >>>>>>>> mailing list instructions. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Hugh >>>>>>>> >>>>>>>> Hi Hugh >>>>>>>> >>>>>>>> Just to make sure: >>>>>>>> You have installed Open-MPI on all your nodes? >>>>>>>> Same version everywhere? >>>>>>>> >>>>>>>> Jody >>>>>>>> >>>>>>>> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson >>>>>>>> <h.j.dickinson_at_[hidden]> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> First of all let me make it perfectly clear that I'm a complete >>>>>>>>> beginner >>>>>>>>> as >>>>>>>>> far as MPI is concerned, so this may well be a trivial problem! >>>>>>>>> >>>>>>>>> I've tried to set up Open MPI to use SSH to communicate between >>>>>>>>> nodes >>>>>>>>> on >>>>>>>>> a >>>>>>>>> heterogeneous cluster. I've set up passwordless SSH and it seems to >>>>>>>>> be >>>>>>>>> working fine. For example by hand I can do: >>>>>>>>> >>>>>>>>> ssh nodename uptime >>>>>>>>> >>>>>>>>> and it returns the appropriate information for each node. >>>>>>>>> I then tried running a non-MPI program on all the nodes at the same >>>>>>>>> time: >>>>>>>>> >>>>>>>>> mpirun -np 10 --hostfile hostfile uptime >>>>>>>>> >>>>>>>>> Where hostfile is a list of the 10 cluster node names with slots=1 >>>>>>>>> after >>>>>>>>> each one i.e >>>>>>>>> >>>>>>>>> nodename1 slots=1 >>>>>>>>> nodename2 slots=2 >>>>>>>>> etc... >>>>>>>>> >>>>>>>>> Nothing happens! The process just seems to hang. If I interrupt the >>>>>>>>> process >>>>>>>>> with Ctrl-C I get: >>>>>>>>> >>>>>>>>> " >>>>>>>>> >>>>>>>>> mpirun: killing job... >>>>>>>>> >>>>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout >>>>>>>>> in >>>>>>>>> file >>>>>>>>> base/pls_base_orted_cmds.c at line 275 >>>>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout >>>>>>>>> in >>>>>>>>> file >>>>>>>>> pls_rsh_module.c at line 1166 >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> WARNING: mpirun has exited before it received notification that all >>>>>>>>> started processes had terminated. You should double check and >>>>>>>>> ensure >>>>>>>>> that there are no runaway processes still executing. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -------------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> " >>>>>>>>> >>>>>>>>> If, instead of using the hostfile, I specify on the command line >>>>>>>>> the >>>>>>>>> host >>>>>>>>> from which I'm running mpirun, e.g.: >>>>>>>>> >>>>>>>>> mpirun -np 1 --host nodename uptime >>>>>>>>> >>>>>>>>> then it works (i.e. if it doesn't need to communicate with other >>>>>>>>> nodes). >>>>>>>>> Do >>>>>>>>> I need to tell Open MPI it should be using SSH to communicate? If >>>>>>>>> so, >>>>>>>>> how >>>>>>>>> do >>>>>>>>> I do this? To be honest I think it's trying to do so, because >>>>>>>>> before >>>>>>>>> I >>>>>>>>> set >>>>>>>>> up passwordless SSH it challenged me for lots of passwords. >>>>>>>>> >>>>>>>>> I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2. Let >>>>>>>>> me >>>>>>>>> reiterate, it's very likely that I've done something stupid, so all >>>>>>>>> suggestions are welcome. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Hugh >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> users_at_[hidden] >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> us...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> >>> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >