Hi Hugh I'm sorry, but i must admit that i have never encountered these messages, and i don't know what their cause exactly is.
Perhaps one of the developers can give an explanation? Jody On Tue, Apr 28, 2009 at 5:52 PM, Hugh Dickinson <h.j.dickin...@durham.ac.uk> wrote: > Hi again, > > I tried a simple mpi c++ program: > > -- > #include <iostream> > #include <mpi.h> > > using namespace MPI; > using namespace std; > > int main(int argc, char* argv[]) { > int rank,size; > Init(argc,argv); > rank=COMM_WORLD.Get_rank(); > size=COMM_WORLD.Get_size(); > cout << "P:" << rank << " out of " << size << endl; > Finalize(); > } > -- > It didn't work over all the nodes, again same problem - the system seems to > hang. However, by forcing mpirun to use only the node on which I'm > launching mpirun I get some more error messages > > -- > libibverbs: Fatal: couldn't read uverbs ABI version. > libibverbs: Fatal: couldn't read uverbs ABI version. > -------------------------------------------------------------------------- > [0,1,0]: OpenIB on host gamma2 was unable to find any HCAs. > Another transport will be used instead, although this may result in > lower performance. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > [0,1,1]: OpenIB on host gamma2 was unable to find any HCAs. > Another transport will be used instead, although this may result in > lower performance. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > [0,1,1]: uDAPL on host gamma2 was unable to find any NICs. > Another transport will be used instead, although this may result in > lower performance. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > [0,1,0]: uDAPL on host gamma2 was unable to find any NICs. > Another transport will be used instead, although this may result in > lower performance. > -------------------------------------------------------------------------- > -- > > However, as before the program does work in this special case, and I get: > -- > P:0 out of 2 > P:1 out of 2 > -- > > Do these errors indicate a problem with the Open MPI installation? > > Hugh > > On 28 Apr 2009, at 16:36, Hugh Dickinson wrote: > >> Hi Jody, >> >> I can paswordlessly ssh between all nodes (to and from) >> Almost none of these mpirun commands work. The only working case is if >> nodenameX is the node from which you are running the command. I don't know >> if this gives you extra diagnostic information, but if I explicitly set the >> wrong prefix (using --prefix), then I get errors from all the nodes telling >> me the daemon would not start. I don't get these errors normally. It seems >> to me that the communication is working okay, at least in the outwards >> direction (and from all nodes). Could this be a problem with forwarding of >> standard output? If I were to try a simple hello world program, is this more >> likely to work, or am I just adding another layer of complexity? >> >> Cheers, >> >> Hugh >> >> On 28 Apr 2009, at 15:55, jody wrote: >> >>> Hi Hugh >>> You're right, there is no initialization command (like lamboot) you >>> have to call. >>> >>> I don't really know why your sewtup doesn't work, so i'm making some >>> more "blind shots" >>> >>> can you do passwordless ssh from between any two of your nodes? >>> >>> does >>> mpirun -np 1 --host nodenameX uptime >>> work for every X when called from any of your nodes? >>> >>> Have you tried >>> mpirun -np 2 --host nodename1,nodename2 uptime >>> (i.e. not using the host file) >>> >>> Jody >>> >>> On Tue, Apr 28, 2009 at 4:37 PM, Hugh Dickinson >>> <h.j.dickin...@durham.ac.uk> wrote: >>>> >>>> Hi Jody, >>>> >>>> The node names are exactly the same. I wanted to avoid updating the >>>> version >>>> because I'm not the system administrator, and it could take some time >>>> before >>>> it gets done. If it's likely to fix the problem though I'll try it. I'm >>>> assuming that I don't have to do something analogous to the old >>>> "lamboot" >>>> command to initialise Open MPI on all the nodes. I've seen no >>>> documentation >>>> anywhere that says I should. >>>> >>>> Cheers, >>>> >>>> Hugh >>>> >>>> On 28 Apr 2009, at 15:28, jody wrote: >>>> >>>>> Hi Hugh >>>>> >>>>> Again, just to make sure, are the hostnames in your host file >>>>> well-known? >>>>> I.e. when you say you can do >>>>> ssh nodename uptime >>>>> do you use exactly the same nodename in your host file? >>>>> (I'm trying to eliminate all non-Open-MPI error sources, >>>>> because with your setup it should basically work.) >>>>> >>>>> One more point to consider is to update to Open-MPI 1.3. >>>>> I don't think your OPen-MPI version is the cause of your trouble, >>>>> but there have been quite some changes since v1.2.5 >>>>> >>>>> Jody >>>>> >>>>> On Tue, Apr 28, 2009 at 3:22 PM, Hugh Dickinson >>>>> <h.j.dickin...@durham.ac.uk> wrote: >>>>>> >>>>>> Hi Jody, >>>>>> >>>>>> Indeed, all the nodes are running the same version of Open MPI. >>>>>> Perhaps I >>>>>> was incorrect to describe the cluster as heterogeneous. In fact, all >>>>>> the >>>>>> nodes run the same operating system (Scientific Linux 5.2), it's only >>>>>> the >>>>>> hardware that's different and even then they're all i386 or i686. I'm >>>>>> also >>>>>> attaching the output of ompi_info --all as I've seen it's suggested in >>>>>> the >>>>>> mailing list instructions. >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Hugh >>>>>> >>>>>> Hi Hugh >>>>>> >>>>>> Just to make sure: >>>>>> You have installed Open-MPI on all your nodes? >>>>>> Same version everywhere? >>>>>> >>>>>> Jody >>>>>> >>>>>> On Tue, Apr 28, 2009 at 12:57 PM, Hugh Dickinson >>>>>> <h.j.dickinson_at_[hidden]> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> First of all let me make it perfectly clear that I'm a complete >>>>>>> beginner >>>>>>> as >>>>>>> far as MPI is concerned, so this may well be a trivial problem! >>>>>>> >>>>>>> I've tried to set up Open MPI to use SSH to communicate between nodes >>>>>>> on >>>>>>> a >>>>>>> heterogeneous cluster. I've set up passwordless SSH and it seems to >>>>>>> be >>>>>>> working fine. For example by hand I can do: >>>>>>> >>>>>>> ssh nodename uptime >>>>>>> >>>>>>> and it returns the appropriate information for each node. >>>>>>> I then tried running a non-MPI program on all the nodes at the same >>>>>>> time: >>>>>>> >>>>>>> mpirun -np 10 --hostfile hostfile uptime >>>>>>> >>>>>>> Where hostfile is a list of the 10 cluster node names with slots=1 >>>>>>> after >>>>>>> each one i.e >>>>>>> >>>>>>> nodename1 slots=1 >>>>>>> nodename2 slots=2 >>>>>>> etc... >>>>>>> >>>>>>> Nothing happens! The process just seems to hang. If I interrupt the >>>>>>> process >>>>>>> with Ctrl-C I get: >>>>>>> >>>>>>> " >>>>>>> >>>>>>> mpirun: killing job... >>>>>>> >>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in >>>>>>> file >>>>>>> base/pls_base_orted_cmds.c at line 275 >>>>>>> [gamma2.phyastcl.dur.ac.uk:18124] [0,0,0] ORTE_ERROR_LOG: Timeout in >>>>>>> file >>>>>>> pls_rsh_module.c at line 1166 >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> WARNING: mpirun has exited before it received notification that all >>>>>>> started processes had terminated. You should double check and ensure >>>>>>> that there are no runaway processes still executing. >>>>>>> >>>>>>> >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> " >>>>>>> >>>>>>> If, instead of using the hostfile, I specify on the command line the >>>>>>> host >>>>>>> from which I'm running mpirun, e.g.: >>>>>>> >>>>>>> mpirun -np 1 --host nodename uptime >>>>>>> >>>>>>> then it works (i.e. if it doesn't need to communicate with other >>>>>>> nodes). >>>>>>> Do >>>>>>> I need to tell Open MPI it should be using SSH to communicate? If so, >>>>>>> how >>>>>>> do >>>>>>> I do this? To be honest I think it's trying to do so, because before >>>>>>> I >>>>>>> set >>>>>>> up passwordless SSH it challenged me for lots of passwords. >>>>>>> >>>>>>> I'm running Open MPI 1.2.5 installed with Scientific Linux 5.2. Let >>>>>>> me >>>>>>> reiterate, it's very likely that I've done something stupid, so all >>>>>>> suggestions are welcome. >>>>>>> >>>>>>> Cheers, >>>>>>> >>>>>>> Hugh >>>>>>> >>>>>>> _______________________________________________ >>>>>>> users mailing list >>>>>>> users_at_[hidden] >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >