You made my day Gus! Thank you very much. If I asked before, I would have finished within two hours (but I guess that's part of the learning process). Very straight forward! Although I tried doing exactly what you said, the Googled-information is not clear and sometimes misleading about what to install and where.
Thanks a lot Gus. ~Belaid. > Date: Tue, 1 Dec 2009 19:15:53 -0500 > From: g...@ldeo.columbia.edu > To: us...@open-mpi.org > Subject: Re: [OMPI users] mpirun is using one PBS node only > > Hi Belaid Moa > > Belaid MOA wrote: > > In that case, the way I installed it is not right. I thought that only > > the HN should be configured with the tm support > > not the worker nodes; the worker nodes only have the PBS daemon clients > > - No need for tm support on the worker nodes. > > > > When I ran ompi_info | grep tm on the worker nodes, the output is empty. > > > > Yes, it is clear that OpenMPI on your worker nodes > doesn't have "tm" support. > Again, I would guess this is the reason you can't get even hostname > to run on more than one node. > > Just reinstall OpemMPI with TM support on the head node > *on a NFS mounted directory*, and life will be much easier! > All nodes, head and worker, will see the same OpenMPI version. > It works very well for me here. > The only additional > thing you may need to do is to add the OpenMPI bin directory > to your PATH and the OpenMPI lib directory to LD_LIBRARY_PATH > on your .bashrc/.cshrc file (or on appropriate .csh and .sh files in > the /etc/profile.d directory). > Upgrades will be also much simpler. > The only disadvantage of this scheme may be on large clusters, > where scaling may bump on NFS limitations, but with only three > nodes that is certainly not your case. > > > > The information on the following link has mislead me then: > > http://www.physics.iitm.ac.in/~sanoop/linux_files/cluster.html > > (check OpenMPI Configuration section.) > > > > I suggest that you refer to the OpenMPI site instead. > That is the authoritative source of information about OpenMPI. > Their FAQs have a lot of information: > http://www.open-mpi.org/faq/ > Likewise, the README file that comes with the OpenMPI tarball > is very clarifying. > > I hope this helps, > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > > ~Belaid. > > > Date: Tue, 1 Dec 2009 18:36:15 -0500 > > > From: g...@ldeo.columbia.edu > > > To: us...@open-mpi.org > > > Subject: Re: [OMPI users] mpirun is using one PBS node only > > > > > > Hi Belaid Moa > > > > > > The OpenMPI I install and use is on a NFS mounted directory. > > > Hence, all the nodes see the same version, which has "tm" support. > > > > > > After reading your OpenMPI configuration parameters on the headnode > > > and working nodes (and the difference between them), > > > I would guess (just a guess) that the problem you see is because your > > > OpenMPI version on the nodes (probably) do not have Torque support. > > > > > > However, you should first verify that this is really the case, > > > because if the OpenMPI configure script > > > finds the torque libraries it will (probably) configure and > > > install OpenMPI with "tm" support, even if you don't ask it > > > explicitly on the working nodes. > > > Hence, ssh to WN1 or WN2 and do "ompi_info" to check this out first. > > > > > > If there is no Torque on WN1 and WN2 then OpenMPI won't find it > > > and you won't have "tm" support on the nodes. > > > > > > In any case, if OpenMPI "tm" support is missing on WN[1,2}, > > > I would suggest that you reinstall OpenMPI on WN1 and WN2 *with tm > > support*. > > > This will require that you have Torque on the working nodes also, > > > and use the same configure command line that you used on the headnode. > > > > > > A low-tech alternative is to copy over your OpenMPI directory tree to > > > the WN1 and WN2 nodes. > > > > > > A yet simpler alternative is to reinstall OpenMPI on the headnode > > > on a NFS mounted directory (as I do here), then > > > add the corresponding "bin" path to your PATH, > > > and the corresponding "lib" path to your LD_LIBRARY_PATH environment > > > variables. > > > > > > Think about maintenance, and upgrades: > > > On an NFS mounted directory > > > you need to install only once, whereas the way you have it now you need > > > to do it N+1 times (or have a mechanism to propagate a single > > > installation from the head node to the compute nodes). > > > > > > NFS is your friend! :) > > > > > > I hope this helps, > > > Gus Correa > > > --------------------------------------------------------------------- > > > Gustavo Correa > > > Lamont-Doherty Earth Observatory - Columbia University > > > Palisades, NY, 10964-8000 - USA > > > --------------------------------------------------------------------- > > > > > > > > > Belaid MOA wrote: > > > > I tried -bynode option but it did not change anything. I also tried > > the > > > > "hostname" name command and > > > > I keep getting only the name of one node repeated according to the -n > > > > value. > > > > > > > > Just to make sure I did the right installation, here is what I did: > > > > > > > > -- On the head node (HN), I installed openMPI using the --with-tm > > option > > > > as follows: > > > > > > > > ./configure --with-tm=/var/spool/torque --enable-static > > > > make install all > > > > > > > > -- On the worker nodes (WN1 and WN2), I installed openMPI without tm > > > > option as follows (it is a local installation on each worker node): > > > > > > > > ./configure --enable-static > > > > make install all > > > > > > > > Is this correct? > > > > > > > > Thanks a lot in advance. > > > > ~Belaid. > > > > > Date: Tue, 1 Dec 2009 17:07:58 -0500 > > > > > From: g...@ldeo.columbia.edu > > > > > To: us...@open-mpi.org > > > > > Subject: Re: [OMPI users] mpirun is using one PBS node only > > > > > > > > > > Hi Belaid Moa > > > > > > > > > > Belaid MOA wrote: > > > > > > Thanks a lot Gus for you help again. I only have one CPU per node. > > > > > > The -n X option (no matter what the value of X is) shows X > > processes > > > > > > running on one node only (the other one is free). > > > > > > > > > > So, somehow it is oversubscribing your single processor > > > > > on the first node. > > > > > > > > > > A simple diagnostic: > > > > > > > > > > Have you tried to run "hostname" on the two nodes through Torque/PBS > > > > > and mpiexec? > > > > > > > > > > [PBS directives, cd $PBS_O_WORKDIR, etc] > > > > > ... > > > > > /full/path/to/openmpi/bin/mpiexec -n 2 hostname > > > > > > > > > > Try also with the -byslot and -bynode options. > > > > > > > > > > > > > > > > If I add the machinefile option with WN1 and WN2 in it, the right > > > > > > behavior is manifested. According to the documentation, > > > > > > mpirun should get the PBS_NODEFILE automatically from the PBS. > > > > > > > > > > Yes, if you compiled OpenMPI you are using with Torque ("tm) support. > > > > > Did you? > > > > > Make sure the it has tm support. > > > > > Run "ompi_info" with full path if needed, to check that. > > > > > Are you sure the correct path to what you want is > > > > > /usr/local/bin/mpirun ? > > > > > Linux distributions, compilers, and other tools come with their > > > > > mpiexec and put them in places that you may not suspect, to better > > > > > double check you get what you want. > > > > > That has been a source of repeated confusion on this and other > > > > > mailing lists. > > > > > > > > > > Also, make sure that passwordless ssh across the nodes is working. > > > > > > > > > > Yet another thing to check, for easy name resolution, > > > > > your /etc/hosts file on *all* > > > > > nodes including the headnode should > > > > > have a list of all nodes and their IP addresses. > > > > > Something like this: > > > > > > > > > > 127.0.0.1 localhost.localdomain localhost > > > > > 192.168.0.1 WN1 > > > > > 192.168.0.2 WN2 > > > > > > > > > > (The IPs above are guesswork of mine, you know better which to use.) > > > > > > > > > > > So, I do > > > > > > not need to use machinefile. > > > > > > > > > > > > > > > > True assuming the first condition above (OpenMPI *with* "tm" suport). > > > > > > > > > > > Any ideas? > > > > > > > > > > > > > > > > Yes, and I sent it to you on my last email! > > > > > Try the "-bynode" option of mpiexec. > > > > > ("man mpiexec" is your friend!) > > > > > > > > > > > Thanks a lot in advance. > > > > > > ~Belaid. > > > > > > > > > > > > > > > > Best of luck! > > > > > Gus Correa > > > > > --------------------------------------------------------------------- > > > > > Gustavo Correa > > > > > Lamont-Doherty Earth Observatory - Columbia University > > > > > Palisades, NY, 10964-8000 - USA > > > > > --------------------------------------------------------------------- > > > > > > > > > > PS - Your web site link to Paul Krugman is out of date. > > > > > Here are one to his (active) blog, > > > > > and another to his (no longer updated) web page: :) > > > > > > > > > > http://krugman.blogs.nytimes.com/ > > > > > http://www.princeton.edu/~pkrugman/ > > > > > > > > > > > > > > > > > > Date: Tue, 1 Dec 2009 15:42:30 -0500 > > > > > > > From: g...@ldeo.columbia.edu > > > > > > > To: us...@open-mpi.org > > > > > > > Subject: Re: [OMPI users] mpirun is using one PBS node only > > > > > > > > > > > > > > Hi Belaid Moa > > > > > > > > > > > > > > Belaid MOA wrote: > > > > > > > > Hi everyone, > > > > > > > > Here is another elementary question. I tried the following > > > > steps found > > > > > > > > in the FAQ section of www.open-mpi.org with a simple hello > > world > > > > > > example > > > > > > > > (with PBS/torque): > > > > > > > > $ qsub -l nodes=2 my_script.sh > > > > > > > > > > > > > > > > my_script.sh is pasted below: > > > > > > > > ======================== > > > > > > > > #!/bin/sh -l > > > > > > > > #PBS -N helloTest > > > > > > > > #PBS -j eo > > > > > > > > echo `cat $PBS_NODEFILE` # shows two nodes: WN1 WN2 > > > > > > > > cd $PBS_O_WORKDIR > > > > > > > > /usr/local/bin/mpirun hello > > > > > > > > ======================== > > > > > > > > > > > > > > > > When the job is submitted, only one process is ran. When I > > add the > > > > > > -n 2 > > > > > > > > option to the mpirun command, > > > > > > > > two processes are ran but on one node only. > > > > > > > > > > > > > > Do you have a single CPU/core per node? > > > > > > > Or are they multi-socket/multi-core? > > > > > > > > > > > > > > Check "man mpiexec" for the options that control on which > > nodes and > > > > > > > slots, etc your program will run. > > > > > > > ("Man mpiexec" will tell you more than I possibly can.) > > > > > > > > > > > > > > The default option is "-byslot", > > > > > > > which will use all "slots" (actually cores > > > > > > > or CPUs) available on a node before it moves to the next node. > > > > > > > Reading your question and your surprise with the result, > > > > > > > I would guess what you want is "-bynode" (not the default). > > > > > > > > > > > > > > Also, if you have more than one CPU/core per node, > > > > > > > you need to put this information in your Torque/PBS "nodes" file > > > > > > > (and restart your pbs_server daemon). > > > > > > > Something like this (for 2 CPUs/cores per node): > > > > > > > > > > > > > > WN1 np=2 > > > > > > > WN2 np=2 > > > > > > > > > > > > > > I hope this helps, > > > > > > > Gus Correa > > > > > > > > > --------------------------------------------------------------------- > > > > > > > Gustavo Correa > > > > > > > Lamont-Doherty Earth Observatory - Columbia University > > > > > > > Palisades, NY, 10964-8000 - USA > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > > > > > > > > > > > > > > > Note that echo `cat > > > > > > > > $PBS_NODEFILE` outputs > > > > > > > > the two nodes I am using: WN1 and WN2. > > > > > > > > > > > > > > > > The output from ompi_info is shown below: > > > > > > > > > > > > > > > > $ ompi_info | grep tm > > > > > > > > MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3) > > > > > > > > MCA ras: tm (MCA v2.0, API v2.0, Component v1.3.3) > > > > > > > > MCA plm: tm (MCA v2.0, API v2.0, Component v1.3.3) > > > > > > > > > > > > > > > > Any help on why openMPI/mpirun is using only one PBS node > > is very > > > > > > > > appreciated. > > > > > > > > > > > > > > > > Thanks a lot in advance and sorry for bothering you guys > > with my > > > > > > > > elementary questions! > > > > > > > > > > > > > > > > ~Belaid. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > Windows Live: Keep your friends up to date with what you do > > online. > > > > > > > > <http://go.microsoft.com/?linkid=9691810> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > users mailing list > > > > > > > > us...@open-mpi.org > > > > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > > > _______________________________________________ > > > > > > > users mailing list > > > > > > > us...@open-mpi.org > > > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > Windows Live: Keep your friends up to date with what you do online. > > > > > > <http://go.microsoft.com/?linkid=9691810> > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > > > > > _______________________________________________ > > > > > > users mailing list > > > > > > us...@open-mpi.org > > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > _______________________________________________ > > > > > users mailing list > > > > > us...@open-mpi.org > > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > > ------------------------------------------------------------------------ > > > > Get a great deal on Windows 7 and see how it works the way you > > want. See > > > > the Windows 7 offers now. <http://go.microsoft.com/?linkid=9691813> > > > > > > > > > > > > > > ------------------------------------------------------------------------ > > > > > > > > _______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > ------------------------------------------------------------------------ > > Windows Live: Keep your friends up to date with what you do online. > > <http://go.microsoft.com/?linkid=9691810> > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users _________________________________________________________________ Windows Live: Keep your friends up to date with what you do online. http://go.microsoft.com/?linkid=9691815