Just out of curiosity, I saw that one of the segv stack traces involved the cuda stack.
Can you try a build without CUDA and see if that resolves the problem? On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault <maxime.boissonnea...@calculquebec.ca> wrote: > Hi Jeff, > > Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit : >> On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault >> <maxime.boissonnea...@calculquebec.ca> wrote: >> >>> Correct. >>> >>> Can it be because torque (pbs_mom) is not running on the head node and >>> mpiexec attempts to contact it ? >> Not for Open MPI's mpiexec, no. >> >> Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use >> TM stuff (i.e., Torque stuff) if it sees the environment variable markers >> indicating that it's inside a Torque job. If not, it just uses rsh/ssh (or >> localhost launch in your case, since you didn't specify any hosts). >> >> If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI >> "hostname" command from Linux), then something is seriously borked with your >> Open MPI installation. > mpirun -np 4 hostname works fine : > [mboisson@helios-login1 ~]$ which mpirun > /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun > [mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $? > helios-login1 > helios-login1 > helios-login1 > helios-login1 > 0 > >> >> Try running with: >> >> mpirun -np 4 --mca plm_base_verbose 10 hostname >> >> This should show the steps OMPI is trying to take to launch the 4 copies of >> "hostname" and potentially give some insight into where it's hanging. >> >> Also, just to make sure, you have ensured that you're compiling everything >> with a single compiler toolchain, and the support libraries from that >> specific compiler toolchain are available on any server on which you're >> running (to include the head node and compute nodes), right? > Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 with the > same results). Almost every software (that is compiler, toolchain, etc.) is > installed on lustre, from sources and is the same on both the login (head) > node and the compute. > > The few differences between the head node and the compute : > 1) Computes are in RAMFS - login is installed on disk > 2) Computes and login node have different hardware configuration (computes > have GPUs, head node does not). > 3) Login node has MORE CentOS6 packages than computes (such as the -devel > packages, some fonts/X11 libraries, etc.), but all the packages that are on > the computes are also on the login node. > >> >> And you've verified that PATH and LD_LIBRARY_PATH are pointing to the right >> places -- i.e., to the Open MPI installation that you expect it to point to. >> E.g., if you "ldd ring_c", it shows the libmpi.so that you expect. And >> "which mpiexec" shows the mpirun that you expect. Etc. > As per the content of "env.out" in the archive, yes. They point to the OMPI > 1.8.2rc4 installation directories, on lustre, and are shared between the head > node and the compute nodes. > > > Maxime > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/08/25043.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/