Maxime, Can you run with:
mpirun -np 4 --mca plm_base_verbose 10 /path/to/examples//ring_c On Mon, Aug 18, 2014 at 12:21 PM, Maxime Boissonneault < maxime.boissonnea...@calculquebec.ca> wrote: > Hi, > I just did compile without Cuda, and the result is the same. No output, > exits with code 65. > > [mboisson@helios-login1 examples]$ ldd ring_c > linux-vdso.so.1 => (0x00007fff3ab31000) > libmpi.so.1 => /software-gpu/mpi/openmpi/1.8. > 2rc4_gcc4.8_nocuda/lib/libmpi.so.1 (0x00007fab9ec2a000) > libpthread.so.0 => /lib64/libpthread.so.0 (0x000000381c000000) > libc.so.6 => /lib64/libc.so.6 (0x000000381bc00000) > librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x000000381c800000) > libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x000000381c400000) > libopen-rte.so.7 => /software-gpu/mpi/openmpi/1.8. > 2rc4_gcc4.8_nocuda/lib/libopen-rte.so.7 (0x00007fab9e932000) > libtorque.so.2 => /usr/lib64/libtorque.so.2 (0x0000003918200000) > libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x0000003917e00000) > libz.so.1 => /lib64/libz.so.1 (0x000000381cc00000) > libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x0000003821000000) > libssl.so.10 => /usr/lib64/libssl.so.10 (0x0000003823000000) > libopen-pal.so.6 => /software-gpu/mpi/openmpi/1.8. > 2rc4_gcc4.8_nocuda/lib/libopen-pal.so.6 (0x00007fab9e64a000) > libdl.so.2 => /lib64/libdl.so.2 (0x000000381b800000) > librt.so.1 => /lib64/librt.so.1 (0x00000035b3600000) > libm.so.6 => /lib64/libm.so.6 (0x0000003c25a00000) > libutil.so.1 => /lib64/libutil.so.1 (0x0000003f71000000) > /lib64/ld-linux-x86-64.so.2 (0x000000381b400000) > libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003917a00000) > libgcc_s.so.1 => /software6/compilers/gcc/4.8/lib64/libgcc_s.so.1 > (0x00007fab9e433000) > libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 > (0x0000003822400000) > libkrb5.so.3 => /lib64/libkrb5.so.3 (0x0000003821400000) > libcom_err.so.2 => /lib64/libcom_err.so.2 (0x000000381e400000) > libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x0000003821800000) > libkrb5support.so.0 => /lib64/libkrb5support.so.0 > (0x0000003821c00000) > libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x0000003822000000) > libresolv.so.2 => /lib64/libresolv.so.2 (0x000000381dc00000) > libselinux.so.1 => /lib64/libselinux.so.1 (0x000000381d000000) > > [mboisson@helios-login1 examples]$ mpiexec ring_c > [mboisson@helios-login1 examples]$ echo $? > 65 > > > Maxime > > > Le 2014-08-16 06:22, Jeff Squyres (jsquyres) a écrit : > > Just out of curiosity, I saw that one of the segv stack traces involved >> the cuda stack. >> >> Can you try a build without CUDA and see if that resolves the problem? >> >> >> >> On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault <maxime.boissonneault@ >> calculquebec.ca> wrote: >> >> Hi Jeff, >>> >>> Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit : >>> >>>> On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault < >>>> maxime.boissonnea...@calculquebec.ca> wrote: >>>> >>>> Correct. >>>>> >>>>> Can it be because torque (pbs_mom) is not running on the head node and >>>>> mpiexec attempts to contact it ? >>>>> >>>> Not for Open MPI's mpiexec, no. >>>> >>>> Open MPI's mpiexec (mpirun -- they're the same to us) will only try to >>>> use TM stuff (i.e., Torque stuff) if it sees the environment variable >>>> markers indicating that it's inside a Torque job. If not, it just uses >>>> rsh/ssh (or localhost launch in your case, since you didn't specify any >>>> hosts). >>>> >>>> If you are unable to run even "mpirun -np 4 hostname" (i.e., the >>>> non-MPI "hostname" command from Linux), then something is seriously borked >>>> with your Open MPI installation. >>>> >>> mpirun -np 4 hostname works fine : >>> [mboisson@helios-login1 ~]$ which mpirun >>> /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun >>> [mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $? >>> helios-login1 >>> helios-login1 >>> helios-login1 >>> helios-login1 >>> 0 >>> >>> Try running with: >>>> >>>> mpirun -np 4 --mca plm_base_verbose 10 hostname >>>> >>>> This should show the steps OMPI is trying to take to launch the 4 >>>> copies of "hostname" and potentially give some insight into where it's >>>> hanging. >>>> >>>> Also, just to make sure, you have ensured that you're compiling >>>> everything with a single compiler toolchain, and the support libraries from >>>> that specific compiler toolchain are available on any server on which >>>> you're running (to include the head node and compute nodes), right? >>>> >>> Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 >>> with the same results). Almost every software (that is compiler, toolchain, >>> etc.) is installed on lustre, from sources and is the same on both the >>> login (head) node and the compute. >>> >>> The few differences between the head node and the compute : >>> 1) Computes are in RAMFS - login is installed on disk >>> 2) Computes and login node have different hardware configuration >>> (computes have GPUs, head node does not). >>> 3) Login node has MORE CentOS6 packages than computes (such as the >>> -devel packages, some fonts/X11 libraries, etc.), but all the packages that >>> are on the computes are also on the login node. >>> >>> And you've verified that PATH and LD_LIBRARY_PATH are pointing to the >>>> right places -- i.e., to the Open MPI installation that you expect it to >>>> point to. E.g., if you "ldd ring_c", it shows the libmpi.so that you >>>> expect. And "which mpiexec" shows the mpirun that you expect. Etc. >>>> >>> As per the content of "env.out" in the archive, yes. They point to the >>> OMPI 1.8.2rc4 installation directories, on lustre, and are shared between >>> the head node and the compute nodes. >>> >>> >>> Maxime >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: http://www.open-mpi.org/ >>> community/lists/users/2014/08/25043.php >>> >> >> > > -- > --------------------------------- > Maxime Boissonneault > Analyste de calcul - Calcul Québec, Université Laval > Ph. D. en physique > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/ > 25050.php >