Hi,
I just did compile without Cuda, and the result is the same. No output,
exits with code 65.
[mboisson@helios-login1 examples]$ ldd ring_c
linux-vdso.so.1 => (0x00007fff3ab31000)
libmpi.so.1 =>
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libmpi.so.1
(0x00007fab9ec2a000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x000000381c000000)
libc.so.6 => /lib64/libc.so.6 (0x000000381bc00000)
librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x000000381c800000)
libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x000000381c400000)
libopen-rte.so.7 =>
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libopen-rte.so.7
(0x00007fab9e932000)
libtorque.so.2 => /usr/lib64/libtorque.so.2 (0x0000003918200000)
libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x0000003917e00000)
libz.so.1 => /lib64/libz.so.1 (0x000000381cc00000)
libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x0000003821000000)
libssl.so.10 => /usr/lib64/libssl.so.10 (0x0000003823000000)
libopen-pal.so.6 =>
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libopen-pal.so.6
(0x00007fab9e64a000)
libdl.so.2 => /lib64/libdl.so.2 (0x000000381b800000)
librt.so.1 => /lib64/librt.so.1 (0x00000035b3600000)
libm.so.6 => /lib64/libm.so.6 (0x0000003c25a00000)
libutil.so.1 => /lib64/libutil.so.1 (0x0000003f71000000)
/lib64/ld-linux-x86-64.so.2 (0x000000381b400000)
libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003917a00000)
libgcc_s.so.1 =>
/software6/compilers/gcc/4.8/lib64/libgcc_s.so.1 (0x00007fab9e433000)
libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2
(0x0000003822400000)
libkrb5.so.3 => /lib64/libkrb5.so.3 (0x0000003821400000)
libcom_err.so.2 => /lib64/libcom_err.so.2 (0x000000381e400000)
libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x0000003821800000)
libkrb5support.so.0 => /lib64/libkrb5support.so.0
(0x0000003821c00000)
libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x0000003822000000)
libresolv.so.2 => /lib64/libresolv.so.2 (0x000000381dc00000)
libselinux.so.1 => /lib64/libselinux.so.1 (0x000000381d000000)
[mboisson@helios-login1 examples]$ mpiexec ring_c
[mboisson@helios-login1 examples]$ echo $?
65
Maxime
Le 2014-08-16 06:22, Jeff Squyres (jsquyres) a écrit :
Just out of curiosity, I saw that one of the segv stack traces involved the
cuda stack.
Can you try a build without CUDA and see if that resolves the problem?
On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault
<maxime.boissonnea...@calculquebec.ca> wrote:
Hi Jeff,
Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit :
On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault
<maxime.boissonnea...@calculquebec.ca> wrote:
Correct.
Can it be because torque (pbs_mom) is not running on the head node and mpiexec
attempts to contact it ?
Not for Open MPI's mpiexec, no.
Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM
stuff (i.e., Torque stuff) if it sees the environment variable markers
indicating that it's inside a Torque job. If not, it just uses rsh/ssh (or
localhost launch in your case, since you didn't specify any hosts).
If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI
"hostname" command from Linux), then something is seriously borked with your Open MPI
installation.
mpirun -np 4 hostname works fine :
[mboisson@helios-login1 ~]$ which mpirun
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun
[mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $?
helios-login1
helios-login1
helios-login1
helios-login1
0
Try running with:
mpirun -np 4 --mca plm_base_verbose 10 hostname
This should show the steps OMPI is trying to take to launch the 4 copies of
"hostname" and potentially give some insight into where it's hanging.
Also, just to make sure, you have ensured that you're compiling everything with
a single compiler toolchain, and the support libraries from that specific
compiler toolchain are available on any server on which you're running (to
include the head node and compute nodes), right?
Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 with the
same results). Almost every software (that is compiler, toolchain, etc.) is
installed on lustre, from sources and is the same on both the login (head) node
and the compute.
The few differences between the head node and the compute :
1) Computes are in RAMFS - login is installed on disk
2) Computes and login node have different hardware configuration (computes have
GPUs, head node does not).
3) Login node has MORE CentOS6 packages than computes (such as the -devel
packages, some fonts/X11 libraries, etc.), but all the packages that are on the
computes are also on the login node.
And you've verified that PATH and LD_LIBRARY_PATH are pointing to the right places -- i.e., to the
Open MPI installation that you expect it to point to. E.g., if you "ldd ring_c", it
shows the libmpi.so that you expect. And "which mpiexec" shows the mpirun that you
expect. Etc.
As per the content of "env.out" in the archive, yes. They point to the OMPI
1.8.2rc4 installation directories, on lustre, and are shared between the head node and
the compute nodes.
Maxime
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25043.php
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique