Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

Maxime Boissonneault Mon, 18 Aug 2014 12:21:11 -0400 (EDT)

Hi,

I just did compile without Cuda, and the result is the same. No output,exits with code 65.


[mboisson@helios-login1 examples]$ ldd ring_c
        linux-vdso.so.1 =>  (0x00007fff3ab31000)

libmpi.so.1 =>/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libmpi.so.1(0x00007fab9ec2a000)

        libpthread.so.0 => /lib64/libpthread.so.0 (0x000000381c000000)
        libc.so.6 => /lib64/libc.so.6 (0x000000381bc00000)
        librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x000000381c800000)
        libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x000000381c400000)

libopen-rte.so.7 =>/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libopen-rte.so.7(0x00007fab9e932000)

        libtorque.so.2 => /usr/lib64/libtorque.so.2 (0x0000003918200000)
        libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x0000003917e00000)
        libz.so.1 => /lib64/libz.so.1 (0x000000381cc00000)
        libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x0000003821000000)
        libssl.so.10 => /usr/lib64/libssl.so.10 (0x0000003823000000)

libopen-pal.so.6 =>/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_nocuda/lib/libopen-pal.so.6(0x00007fab9e64a000)

        libdl.so.2 => /lib64/libdl.so.2 (0x000000381b800000)
        librt.so.1 => /lib64/librt.so.1 (0x00000035b3600000)
        libm.so.6 => /lib64/libm.so.6 (0x0000003c25a00000)
        libutil.so.1 => /lib64/libutil.so.1 (0x0000003f71000000)
        /lib64/ld-linux-x86-64.so.2 (0x000000381b400000)
        libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003917a00000)

libgcc_s.so.1 =>/software6/compilers/gcc/4.8/lib64/libgcc_s.so.1 (0x00007fab9e433000)libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2(0x0000003822400000)

        libkrb5.so.3 => /lib64/libkrb5.so.3 (0x0000003821400000)
        libcom_err.so.2 => /lib64/libcom_err.so.2 (0x000000381e400000)
        libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x0000003821800000)

libkrb5support.so.0 => /lib64/libkrb5support.so.0(0x0000003821c00000)

        libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x0000003822000000)
        libresolv.so.2 => /lib64/libresolv.so.2 (0x000000381dc00000)
        libselinux.so.1 => /lib64/libselinux.so.1 (0x000000381d000000)
[mboisson@helios-login1 examples]$ mpiexec ring_c
[mboisson@helios-login1 examples]$ echo $?
65


Maxime


Le 2014-08-16 06:22, Jeff Squyres (jsquyres) a écrit :

Just out of curiosity, I saw that one of the segv stack traces involved the 
cuda stack.

Can you try a build without CUDA and see if that resolves the problem?



On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

Hi Jeff,

Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit :

On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault 
<maxime.boissonnea...@calculquebec.ca> wrote:

Correct.

Can it be because torque (pbs_mom) is not running on the head node and mpiexec 
attempts to contact it ?

Not for Open MPI's mpiexec, no.

Open MPI's mpiexec (mpirun -- they're the same to us) will only try to use TM 
stuff (i.e., Torque stuff) if it sees the environment variable markers 
indicating that it's inside a Torque job.  If not, it just uses rsh/ssh (or 
localhost launch in your case, since you didn't specify any hosts).

If you are unable to run even "mpirun -np 4 hostname" (i.e., the non-MPI 
"hostname" command from Linux), then something is seriously borked with your Open MPI 
installation.

mpirun -np 4 hostname works fine :
[mboisson@helios-login1 ~]$ which mpirun
/software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun
[mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $?
helios-login1
helios-login1
helios-login1
helios-login1
0

Try running with:

     mpirun -np 4 --mca plm_base_verbose 10 hostname

This should show the steps OMPI is trying to take to launch the 4 copies of 
"hostname" and potentially give some insight into where it's hanging.

Also, just to make sure, you have ensured that you're compiling everything with 
a single compiler toolchain, and the support libraries from that specific 
compiler toolchain are available on any server on which you're running (to 
include the head node and compute nodes), right?

Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6 with the 
same results). Almost every software (that is compiler, toolchain, etc.) is 
installed on lustre, from sources and is the same on both the login (head) node 
and the compute.

The few differences between the head node and the compute :
1) Computes are in RAMFS - login is installed on disk
2) Computes and login node have different hardware configuration (computes have 
GPUs, head node does not).
3) Login node has MORE CentOS6 packages than computes (such as the -devel 
packages, some fonts/X11 libraries, etc.), but all the packages that are on the 
computes are also on the login node.

And you've verified that PATH and LD_LIBRARY_PATH are pointing to the right places -- i.e., to the 
Open MPI installation that you expect it to point to.  E.g., if you "ldd ring_c", it 
shows the libmpi.so that you expect.  And "which mpiexec" shows the mpirun that you 
expect.  Etc.

As per the content of "env.out" in the archive, yes. They point to the OMPI 
1.8.2rc4 installation directories, on lustre, and are shared between the head node and 
the compute nodes.


Maxime
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25043.php



--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

Reply via email to