Maxime,

Can you run with:

    mpirun -np 4 --mca plm_base_verbose 10 /path/to/examples//ring_c


On Mon, Aug 18, 2014 at 12:21 PM, Maxime Boissonneault <
maxime.boissonnea...@calculquebec.ca> wrote:

> Hi,
> I just did compile without Cuda, and the result is the same. No output,
> exits with code 65.
>
> [mboisson@helios-login1 examples]$ ldd ring_c
>         linux-vdso.so.1 =>  (0x00007fff3ab31000)
>         libmpi.so.1 => /software-gpu/mpi/openmpi/1.8.
> 2rc4_gcc4.8_nocuda/lib/libmpi.so.1 (0x00007fab9ec2a000)
>         libpthread.so.0 => /lib64/libpthread.so.0 (0x000000381c000000)
>         libc.so.6 => /lib64/libc.so.6 (0x000000381bc00000)
>         librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x000000381c800000)
>         libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x000000381c400000)
>         libopen-rte.so.7 => /software-gpu/mpi/openmpi/1.8.
> 2rc4_gcc4.8_nocuda/lib/libopen-rte.so.7 (0x00007fab9e932000)
>         libtorque.so.2 => /usr/lib64/libtorque.so.2 (0x0000003918200000)
>         libxml2.so.2 => /usr/lib64/libxml2.so.2 (0x0000003917e00000)
>         libz.so.1 => /lib64/libz.so.1 (0x000000381cc00000)
>         libcrypto.so.10 => /usr/lib64/libcrypto.so.10 (0x0000003821000000)
>         libssl.so.10 => /usr/lib64/libssl.so.10 (0x0000003823000000)
>         libopen-pal.so.6 => /software-gpu/mpi/openmpi/1.8.
> 2rc4_gcc4.8_nocuda/lib/libopen-pal.so.6 (0x00007fab9e64a000)
>         libdl.so.2 => /lib64/libdl.so.2 (0x000000381b800000)
>         librt.so.1 => /lib64/librt.so.1 (0x00000035b3600000)
>         libm.so.6 => /lib64/libm.so.6 (0x0000003c25a00000)
>         libutil.so.1 => /lib64/libutil.so.1 (0x0000003f71000000)
>         /lib64/ld-linux-x86-64.so.2 (0x000000381b400000)
>         libstdc++.so.6 => /usr/lib64/libstdc++.so.6 (0x0000003917a00000)
>         libgcc_s.so.1 => /software6/compilers/gcc/4.8/lib64/libgcc_s.so.1
> (0x00007fab9e433000)
>         libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2
> (0x0000003822400000)
>         libkrb5.so.3 => /lib64/libkrb5.so.3 (0x0000003821400000)
>         libcom_err.so.2 => /lib64/libcom_err.so.2 (0x000000381e400000)
>         libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x0000003821800000)
>         libkrb5support.so.0 => /lib64/libkrb5support.so.0
> (0x0000003821c00000)
>         libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x0000003822000000)
>         libresolv.so.2 => /lib64/libresolv.so.2 (0x000000381dc00000)
>         libselinux.so.1 => /lib64/libselinux.so.1 (0x000000381d000000)
>
> [mboisson@helios-login1 examples]$ mpiexec ring_c
> [mboisson@helios-login1 examples]$ echo $?
> 65
>
>
> Maxime
>
>
> Le 2014-08-16 06:22, Jeff Squyres (jsquyres) a écrit :
>
>  Just out of curiosity, I saw that one of the segv stack traces involved
>> the cuda stack.
>>
>> Can you try a build without CUDA and see if that resolves the problem?
>>
>>
>>
>> On Aug 15, 2014, at 6:47 PM, Maxime Boissonneault <maxime.boissonneault@
>> calculquebec.ca> wrote:
>>
>>  Hi Jeff,
>>>
>>> Le 2014-08-15 17:50, Jeff Squyres (jsquyres) a écrit :
>>>
>>>> On Aug 15, 2014, at 5:39 PM, Maxime Boissonneault <
>>>> maxime.boissonnea...@calculquebec.ca> wrote:
>>>>
>>>>  Correct.
>>>>>
>>>>> Can it be because torque (pbs_mom) is not running on the head node and
>>>>> mpiexec attempts to contact it ?
>>>>>
>>>> Not for Open MPI's mpiexec, no.
>>>>
>>>> Open MPI's mpiexec (mpirun -- they're the same to us) will only try to
>>>> use TM stuff (i.e., Torque stuff) if it sees the environment variable
>>>> markers indicating that it's inside a Torque job.  If not, it just uses
>>>> rsh/ssh (or localhost launch in your case, since you didn't specify any
>>>> hosts).
>>>>
>>>> If you are unable to run even "mpirun -np 4 hostname" (i.e., the
>>>> non-MPI "hostname" command from Linux), then something is seriously borked
>>>> with your Open MPI installation.
>>>>
>>> mpirun -np 4 hostname works fine :
>>> [mboisson@helios-login1 ~]$ which mpirun
>>> /software-gpu/mpi/openmpi/1.8.2rc4_gcc4.8_cuda6.0.37/bin/mpirun
>>> [mboisson@helios-login1 examples]$ mpirun -np 4 hostname; echo $?
>>> helios-login1
>>> helios-login1
>>> helios-login1
>>> helios-login1
>>> 0
>>>
>>>  Try running with:
>>>>
>>>>      mpirun -np 4 --mca plm_base_verbose 10 hostname
>>>>
>>>> This should show the steps OMPI is trying to take to launch the 4
>>>> copies of "hostname" and potentially give some insight into where it's
>>>> hanging.
>>>>
>>>> Also, just to make sure, you have ensured that you're compiling
>>>> everything with a single compiler toolchain, and the support libraries from
>>>> that specific compiler toolchain are available on any server on which
>>>> you're running (to include the head node and compute nodes), right?
>>>>
>>> Yes. Everything has been compiled with GCC 4.8 (I also tried GCC 4.6
>>> with the same results). Almost every software (that is compiler, toolchain,
>>> etc.) is installed on lustre, from sources and is the same on both the
>>> login (head) node and the compute.
>>>
>>> The few differences between the head node and the compute :
>>> 1) Computes are in RAMFS - login is installed on disk
>>> 2) Computes and login node have different hardware configuration
>>> (computes have GPUs, head node does not).
>>> 3) Login node has MORE CentOS6 packages than computes (such as the
>>> -devel packages, some fonts/X11 libraries, etc.), but all the packages that
>>> are on the computes are also on the login node.
>>>
>>>  And you've verified that PATH and LD_LIBRARY_PATH are pointing to the
>>>> right places -- i.e., to the Open MPI installation that you expect it to
>>>> point to.  E.g., if you "ldd ring_c", it shows the libmpi.so that you
>>>> expect.  And "which mpiexec" shows the mpirun that you expect.  Etc.
>>>>
>>> As per the content of "env.out" in the archive, yes. They point to the
>>> OMPI 1.8.2rc4 installation directories, on lustre, and are shared between
>>> the head node and the compute nodes.
>>>
>>>
>>> Maxime
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-mpi.org/
>>> community/lists/users/2014/08/25043.php
>>>
>>
>>
>
> --
> ---------------------------------
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2014/08/
> 25050.php
>

Reply via email to