I am also filing a bug at Adaptive Computing since, while I do set CUDA_VISIBLE_DEVICES myself, the default value set by Torque in that case is also wrong.

Maxime

Le 2014-08-19 10:47, Rolf vandeVaart a écrit :
Glad it was solved.  I will submit a bug at NVIDIA as that does not seem like a 
very friendly way to handle that error.

-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
Boissonneault
Sent: Tuesday, August 19, 2014 10:39 AM
To: Open MPI Users
Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
I believe I found what the problem was. My script set the
CUDA_VISIBLE_DEVICES based on the content of $PBS_GPUFILE. Since the
GPUs were listed twice in the $PBS_GPUFILE because of the two nodes, I had
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7

instead of
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

Sorry for the false bug and thanks for directing me toward the solution.

Maxime


Le 2014-08-19 09:15, Rolf vandeVaart a écrit :
Hi:
This problem does not appear to have anything to do with MPI. We are
getting a SEGV during the initial call into the CUDA driver.  Can you log on to
gpu-k20-08, compile your simple program without MPI, and run it there?  Also,
maybe run dmesg on gpu-k20-08 and see if there is anything in the log?
Also, does your program run if you just run it on gpu-k20-07?

Can you include the output from nvidia-smi on each node?

Thanks,
Rolf

-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
Boissonneault
Sent: Tuesday, August 19, 2014 8:55 AM
To: Open MPI Users
Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not
give me much more information.
[mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug
                    Prefix:
/software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda
    Internal debug support: yes
Memory debugging support: no


Is there something I need to do at run time to get more information
out of it ?


[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by
ppr:1:node cudampi_simple [gpu-k20-08:46045] *** Process received
signal *** [gpu-k20-08:46045] Signal: Segmentation fault (11)
[gpu-k20-08:46045] Signal code: Address not mapped (1)
[gpu-k20-08:46045] Failing at address: 0x8 [gpu-k20-08:46045] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710]
[gpu-k20-08:46045] [ 1]
/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf]
[gpu-k20-08:46045] [ 2]
/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83]
[gpu-k20-08:46045] [ 3]
/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da]
[gpu-k20-08:46045] [ 4]
/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933]
[gpu-k20-08:46045] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df
4965]
[gpu-k20-08:46045] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df
4a0a]
[gpu-k20-08:46045] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df
4a3b]
[gpu-k20-08:46045] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0
f647] [gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae]
[gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received
signal *** [gpu-k20-07:61816] Signal: Segmentation fault (11)
[gpu-k20-07:61816] Signal code: Address not mapped (1)
[gpu-k20-07:61816] Failing at address: 0x8 [gpu-k20-07:61816] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710]
[gpu-k20-07:61816] [ 1]
/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf]
[gpu-k20-07:61816] [ 2]
/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83]
[gpu-k20-07:61816] [ 3]
/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da]
[gpu-k20-07:61816] [ 4]
/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933]
[gpu-k20-07:61816] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b
6965]
[gpu-k20-07:61816] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b
6a0a]
[gpu-k20-07:61816] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b
6a3b]
[gpu-k20-07:61816] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d
1647
]
[gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae] [gpu-k20-07:61816]
[10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d]
[gpu-k20-07:61816] [11] cudampi_simple[0x400699] [gpu-k20-07:61816]
*** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d]
[gpu-k20-08:46045] [11] cudampi_simple[0x400699] [gpu-k20-08:46045]
*** End of error message ***
---------------------------------------------------------------------
----- mpiexec noticed that process rank 1 with PID 46045 on node
gpu-k20-08 exited on signal 11 (Segmentation fault).
---------------------------------------------------------------------
-----


Thanks,

Maxime


Le 2014-08-18 16:45, Rolf vandeVaart a écrit :
Just to help reduce the scope of the problem, can you retest with a
non
CUDA-aware Open MPI 1.8.1?   And if possible, use --enable-debug in the
configure line to help with the stack trace?
-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of
Maxime
Boissonneault
Sent: Monday, August 18, 2014 4:23 PM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1)
kindda derailed into two problems, one of which has been addressed,
I figured I would start a new, more precise and simple one.

I reduced the code to the minimal that would reproduce the bug. I
have pasted it here :
http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc
memory, and
then
free memory and finalize MPI. Nothing else.

When I compile and run this on a single node, everything works fine.

When I compile and run this on more than one node, I get the
following
stack
trace :
[gpu-k20-07:40041] *** Process received signal ***
[gpu-k20-07:40041]
Signal:
Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not
mapped
(1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [
0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
[gpu-k20-07:40041] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
[gpu-k20-07:40041] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
[gpu-k20-07:40041] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
[gpu-k20-07:40041] [ 5]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]
[gpu-k20-07:40041] [ 6]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]
[gpu-k20-07:40041] [ 7]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]
[gpu-k20-07:40041] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b
532] [gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5]
[gpu-k20-07:40041] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
[gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041]
***
End of error message ***


The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda
aware)
or
OpenMPI 1.8.1 (cuda aware).

I know this is more than likely a problem with Cuda than with
OpenMPI
(since
it does the same for two different versions), but I figured I would
ask here
if
somebody has a clue of what might be going on. I have yet to be
able to fill
a
bug report on NVidia's website for Cuda.


Thanks,


--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval Ph. D. en
physique

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-
mpi.org/community/lists/users/2014/08/25064.php
--------------------------------------------------------------------
--------------- This email message is for the sole use of the
intended recipient(s) and may
contain
confidential information.  Any unauthorized review, use, disclosure
or
distribution
is prohibited.  If you are not the intended recipient, please
contact the
sender by
reply email and destroy all copies of the original message.
--------------------------------------------------------------------
---------------
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-
mpi.org/community/lists/users/2014/08/25065.php


--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval Ph. D. en
physique

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-
mpi.org/community/lists/users/2014/08/25074.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25075.php

--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-
mpi.org/community/lists/users/2014/08/25076.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25077.php


--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Reply via email to