Hi,
I believe I found what the problem was. My script set the
CUDA_VISIBLE_DEVICES based on the content of $PBS_GPUFILE. Since the
GPUs were listed twice in the $PBS_GPUFILE because of the two nodes, I had
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
instead of
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Sorry for the false bug and thanks for directing me toward the solution.
Maxime
Le 2014-08-19 09:15, Rolf vandeVaart a écrit :
Hi:
This problem does not appear to have anything to do with MPI. We are getting a
SEGV during the initial call into the CUDA driver. Can you log on to
gpu-k20-08, compile your simple program without MPI, and run it there? Also,
maybe run dmesg on gpu-k20-08 and see if there is anything in the log?
Also, does your program run if you just run it on gpu-k20-07?
Can you include the output from nvidia-smi on each node?
Thanks,
Rolf
-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
Boissonneault
Sent: Tuesday, August 19, 2014 8:55 AM
To: Open MPI Users
Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
Hi,
I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not give me
much more information.
[mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug
Prefix:
/software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda
Internal debug support: yes
Memory debugging support: no
Is there something I need to do at run time to get more information out
of it ?
[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by
ppr:1:node
cudampi_simple
[gpu-k20-08:46045] *** Process received signal ***
[gpu-k20-08:46045] Signal: Segmentation fault (11)
[gpu-k20-08:46045] Signal code: Address not mapped (1)
[gpu-k20-08:46045] Failing at address: 0x8
[gpu-k20-08:46045] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710]
[gpu-k20-08:46045] [ 1]
/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf]
[gpu-k20-08:46045] [ 2]
/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83]
[gpu-k20-08:46045] [ 3]
/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da]
[gpu-k20-08:46045] [ 4]
/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933]
[gpu-k20-08:46045] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df4965]
[gpu-k20-08:46045] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df4a0a]
[gpu-k20-08:46045] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df4a3b]
[gpu-k20-08:46045] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0f647]
[gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae]
[gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received signal ***
[gpu-k20-07:61816] Signal: Segmentation fault (11)
[gpu-k20-07:61816] Signal code: Address not mapped (1)
[gpu-k20-07:61816] Failing at address: 0x8
[gpu-k20-07:61816] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710]
[gpu-k20-07:61816] [ 1]
/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf]
[gpu-k20-07:61816] [ 2]
/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83]
[gpu-k20-07:61816] [ 3]
/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da]
[gpu-k20-07:61816] [ 4]
/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933]
[gpu-k20-07:61816] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b6965]
[gpu-k20-07:61816] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b6a0a]
[gpu-k20-07:61816] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b6a3b]
[gpu-k20-07:61816] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d1647
]
[gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae]
[gpu-k20-07:61816] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d]
[gpu-k20-07:61816] [11] cudampi_simple[0x400699]
[gpu-k20-07:61816] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d]
[gpu-k20-08:46045] [11] cudampi_simple[0x400699]
[gpu-k20-08:46045] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 46045 on node gpu-k20-08
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Thanks,
Maxime
Le 2014-08-18 16:45, Rolf vandeVaart a écrit :
Just to help reduce the scope of the problem, can you retest with a non
CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the
configure line to help with the stack trace?
-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
Boissonneault
Sent: Monday, August 18, 2014 4:23 PM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes
Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
derailed into two problems, one of which has been addressed, I figured I
would start a new, more precise and simple one.
I reduced the code to the minimal that would reproduce the bug. I have
pasted it here :
http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc memory, and
then
free memory and finalize MPI. Nothing else.
When I compile and run this on a single node, everything works fine.
When I compile and run this on more than one node, I get the following
stack
trace :
[gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041]
Signal:
Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not
mapped
(1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
[gpu-k20-07:40041] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
[gpu-k20-07:40041] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
[gpu-k20-07:40041] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
[gpu-k20-07:40041] [ 5]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]
[gpu-k20-07:40041] [ 6]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]
[gpu-k20-07:40041] [ 7]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]
[gpu-k20-07:40041] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532]
[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
[gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041]
***
End of error message ***
The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware)
or
OpenMPI 1.8.1 (cuda aware).
I know this is more than likely a problem with Cuda than with OpenMPI
(since
it does the same for two different versions), but I figured I would ask here
if
somebody has a clue of what might be going on. I have yet to be able to fill
a
bug report on NVidia's website for Cuda.
Thanks,
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-
mpi.org/community/lists/users/2014/08/25064.php
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may
contain
confidential information. Any unauthorized review, use, disclosure or
distribution
is prohibited. If you are not the intended recipient, please contact the
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-
mpi.org/community/lists/users/2014/08/25065.php
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-
mpi.org/community/lists/users/2014/08/25074.php
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post:
http://www.open-mpi.org/community/lists/users/2014/08/25075.php
--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique