Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Maxime Boissonneault Tue, 19 Aug 2014 10:38:57 -0400 (EDT)

Hi,

I believe I found what the problem was. My script set theCUDA_VISIBLE_DEVICES based on the content of $PBS_GPUFILE. Since theGPUs were listed twice in the $PBS_GPUFILE because of the two nodes, I had

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7


instead of
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

Sorry for the false bug and thanks for directing me toward the solution.

Maxime


Le 2014-08-19 09:15, Rolf vandeVaart a écrit :

Hi:
This problem does not appear to have anything to do with MPI. We are getting a 
SEGV during the initial call into the CUDA driver.  Can you log on to 
gpu-k20-08, compile your simple program without MPI, and run it there?  Also, 
maybe run dmesg on gpu-k20-08 and see if there is anything in the log?

Also, does your program run if you just run it on gpu-k20-07?

Can you include the output from nvidia-smi on each node?

Thanks,
Rolf

-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
Boissonneault
Sent: Tuesday, August 19, 2014 8:55 AM
To: Open MPI Users
Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not give me
much more information.
[mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug
                   Prefix:
/software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda
   Internal debug support: yes
Memory debugging support: no


Is there something I need to do at run time to get more information out
of it ?


[mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by
ppr:1:node
cudampi_simple
[gpu-k20-08:46045] *** Process received signal ***
[gpu-k20-08:46045] Signal: Segmentation fault (11)
[gpu-k20-08:46045] Signal code: Address not mapped (1)
[gpu-k20-08:46045] Failing at address: 0x8
[gpu-k20-08:46045] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710]
[gpu-k20-08:46045] [ 1]
/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf]
[gpu-k20-08:46045] [ 2]
/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83]
[gpu-k20-08:46045] [ 3]
/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da]
[gpu-k20-08:46045] [ 4]
/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933]
[gpu-k20-08:46045] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df4965]
[gpu-k20-08:46045] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df4a0a]
[gpu-k20-08:46045] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df4a3b]
[gpu-k20-08:46045] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0f647]
[gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae]
[gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received signal ***
[gpu-k20-07:61816] Signal: Segmentation fault (11)
[gpu-k20-07:61816] Signal code: Address not mapped (1)
[gpu-k20-07:61816] Failing at address: 0x8
[gpu-k20-07:61816] [ 0] /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710]
[gpu-k20-07:61816] [ 1]
/usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf]
[gpu-k20-07:61816] [ 2]
/usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83]
[gpu-k20-07:61816] [ 3]
/usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da]
[gpu-k20-07:61816] [ 4]
/usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933]
[gpu-k20-07:61816] [ 5]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b6965]
[gpu-k20-07:61816] [ 6]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b6a0a]
[gpu-k20-07:61816] [ 7]
/software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b6a3b]
[gpu-k20-07:61816] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d1647
]
[gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae]
[gpu-k20-07:61816] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d]
[gpu-k20-07:61816] [11] cudampi_simple[0x400699]
[gpu-k20-07:61816] *** End of error message ***
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d]
[gpu-k20-08:46045] [11] cudampi_simple[0x400699]
[gpu-k20-08:46045] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 1 with PID 46045 on node gpu-k20-08
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------


Thanks,

Maxime


Le 2014-08-18 16:45, Rolf vandeVaart a écrit :

Just to help reduce the scope of the problem, can you retest with a non

CUDA-aware Open MPI 1.8.1?   And if possible, use --enable-debug in the
configure line to help with the stack trace?

-----Original Message-----
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
Boissonneault
Sent: Monday, August 18, 2014 4:23 PM
To: Open MPI Users
Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Hi,
Since my previous thread (Segmentation fault in OpenMPI 1.8.1) kindda
derailed into two problems, one of which has been addressed, I figured I
would start a new, more precise and simple one.

I reduced the code to the minimal that would reproduce the bug. I have
pasted it here :
http://pastebin.com/1uAK4Z8R
Basically, it is a program that initializes MPI and cudaMalloc memory, and

then

free memory and finalize MPI. Nothing else.

When I compile and run this on a single node, everything works fine.

When I compile and run this on more than one node, I get the following

stack

trace :
[gpu-k20-07:40041] *** Process received signal *** [gpu-k20-07:40041]

Signal:

Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not

mapped

(1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ 0]
/lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
[gpu-k20-07:40041] [ 1]
/usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
[gpu-k20-07:40041] [ 2]
/usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
[gpu-k20-07:40041] [ 3]
/usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
[gpu-k20-07:40041] [ 4]
/usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
[gpu-k20-07:40041] [ 5]
/software-

gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]

[gpu-k20-07:40041] [ 6]
/software-

gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]

[gpu-k20-07:40041] [ 7]
/software-

gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]

[gpu-k20-07:40041] [ 8]
/software-
gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b532]
[gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] [gpu-k20-07:40041] [10]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
[gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041]

***

End of error message ***


The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda aware)

or

OpenMPI 1.8.1 (cuda aware).

I know this is more than likely a problem with Cuda than with OpenMPI

(since

it does the same for two different versions), but I figured I would ask here

if

somebody has a clue of what might be going on. I have yet to be able to fill

bug report on NVidia's website for Cuda.


Thanks,


--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-
mpi.org/community/lists/users/2014/08/25064.php

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may

contain

confidential information.  Any unauthorized review, use, disclosure or

distribution

is prohibited.  If you are not the intended recipient, please contact the

sender by

reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-

mpi.org/community/lists/users/2014/08/25065.php


--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: http://www.open-
mpi.org/community/lists/users/2014/08/25074.php

_______________________________________________
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/08/25075.php



--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Reply via email to