Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Rolf vandeVaart Tue, 19 Aug 2014 10:47:14 -0400 (EDT)

Glad it was solved.  I will submit a bug at NVIDIA as that does not seem like a 
very friendly way to handle that error.


>-----Original Message-----
>From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>Boissonneault
>Sent: Tuesday, August 19, 2014 10:39 AM
>To: Open MPI Users
>Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>
>Hi,
>I believe I found what the problem was. My script set the
>CUDA_VISIBLE_DEVICES based on the content of $PBS_GPUFILE. Since the
>GPUs were listed twice in the $PBS_GPUFILE because of the two nodes, I had
>CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>
>instead of
>CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
>
>Sorry for the false bug and thanks for directing me toward the solution.
>
>Maxime
>
>
>Le 2014-08-19 09:15, Rolf vandeVaart a écrit :
>> Hi:
>> This problem does not appear to have anything to do with MPI. We are
>getting a SEGV during the initial call into the CUDA driver.  Can you log on to
>gpu-k20-08, compile your simple program without MPI, and run it there?  Also,
>maybe run dmesg on gpu-k20-08 and see if there is anything in the log?
>>
>> Also, does your program run if you just run it on gpu-k20-07?
>>
>> Can you include the output from nvidia-smi on each node?
>>
>> Thanks,
>> Rolf
>>
>>> -----Original Message-----
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime
>>> Boissonneault
>>> Sent: Tuesday, August 19, 2014 8:55 AM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>>>
>>> Hi,
>>> I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not
>>> give me much more information.
>>> [mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug
>>>                    Prefix:
>>> /software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda
>>>    Internal debug support: yes
>>> Memory debugging support: no
>>>
>>>
>>> Is there something I need to do at run time to get more information
>>> out of it ?
>>>
>>>
>>> [mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by
>>> ppr:1:node cudampi_simple [gpu-k20-08:46045] *** Process received
>>> signal *** [gpu-k20-08:46045] Signal: Segmentation fault (11)
>>> [gpu-k20-08:46045] Signal code: Address not mapped (1)
>>> [gpu-k20-08:46045] Failing at address: 0x8 [gpu-k20-08:46045] [ 0]
>>> /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710]
>>> [gpu-k20-08:46045] [ 1]
>>> /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf]
>>> [gpu-k20-08:46045] [ 2]
>>> /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83]
>>> [gpu-k20-08:46045] [ 3]
>>> /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da]
>>> [gpu-k20-08:46045] [ 4]
>>> /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933]
>>> [gpu-k20-08:46045] [ 5]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df
>>> 4965]
>>> [gpu-k20-08:46045] [ 6]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df
>>> 4a0a]
>>> [gpu-k20-08:46045] [ 7]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df
>>> 4a3b]
>>> [gpu-k20-08:46045] [ 8]
>>> /software-
>>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0
>>> f647] [gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae]
>>> [gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received
>>> signal *** [gpu-k20-07:61816] Signal: Segmentation fault (11)
>>> [gpu-k20-07:61816] Signal code: Address not mapped (1)
>>> [gpu-k20-07:61816] Failing at address: 0x8 [gpu-k20-07:61816] [ 0]
>>> /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710]
>>> [gpu-k20-07:61816] [ 1]
>>> /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf]
>>> [gpu-k20-07:61816] [ 2]
>>> /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83]
>>> [gpu-k20-07:61816] [ 3]
>>> /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da]
>>> [gpu-k20-07:61816] [ 4]
>>> /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933]
>>> [gpu-k20-07:61816] [ 5]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b
>>> 6965]
>>> [gpu-k20-07:61816] [ 6]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b
>>> 6a0a]
>>> [gpu-k20-07:61816] [ 7]
>>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b
>>> 6a3b]
>>> [gpu-k20-07:61816] [ 8]
>>> /software-
>>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d
>>> 1647
>>> ]
>>> [gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae] [gpu-k20-07:61816]
>>> [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d]
>>> [gpu-k20-07:61816] [11] cudampi_simple[0x400699] [gpu-k20-07:61816]
>>> *** End of error message ***
>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d]
>>> [gpu-k20-08:46045] [11] cudampi_simple[0x400699] [gpu-k20-08:46045]
>>> *** End of error message ***
>>> ---------------------------------------------------------------------
>>> ----- mpiexec noticed that process rank 1 with PID 46045 on node
>>> gpu-k20-08 exited on signal 11 (Segmentation fault).
>>> ---------------------------------------------------------------------
>>> -----
>>>
>>>
>>> Thanks,
>>>
>>> Maxime
>>>
>>>
>>> Le 2014-08-18 16:45, Rolf vandeVaart a écrit :
>>>> Just to help reduce the scope of the problem, can you retest with a
>>>> non
>>> CUDA-aware Open MPI 1.8.1?   And if possible, use --enable-debug in the
>>> configure line to help with the stack trace?
>>>>
>>>>> -----Original Message-----
>>>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of
>Maxime
>>>>> Boissonneault
>>>>> Sent: Monday, August 18, 2014 4:23 PM
>>>>> To: Open MPI Users
>>>>> Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes
>>>>>
>>>>> Hi,
>>>>> Since my previous thread (Segmentation fault in OpenMPI 1.8.1)
>>>>> kindda derailed into two problems, one of which has been addressed,
>>>>> I figured I would start a new, more precise and simple one.
>>>>>
>>>>> I reduced the code to the minimal that would reproduce the bug. I
>>>>> have pasted it here :
>>>>> http://pastebin.com/1uAK4Z8R
>>>>> Basically, it is a program that initializes MPI and cudaMalloc
>>>>> memory, and
>>> then
>>>>> free memory and finalize MPI. Nothing else.
>>>>>
>>>>> When I compile and run this on a single node, everything works fine.
>>>>>
>>>>> When I compile and run this on more than one node, I get the
>>>>> following
>>> stack
>>>>> trace :
>>>>> [gpu-k20-07:40041] *** Process received signal ***
>>>>> [gpu-k20-07:40041]
>>> Signal:
>>>>> Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not
>>> mapped
>>>>> (1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [
>>>>> 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710]
>>>>> [gpu-k20-07:40041] [ 1]
>>>>> /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf]
>>>>> [gpu-k20-07:40041] [ 2]
>>>>> /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83]
>>>>> [gpu-k20-07:40041] [ 3]
>>>>> /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da]
>>>>> [gpu-k20-07:40041] [ 4]
>>>>> /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933]
>>>>> [gpu-k20-07:40041] [ 5]
>>>>> /software-
>>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965]
>>>>> [gpu-k20-07:40041] [ 6]
>>>>> /software-
>>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a]
>>>>> [gpu-k20-07:40041] [ 7]
>>>>> /software-
>>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b]
>>>>> [gpu-k20-07:40041] [ 8]
>>>>> /software-
>>>>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b
>>>>> 532] [gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5]
>>>>> [gpu-k20-07:40041] [10]
>>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d]
>>>>> [gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041]
>>> ***
>>>>> End of error message ***
>>>>>
>>>>>
>>>>> The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda
>>>>> aware)
>>> or
>>>>> OpenMPI 1.8.1 (cuda aware).
>>>>>
>>>>> I know this is more than likely a problem with Cuda than with
>>>>> OpenMPI
>>> (since
>>>>> it does the same for two different versions), but I figured I would
>>>>> ask here
>>> if
>>>>> somebody has a clue of what might be going on. I have yet to be
>>>>> able to fill
>>> a
>>>>> bug report on NVidia's website for Cuda.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> --
>>>>> ---------------------------------
>>>>> Maxime Boissonneault
>>>>> Analyste de calcul - Calcul Québec, Université Laval Ph. D. en
>>>>> physique
>>>>>
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post: http://www.open-
>>>>> mpi.org/community/lists/users/2014/08/25064.php
>>>> --------------------------------------------------------------------
>>>> --------------- This email message is for the sole use of the
>>>> intended recipient(s) and may
>>> contain
>>>> confidential information.  Any unauthorized review, use, disclosure
>>>> or
>>> distribution
>>>> is prohibited.  If you are not the intended recipient, please
>>>> contact the
>>> sender by
>>>> reply email and destroy all copies of the original message.
>>>> --------------------------------------------------------------------
>>>> ---------------
>_______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post: http://www.open-
>>> mpi.org/community/lists/users/2014/08/25065.php
>>>
>>>
>>> --
>>> ---------------------------------
>>> Maxime Boissonneault
>>> Analyste de calcul - Calcul Québec, Université Laval Ph. D. en
>>> physique
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post: http://www.open-
>>> mpi.org/community/lists/users/2014/08/25074.php
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/08/25075.php
>
>
>--
>---------------------------------
>Maxime Boissonneault
>Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique
>
>_______________________________________________
>users mailing list
>us...@open-mpi.org
>Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>Link to this post: http://www.open-
>mpi.org/community/lists/users/2014/08/25076.php

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

Reply via email to