Glad it was solved. I will submit a bug at NVIDIA as that does not seem like a very friendly way to handle that error.
>-----Original Message----- >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime >Boissonneault >Sent: Tuesday, August 19, 2014 10:39 AM >To: Open MPI Users >Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes > >Hi, >I believe I found what the problem was. My script set the >CUDA_VISIBLE_DEVICES based on the content of $PBS_GPUFILE. Since the >GPUs were listed twice in the $PBS_GPUFILE because of the two nodes, I had >CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7 > >instead of >CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 > >Sorry for the false bug and thanks for directing me toward the solution. > >Maxime > > >Le 2014-08-19 09:15, Rolf vandeVaart a écrit : >> Hi: >> This problem does not appear to have anything to do with MPI. We are >getting a SEGV during the initial call into the CUDA driver. Can you log on to >gpu-k20-08, compile your simple program without MPI, and run it there? Also, >maybe run dmesg on gpu-k20-08 and see if there is anything in the log? >> >> Also, does your program run if you just run it on gpu-k20-07? >> >> Can you include the output from nvidia-smi on each node? >> >> Thanks, >> Rolf >> >>> -----Original Message----- >>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime >>> Boissonneault >>> Sent: Tuesday, August 19, 2014 8:55 AM >>> To: Open MPI Users >>> Subject: Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes >>> >>> Hi, >>> I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not >>> give me much more information. >>> [mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug >>> Prefix: >>> /software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda >>> Internal debug support: yes >>> Memory debugging support: no >>> >>> >>> Is there something I need to do at run time to get more information >>> out of it ? >>> >>> >>> [mboisson@gpu-k20-07 simple_cuda_mpi]$ mpiexec -np 2 --map-by >>> ppr:1:node cudampi_simple [gpu-k20-08:46045] *** Process received >>> signal *** [gpu-k20-08:46045] Signal: Segmentation fault (11) >>> [gpu-k20-08:46045] Signal code: Address not mapped (1) >>> [gpu-k20-08:46045] Failing at address: 0x8 [gpu-k20-08:46045] [ 0] >>> /lib64/libpthread.so.0(+0xf710)[0x2b9ca76d9710] >>> [gpu-k20-08:46045] [ 1] >>> /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2b9cade98acf] >>> [gpu-k20-08:46045] [ 2] >>> /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2b9cade5ea83] >>> [gpu-k20-08:46045] [ 3] >>> /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2b9cadd902da] >>> [gpu-k20-08:46045] [ 4] >>> /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2b9cadd7c933] >>> [gpu-k20-08:46045] [ 5] >>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2b9ca6df >>> 4965] >>> [gpu-k20-08:46045] [ 6] >>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2b9ca6df >>> 4a0a] >>> [gpu-k20-08:46045] [ 7] >>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2b9ca6df >>> 4a3b] >>> [gpu-k20-08:46045] [ 8] >>> /software- >>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2b9ca6e0 >>> f647] [gpu-k20-08:46045] [ 9] cudampi_simple[0x4007ae] >>> [gpu-k20-08:46045] [10] [gpu-k20-07:61816] *** Process received >>> signal *** [gpu-k20-07:61816] Signal: Segmentation fault (11) >>> [gpu-k20-07:61816] Signal code: Address not mapped (1) >>> [gpu-k20-07:61816] Failing at address: 0x8 [gpu-k20-07:61816] [ 0] >>> /lib64/libpthread.so.0(+0xf710)[0x2aef36e9b710] >>> [gpu-k20-07:61816] [ 1] >>> /usr/lib64/nvidia/libcuda.so(+0x263acf)[0x2aef3d65aacf] >>> [gpu-k20-07:61816] [ 2] >>> /usr/lib64/nvidia/libcuda.so(+0x229a83)[0x2aef3d620a83] >>> [gpu-k20-07:61816] [ 3] >>> /usr/lib64/nvidia/libcuda.so(+0x15b2da)[0x2aef3d5522da] >>> [gpu-k20-07:61816] [ 4] >>> /usr/lib64/nvidia/libcuda.so(cuInit+0x43)[0x2aef3d53e933] >>> [gpu-k20-07:61816] [ 5] >>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2aef365b >>> 6965] >>> [gpu-k20-07:61816] [ 6] >>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2aef365b >>> 6a0a] >>> [gpu-k20-07:61816] [ 7] >>> /software-gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2aef365b >>> 6a3b] >>> [gpu-k20-07:61816] [ 8] >>> /software- >>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaSetDevice+0x47)[0x2aef365d >>> 1647 >>> ] >>> [gpu-k20-07:61816] [ 9] cudampi_simple[0x4007ae] [gpu-k20-07:61816] >>> [10] /lib64/libc.so.6(__libc_start_main+0xfd)[0x2aef370c7d1d] >>> [gpu-k20-07:61816] [11] cudampi_simple[0x400699] [gpu-k20-07:61816] >>> *** End of error message *** >>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2b9ca7905d1d] >>> [gpu-k20-08:46045] [11] cudampi_simple[0x400699] [gpu-k20-08:46045] >>> *** End of error message *** >>> --------------------------------------------------------------------- >>> ----- mpiexec noticed that process rank 1 with PID 46045 on node >>> gpu-k20-08 exited on signal 11 (Segmentation fault). >>> --------------------------------------------------------------------- >>> ----- >>> >>> >>> Thanks, >>> >>> Maxime >>> >>> >>> Le 2014-08-18 16:45, Rolf vandeVaart a écrit : >>>> Just to help reduce the scope of the problem, can you retest with a >>>> non >>> CUDA-aware Open MPI 1.8.1? And if possible, use --enable-debug in the >>> configure line to help with the stack trace? >>>> >>>>> -----Original Message----- >>>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of >Maxime >>>>> Boissonneault >>>>> Sent: Monday, August 18, 2014 4:23 PM >>>>> To: Open MPI Users >>>>> Subject: [OMPI users] Segfault with MPI + Cuda on multiple nodes >>>>> >>>>> Hi, >>>>> Since my previous thread (Segmentation fault in OpenMPI 1.8.1) >>>>> kindda derailed into two problems, one of which has been addressed, >>>>> I figured I would start a new, more precise and simple one. >>>>> >>>>> I reduced the code to the minimal that would reproduce the bug. I >>>>> have pasted it here : >>>>> http://pastebin.com/1uAK4Z8R >>>>> Basically, it is a program that initializes MPI and cudaMalloc >>>>> memory, and >>> then >>>>> free memory and finalize MPI. Nothing else. >>>>> >>>>> When I compile and run this on a single node, everything works fine. >>>>> >>>>> When I compile and run this on more than one node, I get the >>>>> following >>> stack >>>>> trace : >>>>> [gpu-k20-07:40041] *** Process received signal *** >>>>> [gpu-k20-07:40041] >>> Signal: >>>>> Segmentation fault (11) [gpu-k20-07:40041] Signal code: Address not >>> mapped >>>>> (1) [gpu-k20-07:40041] Failing at address: 0x8 [gpu-k20-07:40041] [ >>>>> 0] /lib64/libpthread.so.0(+0xf710)[0x2ac85a5bb710] >>>>> [gpu-k20-07:40041] [ 1] >>>>> /usr/lib64/nvidia/libcuda.so.1(+0x263acf)[0x2ac85fc0aacf] >>>>> [gpu-k20-07:40041] [ 2] >>>>> /usr/lib64/nvidia/libcuda.so.1(+0x229a83)[0x2ac85fbd0a83] >>>>> [gpu-k20-07:40041] [ 3] >>>>> /usr/lib64/nvidia/libcuda.so.1(+0x15b2da)[0x2ac85fb022da] >>>>> [gpu-k20-07:40041] [ 4] >>>>> /usr/lib64/nvidia/libcuda.so.1(cuInit+0x43)[0x2ac85faee933] >>>>> [gpu-k20-07:40041] [ 5] >>>>> /software- >>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15965)[0x2ac859e3a965] >>>>> [gpu-k20-07:40041] [ 6] >>>>> /software- >>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a0a)[0x2ac859e3aa0a] >>>>> [gpu-k20-07:40041] [ 7] >>>>> /software- >>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(+0x15a3b)[0x2ac859e3aa3b] >>>>> [gpu-k20-07:40041] [ 8] >>>>> /software- >>>>> gpu/cuda/6.0.37/lib64/libcudart.so.6.0(cudaMalloc+0x52)[0x2ac859e5b >>>>> 532] [gpu-k20-07:40041] [ 9] cudampi_simple[0x4007b5] >>>>> [gpu-k20-07:40041] [10] >>>>> /lib64/libc.so.6(__libc_start_main+0xfd)[0x2ac85a7e7d1d] >>>>> [gpu-k20-07:40041] [11] cudampi_simple[0x400699] [gpu-k20-07:40041] >>> *** >>>>> End of error message *** >>>>> >>>>> >>>>> The stack trace is the same weither I use OpenMPI 1.6.5 (not cuda >>>>> aware) >>> or >>>>> OpenMPI 1.8.1 (cuda aware). >>>>> >>>>> I know this is more than likely a problem with Cuda than with >>>>> OpenMPI >>> (since >>>>> it does the same for two different versions), but I figured I would >>>>> ask here >>> if >>>>> somebody has a clue of what might be going on. I have yet to be >>>>> able to fill >>> a >>>>> bug report on NVidia's website for Cuda. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> -- >>>>> --------------------------------- >>>>> Maxime Boissonneault >>>>> Analyste de calcul - Calcul Québec, Université Laval Ph. D. en >>>>> physique >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> Link to this post: http://www.open- >>>>> mpi.org/community/lists/users/2014/08/25064.php >>>> -------------------------------------------------------------------- >>>> --------------- This email message is for the sole use of the >>>> intended recipient(s) and may >>> contain >>>> confidential information. Any unauthorized review, use, disclosure >>>> or >>> distribution >>>> is prohibited. If you are not the intended recipient, please >>>> contact the >>> sender by >>>> reply email and destroy all copies of the original message. >>>> -------------------------------------------------------------------- >>>> --------------- >_______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: http://www.open- >>> mpi.org/community/lists/users/2014/08/25065.php >>> >>> >>> -- >>> --------------------------------- >>> Maxime Boissonneault >>> Analyste de calcul - Calcul Québec, Université Laval Ph. D. en >>> physique >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: http://www.open- >>> mpi.org/community/lists/users/2014/08/25074.php >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2014/08/25075.php > > >-- >--------------------------------- >Maxime Boissonneault >Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique > >_______________________________________________ >users mailing list >us...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >Link to this post: http://www.open- >mpi.org/community/lists/users/2014/08/25076.php