Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-19 Thread Reuti
Hi, Am 19.08.2014 um 19:06 schrieb Oscar Mojica: > I discovered what was the error. I forgot include the '-fopenmp' when I > compiled the objects in the Makefile, so the program worked but it didn't > divide the job in threads. Now the program is working and I can use until 15 > cores for mach

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-19 Thread Oscar Mojica
Reuti I discovered what was the error. I forgot include the '-fopenmp' when I compiled the objects in the Makefile, so the program worked but it didn't divide the job in threads. Now the program is working and I can use until 15 cores for machine in the queue one.q. Anyway i would like to try im

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-19 Thread Maxime Boissonneault
I am also filing a bug at Adaptive Computing since, while I do set CUDA_VISIBLE_DEVICES myself, the default value set by Torque in that case is also wrong. Maxime Le 2014-08-19 10:47, Rolf vandeVaart a écrit : Glad it was solved. I will submit a bug at NVIDIA as that does not seem like a ve

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-19 Thread Rolf vandeVaart
Glad it was solved. I will submit a bug at NVIDIA as that does not seem like a very friendly way to handle that error. >-Original Message- >From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Maxime >Boissonneault >Sent: Tuesday, August 19, 2014 10:39 AM >To: Open MPI Users >Sub

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-19 Thread Maxime Boissonneault
Hi, I believe I found what the problem was. My script set the CUDA_VISIBLE_DEVICES based on the content of $PBS_GPUFILE. Since the GPUs were listed twice in the $PBS_GPUFILE because of the two nodes, I had CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7 instead of CUDA_VISIBLE_DEVICES=0,1

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-19 Thread Rolf vandeVaart
Hi: This problem does not appear to have anything to do with MPI. We are getting a SEGV during the initial call into the CUDA driver. Can you log on to gpu-k20-08, compile your simple program without MPI, and run it there? Also, maybe run dmesg on gpu-k20-08 and see if there is anything in the

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-19 Thread Maxime Boissonneault
Hi, I recompiled OMPI 1.8.1 without Cuda and with debug, but it did not give me much more information. [mboisson@gpu-k20-07 simple_cuda_mpi]$ ompi_info | grep debug Prefix: /software-gpu/mpi/openmpi/1.8.1-debug_gcc4.8_nocuda Internal debug support: yes Memory debugging supp

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-19 Thread Maxime Boissonneault
Indeed, there were those to problems. I took the code from here and simplified it. http://cudamusing.blogspot.ca/2011/08/cuda-mpi-and-infiniband.html However, even with the modified code here http://pastebin.com/ax6g10GZ The symptoms are still the same. Maxime Le 2014-08-19 07:59, Alex A. Gr

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-19 Thread Alex A. Granovsky
Also you need to check return code from cudaMalloc before calling cudaFree - the pointer may be invalid as you did not initialized cuda properly. Alex -Original Message- From: Maxime Boissonneault Sent: Tuesday, August 19, 2014 2:19 AM To: Open MPI Users Subject: Re: [OMPI users] Segfa

Re: [OMPI users] Segfault with MPI + Cuda on multiple nodes

2014-08-19 Thread Alex A. Granovsky
Hello, I think your cuda program may be incorrect. Add proper cudaSetDevice call at the beginning and check it again. Kind regards, Alex Granovsky -Original Message- From: Maxime Boissonneault Sent: Tuesday, August 19, 2014 2:19 AM To: Open MPI Users Subject: Re: [OMPI users] Segfau

Re: [OMPI users] No log_num_mtt in Ubuntu 14.04

2014-08-19 Thread Mike Dubman
so, it seems you have old ofed w/o this parameter. Can you install latest Mellanox ofed? or check which community ofed has it? On Tue, Aug 19, 2014 at 9:34 AM, Rio Yokota wrote: > Here is what "modinfo mlx4_core" gives > > filename: > > /lib/modules/3.13.0-34-generic/kernel/drivers/net/ether

Re: [OMPI users] No log_num_mtt in Ubuntu 14.04

2014-08-19 Thread Rio Yokota
Here is what "modinfo mlx4_core" gives filename: /lib/modules/3.13.0-34-generic/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko version:2.2-1 license:Dual BSD/GPL description:Mellanox ConnectX HCA low-level driver author: Roland Dreier srcversion: 3AE2