Re: [OMPI users] Communicator Split Type NUMA Behavior

2019-11-27 Thread Brice Goglin via users
The attached patch (against 4.0.2) should fix it, I'll prepare a PR to fix this upstream. Brice Le 27/11/2019 à 00:41, Brice Goglin via users a écrit : > It looks like NUMA is broken, while others such as SOCKET and L3CACHE > work fine. A quick look in opal_hwloc_base_get_relative_locality() and

[OMPI users] CUDA mpi question

2019-11-27 Thread Zhang, Junchao via users
Hi, Suppose I have this piece of code and I use cuda-aware MPI, cudaMalloc(&sbuf,sz); Kernel1<<<...,stream>>>(...,sbuf); MPI_Isend(sbuf,...); Kernel2<<<...,stream>>>(); Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to make sure data in sbuf is ready

Re: [OMPI users] CUDA mpi question

2019-11-27 Thread George Bosilca via users
Short and portable answer: you need to sync before the Isend or you will send garbage data. Assuming you are willing to go for a less portable solution you can get the OMPI streams and add your kernels inside, so that the sequential order will guarantee correctness of your isend. We have 2 hidden

Re: [OMPI users] CUDA mpi question

2019-11-27 Thread Zhang, Junchao via users
On Wed, Nov 27, 2019 at 3:16 PM George Bosilca mailto:bosi...@icl.utk.edu>> wrote: Short and portable answer: you need to sync before the Isend or you will send garbage data. Ideally, I want to formulate my code into a series of asynchronous "kernel launch, kernel launch, ..." without synchron

Re: [OMPI users] CUDA mpi question

2019-11-27 Thread George Bosilca via users
On Wed, Nov 27, 2019 at 5:02 PM Zhang, Junchao wrote: > On Wed, Nov 27, 2019 at 3:16 PM George Bosilca > wrote: > >> Short and portable answer: you need to sync before the Isend or you will >> send garbage data. >> > Ideally, I want to formulate my code into a series of asynchronous "kernel > la

Re: [OMPI users] CUDA mpi question

2019-11-27 Thread Zhang, Junchao via users
I was pointed to "2.7. Synchronization and Memory Ordering" of https://docs.nvidia.com/pdf/GPUDirect_RDMA.pdf. It is on topic. But unfortunately it is too short and I could not understand it. I also checked cudaStreamAddCallback/cudaLaunchHostFunc, which say the host function "must not make any

Re: [OMPI users] CUDA mpi question

2019-11-27 Thread Zhang, Junchao via users
Interesting idea. But doing MPI_THREAD_MULTIPLE has other side-effects. If MPI nonblocking calls could take an extra stream argument and work like a kernel launch, it would be wonderful. --Junchao Zhang On Wed, Nov 27, 2019 at 6:12 PM Joshua Ladd mailto:josh...@mellanox.com>> wrote: Why not sp

[OMPI users] speed of model is slow with openmpi

2019-11-27 Thread Mahesh Shinde via users
Hi, I am running a physics based boundary layer model with parallel code which uses openmpi libraries. I installed openmpi. I am running it on general purpose Azure machine with 8 cores, 32GB RAM. I compiled the code with *gfortran -O3 -fopenmp -o abc.exe abc.f* and then *mpirun -np 8 ./abc.exe* B

Re: [OMPI users] speed of model is slow with openmpi

2019-11-27 Thread Gilles Gouaillardet via users
Your gfortran command line strongly suggests your program is serial and does not use MPI at all. Consequently, mpirun will simply spawn 8 identical instances of the very same program, and no speed up should be expected (but you can expect some slow down and/or file corruption). If you obser