The attached patch (against 4.0.2) should fix it, I'll prepare a PR to
fix this upstream.
Brice
Le 27/11/2019 à 00:41, Brice Goglin via users a écrit :
> It looks like NUMA is broken, while others such as SOCKET and L3CACHE
> work fine. A quick look in opal_hwloc_base_get_relative_locality() and
Hi,
Suppose I have this piece of code and I use cuda-aware MPI,
cudaMalloc(&sbuf,sz);
Kernel1<<<...,stream>>>(...,sbuf);
MPI_Isend(sbuf,...);
Kernel2<<<...,stream>>>();
Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to make
sure data in sbuf is ready
Short and portable answer: you need to sync before the Isend or you will
send garbage data.
Assuming you are willing to go for a less portable solution you can get the
OMPI streams and add your kernels inside, so that the sequential order will
guarantee correctness of your isend. We have 2 hidden
On Wed, Nov 27, 2019 at 3:16 PM George Bosilca
mailto:bosi...@icl.utk.edu>> wrote:
Short and portable answer: you need to sync before the Isend or you will send
garbage data.
Ideally, I want to formulate my code into a series of asynchronous "kernel
launch, kernel launch, ..." without synchron
On Wed, Nov 27, 2019 at 5:02 PM Zhang, Junchao wrote:
> On Wed, Nov 27, 2019 at 3:16 PM George Bosilca
> wrote:
>
>> Short and portable answer: you need to sync before the Isend or you will
>> send garbage data.
>>
> Ideally, I want to formulate my code into a series of asynchronous "kernel
> la
I was pointed to "2.7. Synchronization and Memory Ordering" of
https://docs.nvidia.com/pdf/GPUDirect_RDMA.pdf. It is on topic. But
unfortunately it is too short and I could not understand it.
I also checked cudaStreamAddCallback/cudaLaunchHostFunc, which say the host
function "must not make any
Interesting idea. But doing MPI_THREAD_MULTIPLE has other side-effects. If MPI
nonblocking calls could take an extra stream argument and work like a kernel
launch, it would be wonderful.
--Junchao Zhang
On Wed, Nov 27, 2019 at 6:12 PM Joshua Ladd
mailto:josh...@mellanox.com>> wrote:
Why not sp
Hi,
I am running a physics based boundary layer model with parallel code which
uses openmpi libraries. I installed openmpi. I am running it on general
purpose Azure machine with 8 cores, 32GB RAM. I compiled the code with
*gfortran
-O3 -fopenmp -o abc.exe abc.f* and then *mpirun -np 8 ./abc.exe* B
Your gfortran command line strongly suggests your program is serial and
does not use MPI at all.
Consequently, mpirun will simply spawn 8 identical instances of the very
same program, and no speed up should be expected
(but you can expect some slow down and/or file corruption).
If you obser