Re: [OMPI users] CUDA mpi question

George Bosilca via users Wed, 27 Nov 2019 14:20:24 -0800

On Wed, Nov 27, 2019 at 5:02 PM Zhang, Junchao <jczh...@mcs.anl.gov> wrote:


> On Wed, Nov 27, 2019 at 3:16 PM George Bosilca <bosi...@icl.utk.edu>
> wrote:
>
>> Short and portable answer: you need to sync before the Isend or you will
>> send garbage data.
>>
> Ideally, I want to formulate my code into a series of asynchronous "kernel
> launch, kernel launch, ..." without synchronization, so that I can hide
> kernel launch overhead. It now seems I have to sync before MPI calls (even
> nonblocking calls)
>

Then you need a means to ensure sequential execution, and this is what the
streams provide. Unfortunately, I looked into the code and I'm afraid there
is currently no realistic way to do what you need. My previous comment was
based on an older code, that seems to be 1) unmaintained currently, and 2)
only applicable to the OB1 PML + OpenIB BTL combo. As recent versions of
OMPI have moved away from the OpenIB BTL, relying more heavily on UCX for
Infiniband support, the old code is now deprecated. Sorry for giving you
hope on this.

Maybe you can delegate the MPI call into a CUDA event callback ?

  George.



>
>
>>
>> Assuming you are willing to go for a less portable solution you can get
>> the OMPI streams and add your kernels inside, so that the sequential order
>> will guarantee correctness of your isend. We have 2 hidden CUDA streams in
>> OMPI, one for device-to-host and one for host-to-device, that can be
>> queried with the non-MPI standard compliant functions
>> (mca_common_cuda_get_dtoh_stream and mca_common_cuda_get_htod_stream).
>>
>> Which streams (dtoh or htod) should I use to insert kernels producing
> send data and kernels using received data? I imagine MPI uses GPUDirect
> RDMA to move data directly from GPU to NIC. Why do we need to bother dtoh
> or htod streams?
>

>
>> George.
>>
>>
>> On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users <
>> users@lists.open-mpi.org> wrote:
>>
>>> Hi,
>>>   Suppose I have this piece of code and I use cuda-aware MPI,
>>>               cudaMalloc(&sbuf,sz);
>>>
>>>    Kernel1<<<...,stream>>>(...,sbuf);
>>>    MPI_Isend(sbuf,...);
>>>    Kernel2<<<...,stream>>>();
>>>
>>>
>>>   Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to
>>> make sure data in sbuf is ready to send?  If not, why?
>>>
>>>   Thank you.
>>>
>>> --Junchao Zhang
>>>
>>

Re: [OMPI users] CUDA mpi question

Reply via email to