Short and portable answer: you need to sync before the Isend or you will send garbage data.
Assuming you are willing to go for a less portable solution you can get the OMPI streams and add your kernels inside, so that the sequential order will guarantee correctness of your isend. We have 2 hidden CUDA streams in OMPI, one for device-to-host and one for host-to-device, that can be queried with the non-MPI standard compliant functions (mca_common_cuda_get_dtoh_stream and mca_common_cuda_get_htod_stream). George. On Wed, Nov 27, 2019 at 4:02 PM Zhang, Junchao via users < users@lists.open-mpi.org> wrote: > Hi, > Suppose I have this piece of code and I use cuda-aware MPI, > cudaMalloc(&sbuf,sz); > > Kernel1<<<...,stream>>>(...,sbuf); > MPI_Isend(sbuf,...); > Kernel2<<<...,stream>>>(); > > > Do I need to call cudaStreamSynchronize(stream) before MPI_Isend() to > make sure data in sbuf is ready to send? If not, why? > > Thank you. > > --Junchao Zhang >