I'm running OpenMPI 4.0.0 built with gdrcopy 1.3 and UCX 1.4 per the instructions at https://www.open-mpi.org/faq/?category=buildcuda, built against CUDA 10.0 on RHEL 7. I'm running on a p2.xlarge instance in AWS (single NVIDIA K80 GPU). OpenMPI reports CUDA support: $ ompi_info --parsable --all | grep mpi_built_with_cuda_support:value mca:mpi:base:param:mpi_built_with_cuda_support:value:true
I'm attempting to use MPI_Ialltoall() to overlap a block of GPU computations with network transfers, using MPI_Test() to nudge async transfers along. Based on the table 5 I see in https://www.open-mpi.org/faq/?category=runcuda, MPI_Ialltoall() should be supported (though I don't see MPI_Test() called out as supported or not supported... though my example crashes with or without it). The behavior I'm seeing is that when running with a small number of elements, everything runs without issue. However, for a larger number of elements (where "large" is just a few hundred), I start to get errors like this "cma_ep.c:113 UCX ERROR process_vm_readv delivered 0 instead of 16000, error message Bad address". Changing to synchronous MPI_alltoall() results in the program running successfully. I tried boiling my issue down to the simplest problem I could that recreates the crash. Note that this needs to be compiled with "--std=c++11". Running "mpirun -np 2 mpi_test_ialltoall 200 256 10" runs successfully; changing the 200 to a 400 results in a crash after a few blocks. Thanks for any thoughts. Code sample: https://gist.github.com/asylvest/7c9d5c15a3a044a0a2338cf9c828d2c3 -Adam
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users