Hi Team, My application fails with following error [compiled with openmpi-5.0.7, ucx-1.18.0, cuda-12.8, gdrcopy-2.5 ]:
Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x14bd8f464160) ==== backtrace (tid:1104544) ==== 0 0x000000000006141c ucs_callbackq_cleanup() ???:0 1 0x00000000000615da ucs_callbackq_cleanup() ???:0 2 0x000000000003e6f0 __GI___sigaction() :0 3 0x0000000000159af7 __memcpy_avx_unaligned_erms() :0 4 0x0000000000076b3d ucp_proto_rndv_handle_data() ???:0 5 0x000000000005ef21 ucs_callbackq_add_safe() ???:0 6 0x000000000004a42a ucp_worker_progress() ???:0 7 0x0000000000027ce4 opal_progress() ???:0 8 0x000000000009028f ompi_request_default_wait_any() ???:0 9 0x00000000000d94a2 MPI_Waitany() ???:0 10 0x00000000010c71c7 gmx::PmeCoordinateReceiverGpu::Impl::waitForCoordinatesFromAnyPpRank() ???:0 11 0x00000000010d211c pme_gpu_spread() ???:0 12 0x0000000000f4502e pme_gpu_launch_spread() ???:0 13 0x0000000000f2cf0a gmx_pmeonly() ???:0 14 0x0000000000f9a15c gmx::Mdrunner::mdrunner() ???:0 15 0x000000000040960a gmx::gmx_mdrun() ???:0 16 0x000000000040975d gmx::gmx_mdrun() ???:0 17 0x000000000077d2a3 gmx::CommandLineModuleManager::run() ???:0 18 0x0000000000405f1d main() ???:0 19 0x0000000000029590 __libc_start_call_main() ???:0 20 0x0000000000029640 __libc_start_main_alias_2() :0 21 0x0000000000405fa5 _start() ???:0 ================================= This error is due to CUDA GDR_COPY. For the GPU Direct RDMA feature, openmpi needs to be installed with ucx, in which ucx needs to be installed with cuda & gdr_copy. The latest versions of ucx & gdr_copy are 1.18.0 & 2.5 respectively. But openmpi recommends ucx-1.4: https://www.open-mpi.org/faq/?category=buildcuda which was released in 2018 [6-7 years old]. Is openmpi not tested with the latest versions of ucx, cuda, gdr_copy? Do we have to still use ucx-1.4 only? To unsubscribe from this group and stop receiving emails from it, send an email to users+unsubscr...@lists.open-mpi.org.