Hi Team,

        My application fails with following error [compiled with
openmpi-5.0.7, ucx-1.18.0, cuda-12.8, gdrcopy-2.5 ]:


Caught signal 11 (Segmentation fault: invalid permissions for mapped object
at address 0x14bd8f464160)
==== backtrace (tid:1104544) ====
 0 0x000000000006141c ucs_callbackq_cleanup()  ???:0
 1 0x00000000000615da ucs_callbackq_cleanup()  ???:0
 2 0x000000000003e6f0 __GI___sigaction()  :0
 3 0x0000000000159af7 __memcpy_avx_unaligned_erms()  :0
 4 0x0000000000076b3d ucp_proto_rndv_handle_data()  ???:0
 5 0x000000000005ef21 ucs_callbackq_add_safe()  ???:0
 6 0x000000000004a42a ucp_worker_progress()  ???:0
 7 0x0000000000027ce4 opal_progress()  ???:0
 8 0x000000000009028f ompi_request_default_wait_any()  ???:0
 9 0x00000000000d94a2 MPI_Waitany()  ???:0
10 0x00000000010c71c7
gmx::PmeCoordinateReceiverGpu::Impl::waitForCoordinatesFromAnyPpRank()
 ???:0
11 0x00000000010d211c pme_gpu_spread()  ???:0
12 0x0000000000f4502e pme_gpu_launch_spread()  ???:0
13 0x0000000000f2cf0a gmx_pmeonly()  ???:0
14 0x0000000000f9a15c gmx::Mdrunner::mdrunner()  ???:0
15 0x000000000040960a gmx::gmx_mdrun()  ???:0
16 0x000000000040975d gmx::gmx_mdrun()  ???:0
17 0x000000000077d2a3 gmx::CommandLineModuleManager::run()  ???:0
18 0x0000000000405f1d main()  ???:0
19 0x0000000000029590 __libc_start_call_main()  ???:0
20 0x0000000000029640 __libc_start_main_alias_2()  :0
21 0x0000000000405fa5 _start()  ???:0
=================================

        This error is due to CUDA GDR_COPY.

For the GPU Direct RDMA feature, openmpi needs to be installed with ucx, in
which ucx needs to be installed with cuda & gdr_copy. The latest versions
of ucx & gdr_copy are 1.18.0 & 2.5 respectively. But openmpi
recommends ucx-1.4:

https://www.open-mpi.org/faq/?category=buildcuda

which was released in 2018 [6-7 years old].

Is openmpi not tested with the latest versions of ucx, cuda, gdr_copy? Do
we have to still use ucx-1.4 only?

To unsubscribe from this group and stop receiving emails from it, send an email 
to users+unsubscr...@lists.open-mpi.org.

Reply via email to