Hello - we're encountering problems using contexts in OSHMEM 4.1.0. We installed HPC-X 2.3.0 and rebuilt OpenMPI with UCX over Infiniband. The codes that do not use contexts complete successfully. These codes run successfully with Cray OpenSHMEMX and SOS.
In one case if we: shmem_ctx_create(...) <do work with context> shmem_barrier_all() The barrier hangs. If instead we shmem_ctx_create() <do work with context> shmem_ctx_destroy() shmem_barrier_all() then the barrier completes. In a very simple test: shmem_ctx_create() shmem_barrier_all() we get error message from the barrier but it completes. If you destroy the context before the barrier - no error messages. We have another app which is crashing in shmem_ctx_long_atomic_fetch_add(). I wrote a simple reproducer which is copied below. Depending on how we run it we either crash in the fetch_add or get error messages from a subsequent barrier. PEs that don't crash in fetch_add show correct output values after the fetch_add. Any ideas? /* * Run with 2 PEs, 1 ppn, fetch_add succeeds and get error in last barrier * [c01:77010:0:77010] rc_verbs_iface.c:63 FATAL: send completion with error: transport retry counter exceeded * * Run with more than 1 ppn, get error in fetch_add for some PEs. * [1557515773.667071] [c02:58872:0] mm_posix.c:449 UCX ERROR Error returned from open in attach. Permission denied. File name is: /proc/58875/fd/75 [1557515773.667112] [c02:58872:0] mm_ep.c:76 UCX ERROR failed to connect to remote peer with mm. remote mm_id: 252866199552606 * */ #include <stdlib.h> #include <iostream> #include <shmem.h> long fa_val; int main(int argc, char *argv[]) { shmem_init(); int mype = shmem_my_pe(); int npes = shmem_n_pes(); fa_val = mype; shmem_ctx_t ctx_id; int rc = shmem_ctx_create(0, &ctx_id); if (rc) { std::cerr << "error creating context" << std::endl; exit(1); } shmem_barrier_all(); long add_val = 100; int other_pe = (mype + 1) % npes; std::cerr << "GO FETCH_ADD " << mype << " -> " << other_pe << std::endl; long got_val = shmem_ctx_long_atomic_fetch_add(ctx_id, &fa_val, add_val, other_pe); std::cerr << "DONE FETCH_ADD " << mype << " -> " << other_pe << std::endl; shmem_ctx_destroy(ctx_id); std::cerr << "GO BARRIER " << mype << std::endl; shmem_barrier_all(); std::cerr << "DONE BARRIER " << mype << std::endl; std::cout << mype << " DONE got from " << other_pe << ": " << got_val << std::endl; std::cout << mype << " DONE my fetched/added value now " << fa_val << std::endl; shmem_finalize(); } ----- Lee Ann Riesen, Enterprise and Government Group, Intel Corporation, Hillsboro, OR
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users