Hello - we're encountering problems using contexts in OSHMEM 4.1.0.  We 
installed HPC-X 2.3.0 and rebuilt OpenMPI with UCX over Infiniband.  The codes 
that do not use contexts complete successfully.  These codes run successfully 
with Cray OpenSHMEMX and SOS.

In one case if we:

shmem_ctx_create(...)
<do work with context>
shmem_barrier_all()

The barrier hangs.  If instead we

shmem_ctx_create()
<do work with context>
shmem_ctx_destroy()
shmem_barrier_all()

then the barrier completes.

In a very simple test:

shmem_ctx_create()
shmem_barrier_all()

we get error message from the barrier but it completes.  If you destroy the 
context before the barrier - no error messages.

We have another app which is crashing in shmem_ctx_long_atomic_fetch_add().  I 
wrote a simple reproducer which is copied below.  Depending on how we run it we 
either crash in the fetch_add or get error messages from a subsequent barrier.  
PEs that don't crash in fetch_add show correct output values after the 
fetch_add. Any ideas?

/*
* Run with 2 PEs, 1 ppn, fetch_add succeeds and get error in last barrier
*
[c01:77010:0:77010] rc_verbs_iface.c:63   FATAL: send completion with error: 
transport retry counter exceeded
*
* Run with more than 1 ppn, get error in fetch_add for some PEs.
*
[1557515773.667071] [c02:58872:0]       mm_posix.c:449  UCX  ERROR Error 
returned from open in attach. Permission denied. File name is: /proc/58875/fd/75
[1557515773.667112] [c02:58872:0]          mm_ep.c:76   UCX  ERROR failed to 
connect to remote peer with mm. remote mm_id: 252866199552606
*
*/
#include <stdlib.h>
#include <iostream>
#include <shmem.h>

long fa_val;

int main(int argc, char *argv[]) {
  shmem_init();
  int mype = shmem_my_pe();
  int npes = shmem_n_pes();
  fa_val = mype;

  shmem_ctx_t ctx_id;
  int rc = shmem_ctx_create(0, &ctx_id);
  if (rc) {
    std::cerr << "error creating context" << std::endl;
    exit(1);
  }
  shmem_barrier_all();

  long add_val = 100;
  int other_pe = (mype + 1) % npes;
  std::cerr << "GO FETCH_ADD " << mype << " -> " << other_pe << std::endl;
  long got_val = shmem_ctx_long_atomic_fetch_add(ctx_id, &fa_val, add_val, 
other_pe);
  std::cerr << "DONE FETCH_ADD " << mype << " -> " << other_pe << std::endl;

  shmem_ctx_destroy(ctx_id);
  std::cerr << "GO BARRIER " << mype << std::endl;
  shmem_barrier_all();
  std::cerr << "DONE BARRIER " << mype << std::endl;

  std::cout << mype << " DONE got from " << other_pe << ": " << got_val << 
std::endl;
  std::cout << mype << " DONE my fetched/added value now " << fa_val << 
std::endl;
  shmem_finalize();
}


-----
Lee Ann Riesen, Enterprise and Government Group, Intel Corporation, Hillsboro, 
OR
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to