Nathan,

Thanks for looking into that. My test program is attached.

Best
Joseph

On 05/08/2018 02:56 PM, Nathan Hjelm wrote:
I will take a look today. Can you send me your test program?

-Nathan

On May 8, 2018, at 2:49 AM, Joseph Schuchart <schuch...@hlrs.de> wrote:

All,

I have been experimenting with using Open MPI 3.1.0 on our Cray XC40 
(Haswell-based nodes, Aries interconnect) for multi-threaded MPI RMA. 
Unfortunately, a simple (single-threaded) test case consisting of two processes 
performing an MPI_Rget+MPI_Wait hangs when running on two nodes. It succeeds if 
both processes run on a single node.

For completeness, I am attaching the config.log. The build environment was set 
up to build Open MPI for the login nodes (I wasn't sure how to properly 
cross-compile the libraries):

```
# this seems necessary to avoid a linker error during build
export CRAYPE_LINK_TYPE=dynamic
module swap PrgEnv-cray PrgEnv-intel
module sw craype-haswell craype-sandybridge
module unload craype-hugepages16M
module unload cray-mpich
```

I am using mpirun to launch the test code. Below is the BTL debug log (with tcp 
disabled for clarity, turning it on makes no difference):

```
mpirun --mca btl_base_verbose 100 --mca btl ^tcp -n 2 -N 1 ./mpi_test_loop
[nid03060:36184] mca: base: components_register: registering framework btl 
components
[nid03060:36184] mca: base: components_register: found loaded component self
[nid03060:36184] mca: base: components_register: component self register 
function successful
[nid03060:36184] mca: base: components_register: found loaded component sm
[nid03061:36208] mca: base: components_register: registering framework btl 
components
[nid03061:36208] mca: base: components_register: found loaded component self
[nid03060:36184] mca: base: components_register: found loaded component ugni
[nid03061:36208] mca: base: components_register: component self register 
function successful
[nid03061:36208] mca: base: components_register: found loaded component sm
[nid03061:36208] mca: base: components_register: found loaded component ugni
[nid03060:36184] mca: base: components_register: component ugni register 
function successful
[nid03060:36184] mca: base: components_register: found loaded component vader
[nid03061:36208] mca: base: components_register: component ugni register 
function successful
[nid03061:36208] mca: base: components_register: found loaded component vader
[nid03060:36184] mca: base: components_register: component vader register 
function successful
[nid03060:36184] mca: base: components_open: opening btl components
[nid03060:36184] mca: base: components_open: found loaded component self
[nid03060:36184] mca: base: components_open: component self open function 
successful
[nid03060:36184] mca: base: components_open: found loaded component ugni
[nid03060:36184] mca: base: components_open: component ugni open function 
successful
[nid03060:36184] mca: base: components_open: found loaded component vader
[nid03060:36184] mca: base: components_open: component vader open function 
successful
[nid03060:36184] select: initializing btl component self
[nid03060:36184] select: init of component self returned success
[nid03060:36184] select: initializing btl component ugni
[nid03061:36208] mca: base: components_register: component vader register 
function successful
[nid03061:36208] mca: base: components_open: opening btl components
[nid03061:36208] mca: base: components_open: found loaded component self
[nid03061:36208] mca: base: components_open: component self open function 
successful
[nid03061:36208] mca: base: components_open: found loaded component ugni
[nid03061:36208] mca: base: components_open: component ugni open function 
successful
[nid03061:36208] mca: base: components_open: found loaded component vader
[nid03061:36208] mca: base: components_open: component vader open function 
successful
[nid03061:36208] select: initializing btl component self
[nid03061:36208] select: init of component self returned success
[nid03061:36208] select: initializing btl component ugni
[nid03061:36208] select: init of component ugni returned success
[nid03061:36208] select: initializing btl component vader
[nid03061:36208] select: init of component vader returned failure
[nid03061:36208] mca: base: close: component vader closed
[nid03061:36208] mca: base: close: unloading component vader
[nid03060:36184] select: init of component ugni returned success
[nid03060:36184] select: initializing btl component vader
[nid03060:36184] select: init of component vader returned failure
[nid03060:36184] mca: base: close: component vader closed
[nid03060:36184] mca: base: close: unloading component vader
[nid03061:36208] mca: bml: Using self btl for send to [[54630,1],1] on node 
nid03061
[nid03060:36184] mca: bml: Using self btl for send to [[54630,1],0] on node 
nid03060
[nid03061:36208] mca: bml: Using ugni btl for send to [[54630,1],0] on node 
(null)
[nid03060:36184] mca: bml: Using ugni btl for send to [[54630,1],1] on node 
(null)
```

It looks like the UGNI btl is being initialized correctly but then fails to 
find the node to communicate with? Is there a way to get more information? 
There doesn't seem to be an MCA parameter to increase verbosity specifically of 
the UGNI btl.

Any help would be appreciated!

Cheers
Joseph
<config.log.tgz>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

#include <stdio.h>
#include <mpi.h>
#include <time.h>
#include <assert.h>
#include <stdlib.h>

#define NUM_ELEMS 1000000L
#define NUM_ITER 100

static int rank, size;

static void test_rget_wait(MPI_Win win, int target)
{
      MPI_Request req;
      int *val = malloc(sizeof(int)*NUM_ELEMS);
      MPI_Rget(val, NUM_ELEMS, MPI_INT, target, 0, NUM_ELEMS, MPI_INT, win, &req);
      MPI_Wait(&req, MPI_STATUS_IGNORE);
      printf("[%d] target %d done\n", rank, target);
      free(val);
}

int main(int argc, char **argv)
{
  MPI_Win win;
  int elem_per_unit = NUM_ELEMS;
  int *baseptr;

  int thread_provided;
  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &thread_provided);
  if (thread_provided != MPI_THREAD_MULTIPLE) {
    printf("MPI_THREAD_MULTIPLE required, provided: %d\n", thread_provided);
    exit(1);
  }
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  int neighbor = (rank + 1) % size;

  MPI_Win_allocate(
    elem_per_unit*sizeof(int), 1, MPI_INFO_NULL,
    MPI_COMM_WORLD, &baseptr, &win);

  MPI_Win_lock_all(0, win);

#pragma omp parallel for
  for (int i = 0; i < NUM_ITER; ++i) {
    printf("[%d] i=%d\n", i, rank);
    for (int target = 0; target < size; ++target) {
      test_rget_wait(win, target);
    }
  }

  MPI_Win_unlock_all(win);

  MPI_Win_free(&win);

  MPI_Finalize();

  return 0;
}
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to