Dear list,

We've found a problem with openmpi when running over IB when calculation reading elements of an array is overlapping communication to other elements (that are not used in the calculation) of the same array. I have written a small test program (below) that shows this behaviour. When the array is small (arrlen in the code), more problems occur. The problems only occur when using IB (even on the same node!?), using mpirun -mca btl tcp,self the problem vanishes.

The behaviour with 1.2.9 and 1.3.1 is slightly different, where problems occur already for 3 processes with openmpi 1.2.9 but 4 processes are required for problems with 1.3.1. Proper output on 4 processes should just be:
Sum should be 60
Sum should be 60
Sum should be 60
Sum should be 60

With IB:
mpirun  -np 4 ./test3|head
Sum should be 60
Sum should be 60
Sum should be 60
Sum should be 60
Result on rank 0 strangely is 1.06316e+248
Result on rank 2 strangely is 1.54396e+262
Result on rank 3 strangely is 3.87325e+233
Result on rank 1 strangely is 1.54396e+262
Result on rank 1 strangely is 1.54396e+262
Result on rank 2 strangely is 1.54396e+262


Info about the system:

openmpi: 1.2.9, 1.3.1

From ompi_info:
   MCA btl: openib (MCA v2.0, API v2.0, Component v1.3.1)

From lspci:
04:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)

configure picks up ibverbs:
--- MCA component btl:ofud (m4 configuration macro)
checking for MCA component btl:ofud compile mode... dso
checking --with-openib value... simple ok (unspecified)
checking --with-openib-libdir value... simple ok (unspecified)
checking for fcntl.h... (cached) yes
checking sys/poll.h usability... yes
checking sys/poll.h presence... yes
checking for sys/poll.h... yes
checking infiniband/verbs.h usability... yes
checking infiniband/verbs.h presence... yes
checking for infiniband/verbs.h... yes
looking for library without search path
checking for ibv_open_device in -libverbs... yes
checking number of arguments to ibv_create_cq... 5
checking whether IBV_EVENT_CLIENT_REREGISTER is declared... yes
checking for ibv_get_device_list... yes
checking for ibv_resize_cq... yes
checking for struct ibv_device.transport_type... yes
checking for ibv_create_xrc_rcv_qp... no
checking rdma/rdma_cma.h usability... yes
checking rdma/rdma_cma.h presence... yes
checking for rdma/rdma_cma.h... yes
checking for rdma_create_id in -lrdmacm... yes
checking for rdma_get_peer_addr... yes
checking for infiniband/driver.h... yes
checking if ConnectX XRC support is enabled... no
checking if OpenFabrics RDMACM support is enabled... yes
checking if OpenFabrics IBCM support is enabled... no
checking if MCA component btl:ofud can compile... yes

--- MCA component btl:openib (m4 configuration macro)
checking for MCA component btl:openib compile mode... dso
checking --with-openib value... simple ok (unspecified)
checking --with-openib-libdir value... simple ok (unspecified)
checking for fcntl.h... (cached) yes
checking for sys/poll.h... (cached) yes
checking infiniband/verbs.h usability... yes
checking infiniband/verbs.h presence... yes
checking for infiniband/verbs.h... yes
looking for library without search path
checking for ibv_open_device in -libverbs... yes
checking number of arguments to ibv_create_cq... (cached) 5
checking whether IBV_EVENT_CLIENT_REREGISTER is declared... (cached) yes
checking for ibv_get_device_list... (cached) yes
checking for ibv_resize_cq... (cached) yes
checking for struct ibv_device.transport_type... (cached) yes
checking for ibv_create_xrc_rcv_qp... (cached) no
checking for rdma/rdma_cma.h... (cached) yes
checking for rdma_create_id in -lrdmacm... (cached) yes
checking for rdma_get_peer_addr... yes
checking for infiniband/driver.h... (cached) yes
checking if ConnectX XRC support is enabled... no
checking if OpenFabrics RDMACM support is enabled... yes
checking if OpenFabrics IBCM support is enabled... no
checking for ibv_fork_init... yes
checking for thread support (needed for ibcm/rdmacm)... posix
checking which openib btl cpcs will be built... oob rdmacm
checking if MCA component btl:openib can compile... yes


Compilers: gcc 4.1.2 and pgcc 8.0-4 same problems, optimization level does not matter. (-fast, -O3 or -O0) (64 bit)

CPU: opteron 250
OS: Scientific linux 5.2

If you require any more information, I'll be more than happy to provide it!

Is this a proper way to overlap communication with calculation? Could this be some kind of cache-coherency problem? values in cpu cache already but rdma puts things in memory, although in that case I would expect the sum not to be that off? What would happen if the compiler decided to do non-temporal prefetches (or stores in the general case)?



The code:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>


int main(int argc, char **argv)
{
  int rank,size,i,j,k;
  const int arrlen=10;
  const int repeattest=100;
  double *array;
  MPI_Request *reqarr;
  MPI_Status *mpistat;
  MPI_Datatype STRIDED;
  int torank,fromrank,nreq;
  int sumshouldbe;
  MPI_Init(&argc,&argv);
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);
  MPI_Comm_size(MPI_COMM_WORLD,&size);

  /* Non-contiguous data */
  MPI_Type_vector(arrlen,1,size,MPI_DOUBLE,&STRIDED);
  MPI_Type_commit(&STRIDED);

  array=malloc(arrlen*size *sizeof *array);
  reqarr=malloc(2*size*sizeof *reqarr);
  mpistat=malloc(2*size*sizeof *mpistat);

  /* Setup communication */
  sumshouldbe=0;
  nreq=0;
  for (i=1; i<size; i++)
    {
      torank=rank+i;
      if (torank>=size)
        torank-=size;
      fromrank=rank-i;
      if (fromrank<0)
        fromrank+=size;
      MPI_Recv_init(array+i,1,STRIDED,fromrank,i,MPI_COMM_WORLD,reqarr+nreq);
      nreq++;
      MPI_Send_init(array,1,STRIDED,torank,i,MPI_COMM_WORLD,reqarr+nreq);
      nreq++;
      sumshouldbe+=i;
    }
  printf("Sum should be %g\n",(double)arrlen*sumshouldbe);
  /* Do the tests. */
  for (j=0; j<repeattest; j++)
    {
      double sum=0.;
      /* Init test arrays. Array on first process is initially all
         zero. On second process all one, etc. Same as rank number. */
      for (i=0; i<arrlen*size; i++)
        array[i]=(double)rank;

      /* Start communication */
      MPI_Startall(nreq,reqarr);

      /* Accumulate part of arrays that are not communicated. This
         touches the parts of the arrays that are *not*
         communicated!! */
      for (i=0; i<arrlen; i++)
        sum+=array[i*size];

      /* Wait for communication to finish */
      MPI_Waitall(nreq,reqarr,mpistat);

      /* Accumulate part of arrays that have been communicated. */
      for (i=0; i<arrlen*(size-1); i++)
        {
          for (k=0; k<size-1; k++)
            sum+=array[i*size+1+k];
        }

      if (sum!=arrlen*sumshouldbe)
        printf("Result on rank %d strangely is %g\n",rank,sum);
    }

  MPI_Finalize();
  return 0;
}





--
Daniel Spångberg
Materialkemi
Uppsala Universitet

Reply via email to