[OMPI users] MPI Windows: performance of local memory access

Joseph Schuchart Wed, 23 May 2018 03:48:08 -0700

All,

We are observing some strange/interesting performance issues inaccessing memory that has been allocated through MPI_Win_allocate. I amattaching our test case, which allocates memory for 100M integer valueson each process both through malloc and MPI_Win_allocate and writes tothe local ranges sequentially.

On different systems (incl. SuperMUC and a Bull Cluster), we see thataccessing the memory allocated through MPI is significantly slower thanaccessing the malloc'ed memory if multiple processes run on a singlenode, increasing the effect with increasing number of processes pernode. As an example, running 24 processes per node with the exampleattached we see the operations on the malloc'ed memory to take ~0.4swhile the MPI allocated memory takes up to 10s.


After some experiments, I think there are two factors involved:

1) Initialization: it appears that the first iteration is significantlyslower than any subsequent accesses (1.1s vs 0.4s with 12 processes on asingle socket). Excluding the first iteration from the timing ormemsetting the range leads to comparable performance. I assume that thisis due to page faults that stem from first accessing the mmap'ed memorythat backs the shared memory used in the window. The effect ofpresetting the malloc'ed memory seems smaller (0.4s vs 0.6s).

2) NUMA effects: Given proper initialization, running on two socketsstill leads to fluctuating performance degradation under the MPI windowmemory, which ranges up to 20x (in extreme cases). The performance ofaccessing the malloc'ed memory is rather stable. The difference seems toget smaller (but does not disappear) with increasing number ofrepetitions. I am not sure what causes these effects as each processshould first-touch their local memory.


Are these known issues? Does anyone have any thoughts on my analysis?

It is problematic for us that replacing local memory allocation with MPImemory allocation leads to performance degradation as we rely on thismechanism in our distributed data structures. While we can ensure properinitialization of the memory to mitigate 1) for performancemeasurements, I don't see a way to control the NUMA effects. If there isone I'd be happy about any hints :)

I should note that we also tested MPICH-based implementations, whichshowed similar effects (as they also mmap their window memory). Notsurprisingly, using MPI_Alloc_mem and attaching that memory to a dynamicwindow does not cause these effects while using shared memory windowsdoes. I ran my experiments using Open MPI 3.1.0 with the followingcommand lines:


- 12 cores / 1 socket:
mpirun -n 12 --bind-to socket --map-by ppr:12:socket
- 24 cores / 2 sockets:
mpirun -n 24 --bind-to socket

and verified the binding using  --report-bindings.

Any help or comment would be much appreciated.

Cheers
Joseph

--
Dipl.-Inf. Joseph Schuchart
High Performance Computing Center Stuttgart (HLRS)
Nobelstr. 19
D-70569 Stuttgart

Tel.: +49(0)711-68565890
Fax: +49(0)711-6856832
E-Mail: schuch...@hlrs.de

#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>

#include <mpi.h>

#include <sys/time.h>
#include <time.h>

#define MYTIMEVAL( tv_ )                        \
  ((tv_.tv_sec)+(tv_.tv_usec)*1.0e-6)

#define TIMESTAMP( time_ )                                              \
  {                                                                     \
    static struct timeval tv;                                           \
    gettimeofday( &tv, NULL );                                          \
    time_=MYTIMEVAL(tv);                                                \
  }

//
// do some work and measure how long it takes
//
double do_work(int *beg, size_t nelem, int repeat)
{
  const int LCG_A = 1664525, LCG_C = 1013904223;
  
  int seed = 31337;    
  double start, end;
  MPI_Barrier(MPI_COMM_WORLD);
  TIMESTAMP(start);
  for( int j=0; j<repeat; j++ ) {
    for( int i=0; i<nelem; ++i ) {
      seed = LCG_A * seed + LCG_C;
      beg[i] = ((unsigned)seed) %100;
    }
  }
  MPI_Barrier(MPI_COMM_WORLD);
  TIMESTAMP(end);

  return end-start;
}

int main(int argc, char* argv[])
{
  int rank, size;
  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  size_t nelem = 100000000ULL;
  int num_reps = 1;

  MPI_Win win;
  int *baseptr;
  int *mem = (int*) malloc(sizeof(int)*nelem);

  MPI_Win_allocate(
      sizeof(int)*nelem,
      sizeof(int),
      MPI_INFO_NULL,
      MPI_COMM_WORLD,
      &baseptr,
      &win);

  // properly initialize memory
  memset(mem, 0, nelem*sizeof(int));
  double dur2 = do_work(mem, nelem, num_reps);
  
  // properly initialize memory
  memset(baseptr, 0, nelem*sizeof(int));
  double dur1 = do_work(baseptr, nelem, num_reps);

  if (rank == 0) {  
    printf("MPI win mem: %f secs, Local   mem: %f secs\n", dur1, dur2);
  }

  MPI_Win_free(&win);
  
  MPI_Finalize();

  return EXIT_SUCCESS;
}

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] MPI Windows: performance of local memory access

Reply via email to