I've noticed a very significant (100%) slow down for MPI_Alltoallv calls as of version 1.6.1. * This is most noticeable for high-frequency exchanges over 1Gb ethernet where process-to-process message sizes are fairly small (e.g. 100kbyte) and much of the exchange matrix is sparse. * 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default algorithm to a pairwise exchange", but I'm not clear what this means or how to switch back to the old "non-default algorithm".

I attach a test program which illustrates the sort of usage in our MPI application. I have run as this as 32 processes on four nodes, over 1Gb ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,.. on node 1, rank 1,5,9, ... on node 2, etc.

It constructs an array of integers and a nProcess x nProcess exchange typical of part of our application. This is then exchanged several thousand times. Output from "mpicc -O3" runs shown below.

My guess is that 1.6.1 is hitting additional latency not present in 1.6.0. I also attach a plot showing network throughput on our actual mesh generation application. Nodes cfsc01-04 are running 1.6.0 and finish within 35 minutes. Nodes cfsc05-08 are running 1.6.2 (started 10 minutes later) and take over a hour to run. There seems to be a much greater network demand in the 1.6.1 version, despite the user-code and input data being identical.

Thanks for any help you can give,
Simon

For 1.6.0:

Open MPI 1.6.0
Proc 0: 50 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 198 x 100 int Proc 1: 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 148 x 100 int Proc 2: 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 109 x 100 int Proc 3: 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 80 x 100 int Proc 4: 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 58 x 100 int Proc 5: 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 41 x 100 int Proc 6: 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 29 x 100 int Proc 7: 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 20 x 100 int Proc 8: 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 14 x 100 int Proc 9: 3 2 1 0 0 0 0 0 0 0 0 0 Total: 9 x 100 int
Proc 10:  2  1  0  0  0  0  0  0  0  0 0 Total: 6 x 100 int
Proc 11: 1 0 0 0 0 0 0 0 0 0 0 Total: 4 x 100 int Proc 12: 0 0 0 0 0 0 0 0 0 0 0 Total: 2 x 100 int Proc 13: 0 0 0 0 0 0 0 0 0 0 0 Total: 1 x 100 int Proc 14: 0 0 0 0 0 0 0 0 0 0 0 Total: 1 x 100 int Proc 15: 0 0 0 0 0 0 0 0 0 0 0 Total: 0 x 100 int Proc 16: 0 0 0 0 0 0 0 0 0 0 0 Total: 0 x 100 int Proc 17: 0 0 0 0 0 0 0 0 0 0 0 Total: 1 x 100 int Proc 18: 0 0 0 0 0 0 0 0 0 0 0 Total: 1 x 100 int Proc 19: 0 0 0 0 0 0 0 0 0 0 0 Total: 2 x 100 int Proc 20: 0 0 0 0 0 0 0 0 0 0 1 Total: 4 x 100 int
Proc 21: 0  0  0  0  0  0  0  0  0  1  2 Total: 6 x 100 int
Proc 22: 0 0 0 0 0 0 0 0 0 1 2 3 Total: 9 x 100 int
Proc 23: 0  0  0  0  0  0  0  0  0  1  2  3  4 Total: 14 x 100 int
Proc 24: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 Total: 20 x 100 int Proc 25: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 Total: 29 x 100 int Proc 26: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 Total: 41 x 100 int Proc 27: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 16 Total: 58 x 100 int Proc 28: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 16 22 Total: 80 x 100 int Proc 29: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 16 22 29 Total: 109 x 100 int Proc 30: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 16 22 29 38 Total: 148 x 100 int Proc 31: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 16 22 29 38 50 Total: 198 x 100 int
....................................................................................................
Total time = 15.443502 seconds


For 1.6.1:

Open MPI 1.6.1
Proc 0: 50 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 198 x 100 int Proc 1: 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 148 x 100 int Proc 2: 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 109 x 100 int Proc 3: 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 80 x 100 int Proc 4: 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 58 x 100 int Proc 5: 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 41 x 100 int Proc 6: 8 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 29 x 100 int Proc 7: 6 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 20 x 100 int Proc 8: 4 3 2 1 0 0 0 0 0 0 0 0 0 Total: 14 x 100 int Proc 9: 3 2 1 0 0 0 0 0 0 0 0 0 Total: 9 x 100 int
Proc 10:  2  1  0  0  0  0  0  0  0  0 0 Total: 6 x 100 int
Proc 11: 1 0 0 0 0 0 0 0 0 0 0 Total: 4 x 100 int Proc 12: 0 0 0 0 0 0 0 0 0 0 0 Total: 2 x 100 int Proc 13: 0 0 0 0 0 0 0 0 0 0 0 Total: 1 x 100 int Proc 14: 0 0 0 0 0 0 0 0 0 0 0 Total: 1 x 100 int Proc 15: 0 0 0 0 0 0 0 0 0 0 0 Total: 0 x 100 int Proc 16: 0 0 0 0 0 0 0 0 0 0 0 Total: 0 x 100 int Proc 17: 0 0 0 0 0 0 0 0 0 0 0 Total: 1 x 100 int Proc 18: 0 0 0 0 0 0 0 0 0 0 0 Total: 1 x 100 int Proc 19: 0 0 0 0 0 0 0 0 0 0 0 Total: 2 x 100 int Proc 20: 0 0 0 0 0 0 0 0 0 0 1 Total: 4 x 100 int
Proc 21: 0  0  0  0  0  0  0  0  0  1  2 Total: 6 x 100 int
Proc 22: 0 0 0 0 0 0 0 0 0 1 2 3 Total: 9 x 100 int
Proc 23: 0  0  0  0  0  0  0  0  0  1  2  3  4 Total: 14 x 100 int
Proc 24: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 Total: 20 x 100 int Proc 25: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 Total: 29 x 100 int Proc 26: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 Total: 41 x 100 int Proc 27: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 16 Total: 58 x 100 int Proc 28: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 16 22 Total: 80 x 100 int Proc 29: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 16 22 29 Total: 109 x 100 int Proc 30: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 16 22 29 38 Total: 148 x 100 int Proc 31: 0 0 0 0 0 0 0 0 0 1 2 3 4 6 8 12 16 22 29 38 50 Total: 198 x 100 int
....................................................................................................
Total time = 25.549821 seconds
/*
  Test program to illustrate OpenMPI 1.6.0 - 1.6.2 alltoall regression.
*/
#include <mpi.h>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
  int nProc, rank, n, m;
  int *data, *r_data;
  int *nData, *r_nData;
  int *displ, *r_displ;
  int *gather;
  double start = 0, finish = 0;
  MPI_Init(&argc,&argv); 
  MPI_Comm_size(MPI_COMM_WORLD,&nProc); 
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);

  if (rank == 0) {
    printf("Open MPI %d.%d.%d\n",
           OMPI_MAJOR_VERSION,OMPI_MINOR_VERSION,OMPI_RELEASE_VERSION);
  }

  nData = (int *)malloc(sizeof(int)*nProc);
  r_nData = (int *)malloc(sizeof(int)*nProc);
  gather = (int *)malloc(sizeof(int)*nProc*nProc);

  displ = (int *)malloc(sizeof(int)*(nProc + 1));
  r_displ = (int *)malloc(sizeof(int)*(nProc + 1));

  float maxSize = 5000.0f;

  displ[0] = 0;
  for (n = 0; n != nProc; ++n) {
    float x = (float)(nProc - 1 - rank - n)/(float)(nProc - 1);
    nData[n] = (int)fabs(x*x*x*x*x*x*x*x * maxSize);
    displ[n+1] = displ[n] + nData[n];
  }
  data = (int *)malloc(sizeof(int)*displ[nProc]);

  MPI_Gather(nData, nProc, MPI_INT, gather, nProc, MPI_INT, 0, MPI_COMM_WORLD);

  if (rank == 0) {
    for (n = 0; n != nProc; ++n) {
      int total = 0;
      printf("Proc %2d: ", n);
      for (m = 0; m != nProc; ++m) {
        int ki = gather[n*nProc + m];
        if (ki) {
          printf("%2d ", ki / 100);
        } else {
          printf("   ");
        }
        total += ki;
      }
      printf("Total: %d x 100 int\n", total / 100);
    }
  }

  for (n = 0; n != displ[nProc]; ++n) {
    data[n] = n;
  }

  if (rank == 0) start = MPI_Wtime();

  for (m = 0; m < 10000; ++m) {

    MPI_Alltoall(nData, 1, MPI_INT, r_nData, 1, MPI_INT, MPI_COMM_WORLD);
    r_displ[0] = 0;
    for (n = 0; n != nProc; ++n) {
      r_displ[n+1] = r_displ[n] + r_nData[n];
    }
    r_data = (int *)malloc(sizeof(int)*r_displ[nProc]);

    MPI_Alltoallv(data, nData, displ, MPI_INT,
                  r_data, r_nData, r_displ, MPI_INT, MPI_COMM_WORLD);
    free(r_data);
    if (rank == 0 && !(m % 100)) {
      printf(".");
      fflush(stdout);
    }
  }
  if (rank == 0) {
    finish = MPI_Wtime();
    printf("\nTotal time = %f seconds\n",
           finish - start);
  }

  MPI_Finalize();

  return 0;
}

Reply via email to