I've noticed a very significant (100%) slow down for MPI_Alltoallv calls
as of version 1.6.1.
* This is most noticeable for high-frequency exchanges over 1Gb ethernet
where process-to-process message sizes are fairly small (e.g. 100kbyte)
and much of the exchange matrix is sparse.
* 1.6.1 release notes mention "Switch the MPI_ALLTOALLV default
algorithm to a pairwise exchange", but I'm not clear what this means or
how to switch back to the old "non-default algorithm".
I attach a test program which illustrates the sort of usage in our MPI
application. I have run as this as 32 processes on four nodes, over 1Gb
ethernet, each node with 2x Opteron 4180 (dual hex-core); rank 0,4,8,..
on node 1, rank 1,5,9, ... on node 2, etc.
It constructs an array of integers and a nProcess x nProcess exchange
typical of part of our application. This is then exchanged several
thousand times. Output from "mpicc -O3" runs shown below.
My guess is that 1.6.1 is hitting additional latency not present in
1.6.0. I also attach a plot showing network throughput on our actual
mesh generation application. Nodes cfsc01-04 are running 1.6.0 and
finish within 35 minutes. Nodes cfsc05-08 are running 1.6.2 (started 10
minutes later) and take over a hour to run. There seems to be a much
greater network demand in the 1.6.1 version, despite the user-code and
input data being identical.
Thanks for any help you can give,
Simon
For 1.6.0:
Open MPI 1.6.0
Proc 0: 50 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 198 x 100 int
Proc 1: 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 148 x 100 int
Proc 2: 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 109 x 100 int
Proc 3: 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 80 x 100 int
Proc 4: 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 58 x 100 int
Proc 5: 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 41 x 100 int
Proc 6: 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 29 x 100 int
Proc 7: 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 20 x 100 int
Proc 8: 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 14 x
100 int
Proc 9: 3 2 1 0 0 0 0 0 0 0 0
0 Total: 9 x
100 int
Proc 10: 2 1 0 0 0 0 0 0 0 0 0 Total: 6 x 100 int
Proc 11: 1 0 0 0 0 0 0 0 0
0 0
Total: 4 x 100 int
Proc 12: 0 0 0 0 0 0 0 0
0 0 0
Total: 2 x 100 int
Proc 13: 0 0 0 0 0 0 0
0 0 0 0
Total: 1 x 100 int
Proc 14: 0 0 0 0 0 0
0 0 0
0 0 Total: 1 x 100 int
Proc 15: 0 0 0 0 0
0 0 0
0 0 0 Total: 0 x 100 int
Proc 16: 0 0 0 0
0 0 0
0 0 0 0 Total: 0 x 100 int
Proc 17: 0 0 0
0 0 0
0 0 0 0 0 Total: 1 x 100 int
Proc 18: 0 0
0 0 0
0 0 0 0 0 0 Total: 1 x 100 int
Proc 19: 0
0 0 0
0 0 0 0 0 0 0 Total: 2 x 100 int
Proc 20:
0 0 0
0 0 0 0 0 0 0 1 Total: 4 x 100 int
Proc 21: 0 0 0 0 0 0 0 0 0 1 2 Total: 6 x 100 int
Proc 22: 0
0 0 0 0 0 0 0 0 1 2 3 Total: 9 x 100 int
Proc 23: 0 0 0 0 0 0 0 0 0 1 2 3 4 Total: 14 x 100 int
Proc 24: 0 0 0
0 0 0 0 0 0 1 2 3 4 6 Total: 20 x 100 int
Proc 25: 0 0 0 0
0 0 0 0 0 1 2 3 4 6 8 Total: 29 x 100 int
Proc 26: 0 0 0 0 0
0 0 0 0 1 2 3 4 6 8 12 Total: 41 x 100 int
Proc 27: 0 0 0 0 0 0
0 0 0 1 2 3 4 6 8 12 16 Total: 58 x 100 int
Proc 28: 0 0 0 0 0 0 0
0 0 1 2 3 4 6 8 12 16 22 Total: 80 x 100 int
Proc 29: 0 0 0 0 0 0 0 0
0 1 2 3 4 6 8 12 16 22 29 Total: 109 x 100 int
Proc 30: 0 0 0 0 0 0 0 0 0
1 2 3 4 6 8 12 16 22 29 38 Total: 148 x 100 int
Proc 31: 0 0 0 0 0 0 0 0 0 1
2 3 4 6 8 12 16 22 29 38 50 Total: 198 x 100 int
....................................................................................................
Total time = 15.443502 seconds
For 1.6.1:
Open MPI 1.6.1
Proc 0: 50 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 198 x 100 int
Proc 1: 38 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 148 x 100 int
Proc 2: 29 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 109 x 100 int
Proc 3: 22 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 80 x 100 int
Proc 4: 16 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 58 x 100 int
Proc 5: 12 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 41 x 100 int
Proc 6: 8 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 29 x 100 int
Proc 7: 6 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 20 x 100 int
Proc 8: 4 3 2 1 0 0 0 0 0 0 0 0
0 Total: 14 x
100 int
Proc 9: 3 2 1 0 0 0 0 0 0 0 0
0 Total: 9 x
100 int
Proc 10: 2 1 0 0 0 0 0 0 0 0 0 Total: 6 x 100 int
Proc 11: 1 0 0 0 0 0 0 0 0
0 0
Total: 4 x 100 int
Proc 12: 0 0 0 0 0 0 0 0
0 0 0
Total: 2 x 100 int
Proc 13: 0 0 0 0 0 0 0
0 0 0 0
Total: 1 x 100 int
Proc 14: 0 0 0 0 0 0
0 0 0
0 0 Total: 1 x 100 int
Proc 15: 0 0 0 0 0
0 0 0
0 0 0 Total: 0 x 100 int
Proc 16: 0 0 0 0
0 0 0
0 0 0 0 Total: 0 x 100 int
Proc 17: 0 0 0
0 0 0
0 0 0 0 0 Total: 1 x 100 int
Proc 18: 0 0
0 0 0
0 0 0 0 0 0 Total: 1 x 100 int
Proc 19: 0
0 0 0
0 0 0 0 0 0 0 Total: 2 x 100 int
Proc 20:
0 0 0
0 0 0 0 0 0 0 1 Total: 4 x 100 int
Proc 21: 0 0 0 0 0 0 0 0 0 1 2 Total: 6 x 100 int
Proc 22: 0
0 0 0 0 0 0 0 0 1 2 3 Total: 9 x 100 int
Proc 23: 0 0 0 0 0 0 0 0 0 1 2 3 4 Total: 14 x 100 int
Proc 24: 0 0 0
0 0 0 0 0 0 1 2 3 4 6 Total: 20 x 100 int
Proc 25: 0 0 0 0
0 0 0 0 0 1 2 3 4 6 8 Total: 29 x 100 int
Proc 26: 0 0 0 0 0
0 0 0 0 1 2 3 4 6 8 12 Total: 41 x 100 int
Proc 27: 0 0 0 0 0 0
0 0 0 1 2 3 4 6 8 12 16 Total: 58 x 100 int
Proc 28: 0 0 0 0 0 0 0
0 0 1 2 3 4 6 8 12 16 22 Total: 80 x 100 int
Proc 29: 0 0 0 0 0 0 0 0
0 1 2 3 4 6 8 12 16 22 29 Total: 109 x 100 int
Proc 30: 0 0 0 0 0 0 0 0 0
1 2 3 4 6 8 12 16 22 29 38 Total: 148 x 100 int
Proc 31: 0 0 0 0 0 0 0 0 0 1
2 3 4 6 8 12 16 22 29 38 50 Total: 198 x 100 int
....................................................................................................
Total time = 25.549821 seconds
/*
Test program to illustrate OpenMPI 1.6.0 - 1.6.2 alltoall regression.
*/
#include <mpi.h>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int nProc, rank, n, m;
int *data, *r_data;
int *nData, *r_nData;
int *displ, *r_displ;
int *gather;
double start = 0, finish = 0;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&nProc);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
if (rank == 0) {
printf("Open MPI %d.%d.%d\n",
OMPI_MAJOR_VERSION,OMPI_MINOR_VERSION,OMPI_RELEASE_VERSION);
}
nData = (int *)malloc(sizeof(int)*nProc);
r_nData = (int *)malloc(sizeof(int)*nProc);
gather = (int *)malloc(sizeof(int)*nProc*nProc);
displ = (int *)malloc(sizeof(int)*(nProc + 1));
r_displ = (int *)malloc(sizeof(int)*(nProc + 1));
float maxSize = 5000.0f;
displ[0] = 0;
for (n = 0; n != nProc; ++n) {
float x = (float)(nProc - 1 - rank - n)/(float)(nProc - 1);
nData[n] = (int)fabs(x*x*x*x*x*x*x*x * maxSize);
displ[n+1] = displ[n] + nData[n];
}
data = (int *)malloc(sizeof(int)*displ[nProc]);
MPI_Gather(nData, nProc, MPI_INT, gather, nProc, MPI_INT, 0, MPI_COMM_WORLD);
if (rank == 0) {
for (n = 0; n != nProc; ++n) {
int total = 0;
printf("Proc %2d: ", n);
for (m = 0; m != nProc; ++m) {
int ki = gather[n*nProc + m];
if (ki) {
printf("%2d ", ki / 100);
} else {
printf(" ");
}
total += ki;
}
printf("Total: %d x 100 int\n", total / 100);
}
}
for (n = 0; n != displ[nProc]; ++n) {
data[n] = n;
}
if (rank == 0) start = MPI_Wtime();
for (m = 0; m < 10000; ++m) {
MPI_Alltoall(nData, 1, MPI_INT, r_nData, 1, MPI_INT, MPI_COMM_WORLD);
r_displ[0] = 0;
for (n = 0; n != nProc; ++n) {
r_displ[n+1] = r_displ[n] + r_nData[n];
}
r_data = (int *)malloc(sizeof(int)*r_displ[nProc]);
MPI_Alltoallv(data, nData, displ, MPI_INT,
r_data, r_nData, r_displ, MPI_INT, MPI_COMM_WORLD);
free(r_data);
if (rank == 0 && !(m % 100)) {
printf(".");
fflush(stdout);
}
}
if (rank == 0) {
finish = MPI_Wtime();
printf("\nTotal time = %f seconds\n",
finish - start);
}
MPI_Finalize();
return 0;
}