Dear OpenMPI list,
I noticed a performance problem when increasing the number of CPU's used
to solve my problem. I traced the problem to the MPI_Alltoallv calls. I
turns out the default basic linear algorithm is very sensitive to the
number of CPU's, but the pairwise routine behaves appropriately in my
case. I have performed tests on 16 processes and 24 processes. I have
three 8 core nodes (dual intel quadcore 2.5 GHz), connected with GBE for
these tests. The test sends data (about 12k from each node to every other
node.) I know alltoallv is not the best choice if the data sizes are the
same, but this way it reproduces the situation in my original code.
I have set "coll_tuned_use_dynamic_rules=1" in
$HOME/.openmpi/mca-params.conf
For default runs I used:
time mpirun -np 16 -machinefile hostfile ./testalltoallv
For the basic linear algorithm I used:
time mpirun -np 16 -machinefile hostfile -mca
coll_tuned_alltoallv_algorithm 1 ./testalltoallv
For the pairwise algorithm I used:
time mpirun -np 16 -machinefile hostfile -mca
coll_tuned_alltoallv_algorithm 2 ./testalltoallv
For 24 processes I replaced -np 16 with -np 24. The results (runtime in
seconds):
-np 16 -np 24
default 2.1 15.6
basic linear 2.1 15.6
pairwise 2.1 2.8
*******************************************
A speed difference of almost a factor 6 !!!
*******************************************
The test code:
#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv)
{
const int data_size=3000;
int repeat=100;
int rank,size;
int i,j;
int *sendbuf, *sendcount, *senddispl;
int *recvbuf, *recvcount, *recvdispl;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
sendbuf=malloc(size * data_size * sizeof *sendbuf);
recvbuf=malloc(size * data_size * sizeof *recvbuf);
sendcount=malloc(size * sizeof *sendcount);
senddispl=malloc(size * sizeof *senddispl);
recvcount=malloc(size * sizeof *recvcount);
recvdispl=malloc(size * sizeof *recvdispl);
/* Set up maximum receive lenghts
(*sizeof(int) because MPI_BYTE is used later on) */
for (i=0; i<size; i++)
{
recvcount[i]=data_size*sizeof(int);
recvdispl[i]=i*data_size*sizeof(int);
}
/* Set up number of data items to send */
for (i=0; i<size; i++)
sendcount[i]=data_size*sizeof(int);
for (i=0; i<size; i++)
senddispl[i]=i*data_size*sizeof(int);
/* Do a repetitive test. */
for (j=0; j<repeat; j++)
MPI_Alltoallv(sendbuf,sendcount,senddispl,MPI_BYTE,
recvbuf,recvcount,recvdispl,MPI_BYTE,
MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
The hostfile:
arthur
arthur
arthur
arthur
arthur
arthur
arthur
arthur
trillian
trillian
trillian
trillian
trillian
trillian
trillian
trillian
zaphod
zaphod
zaphod
zaphod
zaphod
zaphod
zaphod
zaphod
I am using openmpi 1.3.2.
For me the problem is essentially solved, since I can now change the
algorithm and get reasonable speed for my problem, but I was somewhat
surprised about the very large difference in speed, so I wanted to report
it here, if other users find themselves in a similar situation.
--
Daniel Spångberg
Materialkemi
Uppsala Universitet