Dear OpenMPI list,

I noticed a performance problem when increasing the number of CPU's used to solve my problem. I traced the problem to the MPI_Alltoallv calls. I turns out the default basic linear algorithm is very sensitive to the number of CPU's, but the pairwise routine behaves appropriately in my case. I have performed tests on 16 processes and 24 processes. I have three 8 core nodes (dual intel quadcore 2.5 GHz), connected with GBE for these tests. The test sends data (about 12k from each node to every other node.) I know alltoallv is not the best choice if the data sizes are the same, but this way it reproduces the situation in my original code.

I have set "coll_tuned_use_dynamic_rules=1" in $HOME/.openmpi/mca-params.conf

For default runs I used:
time mpirun -np 16 -machinefile hostfile ./testalltoallv
For the basic linear algorithm I used:
time mpirun -np 16 -machinefile hostfile -mca coll_tuned_alltoallv_algorithm 1 ./testalltoallv
For the pairwise algorithm I used:
time mpirun -np 16 -machinefile hostfile -mca coll_tuned_alltoallv_algorithm 2 ./testalltoallv

For 24 processes I replaced -np 16 with -np 24. The results (runtime in seconds):

                     -np 16           -np 24
default               2.1              15.6
basic linear          2.1              15.6
pairwise              2.1               2.8

*******************************************
A speed difference of almost a factor 6 !!!
*******************************************

The test code:

#include <stdlib.h>
#include <stdio.h>
#include <mpi.h>

int main(int argc, char **argv)
{
  const int data_size=3000;
  int repeat=100;
  int rank,size;
  int i,j;
  int *sendbuf, *sendcount, *senddispl;
  int *recvbuf, *recvcount, *recvdispl;

  MPI_Init(&argc,&argv);
  MPI_Comm_rank(MPI_COMM_WORLD,&rank);
  MPI_Comm_size(MPI_COMM_WORLD,&size);

  sendbuf=malloc(size * data_size * sizeof *sendbuf);
  recvbuf=malloc(size * data_size * sizeof *recvbuf);
  sendcount=malloc(size * sizeof *sendcount);
  senddispl=malloc(size * sizeof *senddispl);
  recvcount=malloc(size * sizeof *recvcount);
  recvdispl=malloc(size * sizeof *recvdispl);


  /* Set up maximum receive lenghts
     (*sizeof(int) because MPI_BYTE is used later on) */
  for (i=0; i<size; i++)
    {
      recvcount[i]=data_size*sizeof(int);
      recvdispl[i]=i*data_size*sizeof(int);
    }

  /* Set up number of data items to send */

  for (i=0; i<size; i++)
      sendcount[i]=data_size*sizeof(int);
  for (i=0; i<size; i++)
      senddispl[i]=i*data_size*sizeof(int);

  /* Do a repetitive test. */
  for (j=0; j<repeat; j++)
    MPI_Alltoallv(sendbuf,sendcount,senddispl,MPI_BYTE,
                  recvbuf,recvcount,recvdispl,MPI_BYTE,
                  MPI_COMM_WORLD);
  MPI_Finalize();
  return 0;
}

The hostfile:
arthur
arthur
arthur
arthur
arthur
arthur
arthur
arthur
trillian
trillian
trillian
trillian
trillian
trillian
trillian
trillian
zaphod
zaphod
zaphod
zaphod
zaphod
zaphod
zaphod
zaphod

I am using openmpi 1.3.2.

For me the problem is essentially solved, since I can now change the algorithm and get reasonable speed for my problem, but I was somewhat surprised about the very large difference in speed, so I wanted to report it here, if other users find themselves in a similar situation.

--
Daniel Spångberg
Materialkemi
Uppsala Universitet

Reply via email to