Hello,
I noticed when performing a profiling of an application that the
MPI_init() function takes a considerable amount of time.
There is a big difference when running 32 processes over 32 machines and
32 processes over 8 machines (Each machine has 8 cores).
These are the results of the profiling:
Results for 32 cores (8 machines)
Group.1 percent usec
38 SSOR 79.1125 2557445.625
7 EXCHANGE_1 31.8125 33.250
24 MPI_Recv() 26.0750 33.375
2 BLTS 24.7500 103.125
3 BUTS 22.2375 92.500
12 INIT_COMM 19.8500 1311003.375
*22 MPI_Init() 19.8500 1310925.750*
33 RHS 18.4000 4690.500
8 EXCHANGE_3 9.2750 1179.000
26 MPI_Wait() 7.2250 565.125
13 JACLD 6.4875 27.000
25 MPI_Send() 6.3500 8.000
14 JACU 6.2500 26.000
37 SETIV 0.6625 20908.500
6 EXACT 0.2188 0.000
4 ERHS 0.2000 11499.000
Results for 32 machines
Group.1 percent usec
38 SSOR 97.28889 2573471.0000
7 EXCHANGE_1 39.25556 33.3333
2 BLTS 29.11111 98.7778
3 BUTS 27.96667 95.0000
24 MPI_Recv() 27.48889 28.7778
33 RHS 23.98889 5018.6667
25 MPI_Send() 13.51111 14.0000
8 EXCHANGE_3 13.06667 1361.1111
26 MPI_Wait() 9.37778 599.0000
13 JACLD 7.72222 26.0000
14 JACU 7.37778 25.0000
12 INIT_COMM 1.46667 76713.6667
*22 MPI_Init() 1.45556 76253.4444*
37 SETIV 0.80000 20914.0000
6 EXACT 0.25000 0.0000
4 ERHS 0.21111 10458.3333
The function MPI_init() in the first case (4 processes per machine) was
17 times slower than the second case (1 process per machine). Is this
behaviour normal?
The command I used for running the application was:
First case:
mpirun --machinefile machine_file -npernode 4 --mca btl self,sm,tcp
lu.A.32
Second case:
mpirun --machinefile machine_file --mca btl self,sm,tcp lu.A.32
I used the version of mpi:
mpirun --V
mpirun (Open MPI) 1.4.5
and the system I used is the following:
Linux kameleon-debian 3.2.0-4-amd64 #1 SMP Debian 3.2.65-1+deb7u2 x86_64
GNU/Linux
I will appreciate any feedback, thank you.