Your processes are probably running asynchronously. You could perhaps
try tracing program execution and look at the timeline. E.g.,
http://www.open-mpi.org/faq/?category=perftools#free-tools . Or, where
you have MPI_Wtime calls, just capture those timestamps on each process
and dump the results at the end of your run. Or, report timings for all
ranks instead of just for rank 0.
Put another way, rank 0 must broadcast n. So, no one starts computation
until they get the Bcast result. Rank 0 probably starts its
computations before anyone else does. So, it gets to the Reduce before
anyone else does, but it can't exit until other ranks have finished
their computations. So, the Reduce time on rank 0 includes some amount
of other ranks' compute times.
Yet another approach is to insert MPI_Barrier calls at each phase of the
program so that the various phases are synchronized. This adds some
overhead to the program, but helps simplify interpretation of the timing
results.
Qing Pang wrote:
I'm running the popular Calculate PI program on a 2 node setting
running ubuntu 8.10 and openmpi1.3.3(with default settings).
Password-less ssh is set up but no cluster management program such as
network file system, network time protocol, resource management,
scheduler, etc. The two nodes are connected though TCP/IP only.
When I tried to benchmark the program, it shows that the time spent on
MPI_Reduce(), is proportional to the Number-of-Intervals (n) used in
calculation. For example, when n = 1,000,000, MPI_Reduce costs 15.65
milliseconds; while n= 1,000,000,000, MPI_Reduce costs 15526
milliseconds.
This confused me - in this Calc-PI program, MPI_Reduce is used only
once - no matter what number of intervals is used, MPI_Reduce is
called after both nodes got the result, to merge the result - just
once. So the time cost by MPI_Reduce (all though it might be slow
through TCP/IP connection) should be somewhat consistent. But
obviously it's not what I saw.
Had anyone have the similar problem before? I'm not sure how
MPI_Reduce() work internally. Does the fact that I don't have network
file system, network time protocol, resource management, scheduler,
etc installed matters?
Below is the program - I did feed "n" to it more than once to warm it up.
#include "mpi.h"
#include <stdio.h>
#include <math.h>
int main(int argc, char *argv[]) { int numprocs, myid, rc;
double ACCUPI = 3.1415926535897932384626433832795;
double mypi, pi, h, sum, x;
int n, i;
double starttime, endtime;
double time,told,bcasttime,reducetime,comptime,totaltime;
rc = MPI_Init(&argc,&argv);
if (rc != MPI_SUCCESS) {
printf("Error starting MPI program. Terminating.\n");
MPI_Abort(MPI_COMM_WORLD, rc);
}
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
while (1) {
if (myid == 0) {
printf("Enter the number of intervals: (0 quits) \n");
scanf("%d",&n);
starttime = MPI_Wtime();
}
time = MPI_Wtime();
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
told = time;
time = MPI_Wtime();
bcasttime = time - told;
if (n == 0)
break;
else {
h = 1.0/(double)n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs) {
x = h*((double)i - 0.5);
sum += (4.0/(1.0 + x*x));
}
mypi = sum*h;
told = time;
time = MPI_Wtime();
comptime = time - told;
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,
MPI_COMM_WORLD);
told = time;
time = MPI_Wtime();
reducetime = time - told;
if (myid == 0) {
totaltime = MPI_Wtime() - starttime;
printf("\nElapsed time (total): %f
milliseconds\n",totaltime*1000);
printf("Elapsed time (Bcast): %f milliseconds
(%5.2f%%)\n",bcasttime*1000,bcasttime*100/totaltime);
printf("Elapsed time (Reduce): %f milliseconds
(%5.2f%%)\n",reducetime*1000,reducetime*100/totaltime);
printf("Elapsed time (Comput): %f milliseconds
(%5.2f%%)\n",comptime*1000,comptime*100/totaltime);
printf("\nApproximated pi is %.16f, Error is %.4e\n", pi,
fabs(pi - ACCUPI));
}
}
}
MPI_Finalize(); }