Hi Jeff,
Please read below:
On Jan 12, 2009, at 2:50 PM, kmur...@lbl.gov wrote:
Is there is any requirement on the size of the data buffers
I should use in these warmup broadcasts ? If I use small
buffers like 1000 real values during warmup, the following
actual and timed MPI_BCAST over IB is taking a lot of time
(more than that on GiGE). If I use a bigger buffer of 10000 real
values during warmup the following timed MPI_BCAST is quick.
I can't quite grok that -- "actual and timed MPI_BCAST"; are you talking
about 2 different bcasts?
No I meant the same bcast when I said actual and timed.
This is the main bcast in the program which I have timed and
before this bcast as you suggested I did one warmup
bcast and in each attempt I picked the size of warmup bcast
from 1000 real to 10000 real values.
With IB, there's also the issue of registered memory. Open MPI v1.2.x
defaults to copy in/copy out semantics (with pre-registered memory) until the
message reaches a certain size, and then it uses a pipelined register/RDMA
protocol. However, even with copy in/out semantics of small messages, the
resulting broadcast should still be much faster than over gige.
Are you using the same buffer for the warmup bcast as the actual bcast? You
might try using "--mca mpi_leave_pinned 1" to see if that helps as well (will
likely only help with large messages).
I'm using different buffers for warmup and actual bcast. I tried the
mpi_leave_pinned 1, but did not see any difference in behaviour.
May be when ever the openmpi defaults to copy in/copy out semantics on my
cluster its performing very slow (than gige) but not when it uses RDMA.
Any tips on how to debug this !.
Surprisingly just doing two consecutive 80K byte MPI_BCASTs
performs very quick (forget about warmup and actual broadcast).
wheres as a single 80K broadcast is slow. Not sure if I'm missing
anything!.
There's also the startup time and synchronization issues. Remember that
although MPI_BCAST does not provide any synchronization guarantees, it could
well be that the 1st bcast effectively synchronizes the processes and the 2nd
one therefore runs much faster (because individual processes won't need to
spend much time blocking waiting for messages because they're effectively
operating in lock step after the first bcast).
Benchmarking is a very tricky business; it can be extremely difficult to
precisely measure exactly what you want to measure.
My main effort here is not to benchmark my cluster but to resolve a
user problem, where in he complained that his bcasts are running very slow. I
tried to recreate the situation with a simple fortran program
which just performs a bcast of size similar in his code. It also performed
very slow (than gige) then I started increasing and decreasing the sizes
of bcast to observe that it performs slow only in the range 8K bytes
to 100K bytes.
thanks,
Krishna.