On Jan 12, 2009, at 2:50 PM, kmur...@lbl.gov wrote:
Is there is any requirement on the size of the data buffers I should use in these warmup broadcasts ? If I use small buffers like 1000 real values during warmup, the following actual and timed MPI_BCAST over IB is taking a lot of time (more than that on GiGE). If I use a bigger buffer of 10000 real values during warmup the following timed MPI_BCAST is quick.
I can't quite grok that -- "actual and timed MPI_BCAST"; are you talking about 2 different bcasts?
With IB, there's also the issue of registered memory. Open MPI v1.2.x defaults to copy in/copy out semantics (with pre-registered memory) until the message reaches a certain size, and then it uses a pipelined register/RDMA protocol. However, even with copy in/out semantics of small messages, the resulting broadcast should still be much faster than over gige.
Are you using the same buffer for the warmup bcast as the actual bcast? You might try using "--mca mpi_leave_pinned 1" to see if that helps as well (will likely only help with large messages).
Surprisingly just doing two consecutive 80K byte MPI_BCASTs performs very quick (forget about warmup and actual broadcast). wheres as a single 80K broadcast is slow. Not sure if I'm missing anything!.
There's also the startup time and synchronization issues. Remember that although MPI_BCAST does not provide any synchronization guarantees, it could well be that the 1st bcast effectively synchronizes the processes and the 2nd one therefore runs much faster (because individual processes won't need to spend much time blocking waiting for messages because they're effectively operating in lock step after the first bcast).
Benchmarking is a very tricky business; it can be extremely difficult to precisely measure exactly what you want to measure.
-- Jeff Squyres Cisco Systems