On Jan 13, 2009, at 3:32 PM, kmur...@lbl.gov wrote:

With IB, there's also the issue of registered memory. Open MPI v1.2.x defaults to copy in/copy out semantics (with pre-registered memory) until the message reaches a certain size, and then it uses a pipelined register/RDMA protocol. However, even with copy in/out semantics of small messages, the resulting broadcast should still be much faster than over gige. Are you using the same buffer for the warmup bcast as the actual bcast? You might try using "--mca mpi_leave_pinned 1" to see if that helps as well (will likely only help with large messages).

I'm using different buffers for warmup and actual bcast. I tried the mpi_leave_pinned 1, but did not see any difference in behaviour.

In this case, you likely won't see much of a difference -- mpi_leave_pinned will generally only be a boost for long messages that use the same buffers repeatedly.

May be when ever the openmpi defaults to copy in/copy out semantics on my cluster its performing very slow (than gige) but not when it uses RDMA.

That would be quite surprising. I still think there's some kind of startup overhead going on here.

Surprisingly just doing two consecutive 80K byte MPI_BCASTs
performs very quick (forget about warmup and actual broadcast).
wheres as a single 80K broadcast is slow. Not sure if I'm missing
anything!.
There's also the startup time and synchronization issues. Remember that although MPI_BCAST does not provide any synchronization guarantees, it could well be that the 1st bcast effectively synchronizes the processes and the 2nd one therefore runs much faster (because individual processes won't need to spend much time blocking waiting for messages because they're effectively operating in lock step after the first bcast). Benchmarking is a very tricky business; it can be extremely difficult to precisely measure exactly what you want to measure.

My main effort here is not to benchmark my cluster but to resolve a
user problem, where in he complained that his bcasts are running very slow. I tried to recreate the situation with a simple fortran program which just performs a bcast of size similar in his code. It also performed very slow (than gige) then I started increasing and decreasing the sizes
of bcast to observe that it performs slow only in the range 8K bytes
to 100K bytes.


Can you send your modified test program (with a warmup send)?

What happens if you run a benchmark like the broadcast section of IMB on TCP and IB?

--
Jeff Squyres
Cisco Systems

Reply via email to