On Jan 13, 2009, at 3:32 PM, kmur...@lbl.gov wrote:
With IB, there's also the issue of registered memory. Open MPI
v1.2.x defaults to copy in/copy out semantics (with pre-registered
memory) until the message reaches a certain size, and then it uses
a pipelined register/RDMA protocol. However, even with copy in/out
semantics of small messages, the resulting broadcast should still
be much faster than over gige.
Are you using the same buffer for the warmup bcast as the actual
bcast? You might try using "--mca mpi_leave_pinned 1" to see if
that helps as well (will likely only help with large messages).
I'm using different buffers for warmup and actual bcast. I tried the
mpi_leave_pinned 1, but did not see any difference in behaviour.
In this case, you likely won't see much of a difference --
mpi_leave_pinned will generally only be a boost for long messages that
use the same buffers repeatedly.
May be when ever the openmpi defaults to copy in/copy out semantics
on my
cluster its performing very slow (than gige) but not when it uses
RDMA.
That would be quite surprising. I still think there's some kind of
startup overhead going on here.
Surprisingly just doing two consecutive 80K byte MPI_BCASTs
performs very quick (forget about warmup and actual broadcast).
wheres as a single 80K broadcast is slow. Not sure if I'm missing
anything!.
There's also the startup time and synchronization issues. Remember
that although MPI_BCAST does not provide any synchronization
guarantees, it could well be that the 1st bcast effectively
synchronizes the processes and the 2nd one therefore runs much
faster (because individual processes won't need to spend much time
blocking waiting for messages because they're effectively operating
in lock step after the first bcast).
Benchmarking is a very tricky business; it can be extremely
difficult to precisely measure exactly what you want to measure.
My main effort here is not to benchmark my cluster but to resolve a
user problem, where in he complained that his bcasts are running
very slow. I tried to recreate the situation with a simple fortran
program
which just performs a bcast of size similar in his code. It also
performed
very slow (than gige) then I started increasing and decreasing the
sizes
of bcast to observe that it performs slow only in the range 8K bytes
to 100K bytes.
Can you send your modified test program (with a warmup send)?
What happens if you run a benchmark like the broadcast section of IMB
on TCP and IB?
--
Jeff Squyres
Cisco Systems