Re: [OMPI users] slow MPI_BCast for messages size from 24K bytes to 800K bytes. (fwd)

kmuriki Tue, 13 Jan 2009 15:32:49 -0500


Hi Jeff,
Please read below:

On Jan 12, 2009, at 2:50 PM, kmur...@lbl.gov wrote:

Is there is any requirement on the size of the data buffers
I should use in these warmup broadcasts ? If I use small
buffers like 1000 real values during warmup, the following
actual and timed MPI_BCAST over IB is taking a lot of time
(more than that on GiGE). If I use a bigger buffer of 10000 real
values during warmup the following timed MPI_BCAST is quick.

I can't quite grok that -- "actual and timed MPI_BCAST"; are you talkingabout 2 different bcasts?


No I meant the same bcast when I said actual and timed.
This is the main bcast in the program which I have timed and
before this bcast as you suggested I did one warmup
bcast and in each attempt I picked the size of warmup bcast
from 1000 real to 10000 real values.

With IB, there's also the issue of registered memory. Open MPI v1.2.xdefaults to copy in/copy out semantics (with pre-registered memory) until themessage reaches a certain size, and then it uses a pipelined register/RDMAprotocol. However, even with copy in/out semantics of small messages, theresulting broadcast should still be much faster than over gige.
Are you using the same buffer for the warmup bcast as the actual bcast? Youmight try using "--mca mpi_leave_pinned 1" to see if that helps as well (willlikely only help with large messages).

I'm using different buffers for warmup and actual bcast. I tried thempi_leave_pinned 1, but did not see any difference in behaviour.


May be when ever the openmpi defaults to copy in/copy out semantics on my
cluster its performing very slow (than gige) but not when it uses RDMA.
Any tips on how to debug this !.

Surprisingly just doing two consecutive 80K byte MPI_BCASTs
performs very quick (forget about warmup and actual broadcast).
wheres as a single 80K broadcast is slow. Not sure if I'm missing
anything!.
There's also the startup time and synchronization issues. Remember thatalthough MPI_BCAST does not provide any synchronization guarantees, it couldwell be that the 1st bcast effectively synchronizes the processes and the 2ndone therefore runs much faster (because individual processes won't need tospend much time blocking waiting for messages because they're effectivelyoperating in lock step after the first bcast).
Benchmarking is a very tricky business; it can be extremely difficult toprecisely measure exactly what you want to measure.


My main effort here is not to benchmark my cluster but to resolve a

user problem, where in he complained that his bcasts are running very slow. Itried to recreate the situation with a simple fortran program

which just performs a bcast of size similar in his code. It also performed
very slow (than gige) then I started increasing and decreasing the sizes
of bcast to observe that it performs slow only in the range 8K bytes
to 100K bytes.

thanks,
Krishna.

Re: [OMPI users] slow MPI_BCast for messages size from 24K bytes to 800K bytes. (fwd)

Reply via email to