Re: [OMPI users] slow MPI_BCast for messages size from 24K bytes to 800K bytes. (fwd)

Jeff Squyres Wed, 14 Jan 2009 08:47:39 -0500

On Jan 13, 2009, at 3:32 PM, kmur...@lbl.gov wrote:

With IB, there's also the issue of registered memory. Open MPIv1.2.x defaults to copy in/copy out semantics (with pre-registeredmemory) until the message reaches a certain size, and then it usesa pipelined register/RDMA protocol. However, even with copy in/outsemantics of small messages, the resulting broadcast should stillbe much faster than over gige.Are you using the same buffer for the warmup bcast as the actualbcast? You might try using "--mca mpi_leave_pinned 1" to see ifthat helps as well (will likely only help with large messages).
I'm using different buffers for warmup and actual bcast. I tried thempi_leave_pinned 1, but did not see any difference in behaviour.

In this case, you likely won't see much of a difference --mpi_leave_pinned will generally only be a boost for long messages thatuse the same buffers repeatedly.

May be when ever the openmpi defaults to copy in/copy out semanticson mycluster its performing very slow (than gige) but not when it usesRDMA.

That would be quite surprising. I still think there's some kind ofstartup overhead going on here.

Surprisingly just doing two consecutive 80K byte MPI_BCASTs
performs very quick (forget about warmup and actual broadcast).
wheres as a single 80K broadcast is slow. Not sure if I'm missing
anything!.
There's also the startup time and synchronization issues. Rememberthat although MPI_BCAST does not provide any synchronizationguarantees, it could well be that the 1st bcast effectivelysynchronizes the processes and the 2nd one therefore runs muchfaster (because individual processes won't need to spend much timeblocking waiting for messages because they're effectively operatingin lock step after the first bcast).Benchmarking is a very tricky business; it can be extremelydifficult to precisely measure exactly what you want to measure.
My main effort here is not to benchmark my cluster but to resolve a
user problem, where in he complained that his bcasts are runningvery slow. I tried to recreate the situation with a simple fortranprogramwhich just performs a bcast of size similar in his code. It alsoperformedvery slow (than gige) then I started increasing and decreasing thesizes
of bcast to observe that it performs slow only in the range 8K bytes
to 100K bytes.



Can you send your modified test program (with a warmup send)?

What happens if you run a benchmark like the broadcast section of IMBon TCP and IB?


--
Jeff Squyres
Cisco Systems

Re: [OMPI users] slow MPI_BCast for messages size from 24K bytes to 800K bytes. (fwd)

Reply via email to