If you have multiple receivers then use MPI_Bcast, it does all the
necessary optimizations such that MPI users do not have to struggle to
adapt/optimize their application for a specific architecture/network.

  George.



On Fri, May 26, 2017 at 6:43 AM, marcin.krotkiewski <
marcin.krotkiew...@gmail.com> wrote:

> Dear All,
>
> I would appreciate some general advice on how to efficiently implement the
> following scenario.
>
> I am looking into how to send a large amount of data over IB _once_, to
> multiple receivers. The trick is, of course, that while the ping-pong
> benchmark delivers great bandwidth, it does so by re-using the already
> registered memory buffers. Since I need to send the data once, the memory
> registration penalty is not easily avoided. I've been looking into the
> following approaches:
>
> 1. have multiple ranks send different parts of the data to different
> receivers, in the hope that the memory registration cost will be hidden
> 2. pre-register two smaller buffers, into which a data is copied before
> sending
>
> The first approach is the best I've managed so far, but the bandwidth
> reached is still lower than what I observe using the pingpong benchmark.
> Also, the performance depends on the number of sending ranks and drops if
> there are too many.
>
> In the second approach one pays for a data copy. My thinking was that
> since the effective memory bandwidth available on a single modern CPU is
> larger than the IB bandwidth, I could squeeze out some performance by
> combining double buffering and multithreading, e.g.,
>
> Step 1. thread A sends the data in the current buffer. Behind the scenes,
> thread B copies data from memory to the next buffer
> Step 2. buffers are switched
>
> A similar idea would be to use MPI_Get on the remote rank. The sender
> would copy the data from the memory to the second buffer while the RMA
> window with the first buffer is exposed. In theory, I would expect those
> two operations to be executed simultaneously, with the memory copy
> hopefully hidden behind the IB transfer.
>
> Of course, the experiments didn't really work. While the first
> (multi-rank) approach is OK and shows some improvement, the bandwidth could
> still be improved. None of my double-buffering approaches worked at all,
> possibly because memory bandwidth contention.
>
> So I was wondering, has any of you had any experience with similar
> approaches? In your experience, what would be the best approach?
>
> Thanks a lot!
>
> Marcin
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to