If you have multiple receivers then use MPI_Bcast, it does all the necessary optimizations such that MPI users do not have to struggle to adapt/optimize their application for a specific architecture/network.
George. On Fri, May 26, 2017 at 6:43 AM, marcin.krotkiewski < marcin.krotkiew...@gmail.com> wrote: > Dear All, > > I would appreciate some general advice on how to efficiently implement the > following scenario. > > I am looking into how to send a large amount of data over IB _once_, to > multiple receivers. The trick is, of course, that while the ping-pong > benchmark delivers great bandwidth, it does so by re-using the already > registered memory buffers. Since I need to send the data once, the memory > registration penalty is not easily avoided. I've been looking into the > following approaches: > > 1. have multiple ranks send different parts of the data to different > receivers, in the hope that the memory registration cost will be hidden > 2. pre-register two smaller buffers, into which a data is copied before > sending > > The first approach is the best I've managed so far, but the bandwidth > reached is still lower than what I observe using the pingpong benchmark. > Also, the performance depends on the number of sending ranks and drops if > there are too many. > > In the second approach one pays for a data copy. My thinking was that > since the effective memory bandwidth available on a single modern CPU is > larger than the IB bandwidth, I could squeeze out some performance by > combining double buffering and multithreading, e.g., > > Step 1. thread A sends the data in the current buffer. Behind the scenes, > thread B copies data from memory to the next buffer > Step 2. buffers are switched > > A similar idea would be to use MPI_Get on the remote rank. The sender > would copy the data from the memory to the second buffer while the RMA > window with the first buffer is exposed. In theory, I would expect those > two operations to be executed simultaneously, with the memory copy > hopefully hidden behind the IB transfer. > > Of course, the experiments didn't really work. While the first > (multi-rank) approach is OK and shows some improvement, the bandwidth could > still be improved. None of my double-buffering approaches worked at all, > possibly because memory bandwidth contention. > > So I was wondering, has any of you had any experience with similar > approaches? In your experience, what would be the best approach? > > Thanks a lot! > > Marcin > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users