Can you pad data so you can use MPI_Gather instead? It's possible that Gatherv doesn't use recursive doubling.
Or you can implement your own aggregation tree to work around the in-cast problem. Jeff On Thu, Sep 21, 2017 at 2:01 PM saiyedul islam <saiyedul.is...@gmail.com> wrote: > Hi all, > > I am working on parallelization of a Data Clustering Algorithm in which I > am following MPMD pattern of MPI (i.e. 1 master process and p slave > processes in same communicator). It is an iterative algorithm where 2 loops > inside iteration are separately parallelized. > > The first loop is parallelized by partitioning the N size input data into > (almost) equal parts between p slaves. Each slave produces a contiguous > chunk of about (p * N/p) double values as result of its local processing. > This local chunk from each slave is collected back on master process where > it is merged with chunks from other slaves. > If a blocking call (MPI_Send / Recv) is put in a loop on master such that > it receives the data one by one in order of their rank from slaves, then > each slave takes about 75 seconds for its local computation (as calculated > by MPI_Wtime() ) and about 1.5 seconds for transferring its chunk to > master. But, as the transfer happens in order, by the time last slave > process is done, the total time becomes 75 seconds for computation and 50 > seconds for communication. > These timings are for a cluster of 31 machines where a single process > executes in each machine. All these machines are connected directly via a > private Gigabit network switch. In order to be effectively parallelize the > algorithm, the overall execution time needs to come below 80 seconds. > > I have tried following strategies to solve this problem: > 0. Ordered transfer, as explained above. > 1. Collecting data through MPI_Gatherv and assuming that internally it > will transfer data in parallel. > 2. Creating p threads at master using OpenMP and calling MPI_Recv (or > MPI_Irecv with MPI_Wait) by threads. The received data by each process is > put in a separate buffer. My installation support MPI_THREAD_MULTIPLE. > > The problem is that strategies 1 & 2 are taking almost similar time as > compared to strategy 0. > *Is there a way through which I can receive data in parallel and > substantially decrease the overall execution time?* > > Hoping to get your help soon. Sorry for the long question. > > Regards, > Saiyedul Islam > > PS: Specifications of the cluster: GCC 5.10, OpenMP 2.0.1, CentOS 6.5 (as > part of Rockscluster). > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- Jeff Hammond jeff.scie...@gmail.com http://jeffhammond.github.io/
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users