Thank you very much for the reply, Sir. Yes, I can observe this pattern by eyeballing Network communication graph of my cluster (through Ganglia Cluster Monitor : http://ganglia.sourceforge.net/). During this loop execution, master is receiving data at ~ 100 MB/sec (of theoretical 125 MB/sec in Gigabit) while each of the 30 slave processes are sending around ~3-4 MB/sec.
Is there a way to get exact numbers about network utilization through MPI code, instead of just visualizing the graph? Regards, Saiyed On Fri, Sep 22, 2017 at 3:18 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > All your processes send their data to a single destination, in same time. > Clearly you are reaching the capacity of your network and your data > transfers will be bound by this. This is a physical constraint that you can > only overcome by adding network capacity to your cluster. > > At the software level only possibility is to make each of the p slave > processes send their data to your centralize resource at different time, so > that the data has the time to be transferred through the network before the > next slave is ready to submit it's result. > > George > > > > > On Thu, Sep 21, 2017 at 4:57 PM, saiyedul islam <saiyedul.is...@gmail.com> > wrote: > >> Hi all, >> >> I am working on parallelization of a Data Clustering Algorithm in which I >> am following MPMD pattern of MPI (i.e. 1 master process and p slave >> processes in same communicator). It is an iterative algorithm where 2 loops >> inside iteration are separately parallelized. >> >> The first loop is parallelized by partitioning the N size input data into >> (almost) equal parts between p slaves. Each slave produces a contiguous >> chunk of about (p * N/p) double values as result of its local processing. >> This local chunk from each slave is collected back on master process where >> it is merged with chunks from other slaves. >> If a blocking call (MPI_Send / Recv) is put in a loop on master such that >> it receives the data one by one in order of their rank from slaves, then >> each slave takes about 75 seconds for its local computation (as calculated >> by MPI_Wtime() ) and about 1.5 seconds for transferring its chunk to >> master. But, as the transfer happens in order, by the time last slave >> process is done, the total time becomes 75 seconds for computation and 50 >> seconds for communication. >> These timings are for a cluster of 31 machines where a single process >> executes in each machine. All these machines are connected directly via a >> private Gigabit network switch. In order to be effectively parallelize the >> algorithm, the overall execution time needs to come below 80 seconds. >> >> I have tried following strategies to solve this problem: >> 0. Ordered transfer, as explained above. >> 1. Collecting data through MPI_Gatherv and assuming that internally it >> will transfer data in parallel. >> 2. Creating p threads at master using OpenMP and calling MPI_Recv (or >> MPI_Irecv with MPI_Wait) by threads. The received data by each process is >> put in a separate buffer. My installation support MPI_THREAD_MULTIPLE. >> >> The problem is that strategies 1 & 2 are taking almost similar time as >> compared to strategy 0. >> *Is there a way through which I can receive data in parallel and >> substantially decrease the overall execution time?* >> >> Hoping to get your help soon. Sorry for the long question. >> >> Regards, >> Saiyedul Islam >> >> PS: Specifications of the cluster: GCC 5.10, OpenMP 2.0.1, CentOS 6.5 (as >> part of Rockscluster). >> >> _______________________________________________ >> users mailing list >> users@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/users >> > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users