Thank you very much for the reply, Sir.

Yes, I can observe this pattern by eyeballing Network communication graph
of my cluster (through Ganglia Cluster Monitor :
http://ganglia.sourceforge.net/).
During this loop execution, master is receiving data at ~ 100 MB/sec (of
theoretical 125 MB/sec in Gigabit) while each of the 30 slave processes are
sending around ~3-4 MB/sec.

Is there a way to get exact numbers about network utilization through MPI
code, instead of just visualizing the graph?

Regards,
Saiyed

On Fri, Sep 22, 2017 at 3:18 AM, George Bosilca <bosi...@icl.utk.edu> wrote:

> All your processes send their data to a single destination, in same time.
> Clearly you are reaching the capacity of your network and your data
> transfers will be bound by this. This is a physical constraint that you can
> only overcome by adding network capacity to your cluster.
>
> At the software level only possibility is to make each of the p slave
> processes send their data to your centralize resource at different time, so
> that the data has the time to be transferred through the network before the
> next slave is ready to submit it's result.
>
>   George
>
>
>
>
> On Thu, Sep 21, 2017 at 4:57 PM, saiyedul islam <saiyedul.is...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I am working on parallelization of a Data Clustering Algorithm in which I
>> am following MPMD pattern of MPI (i.e. 1 master process and p slave
>> processes in same communicator). It is an iterative algorithm where 2 loops
>> inside iteration are separately parallelized.
>>
>> The first loop is parallelized by partitioning the N size input data into
>> (almost) equal parts between p slaves. Each slave produces a contiguous
>> chunk of about (p * N/p) double values as result of its local processing.
>> This local chunk from each slave is collected back on master process where
>> it is merged with chunks from other slaves.
>> If a blocking call (MPI_Send / Recv) is put in a loop on master such that
>> it receives the data one by one in order of their rank from slaves, then
>> each slave takes about 75 seconds for its local computation (as calculated
>> by MPI_Wtime() ) and about 1.5 seconds for transferring its chunk to
>> master. But, as the transfer happens in order, by the time last slave
>> process is done, the total time becomes 75 seconds for computation and 50
>> seconds for communication.
>> These timings are for a cluster of 31 machines where a single process
>> executes in each machine. All these machines are connected directly via a
>> private Gigabit network switch. In order to be effectively parallelize the
>> algorithm, the overall execution time needs to come below 80 seconds.
>>
>> I have tried following strategies to solve this problem:
>> 0. Ordered transfer, as explained above.
>> 1. Collecting data through MPI_Gatherv and assuming that internally it
>> will transfer data in parallel.
>> 2. Creating p threads at master using OpenMP and calling MPI_Recv (or
>> MPI_Irecv with MPI_Wait) by threads. The received data by each process is
>> put in a separate buffer. My installation support MPI_THREAD_MULTIPLE.
>>
>> The problem is that strategies 1 & 2 are taking almost similar time as
>> compared to strategy 0.
>> *Is there a way through which I can receive data in parallel and
>> substantially decrease the overall execution time?*
>>
>> Hoping to get your help soon. Sorry for the long question.
>>
>> Regards,
>> Saiyedul Islam
>>
>> PS: Specifications of the cluster: GCC 5.10, OpenMP 2.0.1, CentOS 6.5 (as
>> part of Rockscluster).
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to