[OMPI users] Reducing across nodes communication in MPI

Mudassar Majeed Sat, 9 Jun 2012 19:52:37 -0400

Dear MPI people, 

                               I am working on a graph partitioning problem, 
where we have an undirected graph of p MPI processes. The edges have weights 
that show how much communication processes do among themselves. The cluster has 
multiple nodes (each node with 8 cores) and nodes are connected through 
Infiniband fast network. The task is the partition the graph in k partitions 
where k is the number of nodes such that the edge cut is minimized (across 
partitions/nodes edges or communication). I wrote a distributed algorithm for 
that. I will compare its quality with the algorithms from graph community, but 
right now I am writing the algorithm for MPI applications. I have seen that 
with 10% sparse undirected graph, I got upto 80% reduction in edge cut (for 
k=2).



I checked the algorithm very deeply and I have seen that the optimal partitions 
are residing on different nodes (I saw the IPs). Based on the NetPipe 
benchmark, I wrote a test application where I randomly generate undirected 
graphs. I run that application before doing the reduction of edge cut and after 
reduction of edge cut. The processes that communicate more come to the same 
nodes. In the algorithm, each process gets a new rank based on communication 
requirement. The problem I am getting is that the overall execution time of the 
test application should be less than the application that runs without 
performing edge cut reduction but it does not happen. Here is the code for test 
application (self explanatory) and the output of the program. Please help me 
diagnosing the logical bug. We can discuss more.

MPI_Barrier(comm);
    
    t0 = MPI_Wtime();
    
    for(int j=0;j<10;j++)
    {
      
    for(int i=0; i<comm_size; i++)
    {
    int target = comminfo[i].rank;
    int comm_amount = comminfo[i].comm;
    if(comm_amount > 0)
    {
        buff = new Node[MAX_COMM*NTRIALS];
        MPI_Irecv (buff, MAX_COMM * NTRIALS * sizeof(Node)/sizeof(char), 
MPI_CHAR, MPI_ANY_SOURCE, 4600, comm, &recv_requests[i]);
        recv_buffers.insert(recv_buffers.end(), buff); 
    }
    }

    for(int i=0; i<comm_size; i++)
    {
    int target = comminfo[i].rank;
    int comm_amount = comminfo[i].comm;
    if(comm_amount > 0)
    {
        buff = new Node[comm_amount/2*NTRIALS];
        MPI_Isend((void*)buff, comm_amount/2 * NTRIALS * 
sizeof(Node)/sizeof(char), MPI_CHAR, target, 4600, comm, &send_requests[i]);
        send_buffers.insert(send_buffers.end(), buff);
    }
    }
    
    MPI_Waitall(comm_size, send_requests, send_status);
    
    MPI_Waitall(comm_size, recv_requests, recv_status);
    
    }
    
    t1 = MPI_Wtime();
    
    MPI_Allreduce(&t0, &min_t0, 1, MPI_DOUBLE, MPI_MIN, comm);
    MPI_Allreduce(&t1, &max_t1, 1, MPI_DOUBLE, MPI_MAX, comm);
    
    diff = (max_t1 - min_t0) / NTRIALS;

  retirm diff; // this value will be printed in main function....


.................. the output of the program is 



mpprun INFO: Starting openmpi run on 2 nodes (16 ranks)...

P0 >> Without balancing: Execution took: 0.0081014822 secs

Edge_Cut_Before: 46864

Edge_Cut_After: 20097

Balancing took: 1.0993268490 secs

P0 >> After balancing: Execution took: 0.0095374639 secs



Please help me.

best regards,

Mudassar

[OMPI users] Reducing across nodes communication in MPI

Reply via email to