Re: [OMPI users] Eager sending on InfiniBand

Xiaolong Cui Mon, 16 May 2016 22:12:10 -0400 (EDT)

Hi Nathan,

Thanks for your answer.


The "credits" make sense for the purpose of flow control. However, the
sender in my case will be blocked even for the first message. This doesn't
seem to be the symptom of running out of credits. Is there any reason for
this? Also, is there a mac parameter for the number of credits?

Best,
Michael

On Mon, May 16, 2016 at 6:35 PM, Nathan Hjelm <hje...@lanl.gov> wrote:

>
> When using eager_rdma the sender will block once it runs out of
> "credits". If the receiver enters MPI for any reason the incoming
> messages will be placed in the ob1 unexpected queue and the credits will
> be returned to the sender. If you turn off eager_rdma you will probably
> get different results. That said, the unexpected message path is
> non-optimal and it would be best to ensure a matching receive is posted
> before the send.
>
> Additionally, if you are using infiniband I recommend against adding a
> per-peer queue pair to btl_openib_receive_queues. We have not seen any
> performance benefit to using per-peer queue pairs and they do not
> scale.
>
> -Nathan Hjelm
> HPC-ENV, LANL
>
> On Mon, May 16, 2016 at 12:21:41PM -0400, Xiaolong Cui wrote:
> >    Hi,
> >    I am using Open MPI 1.8.6. I guess my question is related to the flow
> >    control algorithm for small messages. The question is how to avoid the
> >    sender being blocked by the receiver when using openib module for
> small
> >    messages and using blocking send. I have looked through this
> >    FAQ(
> https://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot)
> >    but didn't find the answer. My understanding of "eager sending
> protocol"
> >    is that if a message is "small", it will be transported to the
> receiver
> >    immediately, even if the receiver is not ready. As a result, the
> sender
> >    won't be blocked until the receiver posts the receive operation.
> >    I am trying to observe such behavior with a simple program of two MPI
> >    ranks (attached). My confusion is that while I can see the behavior
> with
> >    "vader" module (shared memory) when running the two ranks on the same
> >    node,
> >    [output]
> >
> >    [0] size = 16, loop = 78, time = 0.00007
> >
> >    [1] size = 16, loop = 78, time = 3.42426
> >
> >    [/output]
> >    but I cannot see it when running them on two nodes using the "openib"
> >    module.
> >    [output]
> >
> >    [0] size = 16, loop = 78, time = 3.42627
> >
> >    [1] size = 16, loop = 78, time = 3.42426
> >
> >    [/output]
> >    So anyone knows the reason? My runtime configuration is also attached.
> >    Thanks!
> >    Sincerely,
> >    Michael
> >    --
> >    Xiaolong Cui (Michael)
> >    Department of Computer Science
> >    Dietrich School of Arts & Science
> >    University of Pittsburgh
> >    Pittsburgh, PA 15260
>
> > btl = openib,vader,self
> > #btl_base_verbose = 100
> > btl_openib_use_eager_rdma = 1
> > btl_openib_eager_limit = 160000
> > btl_openib_rndv_eager_limit = 160000
> > btl_openib_max_send_size = 160000
> > btl_openib_receive_queues =
> P,128,256,192,64:S,2048,1024,1008,80:S,12288,1024,1008,80:S,160000,1024,512,512
>
> > #include "mpi.h"
> > #include <mpi-ext.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> >
> > int main(int argc, char *argv[])
> > {
> >    int size, rank, psize;
> >    int loops = 78;
> >    int length = 4;
> >    MPI_Init(&argc, &argv);
> >    MPI_Comm_size(MPI_COMM_WORLD, &size);
> >    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >    int *code = (int *)malloc(length * sizeof(int));
> >    MPI_Status status;
> >    long long i = 0;
> >    double time_s = MPI_Wtime();
> >
> >    if(rank % 2 == 1)
> >    {
> >        int i ;
> >        int j ;
> >        double a = 0.3, b = 0.5;
> >        for(i = 0; i < 30000; i++)
> >            for(j = 0; j < 30000; j++){
> >                a = a * 2;
> >                b = b + a;
> >            }
> >    }
> >
> >    for(i = 0; i < loops; i++){
> >        if(rank % 2 == 0){
> >            MPI_Send(code, length, MPI_INT, rank + 1, 0, MPI_COMM_WORLD);
> >        }
> >        else if(rank % 2 == 1){
> >            MPI_Recv(code, length, MPI_INT, rank - 1, 0, MPI_COMM_WORLD,
> MPI_STATUS_IGNORE);
> >        }
> >    }
> >    double time_e = MPI_Wtime();
> >    printf("[%d] size = %d, loop = %d, time = %.5f\n", rank, length *
> sizeof(int), loops, time_e - time_s);
> >
> >    MPI_Finalize();
> >    return 0;
> > }
> >
>
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29224.php
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29227.php
>



-- 
Xiaolong Cui (Michael)
Department of Computer Science
Dietrich School of Arts & Science
University of Pittsburgh
Pittsburgh, PA 15260

Re: [OMPI users] Eager sending on InfiniBand

Reply via email to