Re: [OMPI users] Eager sending on InfiniBand

Xiaolong Cui Tue, 17 May 2016 11:31:43 -0400 (EDT)

I think it is the connection manager that blocks the first message. If I
add a pair of send/recv at the very beginning, the problem is gone. But
removing the per-peer queue pair does not help.


Do you know any document that discusses the open mpi internals, especially
related to this problem?

On Tue, May 17, 2016 at 11:00 AM, Nathan Hjelm <hje...@lanl.gov> wrote:

>
> If it is blocking on the first message then it might be blocked by the
> connection manager. Removing the per-peer queue pair might help in that
> case.
>
> -Nathan
>
> On Mon, May 16, 2016 at 10:11:29PM -0400, Xiaolong Cui wrote:
> >    Hi Nathan,
> >    Thanks for your answer.
> >    The "credits" make sense for the purpose of flow control. However, the
> >    sender in my case will be blocked even for the first message. This
> doesn't
> >    seem to be the symptom of running out of credits. Is there any reason
> for
> >    this? Also, is there a mac parameter for the number of credits?
> >    Best,
> >    Michael
> >    On Mon, May 16, 2016 at 6:35 PM, Nathan Hjelm <hje...@lanl.gov>
> wrote:
> >
> >      When using eager_rdma the sender will block once it runs out of
> >      "credits". If the receiver enters MPI for any reason the incoming
> >      messages will be placed in the ob1 unexpected queue and the credits
> will
> >      be returned to the sender. If you turn off eager_rdma you will
> probably
> >      get different results. That said, the unexpected message path is
> >      non-optimal and it would be best to ensure a matching receive is
> posted
> >      before the send.
> >
> >      Additionally, if you are using infiniband I recommend against
> adding a
> >      per-peer queue pair to btl_openib_receive_queues. We have not seen
> any
> >      performance benefit to using per-peer queue pairs and they do not
> >      scale.
> >
> >      -Nathan Hjelm
> >      HPC-ENV, LANL
> >      On Mon, May 16, 2016 at 12:21:41PM -0400, Xiaolong Cui wrote:
> >      >    Hi,
> >      >    I am using Open MPI 1.8.6. I guess my question is related to
> the
> >      flow
> >      >    control algorithm for small messages. The question is how to
> avoid
> >      the
> >      >    sender being blocked by the receiver when using openib module
> for
> >      small
> >      >    messages and using blocking send. I have looked through this
> >      >
> >      FAQ(
> https://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot)
> >      >    but didn't find the answer. My understanding of "eager sending
> >      protocol"
> >      >    is that if a message is "small", it will be transported to the
> >      receiver
> >      >    immediately, even if the receiver is not ready. As a result,
> the
> >      sender
> >      >    won't be blocked until the receiver posts the receive
> operation.
> >      >    I am trying to observe such behavior with a simple program of
> two
> >      MPI
> >      >    ranks (attached). My confusion is that while I can see the
> behavior
> >      with
> >      >    "vader" module (shared memory) when running the two ranks on
> the
> >      same
> >      >    node,
> >      >    [output]
> >      >
> >      >    [0] size = 16, loop = 78, time = 0.00007
> >      >
> >      >    [1] size = 16, loop = 78, time = 3.42426
> >      >
> >      >    [/output]
> >      >    but I cannot see it when running them on two nodes using the
> >      "openib"
> >      >    module.
> >      >    [output]
> >      >
> >      >    [0] size = 16, loop = 78, time = 3.42627
> >      >
> >      >    [1] size = 16, loop = 78, time = 3.42426
> >      >
> >      >    [/output]
> >      >    So anyone knows the reason? My runtime configuration is also
> >      attached.
> >      >    Thanks!
> >      >    Sincerely,
> >      >    Michael
> >      >    --
> >      >    Xiaolong Cui (Michael)
> >      >    Department of Computer Science
> >      >    Dietrich School of Arts & Science
> >      >    University of Pittsburgh
> >      >    Pittsburgh, PA 15260
> >
> >      > btl = openib,vader,self
> >      > #btl_base_verbose = 100
> >      > btl_openib_use_eager_rdma = 1
> >      > btl_openib_eager_limit = 160000
> >      > btl_openib_rndv_eager_limit = 160000
> >      > btl_openib_max_send_size = 160000
> >      > btl_openib_receive_queues =
> >
> P,128,256,192,64:S,2048,1024,1008,80:S,12288,1024,1008,80:S,160000,1024,512,512
> >
> >      > #include "mpi.h"
> >      > #include <mpi-ext.h>
> >      > #include <stdio.h>
> >      > #include <stdlib.h>
> >      >
> >      > int main(int argc, char *argv[])
> >      > {
> >      >    int size, rank, psize;
> >      >    int loops = 78;
> >      >    int length = 4;
> >      >    MPI_Init(&argc, &argv);
> >      >    MPI_Comm_size(MPI_COMM_WORLD, &size);
> >      >    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >      >    int *code = (int *)malloc(length * sizeof(int));
> >      >    MPI_Status status;
> >      >    long long i = 0;
> >      >    double time_s = MPI_Wtime();
> >      >
> >      >    if(rank % 2 == 1)
> >      >    {
> >      >        int i ;
> >      >        int j ;
> >      >        double a = 0.3, b = 0.5;
> >      >        for(i = 0; i < 30000; i++)
> >      >            for(j = 0; j < 30000; j++){
> >      >                a = a * 2;
> >      >                b = b + a;
> >      >            }
> >      >    }
> >      >
> >      >    for(i = 0; i < loops; i++){
> >      >        if(rank % 2 == 0){
> >      >            MPI_Send(code, length, MPI_INT, rank + 1, 0,
> >      MPI_COMM_WORLD);
> >      >        }
> >      >        else if(rank % 2 == 1){
> >      >            MPI_Recv(code, length, MPI_INT, rank - 1, 0,
> >      MPI_COMM_WORLD, MPI_STATUS_IGNORE);
> >      >        }
> >      >    }
> >      >    double time_e = MPI_Wtime();
> >      >    printf("[%d] size = %d, loop = %d, time = %.5f\n", rank,
> length *
> >      sizeof(int), loops, time_e - time_s);
> >      >
> >      >    MPI_Finalize();
> >      >    return 0;
> >      > }
> >      >
> >
> >      > _______________________________________________
> >      > users mailing list
> >      > us...@open-mpi.org
> >      > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> >      > Link to this post:
> >      http://www.open-mpi.org/community/lists/users/2016/05/29224.php
> >
> >      _______________________________________________
> >      users mailing list
> >      us...@open-mpi.org
> >      Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> >      Link to this post:
> >      http://www.open-mpi.org/community/lists/users/2016/05/29227.php
> >
> >    --
> >    Xiaolong Cui (Michael)
> >    Department of Computer Science
> >    Dietrich School of Arts & Science
> >    University of Pittsburgh
> >    Pittsburgh, PA 15260
>
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29228.php
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29229.php
>



-- 
Xiaolong Cui (Michael)
Department of Computer Science
Dietrich School of Arts & Science
University of Pittsburgh
Pittsburgh, PA 15260

Re: [OMPI users] Eager sending on InfiniBand

Reply via email to