Re: [OMPI users] Eager sending on InfiniBand

Xiaolong Cui Tue, 17 May 2016 11:51:47 -0400 (EDT)

Thanks a lot!


On Tue, May 17, 2016 at 11:49 AM, Nathan Hjelm <hje...@lanl.gov> wrote:

>
> I don't know of any documentation on the connection manager other than
> what is in the code and in my head. I rewrote a lot of the code in 2.x
> so you might want to try out the latest 2.x tarball from
> https://www.open-mpi.org/software/ompi/v2.x/
>
> I know the per-peer queue pair will prevent totally asynchronous
> connections even in 2.x but SRQ/XRC only should work.
>
> -Nathan
>
> On Tue, May 17, 2016 at 11:31:01AM -0400, Xiaolong Cui wrote:
> >    I think it is the connection manager that blocks the first message.
> If I
> >    add a pair of send/recv at the very beginning, the problem is gone.
> But
> >    removing the per-peer queue pair does not help.
> >    Do you know any document that discusses the open mpi internals,
> especially
> >    related to this problem?
> >    On Tue, May 17, 2016 at 11:00 AM, Nathan Hjelm <hje...@lanl.gov>
> wrote:
> >
> >      If it is blocking on the first message then it might be blocked by
> the
> >      connection manager. Removing the per-peer queue pair might help in
> that
> >      case.
> >
> >      -Nathan
> >      On Mon, May 16, 2016 at 10:11:29PM -0400, Xiaolong Cui wrote:
> >      >    Hi Nathan,
> >      >    Thanks for your answer.
> >      >    The "credits" make sense for the purpose of flow control.
> However,
> >      the
> >      >    sender in my case will be blocked even for the first message.
> This
> >      doesn't
> >      >    seem to be the symptom of running out of credits. Is there any
> >      reason for
> >      >    this? Also, is there a mac parameter for the number of credits?
> >      >    Best,
> >      >    Michael
> >      >    On Mon, May 16, 2016 at 6:35 PM, Nathan Hjelm <hje...@lanl.gov
> >
> >      wrote:
> >      >
> >      >      When using eager_rdma the sender will block once it runs out
> of
> >      >      "credits". If the receiver enters MPI for any reason the
> incoming
> >      >      messages will be placed in the ob1 unexpected queue and the
> >      credits will
> >      >      be returned to the sender. If you turn off eager_rdma you
> will
> >      probably
> >      >      get different results. That said, the unexpected message
> path is
> >      >      non-optimal and it would be best to ensure a matching
> receive is
> >      posted
> >      >      before the send.
> >      >
> >      >      Additionally, if you are using infiniband I recommend against
> >      adding a
> >      >      per-peer queue pair to btl_openib_receive_queues. We have not
> >      seen any
> >      >      performance benefit to using per-peer queue pairs and they
> do not
> >      >      scale.
> >      >
> >      >      -Nathan Hjelm
> >      >      HPC-ENV, LANL
> >      >      On Mon, May 16, 2016 at 12:21:41PM -0400, Xiaolong Cui wrote:
> >      >      >    Hi,
> >      >      >    I am using Open MPI 1.8.6. I guess my question is
> related to
> >      the
> >      >      flow
> >      >      >    control algorithm for small messages. The question is
> how to
> >      avoid
> >      >      the
> >      >      >    sender being blocked by the receiver when using openib
> >      module for
> >      >      small
> >      >      >    messages and using blocking send. I have looked through
> this
> >      >      >
> >      >
> >      FAQ(
> https://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot)
> >      >      >    but didn't find the answer. My understanding of "eager
> >      sending
> >      >      protocol"
> >      >      >    is that if a message is "small", it will be transported
> to
> >      the
> >      >      receiver
> >      >      >    immediately, even if the receiver is not ready. As a
> result,
> >      the
> >      >      sender
> >      >      >    won't be blocked until the receiver posts the receive
> >      operation.
> >      >      >    I am trying to observe such behavior with a simple
> program
> >      of two
> >      >      MPI
> >      >      >    ranks (attached). My confusion is that while I can see
> the
> >      behavior
> >      >      with
> >      >      >    "vader" module (shared memory) when running the two
> ranks on
> >      the
> >      >      same
> >      >      >    node,
> >      >      >    [output]
> >      >      >
> >      >      >    [0] size = 16, loop = 78, time = 0.00007
> >      >      >
> >      >      >    [1] size = 16, loop = 78, time = 3.42426
> >      >      >
> >      >      >    [/output]
> >      >      >    but I cannot see it when running them on two nodes
> using the
> >      >      "openib"
> >      >      >    module.
> >      >      >    [output]
> >      >      >
> >      >      >    [0] size = 16, loop = 78, time = 3.42627
> >      >      >
> >      >      >    [1] size = 16, loop = 78, time = 3.42426
> >      >      >
> >      >      >    [/output]
> >      >      >    So anyone knows the reason? My runtime configuration is
> also
> >      >      attached.
> >      >      >    Thanks!
> >      >      >    Sincerely,
> >      >      >    Michael
> >      >      >    --
> >      >      >    Xiaolong Cui (Michael)
> >      >      >    Department of Computer Science
> >      >      >    Dietrich School of Arts & Science
> >      >      >    University of Pittsburgh
> >      >      >    Pittsburgh, PA 15260
> >      >
> >      >      > btl = openib,vader,self
> >      >      > #btl_base_verbose = 100
> >      >      > btl_openib_use_eager_rdma = 1
> >      >      > btl_openib_eager_limit = 160000
> >      >      > btl_openib_rndv_eager_limit = 160000
> >      >      > btl_openib_max_send_size = 160000
> >      >      > btl_openib_receive_queues =
> >      >
> >
> P,128,256,192,64:S,2048,1024,1008,80:S,12288,1024,1008,80:S,160000,1024,512,512
> >      >
> >      >      > #include "mpi.h"
> >      >      > #include <mpi-ext.h>
> >      >      > #include <stdio.h>
> >      >      > #include <stdlib.h>
> >      >      >
> >      >      > int main(int argc, char *argv[])
> >      >      > {
> >      >      >    int size, rank, psize;
> >      >      >    int loops = 78;
> >      >      >    int length = 4;
> >      >      >    MPI_Init(&argc, &argv);
> >      >      >    MPI_Comm_size(MPI_COMM_WORLD, &size);
> >      >      >    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> >      >      >    int *code = (int *)malloc(length * sizeof(int));
> >      >      >    MPI_Status status;
> >      >      >    long long i = 0;
> >      >      >    double time_s = MPI_Wtime();
> >      >      >
> >      >      >    if(rank % 2 == 1)
> >      >      >    {
> >      >      >        int i ;
> >      >      >        int j ;
> >      >      >        double a = 0.3, b = 0.5;
> >      >      >        for(i = 0; i < 30000; i++)
> >      >      >            for(j = 0; j < 30000; j++){
> >      >      >                a = a * 2;
> >      >      >                b = b + a;
> >      >      >            }
> >      >      >    }
> >      >      >
> >      >      >    for(i = 0; i < loops; i++){
> >      >      >        if(rank % 2 == 0){
> >      >      >            MPI_Send(code, length, MPI_INT, rank + 1, 0,
> >      >      MPI_COMM_WORLD);
> >      >      >        }
> >      >      >        else if(rank % 2 == 1){
> >      >      >            MPI_Recv(code, length, MPI_INT, rank - 1, 0,
> >      >      MPI_COMM_WORLD, MPI_STATUS_IGNORE);
> >      >      >        }
> >      >      >    }
> >      >      >    double time_e = MPI_Wtime();
> >      >      >    printf("[%d] size = %d, loop = %d, time = %.5f\n", rank,
> >      length *
> >      >      sizeof(int), loops, time_e - time_s);
> >      >      >
> >      >      >    MPI_Finalize();
> >      >      >    return 0;
> >      >      > }
> >      >      >
> >      >
> >      >      > _______________________________________________
> >      >      > users mailing list
> >      >      > us...@open-mpi.org
> >      >      > Subscription:
> >      https://www.open-mpi.org/mailman/listinfo.cgi/users
> >      >      > Link to this post:
> >      >
> http://www.open-mpi.org/community/lists/users/2016/05/29224.php
> >      >
> >      >      _______________________________________________
> >      >      users mailing list
> >      >      us...@open-mpi.org
> >      >      Subscription:
> https://www.open-mpi.org/mailman/listinfo.cgi/users
> >      >      Link to this post:
> >      >
> http://www.open-mpi.org/community/lists/users/2016/05/29227.php
> >      >
> >      >    --
> >      >    Xiaolong Cui (Michael)
> >      >    Department of Computer Science
> >      >    Dietrich School of Arts & Science
> >      >    University of Pittsburgh
> >      >    Pittsburgh, PA 15260
> >
> >      > _______________________________________________
> >      > users mailing list
> >      > us...@open-mpi.org
> >      > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> >      > Link to this post:
> >      http://www.open-mpi.org/community/lists/users/2016/05/29228.php
> >
> >      _______________________________________________
> >      users mailing list
> >      us...@open-mpi.org
> >      Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> >      Link to this post:
> >      http://www.open-mpi.org/community/lists/users/2016/05/29229.php
> >
> >    --
> >    Xiaolong Cui (Michael)
> >    Department of Computer Science
> >    Dietrich School of Arts & Science
> >    University of Pittsburgh
> >    Pittsburgh, PA 15260
>
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29230.php
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29231.php
>



-- 
Xiaolong Cui (Michael)
Department of Computer Science
Dietrich School of Arts & Science
University of Pittsburgh
Pittsburgh, PA 15260

Re: [OMPI users] Eager sending on InfiniBand

Reply via email to