Re: [OMPI users] Eager sending on InfiniBand

Nathan Hjelm Tue, 17 May 2016 11:49:04 -0400 (EDT)

I don't know of any documentation on the connection manager other than
what is in the code and in my head. I rewrote a lot of the code in 2.x
so you might want to try out the latest 2.x tarball from
https://www.open-mpi.org/software/ompi/v2.x/


I know the per-peer queue pair will prevent totally asynchronous
connections even in 2.x but SRQ/XRC only should work.

-Nathan

On Tue, May 17, 2016 at 11:31:01AM -0400, Xiaolong Cui wrote:
>    I think it is the connection manager that blocks the first message. If I
>    add a pair of send/recv at the very beginning, the problem is gone. But
>    removing the per-peer queue pair does not help. 
>    Do you know any document that discusses the open mpi internals, especially
>    related to this problem?
>    On Tue, May 17, 2016 at 11:00 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
> 
>      If it is blocking on the first message then it might be blocked by the
>      connection manager. Removing the per-peer queue pair might help in that
>      case.
> 
>      -Nathan
>      On Mon, May 16, 2016 at 10:11:29PM -0400, Xiaolong Cui wrote:
>      >    Hi Nathan,
>      >    Thanks for your answer.
>      >    The "credits" make sense for the purpose of flow control. However,
>      the
>      >    sender in my case will be blocked even for the first message. This
>      doesn't
>      >    seem to be the symptom of running out of credits. Is there any
>      reason for
>      >    this? Also, is there a mac parameter for the number of credits?
>      >    Best,
>      >    Michael
>      >    On Mon, May 16, 2016 at 6:35 PM, Nathan Hjelm <hje...@lanl.gov>
>      wrote:
>      >
>      >      When using eager_rdma the sender will block once it runs out of
>      >      "credits". If the receiver enters MPI for any reason the incoming
>      >      messages will be placed in the ob1 unexpected queue and the
>      credits will
>      >      be returned to the sender. If you turn off eager_rdma you will
>      probably
>      >      get different results. That said, the unexpected message path is
>      >      non-optimal and it would be best to ensure a matching receive is
>      posted
>      >      before the send.
>      >
>      >      Additionally, if you are using infiniband I recommend against
>      adding a
>      >      per-peer queue pair to btl_openib_receive_queues. We have not
>      seen any
>      >      performance benefit to using per-peer queue pairs and they do not
>      >      scale.
>      >
>      >      -Nathan Hjelm
>      >      HPC-ENV, LANL
>      >      On Mon, May 16, 2016 at 12:21:41PM -0400, Xiaolong Cui wrote:
>      >      >    Hi,
>      >      >    I am using Open MPI 1.8.6. I guess my question is related to
>      the
>      >      flow
>      >      >    control algorithm for small messages. The question is how to
>      avoid
>      >      the
>      >      >    sender being blocked by the receiver when using openib
>      module for
>      >      small
>      >      >    messages and using blocking send. I have looked through this
>      >      >
>      >     
>      FAQ(https://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot)
>      >      >    but didn't find the answer. My understanding of "eager
>      sending
>      >      protocol"
>      >      >    is that if a message is "small", it will be transported to
>      the
>      >      receiver
>      >      >    immediately, even if the receiver is not ready. As a result,
>      the
>      >      sender
>      >      >    won't be blocked until the receiver posts the receive
>      operation.
>      >      >    I am trying to observe such behavior with a simple program
>      of two
>      >      MPI
>      >      >    ranks (attached). My confusion is that while I can see the
>      behavior
>      >      with
>      >      >    "vader" module (shared memory) when running the two ranks on
>      the
>      >      same
>      >      >    node,
>      >      >    [output]
>      >      >
>      >      >    [0] size = 16, loop = 78, time = 0.00007
>      >      >
>      >      >    [1] size = 16, loop = 78, time = 3.42426
>      >      >
>      >      >    [/output]
>      >      >    but I cannot see it when running them on two nodes using the
>      >      "openib"
>      >      >    module.
>      >      >    [output]
>      >      >
>      >      >    [0] size = 16, loop = 78, time = 3.42627
>      >      >
>      >      >    [1] size = 16, loop = 78, time = 3.42426
>      >      >
>      >      >    [/output]
>      >      >    So anyone knows the reason? My runtime configuration is also
>      >      attached.
>      >      >    Thanks!
>      >      >    Sincerely,
>      >      >    Michael
>      >      >    --
>      >      >    Xiaolong Cui (Michael)
>      >      >    Department of Computer Science
>      >      >    Dietrich School of Arts & Science
>      >      >    University of Pittsburgh
>      >      >    Pittsburgh, PA 15260
>      >
>      >      > btl = openib,vader,self
>      >      > #btl_base_verbose = 100
>      >      > btl_openib_use_eager_rdma = 1
>      >      > btl_openib_eager_limit = 160000
>      >      > btl_openib_rndv_eager_limit = 160000
>      >      > btl_openib_max_send_size = 160000
>      >      > btl_openib_receive_queues =
>      >     
>      
> P,128,256,192,64:S,2048,1024,1008,80:S,12288,1024,1008,80:S,160000,1024,512,512
>      >
>      >      > #include "mpi.h"
>      >      > #include <mpi-ext.h>
>      >      > #include <stdio.h>
>      >      > #include <stdlib.h>
>      >      >
>      >      > int main(int argc, char *argv[])
>      >      > {
>      >      >    int size, rank, psize;
>      >      >    int loops = 78;
>      >      >    int length = 4;
>      >      >    MPI_Init(&argc, &argv);
>      >      >    MPI_Comm_size(MPI_COMM_WORLD, &size);
>      >      >    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>      >      >    int *code = (int *)malloc(length * sizeof(int));
>      >      >    MPI_Status status;
>      >      >    long long i = 0;
>      >      >    double time_s = MPI_Wtime();
>      >      >
>      >      >    if(rank % 2 == 1)
>      >      >    {
>      >      >        int i ;
>      >      >        int j ;
>      >      >        double a = 0.3, b = 0.5;
>      >      >        for(i = 0; i < 30000; i++)
>      >      >            for(j = 0; j < 30000; j++){
>      >      >                a = a * 2;
>      >      >                b = b + a;
>      >      >            }
>      >      >    }
>      >      >
>      >      >    for(i = 0; i < loops; i++){
>      >      >        if(rank % 2 == 0){
>      >      >            MPI_Send(code, length, MPI_INT, rank + 1, 0,
>      >      MPI_COMM_WORLD);
>      >      >        }
>      >      >        else if(rank % 2 == 1){
>      >      >            MPI_Recv(code, length, MPI_INT, rank - 1, 0,
>      >      MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>      >      >        }
>      >      >    }
>      >      >    double time_e = MPI_Wtime();
>      >      >    printf("[%d] size = %d, loop = %d, time = %.5f\n", rank,
>      length *
>      >      sizeof(int), loops, time_e - time_s);
>      >      >
>      >      >    MPI_Finalize();
>      >      >    return 0;
>      >      > }
>      >      >
>      >
>      >      > _______________________________________________
>      >      > users mailing list
>      >      > us...@open-mpi.org
>      >      > Subscription:
>      https://www.open-mpi.org/mailman/listinfo.cgi/users
>      >      > Link to this post:
>      >      http://www.open-mpi.org/community/lists/users/2016/05/29224.php
>      >
>      >      _______________________________________________
>      >      users mailing list
>      >      us...@open-mpi.org
>      >      Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>      >      Link to this post:
>      >      http://www.open-mpi.org/community/lists/users/2016/05/29227.php
>      >
>      >    --
>      >    Xiaolong Cui (Michael)
>      >    Department of Computer Science
>      >    Dietrich School of Arts & Science
>      >    University of Pittsburgh
>      >    Pittsburgh, PA 15260
> 
>      > _______________________________________________
>      > users mailing list
>      > us...@open-mpi.org
>      > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>      > Link to this post:
>      http://www.open-mpi.org/community/lists/users/2016/05/29228.php
> 
>      _______________________________________________
>      users mailing list
>      us...@open-mpi.org
>      Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>      Link to this post:
>      http://www.open-mpi.org/community/lists/users/2016/05/29229.php
> 
>    --
>    Xiaolong Cui (Michael)
>    Department of Computer Science
>    Dietrich School of Arts & Science
>    University of Pittsburgh
>    Pittsburgh, PA 15260

> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/05/29230.php

pgp4z0HLEXhR3.pgp
Description: PGP signature

Re: [OMPI users] Eager sending on InfiniBand

Reply via email to