Re: [OMPI users] Eager sending on InfiniBand

Xiaolong Cui Wed, 18 May 2016 15:30:37 -0400 (EDT)

Sorry, the figures do not display. They are attached to this message.

On Wed, May 18, 2016 at 3:24 PM, Xiaolong Cui <sunshine...@gmail.com> wrote:


> Hi Nathan,
>
> I got one more question. I am measuring the number of messages that can be
> eagerly sent with a given SRQ. Again, as illustrated below, my program has
> two ranks, rank 0 sends a variable number (*n*) of messages to rank 1 who
> is not ready to receive.
>
> [image: Inline image 1]
>
> I measured the time for rank 0 to send out all the messages, and
> surprisingly, the result looks like below. Do you know why the time drops
> at n=127? The SRQ is simply btl_openib_receive_queues = S,2048,512,494,80
> [image: Inline image 2]
>
>
> On Tue, May 17, 2016 at 11:49 AM, Nathan Hjelm <hje...@lanl.gov> wrote:
>
>>
>> I don't know of any documentation on the connection manager other than
>> what is in the code and in my head. I rewrote a lot of the code in 2.x
>> so you might want to try out the latest 2.x tarball from
>> https://www.open-mpi.org/software/ompi/v2.x/
>>
>> I know the per-peer queue pair will prevent totally asynchronous
>> connections even in 2.x but SRQ/XRC only should work.
>>
>> -Nathan
>>
>> On Tue, May 17, 2016 at 11:31:01AM -0400, Xiaolong Cui wrote:
>> >    I think it is the connection manager that blocks the first message.
>> If I
>> >    add a pair of send/recv at the very beginning, the problem is gone.
>> But
>> >    removing the per-peer queue pair does not help.
>> >    Do you know any document that discusses the open mpi internals,
>> especially
>> >    related to this problem?
>> >    On Tue, May 17, 2016 at 11:00 AM, Nathan Hjelm <hje...@lanl.gov>
>> wrote:
>> >
>> >      If it is blocking on the first message then it might be blocked by
>> the
>> >      connection manager. Removing the per-peer queue pair might help in
>> that
>> >      case.
>> >
>> >      -Nathan
>> >      On Mon, May 16, 2016 at 10:11:29PM -0400, Xiaolong Cui wrote:
>> >      >    Hi Nathan,
>> >      >    Thanks for your answer.
>> >      >    The "credits" make sense for the purpose of flow control.
>> However,
>> >      the
>> >      >    sender in my case will be blocked even for the first message.
>> This
>> >      doesn't
>> >      >    seem to be the symptom of running out of credits. Is there any
>> >      reason for
>> >      >    this? Also, is there a mac parameter for the number of
>> credits?
>> >      >    Best,
>> >      >    Michael
>> >      >    On Mon, May 16, 2016 at 6:35 PM, Nathan Hjelm <
>> hje...@lanl.gov>
>> >      wrote:
>> >      >
>> >      >      When using eager_rdma the sender will block once it runs
>> out of
>> >      >      "credits". If the receiver enters MPI for any reason the
>> incoming
>> >      >      messages will be placed in the ob1 unexpected queue and the
>> >      credits will
>> >      >      be returned to the sender. If you turn off eager_rdma you
>> will
>> >      probably
>> >      >      get different results. That said, the unexpected message
>> path is
>> >      >      non-optimal and it would be best to ensure a matching
>> receive is
>> >      posted
>> >      >      before the send.
>> >      >
>> >      >      Additionally, if you are using infiniband I recommend
>> against
>> >      adding a
>> >      >      per-peer queue pair to btl_openib_receive_queues. We have
>> not
>> >      seen any
>> >      >      performance benefit to using per-peer queue pairs and they
>> do not
>> >      >      scale.
>> >      >
>> >      >      -Nathan Hjelm
>> >      >      HPC-ENV, LANL
>> >      >      On Mon, May 16, 2016 at 12:21:41PM -0400, Xiaolong Cui
>> wrote:
>> >      >      >    Hi,
>> >      >      >    I am using Open MPI 1.8.6. I guess my question is
>> related to
>> >      the
>> >      >      flow
>> >      >      >    control algorithm for small messages. The question is
>> how to
>> >      avoid
>> >      >      the
>> >      >      >    sender being blocked by the receiver when using openib
>> >      module for
>> >      >      small
>> >      >      >    messages and using blocking send. I have looked
>> through this
>> >      >      >
>> >      >
>> >      FAQ(
>> https://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot)
>> >      >      >    but didn't find the answer. My understanding of "eager
>> >      sending
>> >      >      protocol"
>> >      >      >    is that if a message is "small", it will be
>> transported to
>> >      the
>> >      >      receiver
>> >      >      >    immediately, even if the receiver is not ready. As a
>> result,
>> >      the
>> >      >      sender
>> >      >      >    won't be blocked until the receiver posts the receive
>> >      operation.
>> >      >      >    I am trying to observe such behavior with a simple
>> program
>> >      of two
>> >      >      MPI
>> >      >      >    ranks (attached). My confusion is that while I can see
>> the
>> >      behavior
>> >      >      with
>> >      >      >    "vader" module (shared memory) when running the two
>> ranks on
>> >      the
>> >      >      same
>> >      >      >    node,
>> >      >      >    [output]
>> >      >      >
>> >      >      >    [0] size = 16, loop = 78, time = 0.00007
>> >      >      >
>> >      >      >    [1] size = 16, loop = 78, time = 3.42426
>> >      >      >
>> >      >      >    [/output]
>> >      >      >    but I cannot see it when running them on two nodes
>> using the
>> >      >      "openib"
>> >      >      >    module.
>> >      >      >    [output]
>> >      >      >
>> >      >      >    [0] size = 16, loop = 78, time = 3.42627
>> >      >      >
>> >      >      >    [1] size = 16, loop = 78, time = 3.42426
>> >      >      >
>> >      >      >    [/output]
>> >      >      >    So anyone knows the reason? My runtime configuration
>> is also
>> >      >      attached.
>> >      >      >    Thanks!
>> >      >      >    Sincerely,
>> >      >      >    Michael
>> >      >      >    --
>> >      >      >    Xiaolong Cui (Michael)
>> >      >      >    Department of Computer Science
>> >      >      >    Dietrich School of Arts & Science
>> >      >      >    University of Pittsburgh
>> >      >      >    Pittsburgh, PA 15260
>> >      >
>> >      >      > btl = openib,vader,self
>> >      >      > #btl_base_verbose = 100
>> >      >      > btl_openib_use_eager_rdma = 1
>> >      >      > btl_openib_eager_limit = 160000
>> >      >      > btl_openib_rndv_eager_limit = 160000
>> >      >      > btl_openib_max_send_size = 160000
>> >      >      > btl_openib_receive_queues =
>> >      >
>> >
>> P,128,256,192,64:S,2048,1024,1008,80:S,12288,1024,1008,80:S,160000,1024,512,512
>> >      >
>> >      >      > #include "mpi.h"
>> >      >      > #include <mpi-ext.h>
>> >      >      > #include <stdio.h>
>> >      >      > #include <stdlib.h>
>> >      >      >
>> >      >      > int main(int argc, char *argv[])
>> >      >      > {
>> >      >      >    int size, rank, psize;
>> >      >      >    int loops = 78;
>> >      >      >    int length = 4;
>> >      >      >    MPI_Init(&argc, &argv);
>> >      >      >    MPI_Comm_size(MPI_COMM_WORLD, &size);
>> >      >      >    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>> >      >      >    int *code = (int *)malloc(length * sizeof(int));
>> >      >      >    MPI_Status status;
>> >      >      >    long long i = 0;
>> >      >      >    double time_s = MPI_Wtime();
>> >      >      >
>> >      >      >    if(rank % 2 == 1)
>> >      >      >    {
>> >      >      >        int i ;
>> >      >      >        int j ;
>> >      >      >        double a = 0.3, b = 0.5;
>> >      >      >        for(i = 0; i < 30000; i++)
>> >      >      >            for(j = 0; j < 30000; j++){
>> >      >      >                a = a * 2;
>> >      >      >                b = b + a;
>> >      >      >            }
>> >      >      >    }
>> >      >      >
>> >      >      >    for(i = 0; i < loops; i++){
>> >      >      >        if(rank % 2 == 0){
>> >      >      >            MPI_Send(code, length, MPI_INT, rank + 1, 0,
>> >      >      MPI_COMM_WORLD);
>> >      >      >        }
>> >      >      >        else if(rank % 2 == 1){
>> >      >      >            MPI_Recv(code, length, MPI_INT, rank - 1, 0,
>> >      >      MPI_COMM_WORLD, MPI_STATUS_IGNORE);
>> >      >      >        }
>> >      >      >    }
>> >      >      >    double time_e = MPI_Wtime();
>> >      >      >    printf("[%d] size = %d, loop = %d, time = %.5f\n",
>> rank,
>> >      length *
>> >      >      sizeof(int), loops, time_e - time_s);
>> >      >      >
>> >      >      >    MPI_Finalize();
>> >      >      >    return 0;
>> >      >      > }
>> >      >      >
>> >      >
>> >      >      > _______________________________________________
>> >      >      > users mailing list
>> >      >      > us...@open-mpi.org
>> >      >      > Subscription:
>> >      https://www.open-mpi.org/mailman/listinfo.cgi/users
>> >      >      > Link to this post:
>> >      >
>> http://www.open-mpi.org/community/lists/users/2016/05/29224.php
>> >      >
>> >      >      _______________________________________________
>> >      >      users mailing list
>> >      >      us...@open-mpi.org
>> >      >      Subscription:
>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>> >      >      Link to this post:
>> >      >
>> http://www.open-mpi.org/community/lists/users/2016/05/29227.php
>> >      >
>> >      >    --
>> >      >    Xiaolong Cui (Michael)
>> >      >    Department of Computer Science
>> >      >    Dietrich School of Arts & Science
>> >      >    University of Pittsburgh
>> >      >    Pittsburgh, PA 15260
>> >
>> >      > _______________________________________________
>> >      > users mailing list
>> >      > us...@open-mpi.org
>> >      > Subscription:
>> https://www.open-mpi.org/mailman/listinfo.cgi/users
>> >      > Link to this post:
>> >      http://www.open-mpi.org/community/lists/users/2016/05/29228.php
>> >
>> >      _______________________________________________
>> >      users mailing list
>> >      us...@open-mpi.org
>> >      Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> >      Link to this post:
>> >      http://www.open-mpi.org/community/lists/users/2016/05/29229.php
>> >
>> >    --
>> >    Xiaolong Cui (Michael)
>> >    Department of Computer Science
>> >    Dietrich School of Arts & Science
>> >    University of Pittsburgh
>> >    Pittsburgh, PA 15260
>>
>> > _______________________________________________
>> > users mailing list
>> > us...@open-mpi.org
>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> > Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/05/29230.php
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2016/05/29231.php
>>
>
>
>
> --
> Xiaolong Cui (Michael)
> Department of Computer Science
> Dietrich School of Arts & Science
> University of Pittsburgh
> Pittsburgh, PA 15260
>



-- 
Xiaolong Cui (Michael)
Department of Computer Science
Dietrich School of Arts & Science
University of Pittsburgh
Pittsburgh, PA 15260

Re: [OMPI users] Eager sending on InfiniBand

Reply via email to