If it is blocking on the first message then it might be blocked by the connection manager. Removing the per-peer queue pair might help in that case.
-Nathan On Mon, May 16, 2016 at 10:11:29PM -0400, Xiaolong Cui wrote: > Hi Nathan, > Thanks for your answer. > The "credits" make sense for the purpose of flow control. However, the > sender in my case will be blocked even for the first message. This doesn't > seem to be the symptom of running out of credits. Is there any reason for > this? Also, is there a mac parameter for the number of credits? > Best, > Michael > On Mon, May 16, 2016 at 6:35 PM, Nathan Hjelm <hje...@lanl.gov> wrote: > > When using eager_rdma the sender will block once it runs out of > "credits". If the receiver enters MPI for any reason the incoming > messages will be placed in the ob1 unexpected queue and the credits will > be returned to the sender. If you turn off eager_rdma you will probably > get different results. That said, the unexpected message path is > non-optimal and it would be best to ensure a matching receive is posted > before the send. > > Additionally, if you are using infiniband I recommend against adding a > per-peer queue pair to btl_openib_receive_queues. We have not seen any > performance benefit to using per-peer queue pairs and they do not > scale. > > -Nathan Hjelm > HPC-ENV, LANL > On Mon, May 16, 2016 at 12:21:41PM -0400, Xiaolong Cui wrote: > > Hi, > > I am using Open MPI 1.8.6. I guess my question is related to the > flow > > control algorithm for small messages. The question is how to avoid > the > > sender being blocked by the receiver when using openib module for > small > > messages and using blocking send. I have looked through this > > > FAQ(https://www.open-mpi.org/faq/?category=openfabrics#ofa-troubleshoot) > > but didn't find the answer. My understanding of "eager sending > protocol" > > is that if a message is "small", it will be transported to the > receiver > > immediately, even if the receiver is not ready. As a result, the > sender > > won't be blocked until the receiver posts the receive operation. > > I am trying to observe such behavior with a simple program of two > MPI > > ranks (attached). My confusion is that while I can see the behavior > with > > "vader" module (shared memory) when running the two ranks on the > same > > node, > > [output] > > > > [0] size = 16, loop = 78, time = 0.00007 > > > > [1] size = 16, loop = 78, time = 3.42426 > > > > [/output] > > but I cannot see it when running them on two nodes using the > "openib" > > module. > > [output] > > > > [0] size = 16, loop = 78, time = 3.42627 > > > > [1] size = 16, loop = 78, time = 3.42426 > > > > [/output] > > So anyone knows the reason? My runtime configuration is also > attached. > > Thanks! > > Sincerely, > > Michael > > -- > > Xiaolong Cui (Michael) > > Department of Computer Science > > Dietrich School of Arts & Science > > University of Pittsburgh > > Pittsburgh, PA 15260 > > > btl = openib,vader,self > > #btl_base_verbose = 100 > > btl_openib_use_eager_rdma = 1 > > btl_openib_eager_limit = 160000 > > btl_openib_rndv_eager_limit = 160000 > > btl_openib_max_send_size = 160000 > > btl_openib_receive_queues = > > P,128,256,192,64:S,2048,1024,1008,80:S,12288,1024,1008,80:S,160000,1024,512,512 > > > #include "mpi.h" > > #include <mpi-ext.h> > > #include <stdio.h> > > #include <stdlib.h> > > > > int main(int argc, char *argv[]) > > { > > int size, rank, psize; > > int loops = 78; > > int length = 4; > > MPI_Init(&argc, &argv); > > MPI_Comm_size(MPI_COMM_WORLD, &size); > > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > int *code = (int *)malloc(length * sizeof(int)); > > MPI_Status status; > > long long i = 0; > > double time_s = MPI_Wtime(); > > > > if(rank % 2 == 1) > > { > > int i ; > > int j ; > > double a = 0.3, b = 0.5; > > for(i = 0; i < 30000; i++) > > for(j = 0; j < 30000; j++){ > > a = a * 2; > > b = b + a; > > } > > } > > > > for(i = 0; i < loops; i++){ > > if(rank % 2 == 0){ > > MPI_Send(code, length, MPI_INT, rank + 1, 0, > MPI_COMM_WORLD); > > } > > else if(rank % 2 == 1){ > > MPI_Recv(code, length, MPI_INT, rank - 1, 0, > MPI_COMM_WORLD, MPI_STATUS_IGNORE); > > } > > } > > double time_e = MPI_Wtime(); > > printf("[%d] size = %d, loop = %d, time = %.5f\n", rank, length * > sizeof(int), loops, time_e - time_s); > > > > MPI_Finalize(); > > return 0; > > } > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29224.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29227.php > > -- > Xiaolong Cui (Michael) > Department of Computer Science > Dietrich School of Arts & Science > University of Pittsburgh > Pittsburgh, PA 15260 > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/05/29228.php
pgp3wlJox23hC.pgp
Description: PGP signature