Today I tried a test where all sender and receiver processes were located on nodes on the same blade chassis. With this process deployment, I did not detect any delayed send/receive processing, all tests ran as expected!!!
In the failing test below, the sender (proc39) was on the second blade chassis and the receiver (proc28) was on the first blade chassis. So I must have some problem in the Infiniband fabric? Each chassis has a switch to which all blades are internally connected. The two switches are connected by 4 external cables. Are there any tests I can run to diagnose why there are intermittent connection problems between the switches? From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of George Bosilca Sent: Friday, November 09, 2012 6:02 PM To: Open MPI Users Subject: EXTERNAL: Re: [OMPI users] gathering problem Gary, Looking at the trace it become clear that the culprit is the receiving process and not the sending one. Let me walk you over. I split the trace in several groups. The first one will look at the sender (proc39) and the receiver (proc28) for the first message. PROC39: 59.2268409810; SEND_REQ_ACTIVATE, count=81536, peer=28 PROC39: 59.2268649811; SEND_REQ_XFER_BEGIN, count=81536, peer=28 PROC39: 59.2270019811; SEND_REQ_XFER_CONTINUE, count=27178, peer=28 PROC39: 59.2270509811; SEND_REQ_XFER_CONTINUE, count=54358, peer=28 PROC39: 59.2270809811; SEND_REQ_XFER_END, count=81536, peer=28 PROC39: 59.2270869811; SEND_REQ_COMPLETE, count=81536, peer=28 PROC39: 59.2270939811; SEND_REQ_NOTIFY, count=81536, peer=28 PROC28: 59.2271299611; RECV_MSG_ARRIVED, count=0, peer=39 PROC28: 59.2271389612; RECV_SEARCH_POSTED_Q_BEGIN, count=0, peer=39 PROC28: 59.2271439611; RECV_REQ_REMOVE_FROM_POSTED_Q, count=172084, peer=39 PROC28: 59.2271479612; RECV_MSG_MATCH_POSTED_REQ, count=172084, peer=39 PROC28: 59.2271519612; RECV_SEARCH_POSTED_Q_END, count=0, peer=39 PROC28: 59.2271559612; RECV_REQ_XFER_BEGIN, count=172084, peer=39 PROC28: 59.2271609611; RECV_REQ_XFER_CONTINUE, count=27178, peer=39 PROC28: 59.2271759611; RECV_REQ_XFER_CONTINUE, count=54358, peer=39 PROC28: 59.4331819612; RECV_REQ_XFER_END, count=172084, peer=39 PROC28: 59.4332009612; RECV_REQ_COMPLETE, count=172084, peer=39 You can see that the sender send the matching information (27178 bytes line 2). This message arrives on the receiver at line 1. There seems to be a small time drift between the two, so don't use the absolute time to compare. The first piece of data (aka 27178 bytes) is then copied on the receiver memory (line 7). As the pipeline protocol is on, the receiver ask for the second piece of data right after (line 8). The receiver will detect the completion of the last piece of data transfer at line 9, which is about 1.8 ms later. On the sender side, don't take in account the completion time, it is irrelevant, as it is only the local reception (and thus barely indicate the cost of the local memcpy). Now on to the second communication: PROC39: 59.2283769811; SEND_REQ_ACTIVATE, count=81536, peer=28 PROC39: 59.2283959811; SEND_REQ_XFER_BEGIN, count=81536, peer=28 PROC39: 59.4331379811; SEND_REQ_XFER_CONTINUE, count=27178, peer=28 PROC39: 59.4332049811; SEND_REQ_XFER_CONTINUE, count=54358, peer=28 PROC39: 59.4332569811; SEND_REQ_XFER_END, count=81536, peer=28 PROC39: 59.4332659811; SEND_REQ_COMPLETE, count=81536, peer=28 PROC39: 59.4332769811; SEND_REQ_NOTIFY, count=81536, peer=28 PROC28: 59.4332059611; RECV_MSG_ARRIVED, count=0, peer=39 PROC28: 59.4332109612; RECV_SEARCH_POSTED_Q_BEGIN, count=0, peer=39 PROC28: 59.4332149612; RECV_MSG_INSERT_IN_UNEX_Q, count=0, peer=39 PROC28: 59.4332199611; RECV_SEARCH_POSTED_Q_END, count=0, peer=39 PROC28: 59.4332269612; RECV_REQ_NOTIFY, count=172084, peer=39 PROC28: 59.4332539611; RECV_REQ_ACTIVATE, count=172084, peer=39 PROC28: 59.4332589612; RECV_SEARCH_UNEX_Q_BEGIN, count=172084, peer=39 PROC28: 59.4332629612; RECV_REQ_MATCH_UNEX, count=172084, peer=39 PROC28: 59.4332669611; RECV_MSG_REMOVE_FROM_UNEX_Q, count=0, peer=39 PROC28: 59.4332719612; RECV_SEARCH_UNEX_Q_END, count=172084, peer=39 PROC28: 59.4332759612; RECV_REQ_XFER_BEGIN, count=172084, peer=39 PROC28: 59.4332809611; RECV_REQ_XFER_CONTINUE, count=27178, peer=39 Here the things are getting interesting. You can see that the matching information of the second send (line 1 on the receiving side) is received before the first receive is reported back to the MPI layer (line 5). In other words, the second receive is __unexpected__ (line 3 clearly highlight this). Later on when the MPI irecv is posted, it is matched (line 8) and then the confirmation is sent to the sender at line 11. Take a look at the time, this explains why the second send look so slow. If there is something that should be understood better is the lag on the receiver for the detection of the completion of the first operation (as highlighted by the following two lines): PROC28: 59.2271759611; RECV_REQ_XFER_CONTINUE, count=54358, peer=39 PROC28: 59.4331819612; RECV_REQ_XFER_END, count=172084, peer=39 george. On Nov 9, 2012, at 12:36 , "Hodge, Gary C" <gary.c.ho...@lmco.com<mailto:gary.c.ho...@lmco.com>> wrote: Answering my own question, I have downloaded openMPI 1.6.2 and still get the delay in the MPI_Send. Previously, I was using openMPI 1.4.1 I configured 1.6.2 with -enable-peruse and have implemented PERUSE callbacks. Here is a trace of gb2 (PROC39) sending messages to ob (PROC28) PROC39: 59.2268409810; SEND_REQ_ACTIVATE, count=81536, peer=28 PROC39: 59.2268649811; SEND_REQ_XFER_BEGIN, count=81536, peer=28 PROC39: 59.2270019811; SEND_REQ_XFER_CONTINUE, count=27178, peer=28 PROC39: 59.2270509811; SEND_REQ_XFER_CONTINUE, count=54358, peer=28 PROC39: 59.2270809811; SEND_REQ_XFER_END, count=81536, peer=28 PROC39: 59.2270869811; SEND_REQ_COMPLETE, count=81536, peer=28 PROC39: 59.2270939811; SEND_REQ_NOTIFY, count=81536, peer=28 PROC28: 59.2271299611; RECV_MSG_ARRIVED, count=0, peer=39 PROC28: 59.2271389612; RECV_SEARCH_POSTED_Q_BEGIN, count=0, peer=39 PROC28: 59.2271439611; RECV_REQ_REMOVE_FROM_POSTED_Q, count=172084, peer=39 PROC28: 59.2271479612; RECV_MSG_MATCH_POSTED_REQ, count=172084, peer=39 PROC28: 59.2271519612; RECV_SEARCH_POSTED_Q_END, count=0, peer=39 PROC28: 59.2271559612; RECV_REQ_XFER_BEGIN, count=172084, peer=39 PROC28: 59.2271609611; RECV_REQ_XFER_CONTINUE, count=27178, peer=39 PROC28: 59.2271759611; RECV_REQ_XFER_CONTINUE, count=54358, peer=39 PROC39: 59.2283769811; SEND_REQ_ACTIVATE, count=81536, peer=28 PROC39: 59.2283959811; SEND_REQ_XFER_BEGIN, count=81536, peer=28 PROC39: 59.4331379811; SEND_REQ_XFER_CONTINUE, count=27178, peer=28 PROC28: 59.4331819612; RECV_REQ_XFER_END, count=172084, peer=39 PROC28: 59.4332009612; RECV_REQ_COMPLETE, count=172084, peer=39 PROC39: 59.4332049811; SEND_REQ_XFER_CONTINUE, count=54358, peer=28 PROC28: 59.4332059611; RECV_MSG_ARRIVED, count=0, peer=39 PROC28: 59.4332109612; RECV_SEARCH_POSTED_Q_BEGIN, count=0, peer=39 PROC28: 59.4332149612; RECV_MSG_INSERT_IN_UNEX_Q, count=0, peer=39 PROC28: 59.4332199611; RECV_SEARCH_POSTED_Q_END, count=0, peer=39 PROC28: 59.4332269612; RECV_REQ_NOTIFY, count=172084, peer=39 PROC28: 59.4332539611; RECV_REQ_ACTIVATE, count=172084, peer=39 PROC39: 59.4332569811; SEND_REQ_XFER_END, count=81536, peer=28 PROC28: 59.4332589612; RECV_SEARCH_UNEX_Q_BEGIN, count=172084, peer=39 PROC28: 59.4332629612; RECV_REQ_MATCH_UNEX, count=172084, peer=39 PROC39: 59.4332659811; SEND_REQ_COMPLETE, count=81536, peer=28 PROC28: 59.4332669611; RECV_MSG_REMOVE_FROM_UNEX_Q, count=0, peer=39 PROC28: 59.4332719612; RECV_SEARCH_UNEX_Q_END, count=172084, peer=39 PROC28: 59.4332759612; RECV_REQ_XFER_BEGIN, count=172084, peer=39 PROC39: 59.4332769811; SEND_REQ_NOTIFY, count=81536, peer=28 PROC28: 59.4332809611; RECV_REQ_XFER_CONTINUE, count=27178, peer=39 PROC39: 59.4332849811; ERROR: component gb2 exceeded send time limit by 0.104915 seconds. PROC39: 59.4332849811; WARNING: component gb2 increased page faults R=130898,0; P=130898,0; S=132207,0 Notice the 205 ms delay between the hi-lighted SEND_REQ_XFER_BEGIN and the subsequent SEND_REQ_XFER_CONTINUE Also notice that there was no such delay in the previous send request. The last two lines are my own debug that monitor excessive time spent in an MPI_Send and monitor increasing number of page faults. After the delaying MPI_Send, the page faults have increased by 1309 (132207 - 130898) !!! I looked at the PML code (pml_ob1_sendreq.c) and I suspect that something is going awry in the mca_bml_base_prepare_src function. I believe that this is the function in which the openib BTL makes calls to register the user memory. This delay in the MPI_Send kills any hope of meeting our real-time requirements, so any help is greatly appreciated. From: Hodge, Gary C Sent: Monday, November 05, 2012 12:27 PM To: us...@open-mpi.org<mailto:us...@open-mpi.org> Subject: gathering problem I continue to have a problem where 2 processes are sending to the same process and one of the sending processes hangs for 150 to 550 ms in the call to MPI_Send. Each process runs on a different node and the receiving process has posted an MPI_Irecv 17 ms before the hanging send. The posted receives are for 172K buffers and the sending processes are sending 81K size messages. I have set mpi_leave_pinned to 1 and have increased the btl_openib_receive_queues to ...:S,65536,512,256,64 How do I trace the various phases of message passing to diagnose where the send is hanging up? _______________________________________________ users mailing list us...@open-mpi.org<mailto:us...@open-mpi.org> http://www.open-mpi.org/mailman/listinfo.cgi/users