Re: [OMPI users] EXTERNAL: Re: gathering problem

Hodge, Gary C Mon, 12 Nov 2012 17:15:16 -0500

Today I tried a test where all sender and receiver processes were located on 
nodes on the same blade chassis.
With this process deployment, I did not detect any delayed send/receive 
processing, all tests ran as expected!!!


In the failing test below, the sender (proc39) was on the second blade chassis 
and the receiver (proc28) was on the first blade chassis.

So I must have some problem in the Infiniband fabric?

Each chassis has a switch to which all blades are internally connected.  The 
two switches are connected by 4 external cables.

Are there any tests I can run to diagnose why there are intermittent connection 
problems between the switches?


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of George Bosilca
Sent: Friday, November 09, 2012 6:02 PM
To: Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] gathering problem

Gary,

Looking at the trace it become clear that the culprit is the receiving process 
and not the sending one.

Let me walk you over. I split the trace in several groups. The first one will 
look at the sender (proc39) and the receiver (proc28) for the first message.

PROC39: 59.2268409810; SEND_REQ_ACTIVATE, count=81536, peer=28
PROC39: 59.2268649811; SEND_REQ_XFER_BEGIN, count=81536, peer=28
PROC39: 59.2270019811; SEND_REQ_XFER_CONTINUE, count=27178, peer=28
PROC39: 59.2270509811; SEND_REQ_XFER_CONTINUE, count=54358, peer=28
PROC39: 59.2270809811; SEND_REQ_XFER_END, count=81536, peer=28
PROC39: 59.2270869811; SEND_REQ_COMPLETE, count=81536, peer=28
PROC39: 59.2270939811; SEND_REQ_NOTIFY, count=81536, peer=28

PROC28: 59.2271299611; RECV_MSG_ARRIVED, count=0, peer=39
PROC28: 59.2271389612; RECV_SEARCH_POSTED_Q_BEGIN, count=0, peer=39
PROC28: 59.2271439611; RECV_REQ_REMOVE_FROM_POSTED_Q, count=172084, peer=39
PROC28: 59.2271479612; RECV_MSG_MATCH_POSTED_REQ, count=172084, peer=39
PROC28: 59.2271519612; RECV_SEARCH_POSTED_Q_END, count=0, peer=39
PROC28: 59.2271559612; RECV_REQ_XFER_BEGIN, count=172084, peer=39
PROC28: 59.2271609611; RECV_REQ_XFER_CONTINUE, count=27178, peer=39
PROC28: 59.2271759611; RECV_REQ_XFER_CONTINUE, count=54358, peer=39
PROC28: 59.4331819612; RECV_REQ_XFER_END, count=172084, peer=39
PROC28: 59.4332009612; RECV_REQ_COMPLETE, count=172084, peer=39

You can see that the sender send the matching information (27178 bytes line 2). 
This message arrives on the receiver at line 1. There seems to be a small time 
drift between the two, so don't use the absolute time to compare. The first 
piece of data (aka 27178 bytes) is then copied on the receiver memory (line 7). 
As the pipeline protocol is on, the receiver ask for the second piece of data 
right after (line 8). The receiver will detect the completion of the last piece 
of data transfer at line 9, which is about 1.8 ms later.

On the sender side, don't take in account the completion time, it is 
irrelevant, as it is only the local reception (and thus barely indicate the 
cost of the local memcpy).

Now on to the second communication:

PROC39: 59.2283769811; SEND_REQ_ACTIVATE, count=81536, peer=28
PROC39: 59.2283959811; SEND_REQ_XFER_BEGIN, count=81536, peer=28
PROC39: 59.4331379811; SEND_REQ_XFER_CONTINUE, count=27178, peer=28
PROC39: 59.4332049811; SEND_REQ_XFER_CONTINUE, count=54358, peer=28
PROC39: 59.4332569811; SEND_REQ_XFER_END, count=81536, peer=28
PROC39: 59.4332659811; SEND_REQ_COMPLETE, count=81536, peer=28
PROC39: 59.4332769811; SEND_REQ_NOTIFY, count=81536, peer=28

PROC28: 59.4332059611; RECV_MSG_ARRIVED, count=0, peer=39
PROC28: 59.4332109612; RECV_SEARCH_POSTED_Q_BEGIN, count=0, peer=39
PROC28: 59.4332149612; RECV_MSG_INSERT_IN_UNEX_Q, count=0, peer=39
PROC28: 59.4332199611; RECV_SEARCH_POSTED_Q_END, count=0, peer=39
PROC28: 59.4332269612; RECV_REQ_NOTIFY, count=172084, peer=39
PROC28: 59.4332539611; RECV_REQ_ACTIVATE, count=172084, peer=39
PROC28: 59.4332589612; RECV_SEARCH_UNEX_Q_BEGIN, count=172084, peer=39
PROC28: 59.4332629612; RECV_REQ_MATCH_UNEX, count=172084, peer=39
PROC28: 59.4332669611; RECV_MSG_REMOVE_FROM_UNEX_Q, count=0, peer=39
PROC28: 59.4332719612; RECV_SEARCH_UNEX_Q_END, count=172084, peer=39
PROC28: 59.4332759612; RECV_REQ_XFER_BEGIN, count=172084, peer=39
PROC28: 59.4332809611; RECV_REQ_XFER_CONTINUE, count=27178, peer=39

Here the things are getting interesting. You can see that the matching 
information of the second send (line 1 on the receiving side) is received 
before the first receive is reported back to the MPI layer (line 5). In other 
words, the second receive is __unexpected__ (line 3 clearly highlight this). 
Later on when the MPI irecv is posted, it is matched (line 8) and then the 
confirmation is sent to the sender at line 11. Take a look at the time, this 
explains why the second send look so slow.

If there is something that should be understood better is the lag on the 
receiver for the detection of the completion of the first operation (as 
highlighted by the following two lines):

PROC28: 59.2271759611; RECV_REQ_XFER_CONTINUE, count=54358, peer=39
PROC28: 59.4331819612; RECV_REQ_XFER_END, count=172084, peer=39

  george.


On Nov 9, 2012, at 12:36 , "Hodge, Gary C" 
<gary.c.ho...@lmco.com<mailto:gary.c.ho...@lmco.com>> wrote:


Answering my own question, I have downloaded openMPI 1.6.2 and still get the 
delay in the MPI_Send.  Previously, I was using openMPI 1.4.1

I configured 1.6.2 with -enable-peruse and have implemented PERUSE callbacks.
Here is a trace of gb2 (PROC39) sending messages to ob (PROC28)

PROC39: 59.2268409810; SEND_REQ_ACTIVATE, count=81536, peer=28
PROC39: 59.2268649811; SEND_REQ_XFER_BEGIN, count=81536, peer=28
PROC39: 59.2270019811; SEND_REQ_XFER_CONTINUE, count=27178, peer=28
PROC39: 59.2270509811; SEND_REQ_XFER_CONTINUE, count=54358, peer=28
PROC39: 59.2270809811; SEND_REQ_XFER_END, count=81536, peer=28
PROC39: 59.2270869811; SEND_REQ_COMPLETE, count=81536, peer=28
PROC39: 59.2270939811; SEND_REQ_NOTIFY, count=81536, peer=28
PROC28: 59.2271299611; RECV_MSG_ARRIVED, count=0, peer=39
PROC28: 59.2271389612; RECV_SEARCH_POSTED_Q_BEGIN, count=0, peer=39
PROC28: 59.2271439611; RECV_REQ_REMOVE_FROM_POSTED_Q, count=172084, peer=39
PROC28: 59.2271479612; RECV_MSG_MATCH_POSTED_REQ, count=172084, peer=39
PROC28: 59.2271519612; RECV_SEARCH_POSTED_Q_END, count=0, peer=39
PROC28: 59.2271559612; RECV_REQ_XFER_BEGIN, count=172084, peer=39
PROC28: 59.2271609611; RECV_REQ_XFER_CONTINUE, count=27178, peer=39
PROC28: 59.2271759611; RECV_REQ_XFER_CONTINUE, count=54358, peer=39
PROC39: 59.2283769811; SEND_REQ_ACTIVATE, count=81536, peer=28
PROC39: 59.2283959811; SEND_REQ_XFER_BEGIN, count=81536, peer=28
PROC39: 59.4331379811; SEND_REQ_XFER_CONTINUE, count=27178, peer=28
PROC28: 59.4331819612; RECV_REQ_XFER_END, count=172084, peer=39
PROC28: 59.4332009612; RECV_REQ_COMPLETE, count=172084, peer=39
PROC39: 59.4332049811; SEND_REQ_XFER_CONTINUE, count=54358, peer=28
PROC28: 59.4332059611; RECV_MSG_ARRIVED, count=0, peer=39
PROC28: 59.4332109612; RECV_SEARCH_POSTED_Q_BEGIN, count=0, peer=39
PROC28: 59.4332149612; RECV_MSG_INSERT_IN_UNEX_Q, count=0, peer=39
PROC28: 59.4332199611; RECV_SEARCH_POSTED_Q_END, count=0, peer=39
PROC28: 59.4332269612; RECV_REQ_NOTIFY, count=172084, peer=39
PROC28: 59.4332539611; RECV_REQ_ACTIVATE, count=172084, peer=39
PROC39: 59.4332569811; SEND_REQ_XFER_END, count=81536, peer=28
PROC28: 59.4332589612; RECV_SEARCH_UNEX_Q_BEGIN, count=172084, peer=39
PROC28: 59.4332629612; RECV_REQ_MATCH_UNEX, count=172084, peer=39
PROC39: 59.4332659811; SEND_REQ_COMPLETE, count=81536, peer=28
PROC28: 59.4332669611; RECV_MSG_REMOVE_FROM_UNEX_Q, count=0, peer=39
PROC28: 59.4332719612; RECV_SEARCH_UNEX_Q_END, count=172084, peer=39
PROC28: 59.4332759612; RECV_REQ_XFER_BEGIN, count=172084, peer=39
PROC39: 59.4332769811; SEND_REQ_NOTIFY, count=81536, peer=28
PROC28: 59.4332809611; RECV_REQ_XFER_CONTINUE, count=27178, peer=39
PROC39: 59.4332849811; ERROR: component gb2 exceeded send time limit by 
0.104915 seconds.
PROC39: 59.4332849811; WARNING: component gb2 increased page faults R=130898,0; 
P=130898,0; S=132207,0

Notice the 205 ms delay between the hi-lighted SEND_REQ_XFER_BEGIN and the 
subsequent SEND_REQ_XFER_CONTINUE
Also notice that there was no such delay in the previous send request.

The last two lines are my own debug that monitor excessive time spent in an 
MPI_Send and monitor increasing number of page faults.
After the delaying MPI_Send, the page faults have increased by 1309 (132207 - 
130898) !!!

I looked at the PML code (pml_ob1_sendreq.c) and I suspect that something is 
going awry in the mca_bml_base_prepare_src function.
I believe that this is the function in which the openib BTL makes calls to 
register the user memory.

This delay in the MPI_Send kills any hope of meeting our real-time 
requirements, so any help is greatly appreciated.


From: Hodge, Gary C
Sent: Monday, November 05, 2012 12:27 PM
To: us...@open-mpi.org<mailto:us...@open-mpi.org>
Subject: gathering problem

I continue to have a problem where 2 processes are sending to the same process 
and one of the sending processes hangs for 150 to 550 ms in the call to 
MPI_Send.

Each process runs on a different node and the receiving process has posted an 
MPI_Irecv 17 ms before the hanging send.
The posted receives are for 172K buffers and the sending processes are sending 
81K size messages.
I have set mpi_leave_pinned to 1 and have increased the 
btl_openib_receive_queues to ...:S,65536,512,256,64

How do I trace the various phases of message passing to diagnose where the send 
is hanging up?


_______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] EXTERNAL: Re: gathering problem

Reply via email to