On Jan 21, 2016, at 7:40 AM, Eva <wuzh...@gmail.com> wrote: > > Thanks Jeff. > > >>1. Can you create a small example to reproduce the problem? > > >>2. The TCP and verbs-based transports use different thresholds and > >>protocols, and can sometimes bring to light errors in the application > >>(e.g., the application is making assumptions that just happen to be true > >>for TCP, but not necessarily for other transports). > > >>3. Is your program multi-threaded? If so, MPI_THREAD_MULTIPLE support in > >>the v1.8 and v1.10 series is not fully baked. > > >>4. Additionally, if you have buffering / matching / progression assumptions > >>in your application, you might accidentally block. An experiment to try to > >>is to convert all MPI_SEND and MPI_ISEND to MPI_SSEND and MPI_ISSEND, > >>respectively, and see if your program still functions properly on TCP. > > 1. I will try to create a mall example to reproduce the problem. > > 2. I didn't get your point. I didn't make any assumptions for TCP. Is there > any difference in MPI for TCP and RDMA?
The way (Open) MPI communicates under the covers with TCP and other transports is different -- e.g., the amount of buffering is different, the eager sizes are different, etc. Hence, if your application does an unsafe communication pattern (e.g., example 3.9 in MPI-3.1, page 43), it may coincidentally work on one transport and deadlock on another. > 3. My program doesn't enable MPI_THREAD_MULTIPLE > > 4. what do you mean by buffering / matching / progression assumptions in your > application? Essentially the same thing I said above -- see example 3.9 in MPI-3.1. > My program communicates like this: > > 4 processes: process0, 1, 2, 3 > > process1/process3: > > foreach to_id in process0, process2: > > MPI_Send(send_buf, sendlen, to_id, TAG); > > MPI_Recv(recv_buf, recvlen, to_id, TAG); > > > process0/process2: > > while(true): > > MPI_recv(recv_buf, any_source, TAG); > > MPI_Send(send_buf, source_id, TAG); I'm afraid I can't grok what the problem would be from this; it's not enough of a summary describing what your application is doing. If you can replicate the problem in a small example that you can share with the list, that would be most helpful. Also, try the SEND -> SSEND and ISEND -> ISSEND experiment I mentioned in my previous mail. Thanks! -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/