Noam, Thanks for your output, it highlight an usual outcome. It shows that a process (29662) has pending messages from other processes that are tagged with a past sequence number, something that should have not happened. The only way to get that is if somehow we screwed-up the sending part and push the same sequence number twice ...
More digging is required. George. On Fri, Apr 6, 2018 at 2:42 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil> wrote: > > On Apr 6, 2018, at 1:41 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > > Noam, > > According to your stack trace the correct way to call the mca_pml_ob1_dump > is with the communicator from the PMPI call. Thus, this call was successful: > > (gdb) call mca_pml_ob1_dump(0xed932d0, 1) > $1 = 0 > > > I should have been more clear, the output is not on gdb but on the output > stream of your application. If you run your application by hand with > mpirun, the output should be on the terminal where you started mpirun. If > you start your job with a batch schedule, the output should be in the > output file associated with your job. > > > OK, that makes sense. Here’s what I get from the two relevant processes. > compute-1-9 should be receiving, and 1-10 sending, I believe. Is it > possible that the fact that all send send/recv pairs (nodes 1-3 on each set > of 4 sending to 0, which is receiving from each one in turn) are using the > same tag (200) is confusing things? > > [compute-1-9:29662] Communicator MPI COMMUNICATOR 5 SPLIT FROM 3 > [0xeba14d0](5) rank 0 recv_seq 8855 num_procs 4 last_probed 0 > [compute-1-9:29662] [Rank 1] expected_seq 175 ompi_proc 0xeb0ec50 send_seq > 8941 > [compute-1-9:29662] [Rank 2] expected_seq 127 ompi_proc 0xeb97200 send_seq > 385 > [compute-1-9:29662] unexpected frag > [compute-1-9:29662] hdr RNDV [ ] ctx 5 src 2 tag 200 seq 126 > msg_length 86777600 > [compute-1-9:29662] [Rank 3] expected_seq 8558 ompi_proc 0x2b8ee8000f90 > send_seq 5 > [compute-1-9:29662] unexpected frag > [compute-1-9:29662] hdr RNDV [ ] ctx 5 src 3 tag 200 seq 8557 > msg_length 86777600 > > [compute-1-10:15673] Communicator MPI COMMUNICATOR 5 SPLIT FROM 3 > [0xe9cc6a0](5) rank 1 recv_seq 9119 num_procs 4 last_probed 0 > [compute-1-10:15673] [Rank 0] expected_seq 8942 ompi_proc 0xe8e1db0 > send_seq 174 > [compute-1-10:15673] [Rank 2] expected_seq 54 ompi_proc 0xe9d7940 send_seq > 8561 > [compute-1-10:15673] [Rank 3] expected_seq 126 ompi_proc 0xe9c20c0 > send_seq 385 > > > ____________ > | > | > | > *U.S. NAVAL* > | > | > _*RESEARCH*_ > | > LABORATORY > > Noam Bernstein, Ph.D. > Center for Materials Physics and Technology > U.S. Naval Research Laboratory > T +1 202 404 8628 F +1 202 404 7546 > https://www.nrl.navy.mil > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users