Noam,

Thanks for your output, it highlight an usual outcome. It shows that a
process (29662) has pending messages from other processes that are tagged
with a past sequence number, something that should have not happened. The
only way to get that is if somehow we screwed-up the sending part and push
the same sequence number twice ...

More digging is required.

  George.



On Fri, Apr 6, 2018 at 2:42 PM, Noam Bernstein <noam.bernst...@nrl.navy.mil>
wrote:

>
> On Apr 6, 2018, at 1:41 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
> Noam,
>
> According to your stack trace the correct way to call the mca_pml_ob1_dump
> is with the communicator from the PMPI call. Thus, this call was successful:
>
> (gdb) call mca_pml_ob1_dump(0xed932d0, 1)
> $1 = 0
>
>
> I should have been more clear, the output is not on gdb but on the output
> stream of your application. If you run your application by hand with
> mpirun, the output should be on the terminal where you started mpirun. If
> you start your job with a batch schedule, the output should be in the
> output file associated with your job.
>
>
> OK, that makes sense.  Here’s what I get from the two relevant processes.
>  compute-1-9 should be receiving, and 1-10 sending, I believe.  Is it
> possible that the fact that all send send/recv pairs (nodes 1-3 on each set
> of 4 sending to 0, which is receiving from each one in turn) are using the
> same tag (200) is confusing things?
>
> [compute-1-9:29662] Communicator MPI COMMUNICATOR 5 SPLIT FROM 3
> [0xeba14d0](5) rank 0 recv_seq 8855 num_procs 4 last_probed 0
> [compute-1-9:29662] [Rank 1] expected_seq 175 ompi_proc 0xeb0ec50 send_seq
> 8941
> [compute-1-9:29662] [Rank 2] expected_seq 127 ompi_proc 0xeb97200 send_seq
> 385
> [compute-1-9:29662] unexpected frag
> [compute-1-9:29662] hdr RNDV [   ] ctx     5 src 2 tag 200 seq 126
> msg_length 86777600
> [compute-1-9:29662] [Rank 3] expected_seq 8558 ompi_proc 0x2b8ee8000f90
> send_seq 5
> [compute-1-9:29662] unexpected frag
> [compute-1-9:29662] hdr RNDV [   ] ctx     5 src 3 tag 200 seq 8557
> msg_length 86777600
>
> [compute-1-10:15673] Communicator MPI COMMUNICATOR 5 SPLIT FROM 3
> [0xe9cc6a0](5) rank 1 recv_seq 9119 num_procs 4 last_probed 0
> [compute-1-10:15673] [Rank 0] expected_seq 8942 ompi_proc 0xe8e1db0
> send_seq 174
> [compute-1-10:15673] [Rank 2] expected_seq 54 ompi_proc 0xe9d7940 send_seq
> 8561
> [compute-1-10:15673] [Rank 3] expected_seq 126 ompi_proc 0xe9c20c0
> send_seq 385
>
>
> ____________
> |
> |
> |
> *U.S. NAVAL*
> |
> |
> _*RESEARCH*_
> |
> LABORATORY
>
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to