Hi,

I'm seeing an issue with Openmpi Version 2.0.1. The setup uses 2 nodes
with 1 process on each node and the test case is P_Write_Indv. The problem
occurs when the test runs 4MB byte size and the mode is NON-AGGREGATE.
The test just hangs at that point. Here's the exact command/options that's
being used, followed by the output log and the stack trace (with gdb):

# /usr/local/mpi/openmpi/bin/mpirun -np 2 -hostfile hostfile -mca btl
self,sm,open ib -mca btl_openib_receive_queues P,65536,256,192,128 -mca
btl_openib_cpc_include rdmacm -mca orte_base_help_aggregate 0
--allow-run-as-root --bind-to none --map-by node
/usr/local/imb/openmpi/IMB-IO -msglog 21:22 -include P_Write_Indv -time 300

#-----------------------------------------------------------------------------
# Benchmarking P_Write_Indv
# #processes = 1
# ( 1 additional process waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#
#    MODE: AGGREGATE
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0         1000         0.56         0.56         0.56         0.00
      2097152           20     31662.31     31662.31     31662.31        63.17
      4194304           10     64159.89     64159.89     64159.89        62.34

#-----------------------------------------------------------------------------
# Benchmarking P_Write_Indv
# #processes = 1
# ( 1 additional process waiting in MPI_Barrier)
#-----------------------------------------------------------------------------
#
#    MODE: NON-AGGREGATE
#
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   Mbytes/sec
            0          100       570.14       570.14       570.14         0.00
      2097152           20     55007.33     55007.33     55007.33        36.36
      4194304           10     85838.17     85838.17     85838.17        46.60


#1  0x00007f08bf5af2a3 in poll_device () from
/usr/local/mpi/openmpi/lib/libopen-pal.so.20
#2  0x00007f08bf5afe15 in btl_openib_component_progress ()
   from /usr/local/mpi/openmpi/lib/libopen-pal.so.20
#3  0x00007f08bf55b89c in opal_progress () from
/usr/local/mpi/openmpi/lib/libopen-pal.so.20
#4  0x00007f08c0294a55 in ompi_request_default_wait_all ()
   from /usr/local/mpi/openmpi/lib/libmpi.so.20
#5  0x00007f08c02da295 in ompi_coll_base_allreduce_intra_recursivedoubling ()
   from /usr/local/mpi/openmpi/lib/libmpi.so.20
#6  0x00007f08c034b399 in mca_io_ompio_set_view_internal ()
   from /usr/local/mpi/openmpi/lib/libmpi.so.20
#7  0x00007f08c034b76c in mca_io_ompio_file_set_view ()
   from /usr/local/mpi/openmpi/lib/libmpi.so.20
#8  0x00007f08c02cc00f in PMPI_File_set_view () from
/usr/local/mpi/openmpi/lib/libmpi.so.20
#9  0x000000000040e4fd in IMB_init_transfer ()
#10 0x0000000000405f5c in IMB_init_buffers_iter ()
#11 0x0000000000402d25 in main ()
(gdb)


I did some analysis, please see the details below:

The mpi threads on the Root-node and the Peer-node are hung trying to
get completions on the RQ and SQ. There are no new completions as per the
data in the RQ/SQ and corresponding CQs in the device. Here's the sequence
as per the queue status observed in the device (while the thread is stuck
waiting for new completions):

       Root-node (RQ)                                        Peer-node (SQ)


                                 0 <---------------------------0
                                               .................
                                               .................
                                        17 SEND-Inline WRs
                                        (17 RQ-CQEs seen)
                                                ................
                                                ................
                               16 <----------------------------16



                               17<----------------------------- 17
                               1 RDMA-WRITE-Inline+Signaled
                                      (1 SQ-CQE generated)


                               18 <-----------------------------18
                                                .................
                                                .................
                                    19 RDMA-WRITE-Inlines
                                                 ................
                                                 ................
                                36 <----------------------------36


Like shown in the above diagram here's the sequence of events (Work
Requests and Completions) that occurred between the Root node and the
Peer node.

1) Peer node posts 17 Send WRs with Inline flag set
2) Root node receives all these 17 pkts in its RQ
3) 17 CQEs are generated in the Root node in its RCQ
4) Peer node posts an RDMA-WRITE WR with Inline and Signaled flag bits set
5) Operation completes on the SQ and a CQE is generated in the SCQ
6) There's no CQE on the Root node since it is an RDMA-WRITE operation
7) Peer node posts 19 RDMA-WRITE WRs with Inline flag, but no Signaled flag
8) No CQEs on the Peer node, since they are not Signaled
9) No CQEs on the Root node, since they are RDMA-WRITEs

At this point, the Root node is polling on its RCQ for new completions
and there aren't any, since all SENDs are already received and CQEs seen.

Similarly, the Peer node is polling on its SCQ for new completions
and there aren't any, since the 19 RDMA-WRITEs are not signaled.

There is an exact similar condition in the reverse direction too.
That is, Root node issues a bunch of SENDs to the Peer node, followed
by some RDMA-WRITEs. The Peer node gets CQEs for the SENDs and looks
for more CQEs but there won't be any, since the subsequent ones are
all RDMA-WRITEs. The Root node itself is polling on its SCQ and it won't
find any new completions since there are no more signaled WRs.

So the 2 nodes are now in a hung state, polling on both the SCQ and RCQ,
while there's no such operations pending that can generate new CQEs.

Thanks,
-Harsha
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to