The problem is also observed with nightly snapshot version: Open MPI: 2.1.1a1 Open MPI repo revision: v2.1.0-49-g9d7e7a8 Open MPI release date: Unreleased developer copy Open RTE: 2.1.1a1 Open RTE repo revision: v2.1.0-49-g9d7e7a8
Thanks, -Harsha On Thu, Apr 13, 2017 at 10:49 PM, Sriharsha Basavapatna <sriharsha.basavapa...@broadcom.com> wrote: > Hi Jeff, > > The same problem is seen with OpenMPI v2.1.0 too. > > Thanks, > -Harsha > > > On Thu, Apr 13, 2017 at 4:41 PM, Jeff Squyres (jsquyres) > <jsquy...@cisco.com> wrote: >> Can you try the latest version of Open MPI? There have been bug fixes in >> the MPI one-sided area. >> >> Try Open MPI v2.1.0, or v2.0.2 if you want to stick with the v2.0.x series. >> I think there have been some post-release one-sided fixes, too -- you may >> also want to try nightly snapshots on both of those branches. >> >>> On Apr 13, 2017, at 6:48 AM, Sriharsha Basavapatna via users >>> <users@lists.open-mpi.org> wrote: >>> >>> Hi, >>> >>> I'm seeing an issue with Openmpi Version 2.0.1. The setup uses 2 nodes >>> with 1 process on each node and the test case is P_Write_Indv. The problem >>> occurs when the test runs 4MB byte size and the mode is NON-AGGREGATE. >>> The test just hangs at that point. Here's the exact command/options that's >>> being used, followed by the output log and the stack trace (with gdb): >>> >>> # /usr/local/mpi/openmpi/bin/mpirun -np 2 -hostfile hostfile -mca btl >>> self,sm,open ib -mca btl_openib_receive_queues P,65536,256,192,128 -mca >>> btl_openib_cpc_include rdmacm -mca orte_base_help_aggregate 0 >>> --allow-run-as-root --bind-to none --map-by node >>> /usr/local/imb/openmpi/IMB-IO -msglog 21:22 -include P_Write_Indv -time 300 >>> >>> #----------------------------------------------------------------------------- >>> # Benchmarking P_Write_Indv >>> # #processes = 1 >>> # ( 1 additional process waiting in MPI_Barrier) >>> #----------------------------------------------------------------------------- >>> # >>> # MODE: AGGREGATE >>> # >>> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] >>> Mbytes/sec >>> 0 1000 0.56 0.56 0.56 >>> 0.00 >>> 2097152 20 31662.31 31662.31 31662.31 >>> 63.17 >>> 4194304 10 64159.89 64159.89 64159.89 >>> 62.34 >>> >>> #----------------------------------------------------------------------------- >>> # Benchmarking P_Write_Indv >>> # #processes = 1 >>> # ( 1 additional process waiting in MPI_Barrier) >>> #----------------------------------------------------------------------------- >>> # >>> # MODE: NON-AGGREGATE >>> # >>> #bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] >>> Mbytes/sec >>> 0 100 570.14 570.14 570.14 >>> 0.00 >>> 2097152 20 55007.33 55007.33 55007.33 >>> 36.36 >>> 4194304 10 85838.17 85838.17 85838.17 >>> 46.60 >>> >>> >>> #1 0x00007f08bf5af2a3 in poll_device () from >>> /usr/local/mpi/openmpi/lib/libopen-pal.so.20 >>> #2 0x00007f08bf5afe15 in btl_openib_component_progress () >>> from /usr/local/mpi/openmpi/lib/libopen-pal.so.20 >>> #3 0x00007f08bf55b89c in opal_progress () from >>> /usr/local/mpi/openmpi/lib/libopen-pal.so.20 >>> #4 0x00007f08c0294a55 in ompi_request_default_wait_all () >>> from /usr/local/mpi/openmpi/lib/libmpi.so.20 >>> #5 0x00007f08c02da295 in ompi_coll_base_allreduce_intra_recursivedoubling >>> () >>> from /usr/local/mpi/openmpi/lib/libmpi.so.20 >>> #6 0x00007f08c034b399 in mca_io_ompio_set_view_internal () >>> from /usr/local/mpi/openmpi/lib/libmpi.so.20 >>> #7 0x00007f08c034b76c in mca_io_ompio_file_set_view () >>> from /usr/local/mpi/openmpi/lib/libmpi.so.20 >>> #8 0x00007f08c02cc00f in PMPI_File_set_view () from >>> /usr/local/mpi/openmpi/lib/libmpi.so.20 >>> #9 0x000000000040e4fd in IMB_init_transfer () >>> #10 0x0000000000405f5c in IMB_init_buffers_iter () >>> #11 0x0000000000402d25 in main () >>> (gdb) >>> >>> >>> I did some analysis, please see the details below: >>> >>> The mpi threads on the Root-node and the Peer-node are hung trying to >>> get completions on the RQ and SQ. There are no new completions as per the >>> data in the RQ/SQ and corresponding CQs in the device. Here's the sequence >>> as per the queue status observed in the device (while the thread is stuck >>> waiting for new completions): >>> >>> Root-node (RQ) Peer-node (SQ) >>> >>> >>> 0 <---------------------------0 >>> ................. >>> ................. >>> 17 SEND-Inline WRs >>> (17 RQ-CQEs seen) >>> ................ >>> ................ >>> 16 <----------------------------16 >>> >>> >>> >>> 17<----------------------------- 17 >>> 1 RDMA-WRITE-Inline+Signaled >>> (1 SQ-CQE generated) >>> >>> >>> 18 <-----------------------------18 >>> ................. >>> ................. >>> 19 RDMA-WRITE-Inlines >>> ................ >>> ................ >>> 36 <----------------------------36 >>> >>> >>> Like shown in the above diagram here's the sequence of events (Work >>> Requests and Completions) that occurred between the Root node and the >>> Peer node. >>> >>> 1) Peer node posts 17 Send WRs with Inline flag set >>> 2) Root node receives all these 17 pkts in its RQ >>> 3) 17 CQEs are generated in the Root node in its RCQ >>> 4) Peer node posts an RDMA-WRITE WR with Inline and Signaled flag bits set >>> 5) Operation completes on the SQ and a CQE is generated in the SCQ >>> 6) There's no CQE on the Root node since it is an RDMA-WRITE operation >>> 7) Peer node posts 19 RDMA-WRITE WRs with Inline flag, but no Signaled flag >>> 8) No CQEs on the Peer node, since they are not Signaled >>> 9) No CQEs on the Root node, since they are RDMA-WRITEs >>> >>> At this point, the Root node is polling on its RCQ for new completions >>> and there aren't any, since all SENDs are already received and CQEs seen. >>> >>> Similarly, the Peer node is polling on its SCQ for new completions >>> and there aren't any, since the 19 RDMA-WRITEs are not signaled. >>> >>> There is an exact similar condition in the reverse direction too. >>> That is, Root node issues a bunch of SENDs to the Peer node, followed >>> by some RDMA-WRITEs. The Peer node gets CQEs for the SENDs and looks >>> for more CQEs but there won't be any, since the subsequent ones are >>> all RDMA-WRITEs. The Root node itself is polling on its SCQ and it won't >>> find any new completions since there are no more signaled WRs. >>> >>> So the 2 nodes are now in a hung state, polling on both the SCQ and RCQ, >>> while there's no such operations pending that can generate new CQEs. >>> >>> Thanks, >>> -Harsha >>> _______________________________________________ >>> users mailing list >>> users@lists.open-mpi.org >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users