The problem is also observed with nightly snapshot version:

                Open MPI: 2.1.1a1
  Open MPI repo revision: v2.1.0-49-g9d7e7a8
   Open MPI release date: Unreleased developer copy
                Open RTE: 2.1.1a1
  Open RTE repo revision: v2.1.0-49-g9d7e7a8

Thanks,
-Harsha

On Thu, Apr 13, 2017 at 10:49 PM, Sriharsha Basavapatna
<sriharsha.basavapa...@broadcom.com> wrote:
> Hi Jeff,
>
> The same problem is seen with OpenMPI v2.1.0 too.
>
> Thanks,
> -Harsha
>
>
> On Thu, Apr 13, 2017 at 4:41 PM, Jeff Squyres (jsquyres)
> <jsquy...@cisco.com> wrote:
>> Can you try the latest version of Open MPI?  There have been bug fixes in 
>> the MPI one-sided area.
>>
>> Try Open MPI v2.1.0, or v2.0.2 if you want to stick with the v2.0.x series.  
>> I think there have been some post-release one-sided fixes, too -- you may 
>> also want to try nightly snapshots on both of those branches.
>>
>>> On Apr 13, 2017, at 6:48 AM, Sriharsha Basavapatna via users 
>>> <users@lists.open-mpi.org> wrote:
>>>
>>> Hi,
>>>
>>> I'm seeing an issue with Openmpi Version 2.0.1. The setup uses 2 nodes
>>> with 1 process on each node and the test case is P_Write_Indv. The problem
>>> occurs when the test runs 4MB byte size and the mode is NON-AGGREGATE.
>>> The test just hangs at that point. Here's the exact command/options that's
>>> being used, followed by the output log and the stack trace (with gdb):
>>>
>>> # /usr/local/mpi/openmpi/bin/mpirun -np 2 -hostfile hostfile -mca btl
>>> self,sm,open ib -mca btl_openib_receive_queues P,65536,256,192,128 -mca
>>> btl_openib_cpc_include rdmacm -mca orte_base_help_aggregate 0
>>> --allow-run-as-root --bind-to none --map-by node
>>> /usr/local/imb/openmpi/IMB-IO -msglog 21:22 -include P_Write_Indv -time 300
>>>
>>> #-----------------------------------------------------------------------------
>>> # Benchmarking P_Write_Indv
>>> # #processes = 1
>>> # ( 1 additional process waiting in MPI_Barrier)
>>> #-----------------------------------------------------------------------------
>>> #
>>> #    MODE: AGGREGATE
>>> #
>>>       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   
>>> Mbytes/sec
>>>            0         1000         0.56         0.56         0.56         
>>> 0.00
>>>      2097152           20     31662.31     31662.31     31662.31        
>>> 63.17
>>>      4194304           10     64159.89     64159.89     64159.89        
>>> 62.34
>>>
>>> #-----------------------------------------------------------------------------
>>> # Benchmarking P_Write_Indv
>>> # #processes = 1
>>> # ( 1 additional process waiting in MPI_Barrier)
>>> #-----------------------------------------------------------------------------
>>> #
>>> #    MODE: NON-AGGREGATE
>>> #
>>>       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]   
>>> Mbytes/sec
>>>            0          100       570.14       570.14       570.14         
>>> 0.00
>>>      2097152           20     55007.33     55007.33     55007.33        
>>> 36.36
>>>      4194304           10     85838.17     85838.17     85838.17        
>>> 46.60
>>>
>>>
>>> #1  0x00007f08bf5af2a3 in poll_device () from
>>> /usr/local/mpi/openmpi/lib/libopen-pal.so.20
>>> #2  0x00007f08bf5afe15 in btl_openib_component_progress ()
>>>   from /usr/local/mpi/openmpi/lib/libopen-pal.so.20
>>> #3  0x00007f08bf55b89c in opal_progress () from
>>> /usr/local/mpi/openmpi/lib/libopen-pal.so.20
>>> #4  0x00007f08c0294a55 in ompi_request_default_wait_all ()
>>>   from /usr/local/mpi/openmpi/lib/libmpi.so.20
>>> #5  0x00007f08c02da295 in ompi_coll_base_allreduce_intra_recursivedoubling 
>>> ()
>>>   from /usr/local/mpi/openmpi/lib/libmpi.so.20
>>> #6  0x00007f08c034b399 in mca_io_ompio_set_view_internal ()
>>>   from /usr/local/mpi/openmpi/lib/libmpi.so.20
>>> #7  0x00007f08c034b76c in mca_io_ompio_file_set_view ()
>>>   from /usr/local/mpi/openmpi/lib/libmpi.so.20
>>> #8  0x00007f08c02cc00f in PMPI_File_set_view () from
>>> /usr/local/mpi/openmpi/lib/libmpi.so.20
>>> #9  0x000000000040e4fd in IMB_init_transfer ()
>>> #10 0x0000000000405f5c in IMB_init_buffers_iter ()
>>> #11 0x0000000000402d25 in main ()
>>> (gdb)
>>>
>>>
>>> I did some analysis, please see the details below:
>>>
>>> The mpi threads on the Root-node and the Peer-node are hung trying to
>>> get completions on the RQ and SQ. There are no new completions as per the
>>> data in the RQ/SQ and corresponding CQs in the device. Here's the sequence
>>> as per the queue status observed in the device (while the thread is stuck
>>> waiting for new completions):
>>>
>>>       Root-node (RQ)                                        Peer-node (SQ)
>>>
>>>
>>>                                 0 <---------------------------0
>>>                                               .................
>>>                                               .................
>>>                                        17 SEND-Inline WRs
>>>                                        (17 RQ-CQEs seen)
>>>                                                ................
>>>                                                ................
>>>                               16 <----------------------------16
>>>
>>>
>>>
>>>                               17<----------------------------- 17
>>>                               1 RDMA-WRITE-Inline+Signaled
>>>                                      (1 SQ-CQE generated)
>>>
>>>
>>>                               18 <-----------------------------18
>>>                                                .................
>>>                                                .................
>>>                                    19 RDMA-WRITE-Inlines
>>>                                                 ................
>>>                                                 ................
>>>                                36 <----------------------------36
>>>
>>>
>>> Like shown in the above diagram here's the sequence of events (Work
>>> Requests and Completions) that occurred between the Root node and the
>>> Peer node.
>>>
>>> 1) Peer node posts 17 Send WRs with Inline flag set
>>> 2) Root node receives all these 17 pkts in its RQ
>>> 3) 17 CQEs are generated in the Root node in its RCQ
>>> 4) Peer node posts an RDMA-WRITE WR with Inline and Signaled flag bits set
>>> 5) Operation completes on the SQ and a CQE is generated in the SCQ
>>> 6) There's no CQE on the Root node since it is an RDMA-WRITE operation
>>> 7) Peer node posts 19 RDMA-WRITE WRs with Inline flag, but no Signaled flag
>>> 8) No CQEs on the Peer node, since they are not Signaled
>>> 9) No CQEs on the Root node, since they are RDMA-WRITEs
>>>
>>> At this point, the Root node is polling on its RCQ for new completions
>>> and there aren't any, since all SENDs are already received and CQEs seen.
>>>
>>> Similarly, the Peer node is polling on its SCQ for new completions
>>> and there aren't any, since the 19 RDMA-WRITEs are not signaled.
>>>
>>> There is an exact similar condition in the reverse direction too.
>>> That is, Root node issues a bunch of SENDs to the Peer node, followed
>>> by some RDMA-WRITEs. The Peer node gets CQEs for the SENDs and looks
>>> for more CQEs but there won't be any, since the subsequent ones are
>>> all RDMA-WRITEs. The Root node itself is polling on its SCQ and it won't
>>> find any new completions since there are no more signaled WRs.
>>>
>>> So the 2 nodes are now in a hung state, polling on both the SCQ and RCQ,
>>> while there's no such operations pending that can generate new CQEs.
>>>
>>> Thanks,
>>> -Harsha
>>> _______________________________________________
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to