François PELLEGRINI wrote:

I sometimes run into deadlocks in OpenMPI (1.3.3a1r21206), when
running my MPI+threaded PT-Scotch software.

So, are there multiple threads per process that perform message-passing operations?

Other comments below.

Luckily, the case
is very small, with 4 procs only, so I have been able to investigate
it a bit. It seems that matches between commnications are not done
properly on cloned communicators. In the end, I run into a case where
a MPI_Waitall completes a MPI_Barrier on another proc. The bug is
erratic but quite easy to reproduce, luckily too.

To be sure, I ran my code into valgrind using helgrind, its
race condition detection tool. It produced much output, most
of which seems to be innocuous, yet I have some concerns about
such messages as the following ones. The ==12**== were generated
when running on 4 procs, while the ==83**== were generated
when running on 2 procs :

==8329== Possible data race during write of size 4 at 0x8882200
==8329==    at 0x508B315: sm_fifo_write (btl_sm.h:254)
==8329==    by 0x508B401: mca_btl_sm_send (btl_sm.c:811)
==8329==    by 0x5070A0C: mca_bml_base_send_status (bml.h:288)
==8329==    by 0x50708E6: mca_pml_ob1_send_request_start_copy 
(pml_ob1_sendreq.c:567)
==8329==    by 0x5064C30: mca_pml_ob1_send_request_start_btl 
(pml_ob1_sendreq.h:363)
==8329==    by 0x5064A19: mca_pml_ob1_send_request_start (pml_ob1_sendreq.h:429)
==8329==    by 0x5064856: mca_pml_ob1_isend (pml_ob1_isend.c:87)
==8329==    by 0x5142C46: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:51)
==8329==    by 0x514F379: ompi_coll_tuned_barrier_intra_two_procs 
(coll_tuned_barrier.c:258)
==8329==    by 0x5143252: ompi_coll_tuned_barrier_intra_dec_fixed 
(coll_tuned_decision_fixed.c:192)
==8329==    by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
==8329==    by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
==8329==   Old state: shared-readonly by threads #1, #7
==8329==   New state: shared-modified by threads #1, #7
==8329==   Reason:    this thread, #1, holds no consistent locks
==8329==   Location 0x8882200 has never been protected by any lock
This seems to be where the "head" index is incremented in sm_fifo_write(). I believe that function is only ever called via the macro MCA_BTL_SM_FIFO_WRITE, which requires the writer to be holding the FIFO's head lock. So, this would seem to be sufficiently protected. In 1.3.1 and earlier, a lock was required only for multithreaded programs. Now, the writer *always* has to acquire the lock since the FIFOs are shared among senders.

==1220== Possible data race during write of size 4 at 0x88CEF88
==1220==    at 0x508CD84: sm_fifo_read (btl_sm.h:272)
==1220==    by 0x508C864: mca_btl_sm_component_progress (btl_sm_component.c:391)
==1220==    by 0x41F72DF: opal_progress (opal_progress.c:207)
==1220==    by 0x40BD67D: opal_condition_wait (condition.h:85)
==1220==    by 0x40BDA96: ompi_request_default_wait_all (req_wait.c:262)
==1220==    by 0x5142C78: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55)
==1220==    by 0x514F07A: ompi_coll_tuned_barrier_intra_recursivedoubling 
(coll_tuned_barrier.c:174)
==1220==    by 0x51432A3: ompi_coll_tuned_barrier_intra_dec_fixed 
(coll_tuned_decision_fixed.c:208)
==1220==    by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
==1220==    by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
==1220==    by 0x805E2B2: kdgraphMapRbPartFold2 (kdgraph_map_rb_part.c:199)
==1220==    by 0x805EA43: kdgraphMapRbPart2 (kdgraph_map_rb_part.c:331)
==1220==   Old state: shared-readonly by threads #1, #7
==1220==   New state: shared-modified by threads #1, #7
==1220==   Reason:    this thread, #1, holds no consistent locks
==1220==   Location 0x88CEF88 has never been protected by any lock
Here, the FIFO tail index is being incremented in sm_fifo_read(). I believe this function is only called from mca_btl_sm_component_progress(). That function requires that the reader holds the tail lock to read the tail when the process is multithreaded. I believe this requirement suffices since only the reader/owner of the FIFO can read the tail. So, the only contention would be if that reader/owner is multithreaded.

==1219== Possible data race during write of size 4 at 0x891BC8C
==1219==    at 0x508CD99: sm_fifo_read (btl_sm.h:273)
==1219==    by 0x508C864: mca_btl_sm_component_progress (btl_sm_component.c:391)
==1219==    by 0x41F72DF: opal_progress (opal_progress.c:207)
==1219==    by 0x40BD67D: opal_condition_wait (condition.h:85)
==1219==    by 0x40BDA96: ompi_request_default_wait_all (req_wait.c:262)
==1219==    by 0x5142C78: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55)
==1219==    by 0x514F07A: ompi_coll_tuned_barrier_intra_recursivedoubling 
(coll_tuned_barrier.c:174)
==1219==    by 0x51432A3: ompi_coll_tuned_barrier_intra_dec_fixed 
(coll_tuned_decision_fixed.c:208)
==1219==    by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
==1219==    by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
==1219==    by 0x805E2B2: kdgraphMapRbPartFold2 (kdgraph_map_rb_part.c:199)
==1219==    by 0x805EA43: kdgraphMapRbPart2 (kdgraph_map_rb_part.c:331)
==1219==   Old state: shared-readonly by threads #1, #7
==1219==   New state: shared-modified by threads #1, #7
==1219==   Reason:    this thread, #1, holds no consistent locks
==1219==   Location 0x891BC8C has never been protected by any lock
This immediately follows the incrementing of the tail index and is governed by the same tail lock when the process is multi-threaded.

==1220== Possible data race during write of size 4 at 0x4243A68
==1220==    at 0x41F72A7: opal_progress (opal_progress.c:186)
==1220==    by 0x40BD67D: opal_condition_wait (condition.h:85)
==1220==    by 0x40BDA96: ompi_request_default_wait_all (req_wait.c:262)
==1220==    by 0x5142C78: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55)
==1220==    by 0x514F07A: ompi_coll_tuned_barrier_intra_recursivedoubling 
(coll_tuned_barrier.c:174)
==1220==    by 0x51432A3: ompi_coll_tuned_barrier_intra_dec_fixed 
(coll_tuned_decision_fixed.c:208)
==1220==    by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
==1220==    by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
==1220==    by 0x805E2B2: kdgraphMapRbPartFold2 (kdgraph_map_rb_part.c:199)
==1220==    by 0x805EA43: kdgraphMapRbPart2 (kdgraph_map_rb_part.c:331)
==1220==    by 0x805EB86: _SCOTCHkdgraphMapRbPart (kdgraph_map_rb_part.c:421)
==1220==    by 0x8057713: _SCOTCHkdgraphMapSt (kdgraph_map_st.c:182)
==1220==   Old state: shared-readonly by threads #1, #7
==1220==   New state: shared-modified by threads #1, #7
==1220==   Reason:    this thread, #1, holds no consistent locks
==1220==   Location 0x4243A68 has never been protected by any lock
I guess I won't venture any comments on the opal progress engine.

==8328== Possible data race during write of size 4 at 0x4532318
==8328==    at 0x508A9B8: opal_atomic_lifo_pop (opal_atomic_lifo.h:111)
==8328==    by 0x508A69F: mca_btl_sm_alloc (btl_sm.c:612)
==8328==    by 0x5070571: mca_bml_base_alloc (bml.h:241)
==8328==    by 0x5070778: mca_pml_ob1_send_request_start_copy 
(pml_ob1_sendreq.c:506)
==8328==    by 0x5064C30: mca_pml_ob1_send_request_start_btl 
(pml_ob1_sendreq.h:363)
==8328==    by 0x5064A19: mca_pml_ob1_send_request_start (pml_ob1_sendreq.h:429)
==8328==    by 0x5064856: mca_pml_ob1_isend (pml_ob1_isend.c:87)
==8328==    by 0x5142C46: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:51)
==8328==    by 0x514F379: ompi_coll_tuned_barrier_intra_two_procs 
(coll_tuned_barrier.c:258)
==8328==    by 0x5143252: ompi_coll_tuned_barrier_intra_dec_fixed 
(coll_tuned_decision_fixed.c:192)
==8328==    by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
==8328==    by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
==8328==   Old state: shared-readonly by threads #1, #8
==8328==   New state: shared-modified by threads #1, #8
==8328==   Reason:    this thread, #1, holds no consistent locks
==8328==   Location 0x4532318 has never been protected by any lock
Here, opal_atomic_lifo_pop is used to get an item off the sm eager free list. The opal atomic LIFO operation seems to use atomic memory operations for thread safety, but I'll let someone else vouch for that code.

==8329== Possible data race during write of size 4 at 0x452F238
==8329==    at 0x5067FD3: recv_req_matched (pml_ob1_recvreq.h:219)
==8329==    by 0x5067D95: mca_pml_ob1_recv_frag_callback_match 
(pml_ob1_recvfrag.c:191)
==8329==    by 0x508C9BB: mca_btl_sm_component_progress (btl_sm_component.c:426)
==8329==    by 0x41F72DF: opal_progress (opal_progress.c:207)
==8329==    by 0x40BD67D: opal_condition_wait (condition.h:85)
==8329==    by 0x40BDA96: ompi_request_default_wait_all (req_wait.c:262)
==8329==    by 0x5142C78: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55)
==8329==    by 0x514F379: ompi_coll_tuned_barrier_intra_two_procs 
(coll_tuned_barrier.c:258)
==8329==    by 0x5143252: ompi_coll_tuned_barrier_intra_dec_fixed 
(coll_tuned_decision_fixed.c:192)
==8329==    by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
==8329==    by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
==8329==    by 0x805E2B2: kdgraphMapRbPartFold2 (kdgraph_map_rb_part.c:199)
==8329==   Old state: owned exclusively by thread #7
==8329==   New state: shared-modified by threads #1, #7
==8329==   Reason:    this thread, #1, holds no locks at all
Dunno. Here, the PML is copying source and tag information out of a match header into a status structure. I would think this code is okay since the thread presumably owns both the receive request and the match header. But I'll let someone who knows the PML speak up on this point.

==8329== Possible data race during write of size 4 at 0x452F2DC
==8329==    at 0x40D5946: ompi_convertor_unpack (convertor.c:280)
==8329==    by 0x5067E78: mca_pml_ob1_recv_frag_callback_match 
(pml_ob1_recvfrag.c:215)
==8329==    by 0x508C9BB: mca_btl_sm_component_progress (btl_sm_component.c:426)
==8329==    by 0x41F72DF: opal_progress (opal_progress.c:207)
==8329==    by 0x40BD67D: opal_condition_wait (condition.h:85)
==8329==    by 0x40BDA96: ompi_request_default_wait_all (req_wait.c:262)
==8329==    by 0x5142C78: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55)
==8329==    by 0x514F379: ompi_coll_tuned_barrier_intra_two_procs 
(coll_tuned_barrier.c:258)
==8329==    by 0x5143252: ompi_coll_tuned_barrier_intra_dec_fixed 
(coll_tuned_decision_fixed.c:192)
==8329==    by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
==8329==    by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
==8329==    by 0x805E2B2: kdgraphMapRbPartFold2 (kdgraph_map_rb_part.c:199)
==8329==   Old state: owned exclusively by thread #7
==8329==   New state: shared-modified by threads #1, #7
==8329==   Reason:    this thread, #1, holds no locks at all
It's unpacking message data. I would think this is okay, but someone who understands the PML should say for sure.

I guess the following are ok, but I provide them as a
reference :

==1220== Possible data race during write of size 4 at 0x8968780
==1220==    at 0x508A619: opal_atomic_unlock (atomic_impl.h:367)
==1220==    by 0x508B468: mca_btl_sm_send (btl_sm.c:811)
==1220==    by 0x5070A0C: mca_bml_base_send_status (bml.h:288)
==1220==    by 0x50708E6: mca_pml_ob1_send_request_start_copy 
(pml_ob1_sendreq.c:567)
==1220==    by 0x5064C30: mca_pml_ob1_send_request_start_btl 
(pml_ob1_sendreq.h:363)
==1220==    by 0x5064A19: mca_pml_ob1_send_request_start (pml_ob1_sendreq.h:429)
==1220==    by 0x5064856: mca_pml_ob1_isend (pml_ob1_isend.c:87)
==1220==    by 0x5142C46: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:51)
==1220==    by 0x514F07A: ompi_coll_tuned_barrier_intra_recursivedoubling 
(coll_tuned_barrier.c:174)
==1220==    by 0x51432A3: ompi_coll_tuned_barrier_intra_dec_fixed 
(coll_tuned_decision_fixed.c:208)
==1220==    by 0x40E410C: PMPI_Barrier (pbarrier.c:59)
==1220==    by 0x806C5FB: _SCOTCHdgraphInducePart (dgraph_induce.c:334)
==1220==   Old state: shared-modified by threads #1, #7
==1220==   New state: shared-modified by threads #1, #7
==1220==   Reason:    this thread, #1, holds no consistent locks
==1220==   Location 0x8968780 has never been protected by any lock
Unlock during sm FIFO write?  Yes, I would think this is okay.

My comments aren't intended to give the code base my unqualified okay. I'm only saying that I read through these stacks and the sm BTL code that's called out looks okay to me.

ompi_info says :
                Package: Open MPI pelegrin@brol Distribution
               Open MPI: 1.3.3a1r21206
  Open MPI SVN revision: r21206
  Open MPI release date: Unreleased developer copy
               Open RTE: 1.3.3a1r21206
  Open RTE SVN revision: r21206
  Open RTE release date: Unreleased developer copy
                   OPAL: 1.3.3a1r21206
      OPAL SVN revision: r21206
      OPAL release date: Unreleased developer copy
           Ident string: 1.3.3a1r21206
                 Prefix: /usr/local
Configured architecture: i686-pc-linux-gnu
         Configure host: brol
          Configured by: pelegrin
          Configured on: Tue May 12 15:50:08 CEST 2009
         Configure host: brol
               Built by: pelegrin
               Built on: Tue May 12 16:17:34 CEST 2009
             Built host: brol
             C bindings: yes
           C++ bindings: yes
     Fortran77 bindings: yes (all)
     Fortran90 bindings: yes
Fortran90 bindings size: small
             C compiler: gcc
    C compiler absolute: /usr/bin/gcc
           C++ compiler: g++
  C++ compiler absolute: /usr/bin/g++
     Fortran77 compiler: gfortran
 Fortran77 compiler abs: /usr/bin/gfortran
     Fortran90 compiler: gfortran
 Fortran90 compiler abs: /usr/bin/gfortran
            C profiling: yes
          C++ profiling: yes
    Fortran77 profiling: yes
    Fortran90 profiling: yes
         C++ exceptions: no
         Thread support: posix (mpi: yes, progress: no)
          Sparse Groups: no
 Internal debug support: yes
    MPI parameter check: always
Memory profiling support: no
Memory debugging support: yes
        libltdl support: yes
  Heterogeneous support: no
mpirun default --prefix: no
        MPI I/O support: yes
      MPI_WTIME support: gettimeofday
Symbol visibility support: yes
  FT Checkpoint support: no  (checkpoint thread: no)
          MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.3.3)
         MCA memchecker: valgrind (MCA v2.0, API v2.0, Component v1.3.3)
             MCA memory: ptmalloc2 (MCA v2.0, API v2.0, Component v1.3.3)
          MCA paffinity: linux (MCA v2.0, API v2.0, Component v1.3.3)
              MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.3.3)
              MCA carto: file (MCA v2.0, API v2.0, Component v1.3.3)
          MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.3.3)
              MCA timer: linux (MCA v2.0, API v2.0, Component v1.3.3)
        MCA installdirs: env (MCA v2.0, API v2.0, Component v1.3.3)
        MCA installdirs: config (MCA v2.0, API v2.0, Component v1.3.3)
                MCA dpm: orte (MCA v2.0, API v2.0, Component v1.3.3)
             MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.3.3)
          MCA allocator: basic (MCA v2.0, API v2.0, Component v1.3.3)
          MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.3.3)
               MCA coll: basic (MCA v2.0, API v2.0, Component v1.3.3)
               MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.3.3)
               MCA coll: inter (MCA v2.0, API v2.0, Component v1.3.3)
               MCA coll: self (MCA v2.0, API v2.0, Component v1.3.3)
               MCA coll: sm (MCA v2.0, API v2.0, Component v1.3.3)
               MCA coll: sync (MCA v2.0, API v2.0, Component v1.3.3)
               MCA coll: tuned (MCA v2.0, API v2.0, Component v1.3.3)
                 MCA io: romio (MCA v2.0, API v2.0, Component v1.3.3)
              MCA mpool: fake (MCA v2.0, API v2.0, Component v1.3.3)
              MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.3.3)
              MCA mpool: sm (MCA v2.0, API v2.0, Component v1.3.3)
                MCA pml: cm (MCA v2.0, API v2.0, Component v1.3.3)
                MCA pml: csum (MCA v2.0, API v2.0, Component v1.3.3)
                MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.3.3)
                MCA pml: v (MCA v2.0, API v2.0, Component v1.3.3)
                MCA bml: r2 (MCA v2.0, API v2.0, Component v1.3.3)
             MCA rcache: vma (MCA v2.0, API v2.0, Component v1.3.3)
                MCA btl: self (MCA v2.0, API v2.0, Component v1.3.3)
                MCA btl: sm (MCA v2.0, API v2.0, Component v1.3.3)
                MCA btl: tcp (MCA v2.0, API v2.0, Component v1.3.3)
               MCA topo: unity (MCA v2.0, API v2.0, Component v1.3.3)
                MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.3.3)
                MCA osc: rdma (MCA v2.0, API v2.0, Component v1.3.3)
                MCA iof: hnp (MCA v2.0, API v2.0, Component v1.3.3)
                MCA iof: orted (MCA v2.0, API v2.0, Component v1.3.3)
                MCA iof: tool (MCA v2.0, API v2.0, Component v1.3.3)
                MCA oob: tcp (MCA v2.0, API v2.0, Component v1.3.3)
               MCA odls: default (MCA v2.0, API v2.0, Component v1.3.3)
                MCA ras: slurm (MCA v2.0, API v2.0, Component v1.3.3)
              MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.3.3)
              MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.3.3)
              MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.3.3)
                MCA rml: oob (MCA v2.0, API v2.0, Component v1.3.3)
             MCA routed: binomial (MCA v2.0, API v2.0, Component v1.3.3)
             MCA routed: direct (MCA v2.0, API v2.0, Component v1.3.3)
             MCA routed: linear (MCA v2.0, API v2.0, Component v1.3.3)
                MCA plm: rsh (MCA v2.0, API v2.0, Component v1.3.3)
                MCA plm: slurm (MCA v2.0, API v2.0, Component v1.3.3)
              MCA filem: rsh (MCA v2.0, API v2.0, Component v1.3.3)
             MCA errmgr: default (MCA v2.0, API v2.0, Component v1.3.3)
                MCA ess: env (MCA v2.0, API v2.0, Component v1.3.3)
                MCA ess: hnp (MCA v2.0, API v2.0, Component v1.3.3)
                MCA ess: singleton (MCA v2.0, API v2.0, Component v1.3.3)
                MCA ess: slurm (MCA v2.0, API v2.0, Component v1.3.3)
                MCA ess: tool (MCA v2.0, API v2.0, Component v1.3.3)
            MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.3.3)
            MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.3.3)

Thanks in advance for any help / explanation,

                                        f.p.

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Reply via email to