Hi there,
I have attached a little piece of code which summarizes a "bug?" that annoys me ultimately. Issuing various calls to MPI_WIN_LOCK/UNLOCK seems to hang some processes until a MPI_BARRIER is encountered!??

My experience with MPI is very modest, so I apologize in advance if I misread the MPI-2 specs, but it looks that what I want to do is correct.

If you look to the file hangs.F90; the code starts with various call to LOCK/UNLOCK and then goes on with, let's say, a big piece of work, in between the comment " start action" and "action done". For the purpose of this example, that's a do loop of 10s.

I don't want to put a barrier after the various calls to LOCK/UNLOCK because I want it to run asynchronously. Also notice that I don't need some mutex or so, all that calls can be done simultaneously and in any order. My only pb is the following hangs:

Here the output when the code run on a SMP machine (8 cores) by increasing the number of processus (the same occurs with distributed memory).

mpirun -np 1 ./hangs
start action for rank=            0
(10 secondes later)
action done for rank=            0

<----works as I expect.

mpirun -np 2 ./hangs
start action for rank=            1
start action for rank=            0
(10 secs later)
action done for rank=            1
action done for rank=            0

<----so far so good; but with more processus the "bug?" appears:

mpirun -np 3 ./hangs
start action for rank=            1
start action for rank=            0
(10 secs later)
action done for rank=            0
action done for rank=            1
start action for rank=            2
(10 secs later)
action done for rank=            2

The processus 2 remained stuck on the MPI_UNLOCK statement until 0 and 1 reached the MPI_BARRIER instruction; which actually renders the execution serial :)

I tested with up to 8 processes and the problem becomes even worse; a random number of processes are stuck on the MPI_UNLOCK. However, this does not occur at each execution. Sometime, rarely though, all the processes get released as expected from the UNLOCK.

Additionally, if a MPI_BARRIER is issued just after the MPI_UNLOCK, there is no problem any more; but I never read in the MPI-2 specs that it should be the case, and this would completely kills the interest of performing asynchronous operations.

gcc/gfortran is 4.6.3
(Open MPI) 1.4.5

Please let me know if this behaviour can be fixed and if you need additional information!

Thanks in advance,
Cheers,
Chris.


program hangs
  implicit none

  include "mpif.h"

  integer :: myrank, nsize, code
  integer :: targrank
  integer :: timein,timenow


  integer :: WinOnQ
  integer :: IntSize,DispUnit


  integer(MPI_ADDRESS_KIND) :: WinSize
  integer(MPI_ADDRESS_KIND), parameter :: ZeroDisplace = 0

  integer, parameter :: NullAssert = 0
  integer, parameter :: CountOne = 1
  integer, parameter :: QLockFlag = -1




  integer :: GetQRdma

  integer, volatile :: QrdmaAddress



  call MPI_INIT(code)
  call MPI_COMM_RANK(MPI_COMM_WORLD,myrank,code)
  call MPI_COMM_SIZE(MPI_COMM_WORLD,nsize,code)



  QrdmaAddress = 10


!open win

  call MPI_TYPE_SIZE(MPI_INTEGER,IntSize,code)
  WinSize = IntSize !one element
  DispUnit = IntSize
  call MPI_WIN_CREATE(QrdmaAddress,WinSize,DispUnit,MPI_INFO_NULL &
       ,MPI_COMM_WORLD,WinOnQ,code)


!rdma lock/get/put/unlock operations

  do targrank = 0,nsize-1

     if (myrank.eq.targrank) cycle


 !    write(*,*)'lock/get/put   || orig= targ= ',myrank,targrank


     call MPI_WIN_LOCK(MPI_LOCK_EXCLUSIVE,targrank,NullAssert,WinOnQ,code)

     call MPI_GET(GetQrdma,CountOne,MPI_INTEGER,targrank,ZeroDisplace &
          ,CountOne,MPI_INTEGER,WinOnQ,code)

     call MPI_PUT(QLockFlag,CountOne,MPI_INTEGER,targrank,ZeroDisplace &
          ,CountOne,MPI_INTEGER,WinOnQ,code)


     call MPI_WIN_UNLOCK(targrank,WinOnQ,code)


!     write(*,*)'unlocked       || orig= targ= ',myrank,targrank
!     write(*,*)'orig= targ= GetQrdma=          ',myrank,targrank,GetQrdma

  enddo

!this fixes the issue, but I don't want sync here obsviously
!  call MPI_BARRIER(MPI_COMM_WORLD,code)


  write(*,*)
  write(*,*)
  write(*,*)'start action for rank= ',myrank

  timein = time()
  do while ((timenow-timein).lt.10)
     timenow = time()
  end do     

  write(*,*) 'action done for rank= ',myrank


  call MPI_WIN_FREE(WinOnQ,code)

  call MPI_BARRIER(MPI_COMM_WORLD,code)
  call MPI_FINALIZE(code)

end program hangs
# >>> DESIGNED FOR GMAKE <<<


FC=mpif90

FFLAGS= -O -fopenmp


hangs : hangs.o
        $(FC) $(FFLAGS) hangs.o -o $@

%.o: %.F90
        $(FC) $(FFLAGS) $(INCLUDE) -c $<

clean:
        rm *.$(ext) *.o *.mod

Reply via email to