Is there any chance you can make a small-ish reproducer of the issue that we 
can run?

On Jan 27, 2012, at 10:45 AM, Evgeniy Shapiro wrote:

> Hi
> 
> I have a strange problem with MPI_Barrier occurring when writing to a
> file. The output subroutine (the code is in FORTRAN) is called from
> the main program and there is an MPI_Barrier just before the call.
> 
> In the subroutine
> 
> 1. Process 0 checks whether the first file exists and, if not, -
> creates the file 1, writes the file header and closes the file
> 
> 2. there is a loop over the data sets with an embedded barrier
>  do i=0, iDatasets
>   call MPI_Barrier
>   if I do not own data - cycle and go to the next dataset (and barrier)
>   check if the file exists, if not - sleep and check again until it
> does (needed to make sure the buffer has been flushed)
>   write my portion of the file
>  end do
> in theory the above should result in a sequential write of datasets
> to the file.
> 
> 3. Process 0 checks whether the second file exists and, if not, -
> creates the file 2, writes the file header and closes the file
> 
> 2. there is a loop over the data sets with an embedded barrier
>  do i=0, iDatasets
>   call MPI_Barrier
>   if I do not own data - cycle and go to the next dataset (and barrier)
>   check if the file exists, if not - sleep and check again until it
> does (needed to make sure the buffer has been flushed)
>   write my portion of the file including a link to the 1st file
>  end do
> 
> The sub is called several times (different files/datasets) with a
> barrier between calls, erratically the program hangs in one of the
> calls. The likelihood of the program hanging increases with the
> increase of the number of processes.  DDT shows that when this happens
> some of the processes including 0 are waiting at barrier inside the
> first loop, some - at the second barrier and one whereas one  process
> is in the sleep/check file status cycle in the second loop. So somehow
> a part of  processes go through the 1st barrier before process 0.
> This is a debug version, so no loop unrolling etc.
> 
> Is there anything I can do to make sure that the first barrier is
> observed by all processes? Any advice greatly appreciated.
> 
> Evgeniy
> 
> 
> OpenMPI: 1.4.3
> (I cannot use parallel mpi io in this situation for various reasons)
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to