Is there any chance you can make a small-ish reproducer of the issue that we can run?
On Jan 27, 2012, at 10:45 AM, Evgeniy Shapiro wrote: > Hi > > I have a strange problem with MPI_Barrier occurring when writing to a > file. The output subroutine (the code is in FORTRAN) is called from > the main program and there is an MPI_Barrier just before the call. > > In the subroutine > > 1. Process 0 checks whether the first file exists and, if not, - > creates the file 1, writes the file header and closes the file > > 2. there is a loop over the data sets with an embedded barrier > do i=0, iDatasets > call MPI_Barrier > if I do not own data - cycle and go to the next dataset (and barrier) > check if the file exists, if not - sleep and check again until it > does (needed to make sure the buffer has been flushed) > write my portion of the file > end do > in theory the above should result in a sequential write of datasets > to the file. > > 3. Process 0 checks whether the second file exists and, if not, - > creates the file 2, writes the file header and closes the file > > 2. there is a loop over the data sets with an embedded barrier > do i=0, iDatasets > call MPI_Barrier > if I do not own data - cycle and go to the next dataset (and barrier) > check if the file exists, if not - sleep and check again until it > does (needed to make sure the buffer has been flushed) > write my portion of the file including a link to the 1st file > end do > > The sub is called several times (different files/datasets) with a > barrier between calls, erratically the program hangs in one of the > calls. The likelihood of the program hanging increases with the > increase of the number of processes. DDT shows that when this happens > some of the processes including 0 are waiting at barrier inside the > first loop, some - at the second barrier and one whereas one process > is in the sleep/check file status cycle in the second loop. So somehow > a part of processes go through the 1st barrier before process 0. > This is a debug version, so no loop unrolling etc. > > Is there anything I can do to make sure that the first barrier is > observed by all processes? Any advice greatly appreciated. > > Evgeniy > > > OpenMPI: 1.4.3 > (I cannot use parallel mpi io in this situation for various reasons) > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/