Hi

I have a strange problem with MPI_Barrier occurring when writing to a
file. The output subroutine (the code is in FORTRAN) is called from
the main program and there is an MPI_Barrier just before the call.

In the subroutine

1. Process 0 checks whether the first file exists and, if not, -
creates the file 1, writes the file header and closes the file

2. there is a loop over the data sets with an embedded barrier
  do i=0, iDatasets
   call MPI_Barrier
   if I do not own data - cycle and go to the next dataset (and barrier)
   check if the file exists, if not - sleep and check again until it
does (needed to make sure the buffer has been flushed)
   write my portion of the file
  end do
 in theory the above should result in a sequential write of datasets
to the file.

3. Process 0 checks whether the second file exists and, if not, -
creates the file 2, writes the file header and closes the file

2. there is a loop over the data sets with an embedded barrier
  do i=0, iDatasets
   call MPI_Barrier
   if I do not own data - cycle and go to the next dataset (and barrier)
   check if the file exists, if not - sleep and check again until it
does (needed to make sure the buffer has been flushed)
   write my portion of the file including a link to the 1st file
  end do

The sub is called several times (different files/datasets) with a
barrier between calls, erratically the program hangs in one of the
calls. The likelihood of the program hanging increases with the
increase of the number of processes.  DDT shows that when this happens
some of the processes including 0 are waiting at barrier inside the
first loop, some - at the second barrier and one whereas one  process
is in the sleep/check file status cycle in the second loop. So somehow
 a part of  processes go through the 1st barrier before process 0.
This is a debug version, so no loop unrolling etc.

Is there anything I can do to make sure that the first barrier is
observed by all processes? Any advice greatly appreciated.

Evgeniy


OpenMPI: 1.4.3
(I cannot use parallel mpi io in this situation for various reasons)

Reply via email to