Michael, are you running 1.8.7 or master ? if not default, which io module are you running ? (default is ROMIO with 1.8 but ompio with master)
by any chance, could you post a simple program that evidences this issue ? Cheers, Gilles On Thursday, July 23, 2015, Schlottke-Lakemper, Michael < m.schlottke-lakem...@aia.rwth-aachen.de> wrote: > Hi folks, > > We are currently encountering a weird file coherence issue when running > parallel jobs with OpenMPI (1.8.7) and writing files in parallel to an > NFS-mounted file system using Parallel netCDF 1.6.1 (which internally uses > MPI-I/O). Sometimes (~30-40% of our samples) we get a file whose contents > are not consistent across different hosts. > > Specifically, one of the hosts where the file was created will > (persistently) show a different file than any other host (confirmed using > md5sum/sha256sum and manually). From our observations it seems like the bad > host keeps an older state of the file, i.e. one where not all write > processes had finished. The error seems to occur only if the ranks are > distributed to at least two nodes, and it only occurs if there are multiple > programs running within the same pbs/torque job at the same time (MPMD; > each mpirun gets a different subset of the job nodes using the -machinefile > flag). > > Has anyone encountered something similar or do you have an idea what I > could do to track down the problem? > > Regards, > > Michael > > > -- > Michael Schlottke-Lakemper > > SimLab Highly Scalable Fluids & Solids Engineering > Jülich Aachen Research Alliance (JARA-HPC) > RWTH Aachen University > Wüllnerstraße 5a > 52062 Aachen > Germany > > Phone: +49 (241) 80 95188 > Fax: +49 (241) 80 92257 > Mail: m.schlottke-lakem...@aia.rwth-aachen.de > <javascript:_e(%7B%7D,'cvml','m.schlottke-lakem...@aia.rwth-aachen.de');> > Web: http://www.jara.org/jara-hpc > >