Hi folks, We are currently encountering a weird file coherence issue when running parallel jobs with OpenMPI (1.8.7) and writing files in parallel to an NFS-mounted file system using Parallel netCDF 1.6.1 (which internally uses MPI-I/O). Sometimes (~30-40% of our samples) we get a file whose contents are not consistent across different hosts.
Specifically, one of the hosts where the file was created will (persistently) show a different file than any other host (confirmed using md5sum/sha256sum and manually). From our observations it seems like the bad host keeps an older state of the file, i.e. one where not all write processes had finished. The error seems to occur only if the ranks are distributed to at least two nodes, and it only occurs if there are multiple programs running within the same pbs/torque job at the same time (MPMD; each mpirun gets a different subset of the job nodes using the -machinefile flag). Has anyone encountered something similar or do you have an idea what I could do to track down the problem? Regards, Michael -- Michael Schlottke-Lakemper SimLab Highly Scalable Fluids & Solids Engineering Jülich Aachen Research Alliance (JARA-HPC) RWTH Aachen University Wüllnerstraße 5a 52062 Aachen Germany Phone: +49 (241) 80 95188 Fax: +49 (241) 80 92257 Mail: m.schlottke-lakem...@aia.rwth-aachen.de<mailto:m.schlottke-lakem...@aia.rwth-aachen.de> Web: http://www.jara.org/jara-hpc