Hi folks,

We are currently encountering a weird file coherence issue when running 
parallel jobs with OpenMPI (1.8.7) and writing files in parallel to an 
NFS-mounted file system using Parallel netCDF 1.6.1 (which internally uses 
MPI-I/O). Sometimes (~30-40% of our samples) we get a file whose contents are 
not consistent across different hosts.

Specifically, one of the hosts where the file was created will (persistently) 
show a different file than any other host (confirmed using md5sum/sha256sum and 
manually). From our observations it seems like the bad host keeps an older 
state of the file, i.e. one where not all write processes had finished. The 
error seems to occur only if the ranks are distributed to at least two nodes, 
and it only occurs if there are multiple programs running within the same 
pbs/torque job at the same time (MPMD; each mpirun gets a different subset of 
the job nodes using the -machinefile flag).

Has anyone encountered something similar or do you have an idea what I could do 
to track down the problem?

Regards,

Michael


--
Michael Schlottke-Lakemper

SimLab Highly Scalable Fluids & Solids Engineering
Jülich Aachen Research Alliance (JARA-HPC)
RWTH Aachen University
Wüllnerstraße 5a
52062 Aachen
Germany

Phone: +49 (241) 80 95188
Fax: +49 (241) 80 92257
Mail: 
m.schlottke-lakem...@aia.rwth-aachen.de<mailto:m.schlottke-lakem...@aia.rwth-aachen.de>
Web: http://www.jara.org/jara-hpc

Reply via email to