Michael,

are you running 1.8.7 or master ?
if not default, which io module are you running ?
(default is ROMIO with 1.8 but ompio with master)

by any chance, could you post a simple program that evidences this issue ?

Cheers,

Gilles

On Thursday, July 23, 2015, Schlottke-Lakemper, Michael <
m.schlottke-lakem...@aia.rwth-aachen.de> wrote:

>  Hi folks,
>
>  We are currently encountering a weird file coherence issue when running
> parallel jobs with OpenMPI (1.8.7) and writing files in parallel to an
> NFS-mounted file system using Parallel netCDF 1.6.1 (which internally uses
> MPI-I/O). Sometimes (~30-40% of our samples) we get a file whose contents
> are not consistent across different hosts.
>
>  Specifically, one of the hosts where the file was created will
> (persistently) show a different file than any other host (confirmed using
> md5sum/sha256sum and manually). From our observations it seems like the bad
> host keeps an older state of the file, i.e. one where not all write
> processes had finished. The error seems to occur only if the ranks are
> distributed to at least two nodes, and it only occurs if there are multiple
> programs running within the same pbs/torque job at the same time (MPMD;
> each mpirun gets a different subset of the job nodes using the -machinefile
> flag).
>
>  Has anyone encountered something similar or do you have an idea what I
> could do to track down the problem?
>
>  Regards,
>
>  Michael
>
>
>             --
> Michael Schlottke-Lakemper
>
>  SimLab Highly Scalable Fluids & Solids Engineering
> Jülich Aachen Research Alliance (JARA-HPC)
> RWTH Aachen University
> Wüllnerstraße 5a
> 52062 Aachen
> Germany
>
>  Phone: +49 (241) 80 95188
> Fax: +49 (241) 80 92257
> Mail: m.schlottke-lakem...@aia.rwth-aachen.de
> <javascript:_e(%7B%7D,'cvml','m.schlottke-lakem...@aia.rwth-aachen.de');>
> Web: http://www.jara.org/jara-hpc
>
>

Reply via email to