This is not OpenMPI specific - but maybe somebody on the list can give a
hint.
I start a parallel job with:
mpirun -np 19 -nolocal -machinefile machinefile bin/getm_prod_IFORT.0096x0096
everything starts OK and the simulation carries on 2+ hours of
wall clock time - then suddenly without a trace in the logfile:
19:48:46.172 n= 1800
2003-09-01 05:06:00: reading 2D boundary data ...
19:49:21.710 n= 1900
19:49:50.490 n= 2000
or in any system logfiles the simulation stops and all related processes
on the nodes stops.
If I re-run the simulation does not stop at the same time.
Does anybody have a clue where I shall search.
I use a 4 machine/dual P/dual core cluster connected via GBit/s ethernet.
Karsten
PS: If I use MPICH I get the same problem.
--
----------------------------------------------------------------------
Karsten Bolding Bolding & Burchard Hydrodynamics
Strandgyden 25 Phone: +45 64422058
DK-5466 Asperup Fax: +45 64422068
Denmark Email: [email protected]
http://www.findvej.dk/Strandgyden25,5466,11,3
----------------------------------------------------------------------