You might want to run your app through a memory-checking debugger to
see if anything obvious shows up.
Also, check to see if your corelimit size is greater than zero (i.e.,
make it "unlimited"). Then run again and see if you can get corefiles
to see if your app is silently dumping core, an
This is not OpenMPI specific - but maybe somebody on the list can give a
hint.
I start a parallel job with:
mpirun -np 19 -nolocal -machinefile machinefile bin/getm_prod_IFORT.0096x0096
everything starts OK and the simulation carries on 2+ hours of
wall clock time - then suddenly without a trace