Hi - we've been using openmpi for a while, but only for the last few months
with torque/maui. Intermittently (maybe 1/10 jobs), we get mpi jobs that fail
with the error:
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file
ras_tm_module.c at line 142
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file
ras_tm_module.c at line 82
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file
base/ras_base_allocate.c at line 149
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file
base/plm_base_launch_support.c at line 99
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file
plm_tm_module.c at line 194
This is completely unrepeatable - resubmitting the same job almost
always works the second time around. The line appears to be
associated with looking for the torque/maui generated node file,
and when I do something like
echo $PBS_NODEFILE
cat $PBS_NODEFILE
it appears that the file is present and correct.
We're running OpenMPI 1.6.4, configured with
./configure \
--prefix=${DEST} \
--with-tm=/usr/local/torque \
--enable-mpirun-prefix-by-default \
--with-openib=/usr \
--with-openib-libdir=/usr/lib64
Has anyone seen anything like this before, or has any ideas of what might
be happening? It appears to be a line where openmpi looks for
the PBS node file, which is on a local filesystem (e.g.
PBS_NODEFILE=/var/spool/torque/aux//4600.tin).
thanks,
Noam
Noam Bernstein
Center for Computational Materials Science
NRL Code 6390
[email protected]