Hi Noam

Could it be that Torque, or probably more likely NFS,
is too slow to create/make available the PBS_NODEFILE?

What if you insert a "sleep 2",
or whatever number of seconds you want,
before the mpiexec command line?
Or maybe better, a "ls -l $PBS_NODEFILE; cat $PBS_NODEFILE",
just to make sure the file it is available and
filled with the node list, before mpiexec takes over?

My two cents,
Gus Correa

On 09/20/2013 09:55 AM, Noam Bernstein wrote:
Hi - we've been using openmpi for a while, but only for the last few months
with torque/maui.  Intermittently (maybe 1/10 jobs), we get mpi jobs that fail 
with the error:

[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file 
ras_tm_module.c at line 142
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file 
ras_tm_module.c at line 82
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file 
base/ras_base_allocate.c at line 149
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file 
base/plm_base_launch_support.c at line 99
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file 
plm_tm_module.c at line 194

This is completely unrepeatable - resubmitting the same job almost
always works the second time around.  The line appears to be
associated with looking for the torque/maui generated node file,
and when I do something like
   echo $PBS_NODEFILE
   cat $PBS_NODEFILE
it appears that the file is present and correct.

We're running OpenMPI 1.6.4, configured with
./configure \
         --prefix=${DEST} \
         --with-tm=/usr/local/torque \
         --enable-mpirun-prefix-by-default \
         --with-openib=/usr \
         --with-openib-libdir=/usr/lib64

Has anyone seen anything like this before, or has any ideas of what might
be happening?  It appears to be a line where openmpi looks for
the PBS node file, which is on a local filesystem (e.g. 
PBS_NODEFILE=/var/spool/torque/aux//4600.tin).

                                                                        thanks,
                                                                        Noam



Noam Bernstein
Center for Computational Materials Science
NRL Code 6390
noam.bernst...@nrl.navy.mil




_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Reply via email to