Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Gus Correa
On 09/20/2013 12:48 PM, Noam Bernstein wrote: On Sep 20, 2013, at 11:52 AM, Gus Correa wrote: Hi Noam Could it be that Torque, or probably more likely NFS, is too slow to create/make available the PBS_NODEFILE? What if you insert a "sleep 2", or whatever number of seconds you want, before t

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 11:52 AM, Gus Correa wrote: > Hi Noam > > Could it be that Torque, or probably more likely NFS, > is too slow to create/make available the PBS_NODEFILE? > > What if you insert a "sleep 2", > or whatever number of seconds you want, > before the mpiexec command line? > Or may

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Gus Correa
Hi Noam Could it be that Torque, or probably more likely NFS, is too slow to create/make available the PBS_NODEFILE? What if you insert a "sleep 2", or whatever number of seconds you want, before the mpiexec command line? Or maybe better, a "ls -l $PBS_NODEFILE; cat $PBS_NODEFILE", just to make

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 10:36 AM, Noam Bernstein wrote: > > On Sep 20, 2013, at 10:22 AM, Reuti wrote: > >> >> Is the location for the spool directory local or shared by NFS? Disk full? > > No - locally mounted, and far from full on all the nodes. Another new observation, which may shift the

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 10:22 AM, Reuti wrote: > > Is the location for the spool directory local or shared by NFS? Disk full? No - locally mounted, and far from full on all the nodes. Noam

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Reuti
Hi, Am 20.09.2013 um 16:12 schrieb Noam Bernstein: > On Sep 20, 2013, at 10:04 AM, Noam Bernstein > wrote: > >> Never mind - I was sure that my earlier tests showed that the $PBS_NODEFILE >> was there, but now it seems like every time the job fails it's because this >> file really is missing.

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 10:04 AM, Noam Bernstein wrote: > > Never mind - I was sure that my earlier tests showed that the $PBS_NODEFILE > was there, but now it seems like every time the job fails it's because this > file really is missing. Time to check why torque isn't always creating > the nodef

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 9:55 AM, Noam Bernstein wrote: > > This is completely unrepeatable - resubmitting the same job almost > always works the second time around. The line appears to be > associated with looking for the torque/maui generated node file, > and when I do something like > echo $PBS_

[OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
Hi - we've been using openmpi for a while, but only for the last few months with torque/maui. Intermittently (maybe 1/10 jobs), we get mpi jobs that fail with the error: [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 142 [compute-2-4:32448] [