On 09/20/2013 12:48 PM, Noam Bernstein wrote:
On Sep 20, 2013, at 11:52 AM, Gus Correa wrote:
Hi Noam
Could it be that Torque, or probably more likely NFS,
is too slow to create/make available the PBS_NODEFILE?
What if you insert a "sleep 2",
or whatever number of seconds you want,
before t
On Sep 20, 2013, at 11:52 AM, Gus Correa wrote:
> Hi Noam
>
> Could it be that Torque, or probably more likely NFS,
> is too slow to create/make available the PBS_NODEFILE?
>
> What if you insert a "sleep 2",
> or whatever number of seconds you want,
> before the mpiexec command line?
> Or may
Hi Noam
Could it be that Torque, or probably more likely NFS,
is too slow to create/make available the PBS_NODEFILE?
What if you insert a "sleep 2",
or whatever number of seconds you want,
before the mpiexec command line?
Or maybe better, a "ls -l $PBS_NODEFILE; cat $PBS_NODEFILE",
just to make
On Sep 20, 2013, at 10:36 AM, Noam Bernstein
wrote:
>
> On Sep 20, 2013, at 10:22 AM, Reuti wrote:
>
>>
>> Is the location for the spool directory local or shared by NFS? Disk full?
>
> No - locally mounted, and far from full on all the nodes.
Another new observation, which may shift the
On Sep 20, 2013, at 10:22 AM, Reuti wrote:
>
> Is the location for the spool directory local or shared by NFS? Disk full?
No - locally mounted, and far from full on all the nodes.
Noam
Hi,
Am 20.09.2013 um 16:12 schrieb Noam Bernstein:
> On Sep 20, 2013, at 10:04 AM, Noam Bernstein
> wrote:
>
>> Never mind - I was sure that my earlier tests showed that the $PBS_NODEFILE
>> was there, but now it seems like every time the job fails it's because this
>> file really is missing.
On Sep 20, 2013, at 10:04 AM, Noam Bernstein
wrote:
>
> Never mind - I was sure that my earlier tests showed that the $PBS_NODEFILE
> was there, but now it seems like every time the job fails it's because this
> file really is missing. Time to check why torque isn't always creating
> the nodef
On Sep 20, 2013, at 9:55 AM, Noam Bernstein wrote:
>
> This is completely unrepeatable - resubmitting the same job almost
> always works the second time around. The line appears to be
> associated with looking for the torque/maui generated node file,
> and when I do something like
> echo $PBS_
Hi - we've been using openmpi for a while, but only for the last few months
with torque/maui. Intermittently (maybe 1/10 jobs), we get mpi jobs that fail
with the error:
[compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file
ras_tm_module.c at line 142
[compute-2-4:32448] [