On 09/20/2013 12:48 PM, Noam Bernstein wrote:

On Sep 20, 2013, at 11:52 AM, Gus Correa<g...@ldeo.columbia.edu>  wrote:

Hi Noam

Could it be that Torque, or probably more likely NFS,
is too slow to create/make available the PBS_NODEFILE?

What if you insert a "sleep 2",
or whatever number of seconds you want,
before the mpiexec command line?
Or maybe better, a "ls -l $PBS_NODEFILE; cat $PBS_NODEFILE",
just to make sure the file it is available and
filled with the node list, before mpiexec takes over?

I don't see how NFS could be involved, since it's on a local filesystem.
As for adding a sleep, I already tried that - if the file doesn't exist, I 
sleep a few
seconds and check again, and in every case if it's not there to begin with it's 
not
there the second time either.  And this all doesn't explain the very
mysterious even more infrequent situation where I can cat the file, but
mpirun can't find it.


Hi Noam

I only read the full email exchange after I sent my message.
Now I read it is not over NFS but local.
Still a communication delay (which can be non-deterministic)
between pbs_server and the local pbs_mom on the node could be
behind the problem (say, if the server authorizes the
node to start the job first, then second it copies over the
node file over, which may take some time,
depending on the network traffic).

Gus Correa

Reply via email to