I'm running a small-ish slurm grid, 87 nodes with various hardware. On a few 
occasions lately users submitting jobs will get an orted error and the job 
fails.  Try again a few hours later or the next day and the same job runs just 
fine.

Google-fu indicated it might be a DNS issue if for whatever reason a node 
couldn't figure out the address for other nodes in the job.  So I populated the 
/etc/hosts on each node with a complete listing of all the nodes so there 
wouldn't be any reliance on DNS.  And that very afternoon another job failed 
with orted.  So it seems at least in my case DNS isn't the issue.

What's the best way to troubleshoot this when orted fails but doesn't give any 
sort of error to indicate what the root cause of the failure might be?  And I 
also can't predictably induce the failure, just have to wait until it randomly 
chokes.
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
  • [slurm-users]... Berg, Stephen P CIV USN NRL DET SSC MS (USA) via slurm-users

Reply via email to