Hi,

We've been experiencing issues with network saturation on our older nodes 
caused by storage (GPFS) backups. This causes slurmctld to loose contact with 
slurmd on some compute nodes and for user jobs to be killed. While the longer 
term solution is to replace these and upgrade the network, I'm wondering if 
there are any ramifications, beyond nodes with genuine issues taking longer to 
get marked down, by increasing SlurmdTimeout. We've already applied a modest 
increase which has helped but not resolved the issue and wondering if we should 
push it further in the interim.


Kind Regards
Andy Baughan
HPC Systems Developer

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to