[slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds

2024-02-12 Thread Fulcomer, Samuel via slurm-users
We'd bumped ours up for a while 20+ years ago when we had a flaky network connection between two buildings holding our compute nodes. If you need more than 600s you have networking problems. On Mon, Feb 12, 2024 at 5:41 PM Timony, Mick via slurm-users < slurm-users@lists.schedmd.com> wrote: > We

[slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds

2024-02-12 Thread Timony, Mick via slurm-users
We set SlurmdTimeout=600​. The docs say not to go any higher than 65533 seconds: https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdTimeout The FAQ has info about SlurmdTimeout also. The worst thing that could happen is will take longer to set nodes as being down: >A node is set DOWN when the s

[slurm-users] Re: simple question, I guess… from a newbie sysadmin

2024-02-12 Thread Jess Arrington via slurm-users
Hi Richard, I hope your day is treating you well. Thank you for your posts on the Slurm user list. Would there be interest on your side to see a Slurm support contract for your systems at University of Nantes? Sites running Slurm with support give us feedback that support is invaluable and a

[slurm-users] Re: Increasing SlurmdTimeout beyond 300 Seconds

2024-02-12 Thread Bjørn-Helge Mevik via slurm-users
We've been running one cluster with SlurmdTimeout = 1200 sec for a couple of years now, and I haven't seen any problems due to that. -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature -- slurm-users mailin

[slurm-users] simple question, I guess… from a newbie sysadmin

2024-02-12 Thread Richard Randriatoamanana via slurm-users
Hi, I am trying to help a sysadmin colleague (and to understand for myself) trying to configure a new slurm server and he struggles to understand if there is an alternative way to config slurm managing job policy submission per user without necessarily installing an accounting mariadb service.

[slurm-users] Increasing SlurmdTimeout beyond 300 Seconds

2024-02-12 Thread Andrew Baughan (ITCS - Staff) via slurm-users
Hi, We've been experiencing issues with network saturation on our older nodes caused by storage (GPFS) backups. This causes slurmctld to loose contact with slurmd on some compute nodes and for user jobs to be killed. While the longer term solution is to replace these and upgrade the network, I'