Re: [slurm-users] slurmctld hanging

2022-07-29 Thread Loris Bennett
byron writes: > Yep, the question of how he has the job set up is an ongoing conversation, > but for now it is staying like this and I have to make do. Wow, your user must have friends in high places, if he gets to do some thing as goofy as starting a one-second job every second. > Even with a

Re: [slurm-users] slurmctld hanging

2022-07-29 Thread Maciej Pawlik
Hi Byron, does the slurmctld recover by itself or does It require a manual restart of the service? We had some deadlock issues related to MCS handling just after doing the 19->20->21 upgrades. I don't recall what fixed the issue but disabling MCS might be a good place to start if you are using it.

Re: [slurm-users] slurmctld hanging

2022-07-29 Thread byron
Yep, the question of how he has the job set up is an ongoing conversation, but for now it is staying like this and I have to make do. Even with all the traffic he is generating though (at worst 1 a second over the course of a day) I would still have though that slurm was capable of managing that.

Re: [slurm-users] slurmctld hanging

2022-07-28 Thread Loris Bennett
Hi Byron, byron writes: > Hi Loris - about a second What is the use-case for that? Are these individual jobs or it a job array. Either way it sounds to me like a very bad idea. On our system, jobs which can start immediately because resources are available, still take a few seconds to start

Re: [slurm-users] slurmctld hanging

2022-07-28 Thread byron
Hi Loris - about a second On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett wrote: > Hi Byron, > > byron writes: > > > Hi > > > > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we > occasionally (3 times in 2 months) have slurmctld hanging so we get the > following message when running

Re: [slurm-users] slurmctld hanging

2022-07-28 Thread Fulcomer, Samuel
Hi Byron, We ran into this with 20.02, and mitigated it with some kernel tuning. From our sysctl.conf: net.core.somaxconn = 2048 net.ipv4.tcp_max_syn_backlog = 8192 # prevent neighbour (aka ARP) table overflow... net.ipv4.neigh.default.gc_thresh1 = 3 net.ipv4.neigh.default.gc_thresh2 = 320

Re: [slurm-users] slurmctld hanging

2022-07-28 Thread Loris Bennett
Hi Byron, byron writes: > Hi > > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally (3 > times in 2 months) have slurmctld hanging so we get the following message > when running sinfo > > “slurm_load_jobs error: Socket timed out on send/recv operation” > > It only seem

[slurm-users] slurmctld hanging

2022-07-28 Thread byron
Hi We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally (3 times in 2 months) have slurmctld hanging so we get the following message when running sinfo “slurm_load_jobs error: Socket timed out on send/recv operation” It only seems to happen when one of our users runs a job