byron writes:
> Yep, the question of how he has the job set up is an ongoing conversation,
> but for now it is staying like this and I have to make do.
Wow, your user must have friends in high places, if he gets to do some
thing as goofy as starting a one-second job every second.
> Even with a
Hi Byron,
does the slurmctld recover by itself or does It require a manual restart of
the service? We had some deadlock issues related to MCS handling just after
doing the 19->20->21 upgrades. I don't recall what fixed the issue but
disabling MCS might be a good place to start if you are using it.
Yep, the question of how he has the job set up is an ongoing conversation,
but for now it is staying like this and I have to make do.
Even with all the traffic he is generating though (at worst 1 a second over
the course of a day) I would still have though that slurm was capable of
managing that.
Hi Byron,
byron writes:
> Hi Loris - about a second
What is the use-case for that? Are these individual jobs or it a job
array. Either way it sounds to me like a very bad idea. On our system,
jobs which can start immediately because resources are available, still
take a few seconds to start
Hi Loris - about a second
On Thu, Jul 28, 2022 at 2:47 PM Loris Bennett
wrote:
> Hi Byron,
>
> byron writes:
>
> > Hi
> >
> > We recently upgraded slurm from 19.05.7 to 20.11.9 and now we
> occasionally (3 times in 2 months) have slurmctld hanging so we get the
> following message when running
Hi Byron,
We ran into this with 20.02, and mitigated it with some kernel tuning. From
our sysctl.conf:
net.core.somaxconn = 2048
net.ipv4.tcp_max_syn_backlog = 8192
# prevent neighbour (aka ARP) table overflow...
net.ipv4.neigh.default.gc_thresh1 = 3
net.ipv4.neigh.default.gc_thresh2 = 320
Hi Byron,
byron writes:
> Hi
>
> We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally (3
> times in 2 months) have slurmctld hanging so we get the following message
> when running sinfo
>
> “slurm_load_jobs error: Socket timed out on send/recv operation”
>
> It only seem
Hi
We recently upgraded slurm from 19.05.7 to 20.11.9 and now we occasionally
(3 times in 2 months) have slurmctld hanging so we get the following
message when running sinfo
“slurm_load_jobs error: Socket timed out on send/recv operation”
It only seems to happen when one of our users runs a job