Call healthcheck with a shell script that starts with:
sleep $[ ( $RANDOM % 10 ) + 1 ], or similar.
M.K.
From: slurm-users on behalf of SJTU
Sent: Thursday, November 26, 2020 8:24 PM
To: slurm-users@lists.schedmd.com
Subject: [slurm-users] Set a ramdom offset
Hi,
We uses HealthCheckProgram = /usr/sbin/nhc in slurm to check node health
every 600 seconds. However, some NHC checks points to a same central resource
thus starting these checks simultaneously may lead to false alarms of service
degrade.
Is it possible to set a random offset to when
--
>
> This e-mail and any attachments may contain information that is
> confidential to Ross Video.
>
> If you are not the intended recipient, please notify me immediately by
> replying to this message. Please also delete all copies. Thank you.
One last shot on the firewall front Steve -- does the control node have
a firewall enabled? I've seen cases where that can cause the sporadic
messaging failures that you seem to be seeing.
That failing, I'll defer to anyone with different ideas!
Andy
On 11/26/2020 1:01 PM, Steve Bland wrote:
Thanks Andy
Firewall is off on all three system. Also if they could not communicate, I do
not think 'scontrol show node' would not return the data that is does. And the
logs would not show responses as indicated below
And the names are correct, used the recommended 'hostname -s' when configurin
1. Look for a firewall on all of your slurm -- they almost always break
slurm communications.
2. Confirm that "ssh srvgridslurm01 hostname" returns, exactly,
"srvgridslurm01"
Andy
On 11/26/2020 12:21 PM, Steve Bland wrote:
Sinfo always returns nodes not responding
[root@srvgridslurm03 ~
Sinfo always returns nodes not responding
[root@srvgridslurm03 ~]# sinfo -R
REASON USER TIMESTAMP NODELIST
Not responding slurm 2020-11-26T09:12:58 SRVGRIDSLURM01
Not responding slurm 2020-11-26T08:27:58 SRVGRIDSLURM02
Not responding slurm