Re: [slurm-users] Set a ramdom offset when starting node health check in SLURM

2020-11-26 Thread Micheal Krombopulous
Call healthcheck with a shell script that starts with: sleep $[ ( $RANDOM % 10 ) + 1 ], or similar. M.K. From: slurm-users on behalf of SJTU Sent: Thursday, November 26, 2020 8:24 PM To: slurm-users@lists.schedmd.com Subject: [slurm-users] Set a ramdom offset

[slurm-users] Set a ramdom offset when starting node health check in SLURM

2020-11-26 Thread SJTU
Hi, We uses HealthCheckProgram = /usr/sbin/nhc in slurm to check node health every 600 seconds. However, some NHC checks points to a same central resource thus starting these checks simultaneously may lead to false alarms of service degrade. Is it possible to set a random offset to when

Re: [slurm-users] slurm-users Digest, Vol 37, Issue 46

2020-11-26 Thread vero chaul
-- > > This e-mail and any attachments may contain information that is > confidential to Ross Video. > > If you are not the intended recipient, please notify me immediately by > replying to this message. Please also delete all copies. Thank you.

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-26 Thread Andy Riebs
One last shot on the firewall front Steve -- does the control node have a firewall enabled? I've seen cases where that can cause the sporadic messaging failures that you seem to be seeing. That failing, I'll defer to anyone with different ideas! Andy On 11/26/2020 1:01 PM, Steve Bland wrote:

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-26 Thread Steve Bland
Thanks Andy Firewall is off on all three system. Also if they could not communicate, I do not think 'scontrol show node' would not return the data that is does. And the logs would not show responses as indicated below And the names are correct, used the recommended 'hostname -s' when configurin

Re: [slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-26 Thread Andy Riebs
1. Look for a firewall on all of your slurm -- they almost always break slurm communications. 2. Confirm that "ssh srvgridslurm01 hostname" returns, exactly, "srvgridslurm01" Andy On 11/26/2020 12:21 PM, Steve Bland wrote: Sinfo always returns nodes not responding [root@srvgridslurm03 ~

[slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-26 Thread Steve Bland
Sinfo always returns nodes not responding [root@srvgridslurm03 ~]# sinfo -R REASON USER TIMESTAMP NODELIST Not responding slurm 2020-11-26T09:12:58 SRVGRIDSLURM01 Not responding slurm 2020-11-26T08:27:58 SRVGRIDSLURM02 Not responding slurm