Re: [slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-27 Thread Chris Samuel
On 26/11/20 9:21 am, Steve Bland wrote: Sinfo always returns nodes not responding One thing - do the nodes return to this state when you resume them with "scontrol update node=srvgridslurm[01-03] state=resume" ? If they do then what does your slurmctld logs say for the reason for this? You

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-27 Thread Andy Riebs
Steve, you've exhausted my best ideas... hope someone else can jump in! Andy On Fri, Nov 27, 2020, 11:19 AM Steve Bland wrote: > > Andy > > I appreciate you making me check again, things do get missed > > SELinux is off, firewalld is disabled > > [root@SRVGRIDSLURM01 ~]# sestatus > > SELinux st

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-27 Thread Steve Bland
Andy I appreciate you making me check again, things do get missed SELinux is off, firewalld is disabled [root@SRVGRIDSLURM01 ~]# sestatus SELinux status: disabled [root@SRVGRIDSLURM01 ~]# systemctl status firewalld ● firewalld.service - firewalld - dynamic firewall daemon

Re: [slurm-users] Set a ramdom offset when starting node health check in SLURM

2020-11-27 Thread Bjørn-Helge Mevik
You can also check out HealthCheckNodeState=CYCLE man slurm.conf: "Rather than running the health check program on all nodes at the same time, cycle through running on all compute nodes through the course of the HealthCheckInterval. May be combined with the various node state options." -- Chee