Usually means you updated the slurm.conf but have not done "scontrol
reconfigure" yet.
Brian Andrus
On 2/10/2020 8:55 AM, Robert Kudyba wrote:
We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.
We're getting the below errors when I restart the slurmctld service.
The file appears to be the same on the head node and compute nodes:
[root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf
-rw-r--r-- 1 root root 3477 Feb 10 11:05
/cm/shared/apps/slurm/var/etc/slurm.conf
[root@ourcluster ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf
/etc/slurm/slurm.conf
-rw-r--r-- 1 root root 3477 Feb 10 11:05
/cm/shared/apps/slurm/var/etc/slurm.conf
lrwxrwxrwx 1 root root 40 Nov 30 2018 /etc/slurm/slurm.conf ->
/cm/shared/apps/slurm/var/etc/slurm.conf
So what else could be causing this?
[2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set.
[2020-02-10T10:31:12.009] error: Node node001 appears to have a
different slurm.conf than the slurmctld. This could cause issues with
communication and functionality. Please review both files and make
sure they are the same. If this is expected ignore, and set
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-02-10T10:31:12.009] error: Node node001 has low real_memory size
(191846 < 196489092)
[2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration
node=node001: Invalid argument
[2020-02-10T10:31:12.011] error: Node node002 appears to have a
different slurm.conf than the slurmctld. This could cause issues with
communication and functionality. Please review both files and
make sure they are the same. If this is expected ignore, and set
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-02-10T10:31:12.011] error: Node node002 has low real_memory size
(191840 < 196489092)
[2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration
node=node002: Invalid argument
[2020-02-10T10:31:12.047] error: Node node003 appears to have a
different slurm.conf than the slurmctld. This could cause issues with
communication and functionality. Please review both files and
make sure they are the same. If this is expected ignore, and set
DebugFlags=NO_CONF_HASH in your slurm.conf.
[2020-02-10T10:31:12.047] error: Node node003 has low real_memory size
(191840 < 196489092)
[2020-02-10T10:31:12.047] error: Setting node node003 state to DRAIN
[2020-02-10T10:31:12.047] drain_nodes: node node003 state set to DRAIN
[2020-02-10T10:31:12.047] error: _slurm_rpc_node_registration
node=node003: Invalid argument
[2020-02-10T10:32:08.026]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-02-10T10:56:08.988] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2020-02-10T10:56:08.992] layouts: no layout to initialize
[2020-02-10T10:56:08.992] restoring original state of nodes
[2020-02-10T10:56:08.992] cons_res: select_p_node_init
[2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
[2020-02-10T10:56:08.992] _preserve_plugins: backup_controller not
specified
[2020-02-10T10:56:08.992] cons_res: select_p_reconfigure
[2020-02-10T10:56:08.992] cons_res: select_p_node_init
[2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
[2020-02-10T10:56:08.992] No parameter for mcs plugin, default values set
[2020-02-10T10:56:08.992] mcs: MCSParameters = (null). ondemand set.
[2020-02-10T10:56:08.992] _slurm_rpc_reconfigure_controller: completed
usec=4369
[2020-02-10T10:56:11.253]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-02-10T10:56:18.645] update_node: node node001 reason set to: hung
[2020-02-10T10:56:18.645] update_node: node node001 state set to DOWN
[2020-02-10T10:56:18.645] got (nil)
[2020-02-10T10:56:18.679] update_node: node node001 state set to IDLE
[2020-02-10T10:56:18.679] got (nil)
[2020-02-10T10:56:18.693] update_node: node node002 reason set to: hung
[2020-02-10T10:56:18.693] update_node: node node002 state set to DOWN
[2020-02-10T10:56:18.693] got (nil)
[2020-02-10T10:56:18.711] update_node: node node002 state set to IDLE
[2020-02-10T10:56:18.711] got (nil)
And not sure if this is related but we're getting this "Kill task
failed" and a node gets drained.
[2020-02-09T14:42:06.006] error: slurmd error running JobId=1465 on
node(s)=node001: Kill task failed
[2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x1
NodeCnt=1 WEXITSTATUS 1
[2020-02-09T14:42:06.006] email msg to ouru...@ourdomain.edu
<mailto:ouru...@ourdomain.edu>: SLURM Job_id=1465 Name=run.sh Failed,
Run time 00:02:23, NODE_FAIL, ExitCode 0
[2020-02-09T14:42:06.006] _job_complete: requeue JobID=1465
State=0x8000 NodeCnt=1 per user/system request
[2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x8000
NodeCnt=1 done
[2020-02-09T14:42:06.017] Requeuing JobID=1465 State=0x0 NodeCnt=0
[2020-02-09T14:43:16.308] backfill: Started JobID=1466 in defq on node003
[2020-02-09T14:43:17.054] prolog_running_decr: Configuration for
JobID=1466 is complete
[2020-02-09T14:44:16.309] email msg to ouru...@ourdomain.edu
<mailto:ouru...@ourdomain.edu>:: SLURM Job_id=1461 Name=run.sh Began,
Queued time 00:02:14
[2020-02-09T14:44:16.309] backfill: Started JobID=1461 in defq on node003
[2020-02-09T14:44:16.309] email msg to ouru...@ourdomain.edu
<mailto:ouru...@ourdomain.edu>:: SLURM Job_id=1465 Name=run.sh Began,
Queued time 00:02:10
[2020-02-09T14:44:16.309] backfill: Started JobID=1465 in defq on node003
[2020-02-09T14:44:16.850] prolog_running_decr: Configuration for
JobID=1461 is complete
[2020-02-09T14:44:17.040] prolog_running_decr: Configuration for
JobID=1465 is complete
[2020-02-09T14:44:27.016] error: slurmd error running JobId=1466 on
node(s)=node003: Kill task failed
[2020-02-09T14:44:27.016] drain_nodes: node node003 state set to DRAIN
[2020-02-09T14:44:27.016] _job_complete: JobID=1466 State=0x1
NodeCnt=1 WEXITSTATUS 1
[2020-02-09T14:44:27.016] _job_complete: requeue JobID=1466
State=0x8000 NodeCnt=1 per user/system request
[2020-02-09T14:44:27.017] _job_complete: JobID=1466 State=0x8000
NodeCnt=1 done
[2020-02-09T14:44:27.057] Requeuing JobID=1466 State=0x0 NodeCnt=0
[2020-02-09T14:44:27.081] update_node: node node003 reason set to:
Kill task failed
[2020-02-09T14:44:27.082] update_node: node node003 state set to DRAINING
[2020-02-09T14:44:27.082] got (nil)
[2020-02-09T14:45:33.098] _job_complete: JobID=1461 State=0x1
NodeCnt=1 WEXITSTATUS 1
[2020-02-09T14:45:33.098] email msg to ouru...@ourdomain.edu
<mailto:ouru...@ourdomain.edu>:: SLURM Job_id=1461 Name=run.sh Failed,
Run time 00:01:17, FAILED, ExitCode 1
Could this be related to https://bugs.schedmd.com/show_bug.cgi?id=6307?