Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

Brian Andrus Mon, 10 Feb 2020 11:06:00 -0800

Usually means you updated the slurm.conf but have not done "scontrolreconfigure" yet.


Brian Andrus

On 2/10/2020 8:55 AM, Robert Kudyba wrote:

We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.
We're getting the below errors when I restart the slurmctld service.The file appears to be the same on the head node and compute nodes:
[root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf
-rw-r--r-- 1 root root 3477 Feb 10 11:05/cm/shared/apps/slurm/var/etc/slurm.conf
[root@ourcluster ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf/etc/slurm/slurm.conf
-rw-r--r-- 1 root root 3477 Feb 10 11:05/cm/shared/apps/slurm/var/etc/slurm.conf
lrwxrwxrwx 1 root root 40 Nov 30 2018 /etc/slurm/slurm.conf ->/cm/shared/apps/slurm/var/etc/slurm.conf
So what else could be causing this?
[2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set.
[2020-02-10T10:31:12.009] error: Node node001 appears to have adifferent slurm.conf than the slurmctld. This could cause issues withcommunication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and setDebugFlags=NO_CONF_HASH in your slurm.conf.[2020-02-10T10:31:12.009] error: Node node001 has low real_memory size(191846 < 196489092)[2020-02-10T10:31:12.009] error: _slurm_rpc_node_registrationnode=node001: Invalid argument[2020-02-10T10:31:12.011] error: Node node002 appears to have adifferent slurm.conf than the slurmctld. This could cause issues withcommunication and functionality. Please review both files andmake sure they are the same. If this is expected ignore, and setDebugFlags=NO_CONF_HASH in your slurm.conf.[2020-02-10T10:31:12.011] error: Node node002 has low real_memory size(191840 < 196489092)[2020-02-10T10:31:12.011] error: _slurm_rpc_node_registrationnode=node002: Invalid argument[2020-02-10T10:31:12.047] error: Node node003 appears to have adifferent slurm.conf than the slurmctld. This could cause issues withcommunication and functionality. Please review both files andmake sure they are the same. If this is expected ignore, and setDebugFlags=NO_CONF_HASH in your slurm.conf.[2020-02-10T10:31:12.047] error: Node node003 has low real_memory size(191840 < 196489092)
[2020-02-10T10:31:12.047] error: Setting node node003 state to DRAIN
[2020-02-10T10:31:12.047] drain_nodes: node node003 state set to DRAIN
[2020-02-10T10:31:12.047] error: _slurm_rpc_node_registrationnode=node003: Invalid argument[2020-02-10T10:32:08.026]SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-02-10T10:56:08.988] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2020-02-10T10:56:08.992] layouts: no layout to initialize
[2020-02-10T10:56:08.992] restoring original state of nodes
[2020-02-10T10:56:08.992] cons_res: select_p_node_init
[2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
[2020-02-10T10:56:08.992] _preserve_plugins: backup_controller notspecified
[2020-02-10T10:56:08.992] cons_res: select_p_reconfigure
[2020-02-10T10:56:08.992] cons_res: select_p_node_init
[2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions
[2020-02-10T10:56:08.992] No parameter for mcs plugin, default values set
[2020-02-10T10:56:08.992] mcs: MCSParameters = (null). ondemand set.
[2020-02-10T10:56:08.992] _slurm_rpc_reconfigure_controller: completedusec=4369[2020-02-10T10:56:11.253]SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
[2020-02-10T10:56:18.645] update_node: node node001 reason set to: hung
[2020-02-10T10:56:18.645] update_node: node node001 state set to DOWN
[2020-02-10T10:56:18.645] got (nil)
[2020-02-10T10:56:18.679] update_node: node node001 state set to IDLE
[2020-02-10T10:56:18.679] got (nil)
[2020-02-10T10:56:18.693] update_node: node node002 reason set to: hung
[2020-02-10T10:56:18.693] update_node: node node002 state set to DOWN
[2020-02-10T10:56:18.693] got (nil)
[2020-02-10T10:56:18.711] update_node: node node002 state set to IDLE
[2020-02-10T10:56:18.711] got (nil)
And not sure if this is related but we're getting this "Kill taskfailed" and a node gets drained.
[2020-02-09T14:42:06.006] error: slurmd error running JobId=1465 onnode(s)=node001: Kill task failed[2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x1NodeCnt=1 WEXITSTATUS 1[2020-02-09T14:42:06.006] email msg to ouru...@ourdomain.edu<mailto:ouru...@ourdomain.edu>: SLURM Job_id=1465 Name=run.sh Failed,Run time 00:02:23, NODE_FAIL, ExitCode 0[2020-02-09T14:42:06.006] _job_complete: requeue JobID=1465State=0x8000 NodeCnt=1 per user/system request[2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x8000NodeCnt=1 done
[2020-02-09T14:42:06.017] Requeuing JobID=1465 State=0x0 NodeCnt=0
[2020-02-09T14:43:16.308] backfill: Started JobID=1466 in defq on node003
[2020-02-09T14:43:17.054] prolog_running_decr: Configuration forJobID=1466 is complete[2020-02-09T14:44:16.309] email msg to ouru...@ourdomain.edu<mailto:ouru...@ourdomain.edu>:: SLURM Job_id=1461 Name=run.sh Began,Queued time 00:02:14
[2020-02-09T14:44:16.309] backfill: Started JobID=1461 in defq on node003
[2020-02-09T14:44:16.309] email msg to ouru...@ourdomain.edu<mailto:ouru...@ourdomain.edu>:: SLURM Job_id=1465 Name=run.sh Began,Queued time 00:02:10
[2020-02-09T14:44:16.309] backfill: Started JobID=1465 in defq on node003
[2020-02-09T14:44:16.850] prolog_running_decr: Configuration forJobID=1461 is complete[2020-02-09T14:44:17.040] prolog_running_decr: Configuration forJobID=1465 is complete[2020-02-09T14:44:27.016] error: slurmd error running JobId=1466 onnode(s)=node003: Kill task failed
[2020-02-09T14:44:27.016] drain_nodes: node node003 state set to DRAIN
[2020-02-09T14:44:27.016] _job_complete: JobID=1466 State=0x1NodeCnt=1 WEXITSTATUS 1[2020-02-09T14:44:27.016] _job_complete: requeue JobID=1466State=0x8000 NodeCnt=1 per user/system request[2020-02-09T14:44:27.017] _job_complete: JobID=1466 State=0x8000NodeCnt=1 done
[2020-02-09T14:44:27.057] Requeuing JobID=1466 State=0x0 NodeCnt=0
[2020-02-09T14:44:27.081] update_node: node node003 reason set to:Kill task failed
[2020-02-09T14:44:27.082] update_node: node node003 state set to DRAINING
[2020-02-09T14:44:27.082] got (nil)
[2020-02-09T14:45:33.098] _job_complete: JobID=1461 State=0x1NodeCnt=1 WEXITSTATUS 1[2020-02-09T14:45:33.098] email msg to ouru...@ourdomain.edu<mailto:ouru...@ourdomain.edu>:: SLURM Job_id=1461 Name=run.sh Failed,Run time 00:01:17, FAILED, ExitCode 1
Could this be related to https://bugs.schedmd.com/show_bug.cgi?id=6307?

Re: [slurm-users] Node appears to have a different slurm.conf than the slurmctld; update_node: node reason set to: Kill task failed

Reply via email to