Hey Robert, Ask Bright support, they will help you to figure out what is going on there.
Best regards, Taras On Tue, Feb 11, 2020 at 8:26 PM Robert Kudyba <rkud...@fordham.edu> wrote: > This is still happening. Nodes are being drained after a kill task failed. > Could this be related to https://bugs.schedmd.com/show_bug.cgi?id=6307? > > [2020-02-11T12:21:26.005] update_node: node node001 reason set to: Kill > task failed > [2020-02-11T12:21:26.006] update_node: node node001 state set to DRAINING > [2020-02-11T12:21:26.006] got (nil) > [2020-02-11T12:21:26.015] error: slurmd error running JobId=1514 on > node(s)=node001: Kill task failed > [2020-02-11T12:21:26.015] _job_complete: JobID=1514 State=0x1 NodeCnt=1 > WEXITSTATUS 1 > [2020-02-11T12:21:26.015] email msg to sli...@fordham.edu: SLURM > Job_id=1514 Name=run.sh Failed, Run time 00:02:21, NODE_FAIL, ExitCode 0 > [2020-02-11T12:21:26.016] _job_complete: requeue JobID=1514 State=0x8000 > NodeCnt=1 per user/system request > [2020-02-11T12:21:26.016] _job_complete: JobID=1514 State=0x8000 NodeCnt=1 > done > [2020-02-11T12:21:26.057] Requeuing JobID=1514 State=0x0 NodeCnt=0 > [2020-02-11T12:21:46.985] _job_complete: JobID=1511 State=0x1 NodeCnt=1 > WEXITSTATUS 0 > [2020-02-11T12:21:46.985] _job_complete: JobID=1511 State=0x8003 NodeCnt=1 > done > [2020-02-11T12:21:52.111] _job_complete: JobID=1512 State=0x1 NodeCnt=1 > WEXITSTATUS 0 > [2020-02-11T12:21:52.112] _job_complete: JobID=1512 State=0x8003 NodeCnt=1 > done > [2020-02-11T12:21:52.214] sched: Allocate JobID=1516 NodeList=node002 > #CPUs=1 Partition=defq > [2020-02-11T12:21:52.483] _job_complete: JobID=1513 State=0x1 NodeCnt=1 > WEXITSTATUS 0 > [2020-02-11T12:21:52.483] _job_complete: JobID=1513 State=0x8003 NodeCnt=1 > done > > On Tue, Feb 11, 2020 at 11:54 AM Robert Kudyba <rkud...@fordham.edu> > wrote: > >> Usually means you updated the slurm.conf but have not done "scontrol >>> reconfigure" yet. >>> >> Well it turns out it was something else related to a Bright Computing >> setting. In case anyone finds this thread in the future: >> >> ourcluster->category[gpucategory]->roles]% use slurmclient >> [ourcluster->category[gpucategory]->roles[slurmclient]]% show >> ... >> RealMemory 196489092 >> ... >> [ ciscluster->category[gpucategory]->roles[slurmclient]]% >> >> Values are specified in MB and this line is saying that our node has >> 196TB of RAM. >> >> I set this using cmsh: >> >> # cmsh >> % category >> % use gpucategory >> % roles >> % use slurmclient >> % set realmemory 191846 >> % commit >> >> The value in /etc/slurm/slurm.conf was conflicting with this especially >> when restarting slurmctld. >> >> On 2/10/2020 8:55 AM, Robert Kudyba wrote: >>> >>> We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12. >>> >>> We're getting the below errors when I restart the slurmctld service. The >>> file appears to be the same on the head node and compute nodes: >>> [root@node001 ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf >>> >>> -rw-r--r-- 1 root root 3477 Feb 10 11:05 >>> /cm/shared/apps/slurm/var/etc/slurm.conf >>> >>> [root@ourcluster ~]# ls -l /cm/shared/apps/slurm/var/etc/slurm.conf >>> /etc/slurm/slurm.conf >>> >>> -rw-r--r-- 1 root root 3477 Feb 10 11:05 >>> /cm/shared/apps/slurm/var/etc/slurm.conf >>> >>> lrwxrwxrwx 1 root root 40 Nov 30 2018 /etc/slurm/slurm.conf -> >>> /cm/shared/apps/slurm/var/etc/slurm.conf >>> >>> So what else could be causing this? >>> [2020-02-10T10:31:08.987] mcs: MCSParameters = (null). ondemand set. >>> [2020-02-10T10:31:12.009] error: Node node001 appears to have a >>> different slurm.conf than the slurmctld. This could cause issues with >>> communication and functionality. Please review both files and make sure >>> they are the same. If this is expected ignore, and set >>> DebugFlags=NO_CONF_HASH in your slurm.conf. >>> [2020-02-10T10:31:12.009] error: Node node001 has low real_memory size >>> (191846 < 196489092) >>> [2020-02-10T10:31:12.009] error: _slurm_rpc_node_registration >>> node=node001: Invalid argument >>> [2020-02-10T10:31:12.011] error: Node node002 appears to have a >>> different slurm.conf than the slurmctld. This could cause issues with >>> communication and functionality. Please review both files and make sure >>> they are the same. If this is expected ignore, and set >>> DebugFlags=NO_CONF_HASH in your slurm.conf. >>> [2020-02-10T10:31:12.011] error: Node node002 has low real_memory size >>> (191840 < 196489092) >>> [2020-02-10T10:31:12.011] error: _slurm_rpc_node_registration >>> node=node002: Invalid argument >>> [2020-02-10T10:31:12.047] error: Node node003 appears to have a >>> different slurm.conf than the slurmctld. This could cause issues with >>> communication and functionality. Please review both files and make sure >>> they are the same. If this is expected ignore, and set >>> DebugFlags=NO_CONF_HASH in your slurm.conf. >>> [2020-02-10T10:31:12.047] error: Node node003 has low real_memory size >>> (191840 < 196489092) >>> [2020-02-10T10:31:12.047] error: Setting node node003 state to DRAIN >>> [2020-02-10T10:31:12.047] drain_nodes: node node003 state set to DRAIN >>> [2020-02-10T10:31:12.047] error: _slurm_rpc_node_registration >>> node=node003: Invalid argument >>> [2020-02-10T10:32:08.026] >>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 >>> [2020-02-10T10:56:08.988] Processing RPC: REQUEST_RECONFIGURE from uid=0 >>> [2020-02-10T10:56:08.992] layouts: no layout to initialize >>> [2020-02-10T10:56:08.992] restoring original state of nodes >>> [2020-02-10T10:56:08.992] cons_res: select_p_node_init >>> [2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions >>> [2020-02-10T10:56:08.992] _preserve_plugins: backup_controller not >>> specified >>> [2020-02-10T10:56:08.992] cons_res: select_p_reconfigure >>> [2020-02-10T10:56:08.992] cons_res: select_p_node_init >>> [2020-02-10T10:56:08.992] cons_res: preparing for 2 partitions >>> [2020-02-10T10:56:08.992] No parameter for mcs plugin, default values set >>> [2020-02-10T10:56:08.992] mcs: MCSParameters = (null). ondemand set. >>> [2020-02-10T10:56:08.992] _slurm_rpc_reconfigure_controller: completed >>> usec=4369 >>> [2020-02-10T10:56:11.253] >>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2 >>> [2020-02-10T10:56:18.645] update_node: node node001 reason set to: hung >>> [2020-02-10T10:56:18.645] update_node: node node001 state set to DOWN >>> [2020-02-10T10:56:18.645] got (nil) >>> [2020-02-10T10:56:18.679] update_node: node node001 state set to IDLE >>> [2020-02-10T10:56:18.679] got (nil) >>> [2020-02-10T10:56:18.693] update_node: node node002 reason set to: hung >>> [2020-02-10T10:56:18.693] update_node: node node002 state set to DOWN >>> [2020-02-10T10:56:18.693] got (nil) >>> [2020-02-10T10:56:18.711] update_node: node node002 state set to IDLE >>> [2020-02-10T10:56:18.711] got (nil) >>> >>> And not sure if this is related but we're getting this "Kill task >>> failed" and a node gets drained. >>> >>> [2020-02-09T14:42:06.006] error: slurmd error running JobId=1465 on >>> node(s)=node001: Kill task failed >>> [2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x1 NodeCnt=1 >>> WEXITSTATUS 1 >>> [2020-02-09T14:42:06.006] email msg to ouru...@ourdomain.edu: SLURM >>> Job_id=1465 Name=run.sh Failed, Run time 00:02:23, NODE_FAIL, ExitCode 0 >>> [2020-02-09T14:42:06.006] _job_complete: requeue JobID=1465 State=0x8000 >>> NodeCnt=1 per user/system request >>> [2020-02-09T14:42:06.006] _job_complete: JobID=1465 State=0x8000 >>> NodeCnt=1 done >>> [2020-02-09T14:42:06.017] Requeuing JobID=1465 State=0x0 NodeCnt=0 >>> [2020-02-09T14:43:16.308] backfill: Started JobID=1466 in defq on node003 >>> [2020-02-09T14:43:17.054] prolog_running_decr: Configuration for >>> JobID=1466 is complete >>> [2020-02-09T14:44:16.309] email msg to ouru...@ourdomain.edu:: SLURM >>> Job_id=1461 Name=run.sh Began, Queued time 00:02:14 >>> [2020-02-09T14:44:16.309] backfill: Started JobID=1461 in defq on node003 >>> [2020-02-09T14:44:16.309] email msg to ouru...@ourdomain.edu:: SLURM >>> Job_id=1465 Name=run.sh Began, Queued time 00:02:10 >>> [2020-02-09T14:44:16.309] backfill: Started JobID=1465 in defq on node003 >>> [2020-02-09T14:44:16.850] prolog_running_decr: Configuration for >>> JobID=1461 is complete >>> [2020-02-09T14:44:17.040] prolog_running_decr: Configuration for >>> JobID=1465 is complete >>> [2020-02-09T14:44:27.016] error: slurmd error running JobId=1466 on >>> node(s)=node003: Kill task failed >>> [2020-02-09T14:44:27.016] drain_nodes: node node003 state set to DRAIN >>> [2020-02-09T14:44:27.016] _job_complete: JobID=1466 State=0x1 NodeCnt=1 >>> WEXITSTATUS 1 >>> [2020-02-09T14:44:27.016] _job_complete: requeue JobID=1466 State=0x8000 >>> NodeCnt=1 per user/system request >>> [2020-02-09T14:44:27.017] _job_complete: JobID=1466 State=0x8000 >>> NodeCnt=1 done >>> [2020-02-09T14:44:27.057] Requeuing JobID=1466 State=0x0 NodeCnt=0 >>> [2020-02-09T14:44:27.081] update_node: node node003 reason set to: Kill >>> task failed >>> [2020-02-09T14:44:27.082] update_node: node node003 state set to DRAINING >>> [2020-02-09T14:44:27.082] got (nil) >>> [2020-02-09T14:45:33.098] _job_complete: JobID=1461 State=0x1 NodeCnt=1 >>> WEXITSTATUS 1 >>> [2020-02-09T14:45:33.098] email msg to ouru...@ourdomain.edu:: SLURM >>> Job_id=1461 Name=run.sh Failed, Run time 00:01:17, FAILED, ExitCode 1 >>> >>> Could this be related to https://bugs.schedmd.com/show_bug.cgi?id=6307 >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.schedmd.com_show-5Fbug.cgi-3Fid-3D6307&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=m9sYnr-lCssLr6zMcdaFYgSoAl0x7ah_IDtueXpIpZI&s=po2vRnNiVM3rBAFDtfkqGJuglAnL9reBDbDvmtWWBk8&e=> >>> ? >>> >>>