I’m new to Slurm and we have a 3 node + head node cluster running Centos 7 and Bright Cluster 8.1. Their support sent me here as they say Slurm is configured optimally to allow multiple tasks to run. However at times a job will hold up new jobs. Are there any other logs I can look at and/or settings to change to prevent this or alert me when this is happening? Here are some tests and commands that I hope will illuminate where I may be going wrong. The slurn.conf file has these options set: SelectType=select/cons_res SelectTypeParameters=CR_CPU SchedulerTimeSlice=60
I also see /var/log/slurmctld is loaded with errors like these: [2019-07-03T02:21:30.913] error: _slurm_rpc_node_registration node=node003: Invalid argument [2019-07-03T02:54:50.655] error: Node node002 has low real_memory size (191879 < 196489092) [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node002: Invalid argument [2019-07-03T02:54:50.655] error: Node node001 has low real_memory size (191883 < 196489092) [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node001: Invalid argument [2019-07-03T02:54:50.655] error: Node node003 has low real_memory size (191879 < 196489092) [2019-07-03T02:54:50.655] error: _slurm_rpc_node_registration node=node003: Invalid argument [2019-07-03T03:28:10.293] error: Node node002 has low real_memory size (191879 < 196489092) [2019-07-03T03:28:10.293] error: _slurm_rpc_node_registration node=node002: Invalid argument [2019-07-03T03:28:10.293] error: Node node003 has low real_memory size (191879 < 196489092) squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 352 defq TensorFl myuser PD 0:00 3 (Resources) scontrol show jobid -dd 352 JobId=352 JobName=TensorFlowGPUTest UserId=myuser(1001) GroupId=myuser(1001) MCS_label=N/A Priority=4294901741 Nice=0 Account=(null) QOS=normal JobState=PENDING Reason=Resources Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:00:00 TimeLimit=UNLIMITED TimeMin=N/A SubmitTime=2019-07-02T16:57:11 EligibleTime=2019-07-02T16:57:11 StartTime=Unknown EndTime=Unknown Deadline=N/A PreemptTime=None SuspendTime=None SecsPreSuspend=0 LastSchedEval=2019-07-02T16:57:59 Partition=defq AllocNode:Sid=ourcluster:386851 ReqNodeList=(null) ExcNodeList=(null) NodeList=(null) NumNodes=3-3 NumCPUs=3 NumTasks=3 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=3,node=3 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 Gres=gpu:1 Reservation=(null) OverSubscribe=YES Contiguous=0 Licenses=(null) Network=(null) Command=/home/myuser/cnn_gpu.sh WorkDir=/home/myuser StdErr=/home/myuser/slurm-352.out StdIn=/dev/null StdOut=/home/myuser/slurm-352.out Power= Another test showed the below: sinfo -N NODELIST NODES PARTITION STATE node001 1 defq* drain node002 1 defq* drain node003 1 defq* drain sinfo -R REASON USER TIMESTAMP NODELIST Low RealMemory slurm 2019-05-17T10:05:26 node[001-003] [ciscluster]% jobqueue [ciscluster->jobqueue(slurm)]% ls Type Name Nodes ------------ ------------------------ ---------------------------------------------------- Slurm defq node001..node003 Slurm gpuq [ourcluster->jobqueue(slurm)]% use defq [ourcluster->jobqueue(slurm)->defq]% get options QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeL imit=0 State=UP pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'" node003: Thread(s) per core: 1 node003: Core(s) per socket: 12 node003: Socket(s): 2 node001: Thread(s) per core: 1 node001: Core(s) per socket: 12 node001: Socket(s): 2 node002: Thread(s) per core: 1 node002: Core(s) per socket: 12 node002: Socket(s): 2 scontrol show nodes node001 NodeName=node001 Arch=x86_64 CoresPerSocket=12 CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 AvailableFeatures=(null) Ac tiveFeatures=(null) Gres=gpu:1 NodeAddr=node001 NodeHostName=node001 Version=17.11 OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018 RealMemory=196489092 AllocMem=0 FreeMem=184912 Sockets=2 Boards=1 State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=defq BootTime=2019-06-28T15:33:47 SlurmdStartTime=2019-06-28T15:35:17 CfgTRES=cpu=24,mem=196489092M,billing=24 AllocTRES= CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s Reason=Low RealMemory [slurm@2019-05-17T10:05:26] sinfo PARTITION AVAIL TI MELIMIT NODES STATE NODELIST defq* up infinite 3 drain node[001-003] gpuq up infinite 0 n/a scontrol show nodes| grep -i mem RealMemory=196489092 AllocMem=0 FreeMem=184907 Sockets=2 Boards=1 CfgTRES=cpu=24,mem=196489092M,billing=24 Reason=Low RealMemory [slurm@2019-05-17T10:05:26] RealMemory=196489092 AllocMem=0 FreeMem=185084 Sockets=2 Boards=1 CfgTRES=cpu=24,mem=196489092M,billing=24 Reason=Low RealMemory [slurm@2019-05-17T10:05:26] RealMemory=196489092 AllocMem=0 FreeMem=188720 Sockets=2 Boards=1 CfgTRES=cpu=24,mem=196489092M,billing=24 Reason=Low RealMemory [slurm@2019-05-17T10:05:26]