We are on a Bright Cluster and their support says the head node controls this. Here you can see the sym links:
[root@node001 ~]# file /etc/slurm/slurm.conf /etc/slurm/slurm.conf: symbolic link to `/cm/shared/apps/slurm/var/etc/slurm.conf' [root@ourcluster myuser]# file /etc/slurm/slurm.conf /etc/slurm/slurm.conf: symbolic link to `/cm/shared/apps/slurm/var/etc/slurm.conf' ls -l /etc/slurm/slurm.conf lrwxrwxrwx 1 root root 40 Nov 30 2018 /etc/slurm/slurm.conf -> /cm/shared/apps/slurm/var/etc/slurm.conf [root@ourcluster myuser]# ssh node001 Last login: Mon Jan 20 14:02:00 2020 [root@node001 ~]# ls -l /etc/slurm/slurm.conf lrwxrwxrwx 1 root root 40 Nov 30 2018 /etc/slurm/slurm.conf -> /cm/shared/apps/slurm/var/etc/slurm.conf On Mon, Jan 20, 2020 at 1:52 PM Brian Andrus <toomuc...@gmail.com> wrote: > Try using "nodename=node003" in the slurm.conf on your nodes. > > Also, make sure the slurm.conf on the nodes is the same as on the head. > > Somewhere in there, you have "node=node003" (as well as the other nodes > names). > > That may even do it, as they may be trying to register generically, so > their configs are not getting matched to the specific info in your main > config > > Brian Andrus > > > On 1/20/2020 10:37 AM, Robert Kudyba wrote: > > I've posted about this previously here > <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21searchin_slurm-2Dusers_kudyba-257Csort-3Adate_slurm-2Dusers_mMECjerUmFE_V1wK19fFAQAJ&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=536v1kqVHYCPVjdMowh4_kfCXSihJp1LwoDKM8FWu08&s=V4tz7Qab3oK28vrC090A6R6aFEaDXz7Czqr5y2eDUk0&e=>, > and here > <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_forum_-23-21searchin_slurm-2Dusers_kudyba-257Csort-3Adate_slurm-2Dusers_vVAyqm0wg3Y_2YoBq744AAAJ&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=536v1kqVHYCPVjdMowh4_kfCXSihJp1LwoDKM8FWu08&s=eEetgW964TvhYChxX27f_Bjz3tn5UlwUpVEVAZIdIKo&e=> > so > I'm trying to get to the bottom of this once and for all and even got this > comment > <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_msg_slurm-2Dusers_vVAyqm0wg3Y_x9-2D-5FiQQaBwAJ&d=DwMDaQ&c=aqMfXOEvEJQh2iQMCb7Wy8l0sPnURkcqADc2guUW8IM&r=X0jL9y0sL4r4iU_qVtR3lLNo4tOL1ry_m7-psV3GejY&m=536v1kqVHYCPVjdMowh4_kfCXSihJp1LwoDKM8FWu08&s=5UB2Ohj42gVpQ0GXneP02dO3kpRATj5OvQ4nmNTWZd4&e=> > previously: > > our problem here is that the configuration for the nodes in question have >> an incorrect amount of memory set for them. Looks like you have it set in >> bytes instead of megabytes >> In your slurm.conf you should look at the RealMemory setting: >> RealMemory >> Size of real memory on the node in megabytes (e.g. "2048"). The default >> value is 1. >> I would suggest RealMemory=191879 , where I suspect you have >> RealMemory=196489092 > > > Now the slurmctld logs show this: > > [2020-01-20T13:22:48.256] error: Node node002 has low real_memory size > (191840 < 196489092) > [2020-01-20T13:22:48.256] error: Setting node node002 state to DRAIN > [2020-01-20T13:22:48.256] drain_nodes: node node002 state set to DRAIN > [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration > node=node002: Invalid argument > [2020-01-20T13:22:48.256] error: Node node001 has low real_memory size > (191846 < 196489092) > [2020-01-20T13:22:48.256] error: Setting node node001 state to DRAIN > [2020-01-20T13:22:48.256] drain_nodes: node node001 state set to DRAIN > [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration > node=node001: Invalid argument > [2020-01-20T13:22:48.256] error: Node node003 has low real_memory size > (191840 < 196489092) > [2020-01-20T13:22:48.256] error: Setting node node003 state to DRAIN > [2020-01-20T13:22:48.256] drain_nodes: node node003 state set to DRAIN > [2020-01-20T13:22:48.256] error: _slurm_rpc_node_registration > node=node003: Invalid argument > > Here's the setting in slurm.conf: > /etc/slurm/slurm.conf > # Nodes > NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092 Sockets=2 > Gres=gpu:1 > # Partitions > PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL > PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO > Shared=NO GraceTime=0 Preempt$ > PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL > PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO > Shared=NO GraceTime=0 PreemptM$ > > sinfo -N > NODELIST NODES PARTITION STATE > node001 1 defq* drain > node002 1 defq* drain > node003 1 defq* drain > > sinfo -N > NODELIST NODES PARTITION STATE > node001 1 defq* drain > node002 1 defq* drain > node003 1 defq* drain > > [2020-01-20T12:50:51.034] error: Node node003 has low real_memory size > (191840 < 196489092) > [2020-01-20T12:50:51.034] error: _slurm_rpc_node_registration > node=node003: Invalid argument > > /etc/slurm/slurm.conf > # Nodes > NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092 Sockets=2 > Gres=gpu:1 > # Partitions > PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALL > PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO > Shared=NO GraceTime=0 Preempt$ > PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALL > PriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NO Hidden=NO > Shared=NO GraceTime=0 PreemptM$ > > pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'" > node001: Thread(s) per core: 1 > node001: Core(s) per socket: 12 > node001: Socket(s): 2 > node002: Thread(s) per core: 1 > node002: Core(s) per socket: 12 > node002: Socket(s): 2 > node003: Thread(s) per core: 2 > node003: Core(s) per socket: 12 > node003: Socket(s): 2 > > module load cmsh > [root@ciscluster kudyba]# cmsh > [ciscluster]% jobqueue > [ciscluster->jobqueue(slurm)]% ls > Type Name Nodes > ------------ ------------------------ > ---------------------------------------------------- > Slurm defq node001..node003 > Slurm gpuq > > use defq > [ciscluster->jobqueue(slurm)->defq]% get options > QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP > > scontrol show nodes node001 > NodeName=node001 Arch=x86_64 CoresPerSocket=12 > CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.07 > AvailableFeatures=(null) > ActiveFeatures=(null) > Gres=gpu:1 > NodeAddr=node001 NodeHostName=node001 Version=17.11 > OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018 > RealMemory=196489092 AllocMem=0 FreeMem=98557 Sockets=2 Boards=1 > State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A > MCS_label=N/A > Partitions=defq > BootTime=2019-07-18T12:08:42 SlurmdStartTime=2020-01-17T21:34:15 > CfgTRES=cpu=24,mem=196489092M,billing=24 > AllocTRES= > CapWatts=n/a > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s > Reason=Low RealMemory [slurm@2020-01-20T13:22:48] > > sinfo -R > REASON USER TIMESTAMP NODELIST > Low RealMemory slurm 2020-01-20T13:22:48 node[001-003] > > And the total memory in each node: > ssh node001 > Last login: Mon Jan 20 13:34:00 2020 > [root@node001 ~]# free -h > total used free shared buff/cache > available > Mem: 187G 69G 96G 4.0G 21G > 112G > Swap: 11G 11G 55M > > What setting is incorrect here? > >