Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

Marcus Wagner Mon, 20 Jan 2020 23:53:35 -0800

Dear Robert,

On 1/20/20 7:37 PM, Robert Kudyba wrote:

I've posted about this previously here<https://groups.google.com/forum/#!searchin/slurm-users/kudyba%7Csort:date/slurm-users/mMECjerUmFE/V1wK19fFAQAJ>,and here<https://groups.google.com/forum/#!searchin/slurm-users/kudyba%7Csort:date/slurm-users/vVAyqm0wg3Y/2YoBq744AAAJ> soI'm trying to get to the bottom of this once and for all and even gotthis comment<https://groups.google.com/d/msg/slurm-users/vVAyqm0wg3Y/x9-_iQQaBwAJ>previously:
    our problem here is that the configuration for the nodes in
    question have an incorrect amount of memory set for them. Looks
    like you have it set in bytes instead of megabytes
    In your slurm.conf you should look at the RealMemory setting:
    RealMemory
    Size of real memory on the node in megabytes (e.g. "2048"). The
    default value is 1.
    I would suggest RealMemory=191879 , where I suspect you have
    RealMemory=196489092

are you sure, your 24 core nodes have 187 TERABYTES memory?

As you yourself cited:

Size of real memory on the node in megabytes

The settings in your slurm.conf:

NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092Sockets=2 Gres=gpu:1

so, your machines should have 196489092 megabytes memory, that are~191884 gigabytes or ~187 terabytes


Slurm believes, these machines do NOT have that much memory:

[2020-01-20T13:22:48.256] error: Node node002 has low real_memory size(191840 < 196489092)

It sees only 191840 megabytes, which is still less than the 191884.Since the available memory changes slightly from OS version to OSversion, I would suggest to set RealMemory to less than 191840, e.g. 191800.

But Brian already told you to reduce the RealMemory:

I would suggest RealMemory=191879 , where I suspect you haveRealMemory=196489092

If SLURM sees less than RealMemory on a node, it drains the node,because a defective DIMM is assumed.


Best
Marcus

Now the slurmctld logs show this:
[2020-01-20T13:22:48.256] error: Node node002 has low real_memory size(191840 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node002 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node002 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registrationnode=node002: Invalid argument[2020-01-20T13:22:48.256] error: Node node001 has low real_memory size(191846 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node001 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node001 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registrationnode=node001: Invalid argument[2020-01-20T13:22:48.256] error: Node node003 has low real_memory size(191840 < 196489092)
[2020-01-20T13:22:48.256] error: Setting node node003 state to DRAIN
[2020-01-20T13:22:48.256] drain_nodes: node node003 state set to DRAIN
[2020-01-20T13:22:48.256] error: _slurm_rpc_node_registrationnode=node003: Invalid argument
Here's the setting in slurm.conf:
/etc/slurm/slurm.conf
# Nodes
NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092Sockets=2 Gres=gpu:1
# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALLPriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NOHidden=NO Shared=NO GraceTime=0 Preempt$PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALLPriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NOHidden=NO Shared=NO GraceTime=0 PreemptM$
sinfo -N
NODELIST   NODES PARTITION STATE
node001        1     defq* drain
node002        1     defq* drain
node003        1     defq* drain

sinfo -N
NODELIST   NODES PARTITION STATE
node001        1     defq* drain
node002        1     defq* drain
node003        1     defq* drain
[2020-01-20T12:50:51.034] error: Node node003 has low real_memory size(191840 < 196489092)[2020-01-20T12:50:51.034] error: _slurm_rpc_node_registrationnode=node003: Invalid argument
/etc/slurm/slurm.conf
# Nodes
NodeName=node[001-003] CoresPerSocket=12 RealMemory=196489092Sockets=2 Gres=gpu:1
# Partitions
PartitionName=defq Default=YES MinNodes=1 AllowGroups=ALLPriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NOHidden=NO Shared=NO GraceTime=0 Preempt$PartitionName=gpuq Default=NO MinNodes=1 AllowGroups=ALLPriorityJobFactor=1 PriorityTier=1 DisableRootJobs=NO RootOnly=NOHidden=NO Shared=NO GraceTime=0 PreemptM$
pdsh -w node00[1-3] "lscpu | grep -iE 'socket|core'"
node001: Thread(s) per core:    1
node001: Core(s) per socket:    12
node001: Socket(s):             2
node002: Thread(s) per core:    1
node002: Core(s) per socket:    12
node002: Socket(s):             2
node003: Thread(s) per core:    2
node003: Core(s) per socket:    12
node003: Socket(s):             2

module load cmsh
[root@ciscluster kudyba]# cmsh
[ciscluster]% jobqueue
[ciscluster->jobqueue(slurm)]% ls
Type         Name                     Nodes
------------ ----------------------------------------------------------------------------
Slurm        defq                     node001..node003
Slurm        gpuq

use defq
[ciscluster->jobqueue(slurm)->defq]% get options
QoS=N/A ExclusiveUser=NO OverSubscribe=FORCE:12 OverTimeLimit=0 State=UP

scontrol show nodes node001
NodeName=node001 Arch=x86_64 CoresPerSocket=12
   CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.07
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:1
   NodeAddr=node001 NodeHostName=node001 Version=17.11
   OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018
   RealMemory=196489092 AllocMem=0 FreeMem=98557 Sockets=2 Boards=1
State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/AMCS_label=N/A
   Partitions=defq
   BootTime=2019-07-18T12:08:42 SlurmdStartTime=2020-01-17T21:34:15
   CfgTRES=cpu=24,mem=196489092M,billing=24
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [slurm@2020-01-20T13:22:48]

sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Low RealMemory       slurm     2020-01-20T13:22:48 node[001-003]

And the total memory in each node:
ssh node001
Last login: Mon Jan 20 13:34:00 2020
[root@node001 ~]# free -h
total used free shared buff/cache available
Mem:           187G         69G         96G        4.0G   21G        112G
Swap:           11G         11G         55M

What setting is incorrect here?


--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de

Re: [slurm-users] Node node00x has low real_memory size & slurm_rpc_node_registration node=node003: Invalid argument

Reply via email to