Hello all,

I'm trying to turn off core specialization in my cluster by setting 
CoreSpecCount=0, but checking with scontrol does not show my changes. If I set 
CoreSpec=1 or CoreSpecCount=2, or anything except 0, the changes are applied 
correctly. But when I set it to 0, no change is applied -- it remains on 
whatever the previous number was.

with CoreSpecCount=1:

---------------------------------------
# scontrol show node node016
NodeName=node016 Arch=x86_64 CoresPerSocket=18⋅
   CPUAlloc=0 CPUTot=72 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=node016 NodeHostName=node016⋅
   OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018⋅
   RealMemory=95306 AllocMem=0 FreeMem=N/A Sockets=2 Boards=1
   CoreSpecCount=1 CPUSpecList=70-71⋅
   State=IDLE ThreadsPerCore=2 TmpDisk=2038 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=test⋅
   BootTime=2019-06-19T08:41:49 SlurmdStartTime=2019-06-27T09:06:26
   CfgTRES=cpu=72,mem=95306M,billing=72
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
---------------------------------------

That is correct.

with CoreSpecCount=0:

---------------------------------------
# scontrol show node node016
NodeName=node016 Arch=x86_64 CoresPerSocket=18⋅
   CPUAlloc=0 CPUTot=72 CPULoad=0.01
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=node016 NodeHostName=node016⋅
   OS=Linux 3.10.0-862.2.3.el7.x86_64 #1 SMP Wed May 9 18:05:47 UTC 2018⋅
   RealMemory=95306 AllocMem=0 FreeMem=92773 Sockets=2 Boards=1
   CoreSpecCount=1 CPUSpecList=70-71⋅
   State=IDLE ThreadsPerCore=2 TmpDisk=2038 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=test⋅
   BootTime=2019-06-19T08:41:49 SlurmdStartTime=2019-06-27T09:06:26
   CfgTRES=cpu=72,mem=95306M,billing=72
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
---------------------------------------

That is wrong. It's exactly the same -- CoreSpecCount still shows 1.

The weird thing is that if I run slurmd in the foreground in verbose mode on 
the node with "slurmd -cDvvf /etc/slurm/slurm.conf", the change appears to be 
recognized.

Results with CoreSpecCount=1:

---------------------------------------
slurmd: got reconfigure request
slurmd: all threads complete
slurmd: debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
slurmd: debug:  Ignoring obsolete CacheGroups option.
slurmd: debug:  Log file re-opened
slurmd: debug:  CPUs:72 Boards:1 Sockets:2 CoresPerSocket:18 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
slurmd: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
slurmd: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
slurmd: debug:  xcgroup_instantiate: cgroup '/sys/fs/cgroup/cpuset/slurm' 
already exists
slurmd: debug:  xcgroup_instantiate: cgroup 
'/sys/fs/cgroup/cpuset/slurm/system' already exists
slurmd: debug:  system cgroup: system cpuset cgroup initialized
slurmd: Resource spec: Reserved abstract CPU IDs: 70-71
slurmd: Resource spec: Reserved machine CPU IDs: 35,71
slurmd: debug:  Resource spec: Reserved system memory limit not configured for 
this node
---------------------------------------

Results with CoreSpecCount=0:

---------------------------------------
slurmd: got reconfigure request
slurmd: all threads complete
slurmd: debug:  Reading slurm.conf file: /etc/slurm/slurm.conf
slurmd: debug:  Ignoring obsolete CacheGroups option.
slurmd: debug:  Log file re-opened
slurmd: debug:  CPUs:72 Boards:1 Sockets:2 CoresPerSocket:18 ThreadsPerCore:2
slurmd: Message aggregation disabled
slurmd: debug:  Reading cgroup.conf file /etc/slurm/cgroup.conf
slurmd: debug:  Resource spec: No specialized cores configured by default on 
this node
slurmd: debug:  Resource spec: Reserved system memory limit not configured for 
this node
---------------------------------------

The reserved CPUs have been removed as they should be. So why does scontrol 
still show the incorrect value (and jobs still do not run on those cores)?

Dave


David Guertin

Information Technology Services
Middlebury College
700 Exchange St.
Middlebury, VT 05753
(802)443-3143

Reply via email to