Don't think you need CPUs in slurm.conf for the node def, just
Sockets=4 CoresPerSocket=4 ThreadsPerCore=1 for example, and the
slurmctld does the math for # cpus. Also slurmd -C on the nodes is
very useful to see what's being autodetected.
On Wed, Oct 10, 2018 at 11:34 AM Noam Bernstein
<noam.bernst...@nrl.navy.mil> wrote:
>
> Hi all - I’m new to slurm, and in many ways it’s been very nice to work with,
> but I’m having an issue trying to properly set up thread/core/socket counts
> on nodes. Basically, if I don’t specify anything except CPUs, the node is
> available, but doesn’t appear to know about cores and hyperthreading. If I
> do try to specify that info it claims that the numbers aren’t consistent and
> sets the node to drain.
>
> This is all on CentOS 7, slurm 18.08, and FastSchedule is set to 0.
>
> First type of node, 2 x 8 core CPUs, hyperthreading on, nothing specified in
> slurm.conf except CPUs. /proc/cpuinfo confirms that there are 32 “cpus”,
> with the expected values for physical id and core id.
>
> From slurm.conf
>
> NodeName=compute-2-0 NodeAddr=10.1.255.250 CPUs=32 Weight=20511700
> Feature=rack-2,32CPUs
>
> from scontrol show node
>
> NodeName=compute-2-0 Arch=x86_64 CoresPerSocket=1
> CPUAlloc=0 CPUTot=32 CPULoad=0.04
> AvailableFeatures=rack-2,32CPUs
> ActiveFeatures=rack-2,32CPUs
> Gres=(null)
> NodeAddr=10.1.255.250 NodeHostName=compute-2-0 Version=18.08
> OS=Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018
> RealMemory=257742 AllocMem=0 FreeMem=255703 Sockets=32 Boards=1
> State=IDLE ThreadsPerCore=1 TmpDisk=913567 Weight=20511700 Owner=N/A
> MCS_label=N/A
> Partitions=CLUSTER,WHEEL,n2013f
> BootTime=2018-10-10T11:06:42 SlurmdStartTime=2018-10-10T11:07:16
> CfgTRES=cpu=32,mem=257742M,billing=94
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
>
>
> Second type of node, 2 x 4 core CPUs, hyperthreading on. /proc/cpuinfo
> confirms that there are 16 “cpus”, with the expected values for physical id
> and core id.
>
> If I set the numbers of sockets/cores/threads as I think is correct (note
> that this is a different type of machine than the previous),
>
> NodeName=compute-0-0 NodeAddr=10.1.255.253 CPUs=16 Weight=20495900
> Feature=rack-0,16CPUs Sockets=2 CoresPerSocket=4 ThreadsPerCore=2
>
> I get the following
>
> NodeName=compute-0-0 Arch=x86_64 CoresPerSocket=4
> CPUAlloc=0 CPUTot=16 CPULoad=0.42
> AvailableFeatures=rack-0,16CPUs
> ActiveFeatures=rack-0,16CPUs
> Gres=(null)
> NodeAddr=10.1.255.253 NodeHostName=compute-0-0 Version=18.08
> OS=Linux 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018
> RealMemory=11842 AllocMem=0 FreeMem=11335 Sockets=2 Boards=1
> State=IDLE+DRAIN ThreadsPerCore=2 TmpDisk=275125 Weight=20495900 Owner=N/A
> MCS_label=N/A
> Partitions=CLUSTER,WHEEL,ib_qdr
> BootTime=2018-10-10T11:06:55 SlurmdStartTime=2018-10-10T11:07:34
> CfgTRES=cpu=16,mem=11842M,billing=18
> AllocTRES=
> CapWatts=n/a
> CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
> ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> Reason=Low socket*core count [root@2018-10-10T10:26:14]
>
>
> I feel like there are couple of things that are suspicious, but I’m not sure
> 1. I get the impression that slurm is supposed to be able to automatically
> figure out the architecture of the node, but in the first example there’s no
> evidence of that in the scontrol output.
> 2. When I set the various architecture related parameters it claims that the
> numbers don’t match, even though sockets*cores*threads = 2*4*2 = 16 = CPUs
>
> Does anyone have any idea as to what’s going on, or what other information
> would be useful for debugging?
>
>
>
> thanks,
> Noam