Chris, > > 1.) Slurm seems to be incapable of recognizing sockets/cores/threads on > > these systems. > [...] > > Anyone know if there is a way to get Slurm to recognize the true topology > > for POWER nodes? > > IIIRC Slurm uses hwloc for discovering topology, so "lstopo-no-graphics" might > give you some insights into whether it's showing you the right config. > > I'd be curious to see what "lscpu" and "slurmd -C" say as well.
The biggest problem as I see it, is that if I have 2 20-core sockets, if I have SMT2 set this looks like 80 single-core, single-thread sockets to Slurm (see slurmd -C output below). If I have SMT4 set, it thinks there are 160 sockets. NodeName=enki13 CPUs=80 Boards=1 SocketsPerBoard=80 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=583992 UpTime=0-23:20:16 How do you set your configuration for Slurm to get meaningful CPU affinity for, say, placing tasks on 2 cores per socket (instead of scheduling 4 cores on one socket)? For SMT2, lscpu output looks like this: Architecture: ppc64le Byte Order: Little Endian CPU(s): 160 On-line CPU(s) list: 0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77,80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109,112,113,116,117,120,121,124,125,128,129,132,133,136,137,140,141,144,145,148,149,152,153,156,157 Off-line CPU(s) list: 2,3,6,7,10,11,14,15,18,19,22,23,26,27,30,31,34,35,38,39,42,43,46,47,50,51,54,55,58,59,62,63,66,67,70,71,74,75,78,79,82,83,86,87,90,91,94,95,98,99,102,103,106,107,110,111,114,115,118,119,122,123,126,127,130,131,134,135,138,139,142,143,146,147,150,151,154,155,158,159 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 6 Model: 2.2 (pvr 004e 1202) Model name: POWER9, altivec supported CPU max MHz: 3800.0000 CPU min MHz: 2300.0000 L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0,1,4,5,8,9,12,13,16,17,20,21,24,25,28,29,32,33,36,37,40,41,44,45,48,49,52,53,56,57,60,61,64,65,68,69,72,73,76,77 NUMA node8 CPU(s): 80,81,84,85,88,89,92,93,96,97,100,101,104,105,108,109,112,113,116,117,120,121,124,125,128,129,132,133,136,137,140,141,144,145,148,149,152,153,156,157 ... For SMT4, it looks like this: Architecture: ppc64le Byte Order: Little Endian CPU(s): 160 On-line CPU(s) list: 0-159 Thread(s) per core: 4 Core(s) per socket: 20 Socket(s): 2 NUMA node(s): 6 Model: 2.2 (pvr 004e 1202) Model name: POWER9, altivec supported CPU max MHz: 3800.0000 CPU min MHz: 2300.0000 L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 10240K NUMA node0 CPU(s): 0-79 NUMA node8 CPU(s): 80-159 > > > 2.) Another concern is the gres.conf. Slurm seems to have trouble taking > > processor ID's that are > "#Sockets". The true processor ID as given by > > nvidia-smi topo -m output will range up to 159, and slurm doesn't like > > this. Are we to use "Cores=" entries in gres.conf, and use the number of > > the physical cores, instead of what nvidia-smi outputs? > > Again I *think* Slurm is using hwloc's logical CPU numbering for this, so > lstopo should help - using a quick snippet on my local PC (HT enabled) here: > > Package L#0 + L3 L#0 (8192KB) > L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 > PU L#0 (P#0) > PU L#1 (P#4) > L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 > PU L#2 (P#1) > PU L#3 (P#5) > L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 > PU L#4 (P#2) > PU L#5 (P#6) > L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 > PU L#6 (P#3) > PU L#7 (P#7) > > you can see that the logical numbering (L#0 and L#1) is done to be contiguous > compared to how the firmware has enumerated the CPUs. > > > 3.) A related gres.conf question: there seems to be no documentation of > > using "CPUs=" instead of "Cores=", yet I have seen several online examples > > using "CPUs=" (and I myself have used it on an x86 system without issue). > > Should one use "Cores" instead of "CPUs", when specifying binding to > > specific GPUs? > > I think CPUs= was the older syntax which has been replaced with Cores=. > > The gres.conf we use on our HPC cluster uses Cores= quite happily. > > Name=gpu Type=p100 File=/dev/nvidia0 Cores=0-17 > Name=gpu Type=p100 File=/dev/nvidia1 Cores=18-35 We will try setting Cores as numbered by "Core L#n" and see how that works for us. We are using cgroup enforcement, so for a particular user job, they will only see the GPUs they allocate, and I expect that output of "nvidia-smi topo -m" will be similarly affected, in that the cores/threads listed in the output will just be sequential IDs for the cores/threads requested, not the P# ID's reported if "nvidia-smi topo -m" is run by root outside of a slurm-controlled job. SMT lstopo output looks like this: Machine (570GB total) Group0 L#0 NUMANode L#0 (P#0 252GB) Package L#0 L3 L#0 (10MB) + L2 L#0 (512KB) L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 PU L#0 (P#0) PU L#1 (P#1) L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 PU L#2 (P#4) PU L#3 (P#5) L3 L#1 (10MB) + L2 L#1 (512KB) L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 PU L#4 (P#8) PU L#5 (P#9) L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 PU L#6 (P#12) PU L#7 (P#13) ... L3 L#9 (10MB) + L2 L#9 (512KB) L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 PU L#36 (P#72) PU L#37 (P#73) L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 PU L#38 (P#76) PU L#39 (P#77) ... NUMANode L#1 (P#8 256GB) Package L#1 L3 L#10 (10MB) + L2 L#10 (512KB) L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 PU L#40 (P#80) PU L#41 (P#81) L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 PU L#42 (P#84) PU L#43 (P#85) ... L3 L#19 (10MB) + L2 L#19 (512KB) L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 PU L#76 (P#152) PU L#77 (P#153) L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 PU L#78 (P#156) PU L#79 (P#157) So my guess here is that GPU0,GPU1 would get Cores=0-19, and GPU2,GPU3 get Cores=20-39 as numbered by lstopo? - Keith Ball