HelloWe have four Xeon Phi (KNL) nodes with 64 cores SMT-4 each (256 hyperthreads total). They are configured in different KNL modes (SNC4/flat, SNC4/cache, All2all/flat and all2all/cache). The node that is in SNC4/Flat won't let us allocate all 256 hyperthreads. Half the cores only get 2 hyperthreads instead of 4:
|$ srun -c256 -w kona02 --exclusive grep -i cpu /proc/self/status Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff,0000ffff,0000ffff,0000ffff,0000ffff Cpus_allowed_list: 0-15,32-47,64-79,96-111,128-255|
Other nodes configured in other KNL modes are fine, we get all 256 hyperthreads:
|$ srun -c256 -w kona03 --exclusive grep -i cpu /proc/self/status Cpus_allowed: ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff Cpus_allowed_list: 0-255|
If we reconfigure the buggy node to All2all/cache, it works fine. If we reconfigure another node to SNC4/flat, it starts having the same issue. So it looks like something fails only when KNL is configured in SNC4/Flat?
All nodes are configured the same in slurm.conf: NodeName=kona[01-04] Procs=256 CoresPerSocket=64 RealMemory=94000 Sockets=1 ThreadsPerCore=4 Feature=kona,intel,knightslanding,knl Weight=70FWIW, we're using SLURM 19.05.2. An upgrade in possible in the future but not immediately. The "KNL" plugin is installed but we don't think we've done anything to configure it (at least we never used it to reconfigure/reboot KNL nodes).
Thanks Brice
OpenPGP_signature
Description: OpenPGP digital signature