Hi community, I am unable to tell if SLURM is handling the following situation efficiently in terms of CPU affinities at each partition.
Here we have a very small cluster with just one GPU node with 8x GPUs, that offers two partitions --> "gpu" and "cpu". Part of the Config File ## Nodes list ## use native GPUs NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8 Feature=gpu ## Default CPU layout (same total cores as others) #NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 RealMemory=1024000 State=UNKNOWN Gres=gpu:a100:4,gpu:a100_20g:2,gpu:a100_10g:2,gpu:a100_5g:16 Feature=ht,gpu ## Partitions list PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=64 DefCpuPerGPU=8 DefMemPerGPU=65556 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01 Default=YES PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=64 DefMemPerNode=16384 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01 The node has 128 cpu cores (2x 64 core AMD cpus, SMT disabled) and resources have been subdivided from the partition options, 64 maxCores for each one. The gres file is auto-generated with nvml, at it obeys the following GPU topology (focus on CPU affinity) shown ahead ➜ ~ nvidia-smi topo -m GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 mlx5_6 mlx5_7 mlx5_8 mlx5_9 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63 3 GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS 48-63 3 GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31 1 GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS 16-31 1 GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127 7 GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS 112-127 7 GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95 5 GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS 80-95 5 If we look closely, we can see specific CPU affinities for the GPUs, therefore I assume that the multi-core CPU jobs should use the 64 CPU cores that are not listed here, e.g, cores 0-15, 32-47.... Will SLURM realize that CPU jobs should have this core affinity? if not, is there a way I can make the default CPU affinities the correct ones for all JOBs launched on the "cpu" partition? Any help is welcome -- Cristóbal A. Navarro