Hi Brian, Thanks, Yes we have a single node entry, its just that I accidentally put the commmented node entry as well in the message when pasting the config file. Sorry for that.
So from what you mention, I should add some QOS settings to the partitions in order to set proper cpu affinities right? On Sat, May 8, 2021, 12:15 PM Brian Andrus <toomuc...@gmail.com> wrote: > Cristóba, > > Your approach is a little off. > > Slurm needs to know about the node properties. It can then allocate them > based on job/partition. > > So, you should have a single "NodeName" entry for the node the accurately > describes what you want to allow access to at all. > > Then you limit what is allowed to be requested in the partition definition > and/or a QOS (if you are using accounting). > > Brian Andrus > On 5/7/2021 8:11 PM, Cristóbal Navarro wrote: > > Hi community, > I am unable to tell if SLURM is handling the following situation > efficiently in terms of CPU affinities at each partition. > > Here we have a very small cluster with just one GPU node with 8x GPUs, > that offers two partitions --> "gpu" and "cpu". > Part of the Config File > ## Nodes list > ## use native GPUs > NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 > RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8 Feature=gpu > > ## Default CPU layout (same total cores as others) > #NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1 > RealMemory=1024000 State=UNKNOWN > Gres=gpu:a100:4,gpu:a100_20g:2,gpu:a100_10g:2,gpu:a100_5g:16 Feature=ht,gpu > > ## Partitions list > PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=64 DefCpuPerGPU=8 > DefMemPerGPU=65556 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01 Default=YES > PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=64 > DefMemPerNode=16384 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01 > > > The node has 128 cpu cores (2x 64 core AMD cpus, SMT disabled) and > resources have been subdivided from the partition options, 64 maxCores for > each one. > The gres file is auto-generated with nvml, at it obeys the following GPU > topology (focus on CPU affinity) shown ahead > ➜ ~ nvidia-smi topo -m > GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 > mlx5_5 mlx5_6 mlx5_7 mlx5_8 mlx5_9 CPU Affinity NUMA Affinity > GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS > SYS SYS 48-63 3 > GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS > SYS SYS 48-63 3 > GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS > SYS SYS 16-31 1 > GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS > SYS SYS 16-31 1 > GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS > SYS SYS 112-127 7 > GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS > SYS SYS 112-127 7 > GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB > SYS SYS 80-95 5 > GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB > SYS SYS 80-95 5 > > If we look closely, we can see specific CPU affinities for the GPUs, > therefore I assume that the multi-core CPU jobs should use the 64 CPU cores > that are not listed here, e.g, cores 0-15, 32-47.... > Will SLURM realize that CPU jobs should have this core affinity? if not, > is there a way I can make the default CPU affinities the correct ones for > all JOBs launched on the "cpu" partition? > Any help is welcome > -- > Cristóbal A. Navarro > >