Re: [slurm-users] One node, two partitions (gpu and cpu), can SLURM map cpu cores well?

Cristóbal Navarro Sat, 08 May 2021 08:37:57 -0700

Hi Brian,
Thanks,
Yes we have a single node entry, its just that I accidentally put the
commmented node entry as well in the message when pasting the config file.
Sorry for that.


So from what you mention, I should add some QOS settings to the partitions
in order to set proper cpu affinities right?


On Sat, May 8, 2021, 12:15 PM Brian Andrus <toomuc...@gmail.com> wrote:

> Cristóba,
>
> Your approach is a little off.
>
> Slurm needs to know about the node properties. It can then allocate them
> based on job/partition.
>
> So, you should have a single "NodeName" entry for the node the accurately
> describes what you want to allow access to at all.
>
> Then you limit what is allowed to be requested in the partition definition
> and/or a QOS (if you are using accounting).
>
> Brian Andrus
> On 5/7/2021 8:11 PM, Cristóbal Navarro wrote:
>
> Hi community,
> I am unable to tell if SLURM is handling the following situation
> efficiently in terms of CPU affinities at each partition.
>
> Here we have a very small cluster with just one GPU node with 8x GPUs,
> that offers two partitions --> "gpu" and "cpu".
> Part of the Config File
> ## Nodes list
> ## use native GPUs
> NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1
> RealMemory=1024000 State=UNKNOWN Gres=gpu:A100:8 Feature=gpu
>
> ## Default CPU layout (same total cores as others)
> #NodeName=nodeGPU01 SocketsPerBoard=8 CoresPerSocket=16 ThreadsPerCore=1
> RealMemory=1024000 State=UNKNOWN
> Gres=gpu:a100:4,gpu:a100_20g:2,gpu:a100_10g:2,gpu:a100_5g:16 Feature=ht,gpu
>
> ## Partitions list
> PartitionName=gpu OverSubscribe=FORCE MaxCPUsPerNode=64 DefCpuPerGPU=8
> DefMemPerGPU=65556 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01  Default=YES
> PartitionName=cpu OverSubscribe=FORCE MaxCPUsPerNode=64
> DefMemPerNode=16384 MaxTime=1-00:00:00 State=UP Nodes=nodeGPU01
>
>
> The node has 128 cpu cores (2x 64 core AMD cpus, SMT disabled) and
> resources have been subdivided from the partition options, 64 maxCores for
> each one.
> The gres file is auto-generated with nvml, at it obeys the following GPU
> topology (focus on CPU affinity) shown ahead
> ➜  ~ nvidia-smi topo -m
> GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4
> mlx5_5 mlx5_6 mlx5_7 mlx5_8 mlx5_9 CPU Affinity NUMA Affinity
> GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS
> SYS SYS                               48-63            3
> GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB SYS SYS SYS SYS SYS SYS
> SYS SYS                               48-63            3
> GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS
> SYS SYS                               16-31            1
> GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS SYS PXB PXB SYS SYS SYS SYS
> SYS SYS                               16-31            1
> GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS
> SYS SYS                               112-127          7
> GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS PXB PXB SYS SYS
> SYS SYS                               112-127          7
> GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS SYS SYS SYS PXB PXB
> SYS SYS                               80-95            5
> GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS SYS SYS SYS PXB PXB
> SYS SYS                               80-95            5
>
> If we look closely, we can see specific CPU affinities for the GPUs,
> therefore I assume that the multi-core CPU jobs should use the 64 CPU cores
> that are not listed here, e.g, cores 0-15, 32-47....
> Will SLURM realize that CPU jobs should have this core affinity? if not,
> is there a way I can make the default CPU affinities the correct ones for
> all JOBs launched on the "cpu" partition?
> Any help is welcome
> --
> Cristóbal A. Navarro
>
>

Re: [slurm-users] One node, two partitions (gpu and cpu), can SLURM map cpu cores well?

Reply via email to