Re: [slurm-users] Job allocation from a heterogenous pool of nodes

Brian Andrus Wed, 07 Dec 2022 09:28:50 -0800

You may want to look here:

https://slurm.schedmd.com/heterogeneous_jobs.html


Brian Andrus

On 12/7/2022 12:42 AM, Le, Viet Duc wrote:

Dear slurm community,
I am encountering a unique situation where I need to allocate jobs tonodes with different numbers of CPU cores. For instance:
node01: Xeon 6226 32 cores

node02: EPYC 7543 64 cores
$ salloc--partition=all --nodes=2 --nodelist=gpu01,gpu02 --ntasks-per-node=32 --comment=etc
If --ntasks-per-node is larger than 32, the job could not be allocatedsince node01 has only 32 cores.
In the context of NVIDIA's HPL container, we need to pin MPIprocesses according to NUMA affinity for best performance.
For HGX-1, there are 8 A100s having affinity with 1st, 3rd, 5th, and7th NUMA domain, respectively.
With --ntasks-per-node=32, only the first half of EPYC's NUMA domainis available, and we had to assign the 4-7th A100 to 0th and 2nd NUMAdomain, leading to some performance degradation.
I am looking for a way to request more tasks than the number ofphysically available cores, i.e.
$ salloc--partition=all --nodes=2 --nodelist=gpu01,gpu02 --ntasks-per-node=64--comment=etc
Your suggestions are much appreciated.


Regards,

Viet-Duc

Re: [slurm-users] Job allocation from a heterogenous pool of nodes

Reply via email to