Navin,

You can isolate GPUs per job if you have cgroups set up properly. What OS are 
you using? Newer OSes will support cgroupsv2 out of the box, but if necessary 
you can continue using v1, this workflow should be applicable for both.

Add ConstrainDevices=yes to your cgroup.conf

This is what the file looks like at my site:
/etc/slurm/cgroup.conf
CgroupMountpoint="/sys/fs/cgroup"
ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=no
ConstrainDevices=yes

You can find the documentation here:
https://slurm.schedmd.com/cgroup.conf.html

If you want to share GPUs you can use CUDA MPS or MIG if your GPU supports it.

Regards,
Jesse Chintanadilok

From: navin srivastava via slurm-users <slurm-users@lists.schedmd.com>
Sent: Wednesday, February 12, 2025 10:30
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: [EXTERNAL] [slurm-users] avoid using same GPU by the interactive job

hi, facing an issue in my environment where the batch job and the interactive 
job use the same gpu. Each server has 2 gpu. When 2 batch jobs are running it 
works fine and use the 2 different gpu's. but if one batch job is running and 
another
ZjQcmQRYFpfptBannerStart
This message was sent from outside of Texas Instruments.

Do not click links or open attachments unless you recognize the source of this 
email and know the content is safe.

    Report Suspicious  
<https://us-phishalarm-ewt.proofpoint.com/EWT/v1/G3vK!tDdkczjudcZWZCqpHP6Ikzi-El1-dpSwALBpmsdoXJOODQgC9RVWKYSLBAkkSja6JDeYPDDqYANiCMm4xgWAtpPabtvdEeWe5cMxQWuw7pV_l7LSV6lbgQ$>
   ‌



ZjQcmQRYFpfptBannerEnd
hi,

facing an issue in my environment where the batch job and the interactive job 
use the same gpu.

Each server has 2 gpu. When 2 batch jobs are running it works fine and use the 
2 different gpu's. but if one batch job is running and another job is submitted 
interactively then it uses the same GPU . Is there a way to avoid this?

GresTypes=gpu
NodeName=node[01-02] NodeAddr=node[01-02] CPUs=48 Boards=1 SocketsPerBoard=2 
CoresPerSocket=24 ThreadsPerCore=1 TmpDisk=6000000 RealMemory=515634 
Feature=A100 Gres=gpu:2

PartitionName=onprem Nodes=node[01-10] Default=YES MaxTime=21-00:00:00 
DefaultTime=3-00:00:00 State=UP Shared=YES OverSubscribe=NO

 gres.conf:
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1

Any suggestions on this.

Regards
Navin
-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to