[slurm-users] Re: [EXTERNAL] avoid using same GPU by the interactive job

Michael Gutteridge via slurm-users Thu, 13 Feb 2025 09:05:28 -0800

Well that's kind of the core issue- without cgroups _any_ process in the
job will have access to all of the GPUs on the system and there's not much
more that Slurm can do about it at that point.


I would have a look at the environment variable CUDA_VISIBLE_DEVICES
<https://slurm.schedmd.com/gres.html#GPU_Management>.  It is set by Slurm
and should have an index (0, 1, 2, etc.) directing applications to an
appropriate GPU.  I think it's more a case that the batch processes are
honoring that variable and the interactive job is not.

 - Michael


On Wed, Feb 12, 2025 at 9:00 PM navin srivastava via slurm-users <
slurm-users@lists.schedmd.com> wrote:

> Thank you Jesse.
>
> I am  using Enterprise SLES15SP6 as the OS. I have not introduced the
> cgroup functionality in my environment. I can think about it and will see
> if this solution works out. but is there any other way to use without
> Cgroup to achieve the same.  Batch job requests are fine 2 jobs with each
> one GPU request works fine.  in the case of mix( 1 batch job and other
> Interactive job) creating the problem.
>
> Is there a way I can run a job and apply the exclusive way only on GPU
> resources?
>
> Regards
> Navin.
>
>
>
> On Wed, Feb 12, 2025 at 11:24 PM Chintanadilok, Jesse <jc...@ti.com>
> wrote:
>
>> Navin,
>>
>>
>>
>> You can isolate GPUs per job if you have cgroups set up properly. What OS
>> are you using? Newer OSes will support cgroupsv2 out of the box, but if
>> necessary you can continue using v1, this workflow should be applicable for
>> both.
>>
>>
>>
>> Add ConstrainDevices=yes to your cgroup.conf
>>
>>
>>
>> This is what the file looks like at my site:
>>
>> /etc/slurm/cgroup.conf
>>
>> CgroupMountpoint="/sys/fs/cgroup"
>>
>> ConstrainCores=yes
>>
>> ConstrainRAMSpace=yes
>>
>> ConstrainSwapSpace=no
>>
>> ConstrainDevices=yes
>>
>>
>>
>> You can find the documentation here:
>>
>> https://slurm.schedmd.com/cgroup.conf.html
>>
>>
>>
>> If you want to share GPUs you can use CUDA MPS or MIG if your GPU
>> supports it.
>>
>>
>>
>> Regards,
>>
>> Jesse Chintanadilok
>>
>>
>>
>> *From:* navin srivastava via slurm-users <slurm-users@lists.schedmd.com>
>> *Sent:* Wednesday, February 12, 2025 10:30
>> *To:* Slurm User Community List <slurm-users@lists.schedmd.com>
>> *Subject:* [EXTERNAL] [slurm-users] avoid using same GPU by the
>> interactive job
>>
>>
>>
>> hi, facing an issue in my environment where the batch job and the
>> interactive job use the same gpu. Each server has 2 gpu. When 2 batch jobs
>> are running it works fine and use the 2 different gpu's. but if one batch
>> job is running and another
>>
>> ZjQcmQRYFpfptBannerStart
>>
>> *This message was sent from outside of Texas Instruments. *
>>
>> Do not click links or open attachments unless you recognize the source of
>> this email and know the content is safe.
>>
>>   *  Report Suspicious  *
>> <https://us-phishalarm-ewt.proofpoint.com/EWT/v1/G3vK!tDdkczjudcZWZCqpHP6Ikzi-El1-dpSwALBpmsdoXJOODQgC9RVWKYSLBAkkSja6JDeYPDDqYANiCMm4xgWAtpPabtvdEeWe5cMxQWuw7pV_l7LSV6lbgQ$>
>>   ‌
>>
>>
>> ZjQcmQRYFpfptBannerEnd
>>
>> hi,
>>
>>
>>
>> facing an issue in my environment where the batch job and the
>> interactive job use the same gpu.
>>
>>
>>
>> Each server has 2 gpu. When 2 batch jobs are running it works fine and
>> use the 2 different gpu's. but if one batch job is running and another job
>> is submitted interactively then it uses the same GPU . Is there a way to
>> avoid this?
>>
>>
>>
>> GresTypes=gpu
>>
>> NodeName=node[01-02] NodeAddr=node[01-02] CPUs=48 Boards=1
>> SocketsPerBoard=2 CoresPerSocket=24 ThreadsPerCore=1 TmpDisk=6000000
>> RealMemory=515634 Feature=A100 Gres=gpu:2
>>
>>
>>
>> PartitionName=onprem Nodes=node[01-10] Default=YES MaxTime=21-00:00:00
>> DefaultTime=3-00:00:00 State=UP Shared=YES OverSubscribe=NO
>>
>>
>>
>>  gres.conf:
>>
>> Name=gpu File=/dev/nvidia0
>>
>> Name=gpu File=/dev/nvidia1
>>
>>
>>
>> Any suggestions on this.
>>
>>
>>
>> Regards
>>
>> Navin
>>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: [EXTERNAL] avoid using same GPU by the interactive job

Reply via email to