Greetings, There are not many questions regarding GPU sharding here, and I am unsure if I am using it correctly... I have configured it according to the instructions <https://slurm.schedmd.com/gres.html>, and it seems to be configured properly:
$ scontrol show node compute01 NodeName=compute01 Arch=x86_64 CoresPerSocket=32 CPUAlloc=48 CPUEfctv=128 CPUTot=128 CPULoad=10.95 AvailableFeatures=(null) ActiveFeatures=(null) * Gres=gpu:8,shard:32* [truncated] When running with gres:gpu everything works perfectly: $ /usr/bin/srun --gres=gpu:2 ls srun: job 192 queued and waiting for resources srun: job 192 has been allocated resources (...) However, when using sharding, it just stays waiting indefinitely: $ /usr/bin/srun --gres=shard:2 ls srun: job 193 queued and waiting for resources The reason it gives for pending is just "Resources": $ scontrol show job 193 JobId=193 JobName=ls UserId=rpcruz(1000) GroupId=rpcruz(1000) MCS_label=N/A Priority=1 Nice=0 Account=account QOS=normal * JobState=PENDING Reason=Resources Dependency=(null)* Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0 RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2024-06-28T05:36:51 EligibleTime=2024-06-28T05:36:51 AccrueTime=2024-06-28T05:36:51 StartTime=2024-06-29T18:13:22 EndTime=2024-07-01T18:13:22 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-06-28T05:37:20 Scheduler=Backfill:* Partition=partition AllocNode:Sid=localhost:47757 ReqNodeList=(null) ExcNodeList=(null) NodeList= NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* ReqTRES=cpu=1,mem=1031887M,node=1,billing=1 AllocTRES=(null) Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=ls WorkDir=/home/rpcruz Power= * TresPerNode=gres/shard:2* Again, I think I have configured it properly - it shows up correctly in scontrol (as shown above). Our setup is pretty simple - I just added shard to /etc/slurm/slurm.conf: GresTypes=gpu,shard NodeName=compute01 Gres=gpu:8,shard:32 [truncated] Our /etc/slurm/gres.conf is also straight-forward: (it works fine for --gres=gpu:1) Name=gpu File=/dev/nvidia[0-7] Name=shard Count=32 Maybe I am just running srun improperly? Shouldn't it just be srun --gres= shard:2 to allocate half of a GPU? (since I am using 32 shards for the 8 gpus, so it's 4 shards per gpu) Thank you very much for your attention, -- Ricardo Cruz - https://rpmcruz.github.io
-- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsubscribe send an email to slurm-users-le...@lists.schedmd.com