[slurm-users] Re: Using sharding

Ricardo Cruz via slurm-users Thu, 04 Jul 2024 09:56:01 -0700

Dear Brian,

Currently, we have 5 GPUs available (out of 8).


rpcruz@atlas:~$ /usr/bin/srun --gres=shard:2 ls
srun: job 515 queued and waiting for resources

The job shows as PD in squeue.
scontrol says that 5 GPUs are allocated out of 8...

rpcruz@atlas:~$ scontrol show node compute01
NodeName=compute01 Arch=x86_64 CoresPerSocket=32
   CPUAlloc=80 CPUEfctv=128 CPUTot=128 CPULoad=65.38
   AvailableFeatures=(null)
   ActiveFeatures=(null)

*   Gres=gpu:8,shard:32*   NodeAddr=compute01 NodeHostName=compute01
Version=23.11.4
   OS=Linux 6.8.0-36-generic #36-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 10
10:49:14 UTC 2024
   RealMemory=1031887 AllocMem=644925 FreeMem=701146 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=partition
   BootTime=2024-07-02T14:08:37 SlurmdStartTime=2024-07-02T14:08:51
   LastBusyTime=2024-07-03T12:02:11 ResumeAfterTime=None


*   CfgTRES=cpu=128,mem=1031887M,billing=128,gres/gpu=8
 AllocTRES=cpu=80,mem=644925M,gres/gpu=5*   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a

rpcruz@atlas:~$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
partition*    up 5-00:00:00      1    mix compute01


The output is the same, independent of whether "srun --gres=shard:2" is
pending or not.
I wonder if the problem is that CfgTRES is not showing gres/shard ... it
sounds like it should, right?

The complete last part of my /etc/slurm/slurm.conf (which is of course the
same in the login and compute node):

# COMPUTE NODES
GresTypes=gpu,shard
NodeName=compute01 Gres=gpu:8,shard:32 CPUs=128 RealMemory=1031887
Sockets=2 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN
PartitionName=partition Nodes=ALL Default=YES MaxTime=5-00:00:00 State=UP
DefCpuPerGPU=16 DefMemPerGPU=128985

And in the compute node /etc/slurm/gres.conf is:
Name=gpu File=/dev/nvidia[0-7]
Name=shard Count=32


Thank you!
--
Ricardo Cruz - https://rpmcruz.github.io


Brian Andrus via slurm-users <slurm-users@lists.schedmd.com> escreveu
(quinta, 4/07/2024 à(s) 17:16):

> To help dig into it, can you paste the full output of scontrol show node
> compute01 while the job is pending? Also 'sinfo' would be good.
>
> It is basically telling you there aren't enough resources in the partition
> to run the job. Often this is because all the nodes are in use at that
> moment.
>
> Brian Andrus
> On 7/4/2024 8:43 AM, Ricardo Cruz via slurm-users wrote:
>
> Greetings,
>
> There are not many questions regarding GPU sharding here, and I am unsure
> if I am using it correctly... I have configured it according to the
> instructions <https://slurm.schedmd.com/gres.html>, and it seems to be
> configured properly:
>
> $ scontrol show node compute01
> NodeName=compute01 Arch=x86_64 CoresPerSocket=32
>    CPUAlloc=48 CPUEfctv=128 CPUTot=128 CPULoad=10.95
>    AvailableFeatures=(null)
>    ActiveFeatures=(null)
>
> *   Gres=gpu:8,shard:32 *
>    [truncated]
>
> When running with gres:gpu everything works perfectly:
>
> $ /usr/bin/srun --gres=gpu:2 ls
> srun: job 192 queued and waiting for resources
> srun: job 192 has been allocated resources
> (...)
>
> However, when using sharding, it just stays waiting indefinitely:
>
> $ /usr/bin/srun --gres=shard:2 ls
> srun: job 193 queued and waiting for resources
>
> The reason it gives for pending is just "Resources":
>
> $ scontrol show job 193
> JobId=193 JobName=ls
>    UserId=rpcruz(1000) GroupId=rpcruz(1000) MCS_label=N/A
>    Priority=1 Nice=0 Account=account QOS=normal
>
> *   JobState=PENDING Reason=Resources Dependency=(null) *   Requeue=1
> Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>    RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
>    SubmitTime=2024-06-28T05:36:51 EligibleTime=2024-06-28T05:36:51
>    AccrueTime=2024-06-28T05:36:51
>    StartTime=2024-06-29T18:13:22 EndTime=2024-07-01T18:13:22 Deadline=N/A
>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-06-28T05:37:20
> Scheduler=Backfill:*
>    Partition=partition AllocNode:Sid=localhost:47757
>    ReqNodeList=(null) ExcNodeList=(null)
>    NodeList=
>    NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>    ReqTRES=cpu=1,mem=1031887M,node=1,billing=1
>    AllocTRES=(null)
>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>    Features=(null) DelayBoot=00:00:00
>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>    Command=ls
>    WorkDir=/home/rpcruz
>    Power=
> *   TresPerNode=gres/shard:2*
>
> Again, I think I have configured it properly - it shows up correctly in
> scontrol (as shown above).
> Our setup is pretty simple - I just added shard to /etc/slurm/slurm.conf:
> GresTypes=gpu,shard
> NodeName=compute01 Gres=gpu:8,shard:32 [truncated]
> Our /etc/slurm/gres.conf is also straight-forward: (it works fine for
> --gres=gpu:1)
> Name=gpu File=/dev/nvidia[0-7]
> Name=shard Count=32
>
>
> Maybe I am just running srun improperly? Shouldn't it just be srun --gres=
> shard:2 to allocate half of a GPU? (since I am using 32 shards for the 8
> gpus, so it's 4 shards per gpu)
>
> Thank you very much for your attention,
> --
> Ricardo Cruz - https://rpmcruz.github.io
>
>
> --
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Using sharding

Reply via email to