I would try specifying cpus and mem just to be sure its not requesting 0/all.

Also, I was running into a weird issue when I had oversubscribe=yes:2 causing 
odd issues in my lab cluster when playing with shards, where they would go 
pending resources despite no alloc of gpu/shards.
Once I reverted to my normal FORCE:1, it behaved as expected.

Also may want to make sure there isn’t a job_submit script possibly 
intercepting gres requests?

> On Jul 4, 2024, at 12:09 PM, Brian Andrus via slurm-users 
> <slurm-users@lists.schedmd.com> wrote:
> 
> Just a thought.
> 
> Try specifying some memory. It looks like the running jobs do that and by 
> default, if not specified it is "all the memory on the node", so it can't 
> start because some of it is taken.
> 
> Brian Andrus
> 
> On 7/4/2024 9:54 AM, Ricardo Cruz wrote:
>> Dear Brian,
>> 
>> Currently, we have 5 GPUs available (out of 8).
>> 
>> rpcruz@atlas:~$ /usr/bin/srun --gres=shard:2 ls
>> srun: job 515 queued and waiting for resources
>> 
>> The job shows as PD in squeue.
>> scontrol says that 5 GPUs are allocated out of 8...
>> 
>> rpcruz@atlas:~$ scontrol show node compute01
>> NodeName=compute01 Arch=x86_64 CoresPerSocket=32 
>>    CPUAlloc=80 CPUEfctv=128 CPUTot=128 CPULoad=65.38
>>    AvailableFeatures=(null)
>>    ActiveFeatures=(null)
>>    Gres=gpu:8,shard:32
>>    NodeAddr=compute01 NodeHostName=compute01 Version=23.11.4
>>    OS=Linux 6.8.0-36-generic #36-Ubuntu SMP PREEMPT_DYNAMIC Mon Jun 10 
>> 10:49:14 UTC 2024 
>>    RealMemory=1031887 AllocMem=644925 FreeMem=701146 Sockets=2 Boards=1
>>    State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
>>    Partitions=partition 
>>    BootTime=2024-07-02T14:08:37 SlurmdStartTime=2024-07-02T14:08:51
>>    LastBusyTime=2024-07-03T12:02:11 ResumeAfterTime=None
>>    CfgTRES=cpu=128,mem=1031887M,billing=128,gres/gpu=8
>>    AllocTRES=cpu=80,mem=644925M,gres/gpu=5
>>    CapWatts=n/a
>>    CurrentWatts=0 AveWatts=0
>>    ExtSensorsJoules=n/a ExtSensorsWatts=0 ExtSensorsTemp=n/a
>> 
>> rpcruz@atlas:~$ sinfo
>> PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> partition*    up 5-00:00:00      1    mix compute01
>> 
>> 
>> The output is the same, independent of whether "srun --gres=shard:2" is 
>> pending or not.
>> I wonder if the problem is that CfgTRES is not showing gres/shard ... it 
>> sounds like it should, right?
>> 
>> The complete last part of my /etc/slurm/slurm.conf (which is of course the 
>> same in the login and compute node):
>> 
>> # COMPUTE NODES
>> GresTypes=gpu,shard
>> NodeName=compute01 Gres=gpu:8,shard:32 CPUs=128 RealMemory=1031887 Sockets=2 
>> CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN
>> PartitionName=partition Nodes=ALL Default=YES MaxTime=5-00:00:00 State=UP 
>> DefCpuPerGPU=16 DefMemPerGPU=128985
>> 
>> And in the compute node /etc/slurm/gres.conf is:
>> Name=gpu File=/dev/nvidia[0-7]
>> Name=shard Count=32
>> 
>> 
>> Thank you!
>> --
>> Ricardo Cruz - https://rpmcruz.github.io
>>  <https://rpmcruz.github.io/>
>> 
>> Brian Andrus via slurm-users <slurm-users@lists.schedmd.com 
>> <mailto:slurm-users@lists.schedmd.com>> escreveu (quinta, 4/07/2024 à(s) 
>> 17:16):
>>> To help dig into it, can you paste the full output of scontrol show node 
>>> compute01 while the job is pending? Also 'sinfo' would be good.
>>> 
>>> It is basically telling you there aren't enough resources in the partition 
>>> to run the job. Often this is because all the nodes are in use at that 
>>> moment.
>>> 
>>> Brian Andrus
>>> 
>>> On 7/4/2024 8:43 AM, Ricardo Cruz via slurm-users wrote:
>>>> Greetings,
>>>> 
>>>> There are not many questions regarding GPU sharding here, and I am unsure 
>>>> if I am using it correctly... I have configured it according to the 
>>>> instructions <https://slurm.schedmd.com/gres.html>, and it seems to be 
>>>> configured properly:
>>>> 
>>>> $ scontrol show node compute01
>>>> NodeName=compute01 Arch=x86_64 CoresPerSocket=32
>>>>    CPUAlloc=48 CPUEfctv=128 CPUTot=128 CPULoad=10.95
>>>>    AvailableFeatures=(null)
>>>>    ActiveFeatures=(null)
>>>>    Gres=gpu:8,shard:32
>>>>    [truncated]
>>>> 
>>>> When running with gres:gpu everything works perfectly:
>>>> 
>>>> $ /usr/bin/srun --gres=gpu:2 ls
>>>> srun: job 192 queued and waiting for resources
>>>> srun: job 192 has been allocated resources
>>>> (...)
>>>> 
>>>> However, when using sharding, it just stays waiting indefinitely:
>>>> 
>>>> $ /usr/bin/srun --gres=shard:2 ls
>>>> srun: job 193 queued and waiting for resources
>>>> 
>>>> The reason it gives for pending is just "Resources":
>>>> 
>>>> $ scontrol show job 193
>>>> JobId=193 JobName=ls
>>>>    UserId=rpcruz(1000) GroupId=rpcruz(1000) MCS_label=N/A
>>>>    Priority=1 Nice=0 Account=account QOS=normal
>>>>    JobState=PENDING Reason=Resources Dependency=(null)
>>>>    Requeue=1 Restarts=0 BatchFlag=0 Reboot=0 ExitCode=0:0
>>>>    RunTime=00:00:00 TimeLimit=2-00:00:00 TimeMin=N/A
>>>>    SubmitTime=2024-06-28T05:36:51 EligibleTime=2024-06-28T05:36:51
>>>>    AccrueTime=2024-06-28T05:36:51
>>>>    StartTime=2024-06-29T18:13:22 EndTime=2024-07-01T18:13:22 Deadline=N/A
>>>>    SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-06-28T05:37:20 
>>>> Scheduler=Backfill:*
>>>>    Partition=partition AllocNode:Sid=localhost:47757
>>>>    ReqNodeList=(null) ExcNodeList=(null)
>>>>    NodeList=
>>>>    NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
>>>>    ReqTRES=cpu=1,mem=1031887M,node=1,billing=1
>>>>    AllocTRES=(null)
>>>>    Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
>>>>    MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
>>>>    Features=(null) DelayBoot=00:00:00
>>>>    OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
>>>>    Command=ls
>>>>    WorkDir=/home/rpcruz
>>>>    Power=
>>>>    TresPerNode=gres/shard:2
>>>> 
>>>> Again, I think I have configured it properly - it shows up correctly in 
>>>> scontrol (as shown above).
>>>> Our setup is pretty simple - I just added shard to /etc/slurm/slurm.conf:
>>>> GresTypes=gpu,shard
>>>> NodeName=compute01 Gres=gpu:8,shard:32 [truncated]
>>>> Our /etc/slurm/gres.conf is also straight-forward: (it works fine for 
>>>> --gres=gpu:1)
>>>> Name=gpu File=/dev/nvidia[0-7]
>>>> Name=shard Count=32
>>>> 
>>>> 
>>>> Maybe I am just running srun improperly? Shouldn't it just be srun 
>>>> --gres=shard:2 to allocate half of a GPU? (since I am using 32 shards for 
>>>> the 8 gpus, so it's 4 shards per gpu)
>>>> 
>>>> Thank you very much for your attention,
>>>> --
>>>> Ricardo Cruz - https://rpmcruz.github.io
>>>>  <https://rpmcruz.github.io/>
>>> 
>>> -- 
>>> slurm-users mailing list -- slurm-users@lists.schedmd.com 
>>> <mailto:slurm-users@lists.schedmd.com>
>>> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com 
>>> <mailto:slurm-users-le...@lists.schedmd.com>
> 
> -- 
> slurm-users mailing list -- slurm-users@lists.schedmd.com
> To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reed

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to