I would try specifying cpus and mem just to be sure its not requesting 0/all.
Also, I was running into a weird issue when I had oversubscribe=yes:2 causing
odd issues in my lab cluster when playing with shards, where they would go
pending resources despite no alloc of gpu/shards.
Once I reverted
Hi Arnuld,
On 5/07/2024 13:56, Arnuld via slurm-users wrote:
It should show up like this:
Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1)
What's the meaning of (S:0-1) here?
The sockets to which the GPUs are associated:
If GRES are associated with specific sockets, t
> On Fri, Jul 5, 2024 at 12:19 PM Ward Poelmans
> via slurm-users wrote:
> Hi Ricardo,
>
> It should show up like this:
>
> Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1)
>
What's the meaning of (S:0-1) here?
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsub
Hi Ricardo,
It should show up like this:
Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1)
CfgTRES=cpu=32,mem=515000M,billing=130,gres/gpu=4,gres/shard=16
AllocTRES=cpu=8,mem=31200M,gres/shard=1
I can't directly spot any error however. Our gres.conf is simply
`AutoDetect=nvm
Just a thought.
Try specifying some memory. It looks like the running jobs do that and
by default, if not specified it is "all the memory on the node", so it
can't start because some of it is taken.
Brian Andrus
On 7/4/2024 9:54 AM, Ricardo Cruz wrote:
Dear Brian,
Currently, we have 5 GPUs
Dear Brian,
Currently, we have 5 GPUs available (out of 8).
rpcruz@atlas:~$ /usr/bin/srun --gres=shard:2 ls
srun: job 515 queued and waiting for resources
The job shows as PD in squeue.
scontrol says that 5 GPUs are allocated out of 8...
rpcruz@atlas:~$ scontrol show node compute01
NodeName=com
To help dig into it, can you paste the full output of scontrol show node
compute01 while the job is pending? Also 'sinfo' would be good.
It is basically telling you there aren't enough resources in the
partition to run the job. Often this is because all the nodes are in use
at that moment.
B