[slurm-users] Re: Using sharding

2024-07-04 Thread Ward Poelmans via slurm-users
Hi Ricardo, It should show up like this: Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1) CfgTRES=cpu=32,mem=515000M,billing=130,gres/gpu=4,gres/shard=16 AllocTRES=cpu=8,mem=31200M,gres/shard=1 I can't directly spot any error however. Our gres.conf is simply `AutoDetect=nvm

[slurm-users] Re: Using sharding

2024-07-04 Thread Brian Andrus via slurm-users
Just a thought. Try specifying some memory. It looks like the running jobs do that and by default, if not specified it is "all the memory on the node", so it can't start because some of it is taken. Brian Andrus On 7/4/2024 9:54 AM, Ricardo Cruz wrote: Dear Brian, Currently, we have 5 GPUs

[slurm-users] Re: Using sharding

2024-07-04 Thread Ricardo Cruz via slurm-users
Dear Brian, Currently, we have 5 GPUs available (out of 8). rpcruz@atlas:~$ /usr/bin/srun --gres=shard:2 ls srun: job 515 queued and waiting for resources The job shows as PD in squeue. scontrol says that 5 GPUs are allocated out of 8... rpcruz@atlas:~$ scontrol show node compute01 NodeName=com

[slurm-users] Re: Using sharding

2024-07-04 Thread Brian Andrus via slurm-users
To help dig into it, can you paste the full output of scontrol show node compute01 while the job is pending? Also 'sinfo' would be good. It is basically telling you there aren't enough resources in the partition to run the job. Often this is because all the nodes are in use at that moment. B

[slurm-users] Using sharding

2024-07-04 Thread Ricardo Cruz via slurm-users
Greetings, There are not many questions regarding GPU sharding here, and I am unsure if I am using it correctly... I have configured it according to the instructions , and it seems to be configured properly: $ scontrol show node compute01 NodeName=compute01 Ar