[slurm-users] Re: Using sharding

2024-07-05 Thread Reed Dier via slurm-users
I would try specifying cpus and mem just to be sure its not requesting 0/all. Also, I was running into a weird issue when I had oversubscribe=yes:2 causing odd issues in my lab cluster when playing with shards, where they would go pending resources despite no alloc of gpu/shards. Once I reverted

[slurm-users] Re: Using sharding

2024-07-05 Thread Ward Poelmans via slurm-users
Hi Arnuld, On 5/07/2024 13:56, Arnuld via slurm-users wrote: It should show up like this:     Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1) What's the meaning of (S:0-1) here? The sockets to which the GPUs are associated: If GRES are associated with specific sockets, t

[slurm-users] Re: Using sharding

2024-07-05 Thread Arnuld via slurm-users
> On Fri, Jul 5, 2024 at 12:19 PM Ward Poelmans > via slurm-users wrote: > Hi Ricardo, > > It should show up like this: > > Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1) > What's the meaning of (S:0-1) here? -- slurm-users mailing list -- slurm-users@lists.schedmd.com To unsub

[slurm-users] Re: Using sharding

2024-07-04 Thread Ward Poelmans via slurm-users
Hi Ricardo, It should show up like this: Gres=gpu:gtx_1080_ti:4(S:0-1),shard:gtx_1080_ti:16(S:0-1) CfgTRES=cpu=32,mem=515000M,billing=130,gres/gpu=4,gres/shard=16 AllocTRES=cpu=8,mem=31200M,gres/shard=1 I can't directly spot any error however. Our gres.conf is simply `AutoDetect=nvm

[slurm-users] Re: Using sharding

2024-07-04 Thread Brian Andrus via slurm-users
Just a thought. Try specifying some memory. It looks like the running jobs do that and by default, if not specified it is "all the memory on the node", so it can't start because some of it is taken. Brian Andrus On 7/4/2024 9:54 AM, Ricardo Cruz wrote: Dear Brian, Currently, we have 5 GPUs

[slurm-users] Re: Using sharding

2024-07-04 Thread Ricardo Cruz via slurm-users
Dear Brian, Currently, we have 5 GPUs available (out of 8). rpcruz@atlas:~$ /usr/bin/srun --gres=shard:2 ls srun: job 515 queued and waiting for resources The job shows as PD in squeue. scontrol says that 5 GPUs are allocated out of 8... rpcruz@atlas:~$ scontrol show node compute01 NodeName=com

[slurm-users] Re: Using sharding

2024-07-04 Thread Brian Andrus via slurm-users
To help dig into it, can you paste the full output of scontrol show node compute01 while the job is pending? Also 'sinfo' would be good. It is basically telling you there aren't enough resources in the partition to run the job. Often this is because all the nodes are in use at that moment. B