We have a node with 8 H100 GPUs that are split into MIG instances. We are using 
cgroups. This seems to work fine. Users can do something like

sbatch --gres="gpu:1g.10gb:1"...

and the job starts on the node with the gpus and cuda visible devices and the 
pytorch debug shows that the cgroup only gives them the gpu they asked for.

In the accounting database, jobs in the job table always have the "gres_used" 
column be empty. I'd expect to see "gpu:1g.10gb:1" appearing for the job above.

I have this set in slurm.conf

AccountingStorageTRES=gres/gpu

How can I see what gres was requested with the job ? At the moment I only see 
something like this in AllocTres

billing=1,cpu=1,gres/gpu=1,mem=8G,node=1

and can't see any way to see what the specific MIG gpu asked for was. This is 
related to the email from Richard Lefebvre dated 7th June 2023 entitled 
"Billing/accounting for MIGs is not working". As far as I can see this got no 
replies.

We are running slurm version 23.11.6.

Regards,

Emyr James
Head of Scientific IT
CRG - Centre for Genomic Regulation

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to