Hi all,I'm trying to get working the gathering of gres/gpumem and gres/gpuutil on Slurm 23.02.2 , but with no success yet.
We have: AccountingStorageTRES=cpu,mem,gres/gpu in the slurm.conf and Slurm is build with NVML support. Autodetect=NVML in gres.confgres/gpumem and gres/gpuutil now appears in sacct TRESUsageInAve record, but with zero values:
sacct -j 6056927_51 -Pno TRESUsageInAve cpu=00:00:07,energy=0,fs/disk=14073059,gres/gpumem=0,gres/gpuutil=0,mem=6456K,pages=0,vmem=7052K cpu=00:00:00,energy=0,fs/disk=2332,gres/gpumem=0,gres/gpuutil=0,mem=44K,pages=0,vmem=44K cpu=05:18:51,energy=0,fs/disk=708800,gres/gpumem=0,gres/gpuutil=0,mem=2565376K,pages=0,vmem=2961244KWe are using NVIDIA Tesla V100 and A100 GPUs with driver version 530.30.02. dcgm-exporter is working on the nodes.
Is there anything else needed, to get it working? Thanks in advanced. Daniel Vecerka On 24. 05. 23 21:45, Christopher Samuel wrote:
On 5/24/23 11:39 am, Fulton, Ben wrote:Hi,Hi Ben,The release notes for 23.02 say “Added usage gathering for gpu/nvml (Nvidia) and gpu/rsmi (AMD) plugins”.How would I go about enabling this?I can only comment on the nvidia side (as those are the GPUs we have) but for that you need Slurm built with NVML support and running with "Autodetect=NVML" in gres.conf and then that information is stored in slurmdbd as part of the TRES usage data.For example to grab a job step for a test code I ran the other day:csamuel@perlmutter:login01:~> sacct -j 9285567.0 -Pno TRESUsageInAve | tr , \\n | fgrep gpugres/gpumem=493120K gres/gpuutil=76 Hope that helps! All the best, Chris
smime.p7s
Description: S/MIME Cryptographic Signature