[slurm-users] Re: Best Way to See GPUs in Use?

Paul Edmon via slurm-users Sat, 05 Apr 2025 10:08:52 -0700

If you do scontrol -d show node it will give what resources are actuallybeing used in more details:


[root@holy8a24507 general]# scontrol show node holygpu8a11101
NodeName=holygpu8a11101 Arch=x86_64 CoresPerSocket=48
   CPUAlloc=70 CPUEfctv=96 CPUTot=96 CPULoad=173.07
AvailableFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
   ActiveFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
   Gres=gpu:nvidia_h100_80gb_hbm3:4(S:0-15)
   NodeAddr=holygpu8a11101 NodeHostName=holygpu8a11101 Version=24.11.2

OS=Linux 4.18.0-513.18.1.el8_9.x86_64 #1 SMP Wed Feb 21 21:34:36 UTC2024

   RealMemory=1547208 AllocMem=896000 FreeMem=330095 Sockets=2 Boards=1
   MemSpecLimit=16384

State=MIXED ThreadsPerCore=1 TmpDisk=863490 Weight=1442 Owner=N/AMCS_label=N/APartitions=kempner_requeue,kempner_dev,kempner_h100,kempner_h100_priority,gpu_requeue,serial_requeue

   BootTime=2024-10-23T13:10:56 SlurmdStartTime=2025-03-24T14:51:01
   LastBusyTime=2025-03-30T15:55:51 ResumeAfterTime=None
CfgTRES=cpu=96,mem=1547208M,billing=2302,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
AllocTRES=cpu=70,mem=875G,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
   CurrentWatts=0 AveWatts=0



[root@holy8a24507 general]# scontrol -d show node holygpu8a11101
NodeName=holygpu8a11101 Arch=x86_64 CoresPerSocket=48
   CPUAlloc=70 CPUEfctv=96 CPUTot=96 CPULoad=173.07
AvailableFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
   ActiveFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
   Gres=gpu:nvidia_h100_80gb_hbm3:4(S:0-15)
   GresDrain=N/A
   GresUsed=gpu:nvidia_h100_80gb_hbm3:4(IDX:0-3)
   NodeAddr=holygpu8a11101 NodeHostName=holygpu8a11101 Version=24.11.2

OS=Linux 4.18.0-513.18.1.el8_9.x86_64 #1 SMP Wed Feb 21 21:34:36 UTC2024

   RealMemory=1547208 AllocMem=896000 FreeMem=330095 Sockets=2 Boards=1
   MemSpecLimit=16384

State=MIXED ThreadsPerCore=1 TmpDisk=863490 Weight=1442 Owner=N/AMCS_label=N/APartitions=kempner_requeue,kempner_dev,kempner_h100,kempner_h100_priority,gpu_requeue,serial_requeue

   BootTime=2024-10-23T13:10:56 SlurmdStartTime=2025-03-24T14:51:01
   LastBusyTime=2025-03-30T15:55:51 ResumeAfterTime=None
CfgTRES=cpu=96,mem=1547208M,billing=2302,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
AllocTRES=cpu=70,mem=875G,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
   CurrentWatts=0 AveWatts=0

Now it won't give you individual performance of the GPU's, slurm doesn'tcurrently track that in a convenient way like it does cpuload. It willat least give you what has been allocated on the node. We take thenondetailed dump (as it details how many gpus are allocated but notwhich ones) and throw it into grafana via prometheus to get generalcluster stats: https://github.com/fasrc/prometheus-slurm-exporter

If you are looking for performance stats, NVIDIA has a DCGM exporterthat we use to pull them and dump them to grafana:https://github.com/NVIDIA/dcgm-exporter

On a per job basis I know people use Weights & Biases but that is codespecific: https://wandb.ai/site/ You can also use scontrol -d show jobto print out the layout of a job including which specific GPU's wereassigned.


-Paul Edmon-


On 4/2/25 9:17 AM, Jason Simms via slurm-users wrote:

Hello all,
Apologies for the basic question, but is there a straightforward,best-accepted method for using Slurm to report on which GPUs arecurrently in use? I've done some searching and people recommend allsorts of methods, including parsing the output of nvidia-smi (seemsinefficient, especially across multiple GPU nodes), as well as usingother tools such as Grafana, XDMoD, etc.
We do track GPUs as a resource, so I'd expect I could get at the infowith sreport or something like that, but before trying to craft my ownfrom scratch, I'm hoping someone has something working already.Ultimately I'd like to see either which cards are available by node,or the reverse (which are in use by node). I know recent versions ofSlurm supposedly added tighter integration in some way with NVIDIAcards, but I can't seem to find definitive docs on what, exactly,changed or what is now possible as a result.
Warmest regards,
Jason

--
*Jason L. Simms, Ph.D., M.P.H.*
Research Computing Manager
Swarthmore College
Information Technology Services
(610) 328-8102

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Best Way to See GPUs in Use?

Reply via email to