If you do scontrol -d show node it will give what resources are actually
being used in more details:
[root@holy8a24507 general]# scontrol show node holygpu8a11101
NodeName=holygpu8a11101 Arch=x86_64 CoresPerSocket=48
CPUAlloc=70 CPUEfctv=96 CPUTot=96 CPULoad=173.07
AvailableFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
ActiveFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
Gres=gpu:nvidia_h100_80gb_hbm3:4(S:0-15)
NodeAddr=holygpu8a11101 NodeHostName=holygpu8a11101 Version=24.11.2
OS=Linux 4.18.0-513.18.1.el8_9.x86_64 #1 SMP Wed Feb 21 21:34:36 UTC
2024
RealMemory=1547208 AllocMem=896000 FreeMem=330095 Sockets=2 Boards=1
MemSpecLimit=16384
State=MIXED ThreadsPerCore=1 TmpDisk=863490 Weight=1442 Owner=N/A
MCS_label=N/A
Partitions=kempner_requeue,kempner_dev,kempner_h100,kempner_h100_priority,gpu_requeue,serial_requeue
BootTime=2024-10-23T13:10:56 SlurmdStartTime=2025-03-24T14:51:01
LastBusyTime=2025-03-30T15:55:51 ResumeAfterTime=None
CfgTRES=cpu=96,mem=1547208M,billing=2302,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
AllocTRES=cpu=70,mem=875G,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
CurrentWatts=0 AveWatts=0
[root@holy8a24507 general]# scontrol -d show node holygpu8a11101
NodeName=holygpu8a11101 Arch=x86_64 CoresPerSocket=48
CPUAlloc=70 CPUEfctv=96 CPUTot=96 CPULoad=173.07
AvailableFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
ActiveFeatures=amd,holyndr,genoa,avx,avx2,avx512,gpu,h100,cc9.0
Gres=gpu:nvidia_h100_80gb_hbm3:4(S:0-15)
GresDrain=N/A
GresUsed=gpu:nvidia_h100_80gb_hbm3:4(IDX:0-3)
NodeAddr=holygpu8a11101 NodeHostName=holygpu8a11101 Version=24.11.2
OS=Linux 4.18.0-513.18.1.el8_9.x86_64 #1 SMP Wed Feb 21 21:34:36 UTC
2024
RealMemory=1547208 AllocMem=896000 FreeMem=330095 Sockets=2 Boards=1
MemSpecLimit=16384
State=MIXED ThreadsPerCore=1 TmpDisk=863490 Weight=1442 Owner=N/A
MCS_label=N/A
Partitions=kempner_requeue,kempner_dev,kempner_h100,kempner_h100_priority,gpu_requeue,serial_requeue
BootTime=2024-10-23T13:10:56 SlurmdStartTime=2025-03-24T14:51:01
LastBusyTime=2025-03-30T15:55:51 ResumeAfterTime=None
CfgTRES=cpu=96,mem=1547208M,billing=2302,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
AllocTRES=cpu=70,mem=875G,gres/gpu=4,gres/gpu:nvidia_h100_80gb_hbm3=4
CurrentWatts=0 AveWatts=0
Now it won't give you individual performance of the GPU's, slurm doesn't
currently track that in a convenient way like it does cpuload. It will
at least give you what has been allocated on the node. We take the
nondetailed dump (as it details how many gpus are allocated but not
which ones) and throw it into grafana via prometheus to get general
cluster stats: https://github.com/fasrc/prometheus-slurm-exporter
If you are looking for performance stats, NVIDIA has a DCGM exporter
that we use to pull them and dump them to grafana:
https://github.com/NVIDIA/dcgm-exporter
On a per job basis I know people use Weights & Biases but that is code
specific: https://wandb.ai/site/ You can also use scontrol -d show job
to print out the layout of a job including which specific GPU's were
assigned.
-Paul Edmon-
On 4/2/25 9:17 AM, Jason Simms via slurm-users wrote:
Hello all,
Apologies for the basic question, but is there a straightforward,
best-accepted method for using Slurm to report on which GPUs are
currently in use? I've done some searching and people recommend all
sorts of methods, including parsing the output of nvidia-smi (seems
inefficient, especially across multiple GPU nodes), as well as using
other tools such as Grafana, XDMoD, etc.
We do track GPUs as a resource, so I'd expect I could get at the info
with sreport or something like that, but before trying to craft my own
from scratch, I'm hoping someone has something working already.
Ultimately I'd like to see either which cards are available by node,
or the reverse (which are in use by node). I know recent versions of
Slurm supposedly added tighter integration in some way with NVIDIA
cards, but I can't seem to find definitive docs on what, exactly,
changed or what is now possible as a result.
Warmest regards,
Jason
--
*Jason L. Simms, Ph.D., M.P.H.*
Research Computing Manager
Swarthmore College
Information Technology Services
(610) 328-8102
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com