[slurm-users] Re: [EXTERN] How do you guys track which GPU is used by which job ?

Pierre-Antoine Schnell via slurm-users Thu, 17 Oct 2024 01:45:51 -0700

Hello,

we recently started monitoring GPU usage on our GPUs with NVIDIA's DCGM:https://developer.nvidia.com/blog/job-statistics-nvidia-data-center-gpu-manager-slurm/

We create a new dcgmi group for each job and start the statisticsretrieval for it in a prolog script.

Then we stop the retrieval, save the dcgmi verbose stats output anddelete the dcgmi group in an epilog script.

The output presents JobID, GPU IDs, runtime, energy consumed, and SMutilization, among other things.

We retrieve the relevant data into a database and hope to be able toadvise our users on better practices based on the analysis of it.


Best wishes,
Pierre-Antoine Schnell

Am 16.10.24 um 15:10 schrieb Sylvain MARET via slurm-users:

Hey guys !
I'm looking to improve GPU monitoring on our cluster. I want to installthis https://github.com/NVIDIA/dcgm-exporter and saw in the README thatit can support tracking of job id :https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-mapping-on-dcgm-exporter
However I haven't been able to see any examples on how to do it nor doesslurm seem to expose this information by default.Does anyone do this here ? And if so do you have any examples I couldtry to follow ? If you have advise on best practices to monitor GPU I'dbe happy to hear it out !
Regards,
Sylvain Maret


--
Pierre-Antoine Schnell

Medizinische Universität Wien
IT-Dienste & Strategisches Informationsmanagement
Enterprise Technology & Infrastructure
High Performance Computing

1090 Wien, Spitalgasse 23
Bauteil 88, Ebene 00, Büro 611

+43 1 40160-21304

pierre-antoine.schn...@meduniwien.ac.at

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: [EXTERN] How do you guys track which GPU is used by which job ?

Reply via email to