[slurm-users] Re: How do you guys track which GPU is used by which job ?

Brian Andrus via slurm-users Wed, 16 Oct 2024 09:04:19 -0700

Looks like there is a step you would need to do to create the requiredjob mapping files:

/The DCGM-exporter can include High-Performance Computing (HPC) jobinformation into its metric labels. To achieve this, HPC environmentadministrators must configure their HPC environment to generate filesthat map GPUs to HPC jobs./


It does go on to show the conventions/format of the files.

I imagine you could have some bits in a prologue script that createsthat as the job starts on the node and point dcgm-exporter there.


Brian Andrus

On 10/16/24 06:10, Sylvain MARET via slurm-users wrote:

Hey guys !
I'm looking to improve GPU monitoring on our cluster. I want toinstall this https://github.com/NVIDIA/dcgm-exporter and saw in theREADME that it can support tracking of job id :https://github.com/NVIDIA/dcgm-exporter?tab=readme-ov-file#enabling-hpc-job-mapping-on-dcgm-exporter
However I haven't been able to see any examples on how to do it nordoes slurm seem to expose this information by default.Does anyone do this here ? And if so do you have any examples I couldtry to follow ? If you have advise on best practices to monitor GPUI'd be happy to hear it out !
Regards,
Sylvain Maret

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: How do you guys track which GPU is used by which job ?

Reply via email to