We do the same thing. Our prolog has
==============
# setup DCGMI job stats
if [ -n "$CUDA_VISIBLE_DEVICES" ] ; then
if [ -d /var/slurm/gpu_stats.run ] ; then
if pgrep -f nv-hostengine >/dev/null 2>&1 ; then
groupstr=$(/usr/bin/dcgmi group -c J$SLURM_JOB_ID -a
$CUDA_VISIBLE_DEVICES)
groupid=$(echo $groupstr | awk '{print $10}')
/usr/bin/dcgmi stats -e
/usr/bin/dcgmi stats -g $groupid -s $SLURM_JOB_ID
echo $groupid > /var/slurm/gpu_stats.run/J$SLURM_JOB_ID
fi
fi
fi
======================
And our epilog has
======================
if [ -n "$CUDA_VISIBLE_DEVICES" ] ; then
if [ -f /var/slurm/gpu_stats.run/J$SLURM_JOB_ID ] ; then
if pgrep -f nv-hostengine >/dev/null 2>&1 ; then
groupid=$(cat /var/slurm/gpu_stats.run/J$SLURM_JOB_ID)
/usr/bin/dcgmi stats -v -j $SLURM_JOBID > /var/slurm/gpu_stats/$SLURM_JOBID
if [ $? -eq 0 ] ; then
/bin/rsync -a /var/slurm/gpu_stats/$SLURM_JOBID /cluster/batch/GPU/
/bin/rm -rf /tmp/gpuprocess.out
# put the data in MYSQL database with perl script
/cluster/batch/ADMIN/SCRIPTS/gpuprocess.pl $SLURM_JOB_ID > /tmp/gpuprocess.out
2>&1
if [ -s /tmp/gpuprocess.out ] ; then
cat /tmp/gpuprocess.out | mail -s GPU_stat_process_error
al...@nmr.mgh.harvard.edu
fi
fi
/usr/bin/dcgmi stats -x $SLURM_JOBID
/usr/bin/dcgmi group -d $groupid
/bin/rm /var/slurm/gpu_stats.run/J$SLURM_JOB_ID
fi
fi
fi
=======================
We also have a cron job on each node with GPUs that runs every 10 minutes
to query dcgmi stats to write snapshot data on each GPU to the MYSQL
database.
If you are on RHEL based boxes, the RPM you need from NVIDIA repos is
datacenter-gpu-manager
On Thu, 17 Oct 2024 4:45am, Pierre-Antoine Schnell via slurm-users wrote:
External Email - Use Caution
Hello,
we recently started monitoring GPU usage on our GPUs with NVIDIA's DCGM:
https: //secure-web.cisco.com/1KWAURVYDpmQYgABxXHpjl1HYdnLi1gOud_xdNWc3Pxea1JmFPHq-ARojCPZ7k2sn7nHLarge9d-vm4Yo0OwdO4jS-sxbbhr1mfGvdZ9653UOKmqqhQKiF7pNgB9ox8xEcuiLC-y_J7z3yC63xAdOL5pJKatcCaePuaoY4u2mTMIrOpNU-wulYVHWlLnv65d4AAFY6ipTgzp6As2PTZJlPcIP7RcToXJVUJhzDaMPYHRWsgRXaVU5156mcMRwn7bstXHH58PpmS2MkxpRJ0HGSA-Mjsmr6SKV3HixQxohY3OzyPnIslJt-kBC_AJvILCO/https%3A%2F%2Fdeveloper.nvidia.com%2Fblog%2Fjob-statistics-nvidia-data-center-gpu-manager-slurm%2F
We create a new dcgmi group for each job and start the statistics retrieval
for it in a prolog script.
Then we stop the retrieval, save the dcgmi verbose stats output and delete
the dcgmi group in an epilog script.
The output presents JobID, GPU IDs, runtime, energy consumed, and SM
utilization, among other things.
We retrieve the relevant data into a database and hope to be able to advise
our users on better practices based on the analysis of it.
Best wishes,
Pierre-Antoine Schnell
Am 16.10.24 um 15:10 schrieb Sylvain MARET via slurm-users:
Hey guys !
I'm looking to improve GPU monitoring on our cluster. I want to install
this
https://secure-web.cisco.com/1fZ-E5mpOZvWDiBjPS6nGvTxPwlnYDhKBDJvrIMLGr18l4nxmu2j5qnQ5Q3nf51p_peVya-eFakdHLnxy5JuaaiSunHX8y9NYwfuF7DtezBTP4Eo1Io69GtnBnZKY8c8ZLP205ow2pp8eMB4DXR2lSrk2V3lplhlQdynRS7IbAsMsnchdYrMOQ65ncIttaVEMu1QNgaxpPEhP34bqm2aC3vcCevXNtevJFTJ9W_4ott9z-iUfXTx8ZseRo9W3ogUBdjWmOSelFM859D4khf_WwBk_HXkVivUKBR_CbvsCzQy1N5Mmx61GHK0hyQR6OwA2/https%3A%2F%2Fgithub.com%2FNVIDIA%2Fdcgm-exporter
and saw in the README that it can support tracking of job id :
https: //secure-web.cisco.com/1_JvkKV0Jm0yqxhTNbhLO9yC7U4G3sl2GSQRb2wrb-zRFRzd5kjwL7go8M2ESNdeIlaQM_peIOOHZCtWJibqHA4fl3Bk5xkr1tDe0QiOOCg8DLzw_OImhCSzXej8uZf3wHjpaQXCGtKzhUsW84CSsREcyBNTOTNjzAhr2HmDxYqMapS-TM8QFFrEB0u-3cJjdekUhw2rRhpZifMnj86S4nu6uG3Elyyla8GsaN8OC_Q6Jbu9kiW9hHGspRQ37Q3kbDIj7beBPkuik5eCPDtmabV-j2ppjd05G9eHZIrj9HAU2ZU3sIEsacOJ19eDUmNhl/https%3A%2F%2Fgithub.com%2FNVIDIA%2Fdcgm-exporter%3Ftab%3Dreadme-ov-file%23enabling-hpc-job-mapping-on-dcgm-exporter
However I haven't been able to see any examples on how to do it nor does
slurm seem to expose this information by default.
Does anyone do this here ? And if so do you have any examples I could try
to follow ? If you have advise on best practices to monitor GPU I'd be
happy to hear it out !
Regards,
Sylvain Maret
--
Pierre-Antoine Schnell
Medizinische Universität Wien
IT-Dienste & Strategisches Informationsmanagement
Enterprise Technology & Infrastructure
High Performance Computing
1090 Wien, Spitalgasse 23
Bauteil 88, Ebene 00, Büro 611
+43 1 40160-21304
pierre-antoine.schn...@meduniwien.ac.at
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Mass General Brigham Compliance
HelpLine at https://www.massgeneralbrigham.org/complianceline
<https://www.massgeneralbrigham.org/complianceline> .
Please note that this e-mail is not secure (encrypted). If you do not wish to continue communication over unencrypted e-mail, please notify the sender of this message immediately. Continuing to send or respond to e-mail after receiving this message means you understand and accept this risk and wish to continue to communicate over unencrypted e-mail.
--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com