[slurm-users] Problem with squeue reporting of GPUs in use

Venable, Richard (NIH/NHLBI) [E] Mon, 24 Feb 2020 14:09:51 -0800

I’m seeing a problem with GPU usage reporting via squeue in the 19.05.3 release.


I’ve been using a custom script to track GPUs in use, and had been relying on 
the ‘%b’ field of squeue -o formatting (which now seems to be undocumented) to 
capture usage requested via --gres option of sbatch.  Unfortunately, besides 
apparently being deprecated, ‘%b’ does not report usage requested via the new 
--gpus option.

I’ve tried several squeue -O option fields, but only ‘tres-alloc’ seems to 
consistently report GPU usage, independent of which sbatch option was used for 
the request.  The ‘tres-per-node’ field only reports usage requested by --gres, 
while ‘tres-per-job’ only reports usage requested by the  --gpus option.  Also, 
the -O formatting doesn’t put a single space between fields, a problem for 
longer job names or usernames, and messes up the field parsing of the output 
when two fields are run together.

Our users like to know which partition has the most free GPUs, and right now my 
script is broken wrt. usage via the --gpus option.

If there is no other option, I can probably parse the ‘tres-alloc’ field (it 
has more info than I need), but I’m looking for alternatives, or any 
information that might indicate the ‘tres-*’ fields are more consistent in the 
newer (.4 or .5) SLURM releases.


BTW, sreport does a bad job of reporting GPU usage as well, in that the 
GRES/GPU total % for root in the account listing on a given cluster is always 
less than the % allocated in the utilization listing, sometime by a substantial 
amount.  The CPU usage is almost always the same in both sreport listings.


--
Rick Venable
NIH/NHLBI/DIR/BBC
Lab. of Membrane Biophysics MSC 5690
Bldg. 12A Room 3053L
Bethesda, MD  20892-5690   U.S.A.

[slurm-users] Problem with squeue reporting of GPUs in use

Reply via email to