Dear All,

I tried to implement a strict limit on the GrpTRESMins for
each user. The effect I'm trying to achieve is that after the
limit of GPU minutes is reached, no new jobs can be run.
No decay, no automatic resource replenishment. After the
limit on GPU minutes is reached, each user should ask for
more minutes.
But despite exceeding the limits users *can* run new jobs.

* When I'm adding a user to the cluster I set:

  sacctmgr --immediate add user name=...
  ...
  QOS=2gpu2d
  GrpTRESMins=gres/gpu=20000

* In the "slurm.conf" ("safe" means limits and associations
  are automatically set). Storage is MariaDB with SlurmDBD:

  GresTypes=gpu
  AccountingStorageTRES=gres/gpu
  AccountingStorageEnforce=qos,safe
  # This disables GPU minutes replenishing.
  PriorityDecayHalfLife=0
  PriorityUsageResetPeriod=NONE

But when I look at a user's account info and usage, you can
see that the limits are not enforced.

   Account             User    Partition          QOS          GrpTRESMins
---------- ---------------- ------------ ------------ --------------------
redacted redacted a6000 2gpu2d gres/gpu=10000


--------------------------------------------------------------------------------
Top 1 Users 2024-01-05T00:00:00 - 2024-01-17T19:59:59 (1108800 secs)
Usage reported in TRES Minutes
--------------------------------------------------------------------------------
       Login     Used        TRES Name
------------ -------- ----------------
 redacted     184311         gres/gpu
 redacted     1558558              cpu


Could someone explain, where could the problem be? Am I missing
something? Apparently yes :)

Kind regards
--
Kamil Wilczek [https://keys.openpgp.org/]
[D415917E84B8DA5A60E853B6E676ED061316B69B]

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to