Hello all,

I am not sure if I've stumbled upon a bug (14.03.6) or if this is the
intended behavior of an account tied into an association with a
DefQOS.

Basically, I've configured our cluster to use associations with a
DefQOS other than "normal" for specific accounts (Priority,
preemption, etc.).  When I add a new user via sacctmgr add user... or
via sacctmgr load file=... and that new user submits a job, the
default qos I've configured is not respected.  If I restart
slurmd/slurmctld (service slurm restart) across the cluster and submit
a job again, the default qos is respected.

Here is the configuration I've set:

sacctmgr show association where account=faculty
Cluster    Account       User  Partition     Share GrpJobs GrpNodes
GrpCPUs  GrpMem GrpSubmit     GrpWall  GrpCPUMins MaxJobs MaxNodes
MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def
QOS GrpCPURunMins
---------- ---------- ---------- ---------- --------- ------- --------
-------- ------- --------- ----------- ----------- ------- --------
-------- --------- ----------- ----------- --------------------
--------- -------------
c_slurm                faculty                                 1
512                                                       faculty
elevated,faculty,no+
c_slurm                faculty    davidrogers             parent
512                                                       faculty
elevated,faculty,no+
c_slurm                faculty    hlw                     parent
512                                                       faculty
elevated,faculty,no+
c_slurm                faculty    sathan                  parent
512                                                       faculty
elevated,faculty,no+

Here is the user I've added:

sacctmgr add user pshudson defaultaccount=faculty fairshare=parent
 Adding User(s)
  pshudson
 Settings =
  Default Account = faculty
 Associations =
  U = pshudson  A = faculty    C = c_slur
 Non Default Settings
  Fairshare     = 2147483647
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y

Double-check the new entry:

sacctmgr show association where account=faculty
 Cluster    Account       User  Partition     Share GrpJobs GrpNodes
GrpCPUs  GrpMem GrpSubmit     GrpWall  GrpCPUMins MaxJobs MaxNodes
MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def
QOS GrpCPURunMins
---------- ---------- ---------- ---------- --------- ------- --------
-------- ------- --------- ----------- ----------- ------- --------
-------- --------- ----------- ----------- --------------------
--------- -------------
c_slurm                faculty                                 1
512                                                       faculty
elevated,faculty,no+
c_slurm                faculty    davidrogers             parent
512                                                       faculty
elevated,faculty,no+
c_slurm                faculty    hlw                     parent
512                                                       faculty
elevated,faculty,no+
c_slurm                faculty    pshudson                parent
512                                                       faculty
elevated,faculty,no+
c_slurm                faculty    sathan                  parent
512                                                       faculty
elevated,faculty,no+

Here is the user immediately submitting a job:

[pshudson@host ~]$ salloc -N 4 -n 4 -t 01:00:00
salloc: Granted job allocation 24150

Here is the squeue output:

JOBID    ST  PRIO QOS   PARTI NAME     USER            ACCOUNT
SUBMIT_TIME          START_TIME           TIME       TIMELIMIT
EXEC_HOST            CPUS NODES MIN_M NODELIST(REASON)
24150    R   0.00 norma satur bash     pshudson        faculty
2014-10-15T14:40:46  2014-10-15T14:40:46  0:03       1:00:00
rcslurm              4    4     0     rack-5-[16-19]

As you can see, the "QOS" is "normal".  I go ahead and restart
slurmd/slurmctld across the cluster and resubmit:

[pshudson@host ~]$ salloc -N 4 -n 4 -t 01:00:00
salloc: Granted job allocation 24153

Here is the squeue output:

JOBID    ST  PRIO QOS   PARTI NAME     USER            ACCOUNT
SUBMIT_TIME          START_TIME           TIME       TIMELIMIT
EXEC_HOST            CPUS NODES MIN_M NODELIST(REASON)
24153    R   0.00 facul satur bash     pshudson        faculty
2014-10-15T14:42:23  2014-10-15T14:42:23  0:03       1:00:00
rcslurm              4    4     0     rack-5-[16-19]

The default qos is now respected.

Is a restart of the slurmd/slurmctld daemons necessary and just
undocumented or is this a potential bug?

Thank you,
John DeSantis

Reply via email to