[slurm-users] Re: Jobs pending with reason "priority" but nodes are idle

2024-09-25 Thread Renfro, Michael via slurm-users
Since nobody replied after this, if the nodes are incapable of running the jobs 
due to insufficient resources, it may be that the default 
“EnforcePartLimits=No” [1] might be an issue. That might allow a job to stay 
queued even if it’s impossible to run.

[1] https://slurm.schedmd.com/slurm.conf.html#OPT_EnforcePartLimits

From: Long, Daniel S. via slurm-users 
Date: Tuesday, September 24, 2024 at 1:39 PM
To: Paul Edmon , slurm-users@lists.schedmd.com 

Subject: [slurm-users] Re: Jobs pending with reason "priority" but nodes are 
idle

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


The low priority jobs definitely can’t “fit in” before the high priority jobs 
would start, but I don’t think that should matter. The idle nodes are incapable 
of running the high priority jobs, ever. I would expect slurm to assign those 
nodes the highest priority jobs that they are capable of running.


From: Paul Edmon via slurm-users 
Reply-To: Paul Edmon 
Date: Tuesday, September 24, 2024 at 2:26 PM
To: "slurm-users@lists.schedmd.com" 
Subject: [slurm-users] Re: Jobs pending with reason "priority" but nodes are 
idle


You might need to do some tuning on your backfill loop as that loop should be 
the one that backfills in those lower priority jobs.  I would also look to see 
if those lower priority jobs will actually fit in prior to the higher priority 
job running, they may not.

-Paul Edmon-
On 9/24/24 2:19 PM, Long, Daniel S. via slurm-users wrote:
I experimented a bit and think I have figured out the problem but not the 
solution.

We use multifactor priority with the job account the primary factor. Right now 
one project has much higher priority due to a deadline. Those are the jobs that 
are pending with “Resources”. They cannot run on the idle nodes because they do 
not satisfy the resource requirements (don’t have GPUs). What I don’t 
understand is why slurm doesn’t schedule the lower priority jobs onto those 
nodes, since those jobs don’t require GPUs. It’s very unexpected behavior, to 
me. Is there an option somewhere I need to set?


From: "Renfro, Michael" 
Date: Tuesday, September 24, 2024 at 1:54 PM
To: Daniel Long 
, 
"slurm-us...@schedmd.com" 

Subject: Re: Jobs pending with reason "priority" but nodes are idle

In theory, if jobs are pending with “Priority”, one or more other jobs will be 
pending with “Resources”.

So a few questions:


  1.  What are the “Resources” jobs waiting on, resource-wise?
  2.  When are they scheduled to start?
  3.  Can your array jobs backfill into the idle resources and finish before 
the “Resources” jobs are scheduled to start?

From: Long, Daniel S. via slurm-users 

Date: Tuesday, September 24, 2024 at 11:47 AM
To: slurm-us...@schedmd.com 

Subject: [slurm-users] Jobs pending with reason "priority" but nodes are idle

External Email Warning

This email originated from outside the university. Please use caution when 
opening attachments, clicking links, or responding to requests.


Hi,

On our cluster we have some jobs that are queued even though there are 
available nodes to run on. The listed reason is “priority” but that doesn’t 
really make sense to me. Slurm isn’t picking another job to run on those nodes; 
it’s just not running anything at all. We do have a quite heterogeneous 
cluster, but as far as I can tell the queued jobs aren’t requesting anything 
that would preclude them from running on the idle nodes. They are array jobs, 
if that makes a difference.

Thanks for any help you all can provide.




-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] A note on updating Slurm from 23.02 to 24.05 & multi-cluster

2024-09-25 Thread Ward Poelmans via slurm-users

Hi all,

We hit a snag when updating our clusters from Slurm 23.02 to 24.05. After 
updating the slurmdbd, our multi cluster setup was broken until everything was 
updated to 24.05. We had not anticipated this.

SchedMD says that fixing it would be a very complex operation.

Hence, this warning to everybody on planning to update: make sure to quickly 
updating everything once you've updated the slurmdbd daemon.

Reference: https://support.schedmd.com/show_bug.cgi?id=20931



Ward


smime.p7s
Description: S/MIME Cryptographic Signature

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com


[slurm-users] Re: Max TRES per user and node

2024-09-25 Thread Carsten Beyer via slurm-users

Hi Guillaume,


as Rob it already mentioned, this could maybe a way for you (partition 
just created temporarily online for testing). You could also add your 
MaxTRES=node=1 for more restrictions. We do something similar with QOS 
to restrict the number of CPU's for user in certain partitions.



sacctmgr create qos name=maxtrespu200G maxtrespu=mem=200G flags=denyonlimit


scontrol create partition=testtres qos=maxtrespu200g maxtime=08:00:00 
nodes=lt[1-10003] DefMemPerCPU=940 MaxMemPerCPU=940 OverSubscribe=NO



That results in:


4 jobs with 100G each:

---
[root@levantetest ~]# squeue
 JOBID PARTITION NAME USER ST   TIME NODES 
NODELIST(REASON)
   862  testtres hostname  xxx PD   0:00 1 
(QOSMaxMemoryPerUser)
   861  testtres hostname  xxx PD   0:00 1 
(QOSMaxMemoryPerUser)

   860  testtres hostname  xxx  R   0:15 1 lt1
   859  testtres hostname  xxx  R   0:22 1 lt1


6 jobs with 50G each:

---
[k202068@levantetest ~]$ squeue
 JOBID PARTITION NAME USER ST   TIME NODES 
NODELIST(REASON)
   876  testtres hostname  xxx PD   0:00 1 
(QOSMaxMemoryPerUser)
   875  testtres hostname  xxx PD   0:00 1 
(QOSMaxMemoryPerUser)

   874  testtres hostname  xxx  R   9:09 1 lt1
   873  testtres hostname  xxx  R   9:15 1 lt1
   872  testtres hostname  xxx  R   9:22 1 lt1
   871  testtres hostname  xxx  R   9:26 1 lt1

Best Regrads,
Carsten


--
Carsten Beyer
Abteilung Systeme

Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45a * D-20146 Hamburg * Germany

Phone:  +49 40 460094-221
Fax:+49 40 460094-270
Email:be...@dkrz.de
URL:http://www.dkrz.de

Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784




Am 24.09.24 um 16:58 schrieb Guillaume COCHARD via slurm-users:
> "So if they submit a 2^nd job, that job can start but will have to 
go onto another node, and will again be restricted to 200G?  So they 
can start as many jobs as there are nodes, and each job will be 
restricted to using 1 node and 200G of memory?"


Yes that's it. We already have MaxNodes=1 so a job can't be spread on 
multiple nodes.


To be more precise, the limit should be by user and not by job. To 
illustrate, let's imagine we have 3 empty nodes and a 200G/user/node 
limit. If a user submit 10 jobs each requesting 100G of memory, there 
should be 2 jobs running on each worker and 4 jobs pending.


Guillaume


*De: *"Groner, Rob" 
*À: *"Guillaume COCHARD" 
*Cc: *slurm-users@lists.schedmd.com
*Envoyé: *Mardi 24 Septembre 2024 16:37:34
*Objet: *Re: Max TRES per user and node

Ah, sorry, I didn't catch that from your first post (though you did 
say it).


So, you are trying to limit the user to no more than 200G of memory on 
a single node?  So if they submit a 2^nd  job, that job can start but 
will have to go onto another node, and will again be restricted to 
200G?  So they can start as many jobs as there are nodes, and each job 
will be restricted to using 1 node and 200G of memory? Or can they 
submit a job asking for 4 nodes, where they are limited to 200G on 
each node?  Or are they limited to a single node, no matter how many jobs?


Rob


*From:* Guillaume COCHARD 
*Sent:* Tuesday, September 24, 2024 10:09 AM
*To:* Groner, Rob 
*Cc:* slurm-users@lists.schedmd.com 
*Subject:* Re: Max TRES per user and node
Thank you for your answer.

To test it I tried:
sacctmgr update qos normal set maxtresperuser=cpu=2
# Then in slurm.conf
PartitionName=test […] qos=normal

But then if I submit several 1-cpu jobs only two start and the others 
stay pending, even though I have several nodes available. So it seems 
that MaxTRESPerUser is a QoS-wide limit, and doesn't limit TRES per 
user and per node but rather per user and QoS (or rather partition 
since I applied the QoS on the partition). Did I miss something?


Thanks again,
Guillaume


*De: *"Groner, Rob" 
*À: *slurm-users@lists.schedmd.com, "Guillaume COCHARD" 


*Envoyé: *Mardi 24 Septembre 2024 15:45:08
*Objet: *Re: Max TRES per user and node

You have the right idea.

On that same page, you'll find MaxTRESPerUser, as a QOS parameter.

You can create a QOS with the restrictions you'd like, and then in the 
partition definition, you give it that QOS.  The QOS will then apply 
its restrictions to any jobs that use that partition.


Rob

*From:* Guillaume COCHARD via slurm-users 
*Sent:* Tuesday, September 24, 2024 9:30 AM
*To:* slurm-users@lists.schedmd.com 
*Subject:* [slurm-users] Max TRES p

[slurm-users] Re: Max TRES per user and node

2024-09-25 Thread Groner, Rob via slurm-users
The trick, I think (and Guillaume can certainly correct me) is that the aim is 
to allow the user to run as many (up to) 200G mem jobs as they wantso long 
as they do not consume more than 200G on any single node.  So, they could run 
10 200G jobson 10 different nodes.  So the mem limit isn't per user...it's 
per user per node.  I think the qos limit you created below works as an OVERALL 
limit for the user, but doesn't allow a per-node limiting.

Rob



From: Carsten Beyer via slurm-users 
Sent: Wednesday, September 25, 2024 7:27 AM
To: Guillaume COCHARD 
Cc: Slurm User Community List 
Subject: [slurm-users] Re: Max TRES per user and node


Hi Guillaume,


as Rob it already mentioned, this could maybe a way for you (partition just 
created temporarily online for testing). You could also add your MaxTRES=node=1 
for more restrictions. We do something similar with QOS to restrict the number 
of CPU's for user in certain partitions.


sacctmgr create qos name=maxtrespu200G maxtrespu=mem=200G flags=denyonlimit


scontrol create partition=testtres qos=maxtrespu200g maxtime=08:00:00 
nodes=lt[1-10003] DefMemPerCPU=940 MaxMemPerCPU=940 OverSubscribe=NO


That results in:


4 jobs with 100G each:

---
[root@levantetest ~]# squeue
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
   862  testtres hostname  xxx PD   0:00  1 
(QOSMaxMemoryPerUser)
   861  testtres hostname  xxx PD   0:00  1 
(QOSMaxMemoryPerUser)
   860  testtres hostname  xxx  R   0:15  1 lt1
   859  testtres hostname  xxx  R   0:22  1 lt1


6 jobs with 50G each:

---
[k202068@levantetest ~]$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
   876  testtres hostname  xxx PD   0:00  1 
(QOSMaxMemoryPerUser)
   875  testtres hostname  xxx PD   0:00  1 
(QOSMaxMemoryPerUser)
   874  testtres hostname  xxx  R   9:09  1 lt1
   873  testtres hostname  xxx  R   9:15  1 lt1
   872  testtres hostname  xxx  R   9:22  1 lt1
   871  testtres hostname  xxx  R   9:26  1 lt1


Best Regrads,
Carsten


--
Carsten Beyer
Abteilung Systeme

Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45a * D-20146 Hamburg * Germany

Phone:  +49 40 460094-221
Fax:+49 40 460094-270
Email:  be...@dkrz.de
URL:http://www.dkrz.de

Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784




Am 24.09.24 um 16:58 schrieb Guillaume COCHARD via slurm-users:
> "So if they submit a 2nd job, that job can start but will have to go onto 
> another node, and will again be restricted to 200G?  So they can start as 
> many jobs as there are nodes, and each job will be restricted to using 1 node 
> and 200G of memory?"

Yes that's it. We already have MaxNodes=1 so a job can't be spread on multiple 
nodes.

To be more precise, the limit should be by user and not by job. To illustrate, 
let's imagine we have 3 empty nodes and a 200G/user/node limit. If a user 
submit 10 jobs each requesting 100G of memory, there should be 2 jobs running 
on each worker and 4 jobs pending.

Guillaume


De: "Groner, Rob" 
À: "Guillaume COCHARD" 

Cc: slurm-users@lists.schedmd.com
Envoyé: Mardi 24 Septembre 2024 16:37:34
Objet: Re: Max TRES per user and node

Ah, sorry, I didn't catch that from your first post (though you did say it).

So, you are trying to limit the user to no more than 200G of memory on a single 
node?  So if they submit a 2nd job, that job can start but will have to go onto 
another node, and will again be restricted to 200G?  So they can start as many 
jobs as there are nodes, and each job will be restricted to using 1 node and 
200G of memory? Or can they submit a job asking for 4 nodes, where they are 
limited to 200G on each node?  Or are they limited to a single node, no matter 
how many jobs?

Rob


From: Guillaume COCHARD 

Sent: Tuesday, September 24, 2024 10:09 AM
To: Groner, Rob 
Cc: slurm-users@lists.schedmd.com 

Subject: Re: Max TRES per user and node

Thank you for your answer.

To test it I tried:
sacctmgr update qos normal set maxtresperuser=cpu=2
# Then in slurm.conf
PartitionName=test […] qos=normal

But then if I submit several 1-cpu jobs only two start and the others stay 
pending, even though I have several nodes available. So it seems that 
MaxTRESPerUser is a QoS-wide limit, and doesn't limit TRES per us

[slurm-users] Re: Max TRES per user and node

2024-09-25 Thread Paul Raines via slurm-users


I am pretty sure there is no way to do exactly a per user per node limit 
in SLURM.  I cannot think of a good reason why one would do this.  Can

you explain?

I don't see why it matters if you have two user submitting two 200G jobs
if the jobs for the users are spread out over two nodes rather than
jobs for one users both running on one node and jobs for the other user
running on the other node.

If what you are really trying to limit is the amount of resources SLURM
as a whole is using on a node so SLURM never uses more than 200G out
of the 400GB on a node (for example) there are definitely ways to do that
using MemSpecLimit on the node.  You can even apportion CPU cores using
CpuSpecLimit and varios cgroups v2 settings at the OS level

Otherwise there maybe a way with some fancy scripting in LUA submit
plugin or playing around with Feature/Helper plugin



-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Wed, 25 Sep 2024 9:06am, Groner, Rob via slurm-users wrote:


   External Email - Use Caution

The trick, I think (and Guillaume can certainly correct me) is that the aim is 
to allow the user to run as many (up to) 200G mem jobs as they wantso long 
as they do not consume more than 200G on any single node.  So, they could run 
10 200G jobson 10 different nodes.  So the mem limit isn't per user...it's 
per user per node.  I think the qos limit you created below works as an OVERALL 
limit for the user, but doesn't allow a per-node limiting.

Rob



From: Carsten Beyer via slurm-users 
Sent: Wednesday, September 25, 2024 7:27 AM
To: Guillaume COCHARD 
Cc: Slurm User Community List 
Subject: [slurm-users] Re: Max TRES per user and node


Hi Guillaume,


as Rob it already mentioned, this could maybe a way for you (partition just 
created temporarily online for testing). You could also add your MaxTRES=node=1 
for more restrictions. We do something similar with QOS to restrict the number 
of CPU's for user in certain partitions.


sacctmgr create qos name=maxtrespu200G maxtrespu=mem=200G flags=denyonlimit


scontrol create partition=testtres qos=maxtrespu200g maxtime=08:00:00 
nodes=lt[1-10003] DefMemPerCPU=940 MaxMemPerCPU=940 OverSubscribe=NO


That results in:


4 jobs with 100G each:

---
[root@levantetest ~]# squeue
JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
  862  testtres hostname  xxx PD   0:00  1 
(QOSMaxMemoryPerUser)
  861  testtres hostname  xxx PD   0:00  1 
(QOSMaxMemoryPerUser)
  860  testtres hostname  xxx  R   0:15  1 lt1
  859  testtres hostname  xxx  R   0:22  1 lt1


6 jobs with 50G each:

---
[k202068@levantetest ~]$ squeue
JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
  876  testtres hostname  xxx PD   0:00  1 
(QOSMaxMemoryPerUser)
  875  testtres hostname  xxx PD   0:00  1 
(QOSMaxMemoryPerUser)
  874  testtres hostname  xxx  R   9:09  1 lt1
  873  testtres hostname  xxx  R   9:15  1 lt1
  872  testtres hostname  xxx  R   9:22  1 lt1
  871  testtres hostname  xxx  R   9:26  1 lt1


Best Regrads,
Carsten


--
Carsten Beyer
Abteilung Systeme

Deutsches Klimarechenzentrum GmbH (DKRZ)
Bundesstraße 45a * D-20146 Hamburg * Germany

Phone:  +49 40 460094-221
Fax:+49 40 460094-270
Email:  be...@dkrz.de
URL:
http://secure-web.cisco.com/18mNIF1Mm62EsB2cTnSaa6d75Sa4G5gm93wEUMB3EcCPiuTd6KjrOvy1gNWZD_WxGcXbEds3A6UJgQzpuZirB2uccNi8HmXurD-kuL3HJYU1DWDtD9kdbrHJb9DEC-JVZr7XkA1iF8cIhxnuKdDUWpEPsDr7pPFKUjjeNA_7IOFZzQqZ3AK6A8vlN4hx250Z2b7FIXKXefXS8Jj2Djq5H4KB7ZU2wND9U0lOXdbrTElRk_EaFeCdpQAtIJgR2CRBAefUEqqFZX9d-s9ylb_phg5jfsboPrEFq3yPoQkNjlB1uRSpoe1EUKA2IWL3PE-ic/http%3A%2F%2Fwww.dkrz.de

Geschäftsführer: Prof. Dr. Thomas Ludwig
Sitz der Gesellschaft: Hamburg
Amtsgericht Hamburg HRB 39784




Am 24.09.24 um 16:58 schrieb Guillaume COCHARD via slurm-users:

"So if they submit a 2nd job, that job can start but will have to go onto another 
node, and will again be restricted to 200G?  So they can start as many jobs as there are 
nodes, and each job will be restricted to using 1 node and 200G of memory?"


Yes that's it. We already have MaxNodes=1 so a job can't be spread on multiple 
nodes.

To be more precise, the limit should be by user and not by job. To illustrate, 
let's imagine we have 3 empty nodes and a 20

[slurm-users] Re: Max TRES per user and node

2024-09-25 Thread Guillaume COCHARD via slurm-users
Hello,

Thank you all for your answers. 

Carsten, as sid by Rob we need a limit per node, not only per user.

Paul, we know we are asking for something quite unorthodox. The thing is, we 
overbook the memory on our cluster (i.e., if a worker has 200G of memory, Slurm 
can allocate up to 280G on it). In our use case (HTC, with lots of small, 
inefficient jobs), this approach has worked well to improve our cluster usage 
(up to 40% more jobs without adding any hardware!). However, if the cluster is 
somewhat empty and a user submits lots of big, efficient jobs, we can of course 
experience some OOM kills.

So far, the tradeoff has been largely in our favor, so we are okay with this, 
but it would be nice to avoid this situation altogether. Having a maximum TRES 
per user and per node would ensure a good mix of jobs from different users, so 
if some jobs were highly efficient, others would be inefficient enough to 
counterbalance that.

Once again, I know this is sub-optimal and that we should probably educate our 
users so they stop wasting resources, but in the meantime, this approach works 
quite well, so we are looking to improve it until we no longer need it.

Guillaume

- Mail original -
De: "Paul Raines" 
À: "Guillaume COCHARD" 
Cc: "Slurm User Community List" 
Envoyé: Mercredi 25 Septembre 2024 15:29:28
Objet: Re: [slurm-users] Re: Max TRES per user and node

I am pretty sure there is no way to do exactly a per user per node limit 
in SLURM.  I cannot think of a good reason why one would do this.  Can
you explain?

I don't see why it matters if you have two user submitting two 200G jobs
if the jobs for the users are spread out over two nodes rather than
jobs for one users both running on one node and jobs for the other user
running on the other node.

If what you are really trying to limit is the amount of resources SLURM
as a whole is using on a node so SLURM never uses more than 200G out
of the 400GB on a node (for example) there are definitely ways to do that
using MemSpecLimit on the node.  You can even apportion CPU cores using
CpuSpecLimit and varios cgroups v2 settings at the OS level

Otherwise there maybe a way with some fancy scripting in LUA submit
plugin or playing around with Feature/Helper plugin



-- Paul Raines (http://help.nmr.mgh.harvard.edu)



On Wed, 25 Sep 2024 9:06am, Groner, Rob via slurm-users wrote:

>External Email - Use Caution
>
> The trick, I think (and Guillaume can certainly correct me) is that the aim 
> is to allow the user to run as many (up to) 200G mem jobs as they wantso 
> long as they do not consume more than 200G on any single node.  So, they 
> could run 10 200G jobson 10 different nodes.  So the mem limit isn't per 
> user...it's per user per node.  I think the qos limit you created below works 
> as an OVERALL limit for the user, but doesn't allow a per-node limiting.
>
> Rob
>
>
> 
> From: Carsten Beyer via slurm-users 
> Sent: Wednesday, September 25, 2024 7:27 AM
> To: Guillaume COCHARD 
> Cc: Slurm User Community List 
> Subject: [slurm-users] Re: Max TRES per user and node
>
>
> Hi Guillaume,
>
>
> as Rob it already mentioned, this could maybe a way for you (partition just 
> created temporarily online for testing). You could also add your 
> MaxTRES=node=1 for more restrictions. We do something similar with QOS to 
> restrict the number of CPU's for user in certain partitions.
>
>
> sacctmgr create qos name=maxtrespu200G maxtrespu=mem=200G flags=denyonlimit
>
>
> scontrol create partition=testtres qos=maxtrespu200g maxtime=08:00:00 
> nodes=lt[1-10003] DefMemPerCPU=940 MaxMemPerCPU=940 OverSubscribe=NO
>
>
> That results in:
>
>
> 4 jobs with 100G each:
>
> ---
> [root@levantetest ~]# squeue
> JOBID PARTITION NAME USER ST   TIME  NODES 
> NODELIST(REASON)
>   862  testtres hostname  xxx PD   0:00  1 
> (QOSMaxMemoryPerUser)
>   861  testtres hostname  xxx PD   0:00  1 
> (QOSMaxMemoryPerUser)
>   860  testtres hostname  xxx  R   0:15  1 lt1
>   859  testtres hostname  xxx  R   0:22  1 lt1
>
>
> 6 jobs with 50G each:
>
> ---
> [k202068@levantetest ~]$ squeue
> JOBID PARTITION NAME USER ST   TIME  NODES 
> NODELIST(REASON)
>   876  testtres hostname  xxx PD   0:00  1 
> (QOSMaxMemoryPerUser)
>   875  testtres hostname  xxx PD   0:00  1 
> (QOSMaxMemoryPerUser)
>   874  testtres hostname  xxx  R   9:09  1 lt1
>   873  testtres hostname  xxx  R   9:15  1 lt1
>   872  testtres hostname  xxx  R   9:22  1 lt1
>   871  testtres hostname  xxx  R   9:26  1 lt1
>
>
> Best Regrads,
> Carsten
>
>
> --
> Carsten Beyer
> Abteilung Systeme
>
> Deutsches Klimarechenzentrum GmbH (DKRZ)
> Bundesstraße 45