Forgot to mention: this is with slurm 23.02.6 (apologize for the double
message)
On Mon, Dec 11, 2023 at 9:49 AM Davide DelVento
wrote:
> Following the example from https://slurm.schedmd.com/power_save.html
> regarding SuspendExcNodes
>
> I configured my slurm.conf with
>
> SuspendExcNodes=node[
In case it's useful to others: I've been able to get this working by having
the "no action" script stop the slurmd daemon and start it *with the -b
option*.
On Fri, Oct 6, 2023 at 4:28 AM Ole Holm Nielsen
wrote:
> Hi Davide,
>
> On 10/5/23 15:28, Davide DelVento wrote:
> > IMHO, "pretending"
Following the example from https://slurm.schedmd.com/power_save.html
regarding SuspendExcNodes
I configured my slurm.conf with
SuspendExcNodes=node[01-12]:2,node[13-32]:2,node[33-34]:1,nodegpu[01-02]:1
SuspendExcStates=down,drain,fail,maint,not_responding,reserved
#SuspendExcParts=
(the nodes in
We've been running for years with out swap on with no issues. You may
want to set MemSpecLimit in your config to reserve memory for the OS, so
that way you don't OOM the system with user jobs:
https://slurm.schedmd.com/slurm.conf.html#OPT_MemSpecLimit
-Paul Edmon-
On 12/11/2023 11:19 AM, Davi
By getting "stuck" do you mean the job stays PENDING forever or does
eventually run? I've seen the latter (and I agree with you that I wish
Slurm will log things like "I looked at this job and I am not starting it
yet because") but not the former
On Fri, Dec 8, 2023 at 9:00 AM Pacey, Mike wro
A little late here, but yes everything Hans said is correct and if you are
worried about slurm (or other critical system software) getting killed by
OOM, you can workaround it by properly configuring cgroup.
On Wed, Dec 6, 2023 at 2:06 AM Hans van Schoot wrote:
> Hi Joseph,
>
> This might depend