Hi Davide,
On 10/5/23 15:28, Davide DelVento wrote:
IMHO, "pretending" to power down nodes defies the logic of the Slurm
power_save plugin.
And it is sure useless ;)
But I was using the suggestion from
https://slurm.schedmd.com/power_save.html
<https://slurm.schedmd.com/power_save.html> which says
You can also configure Slurm with programs that perform no action as
*SuspendProgram* and *ResumeProgram* to assess the potential impact of
power saving mode before enabling it.
I had not noticed the above sentence in the power_save manual before! So
I decided to test a "no action" power saving script, similar to what you
have done, applying it to a test partition. I conclude that "no action"
power saving DOES NOT WORK, at least in Slurm 23.02.5. So I opened a bug
report https://bugs.schedmd.com/show_bug.cgi?id=17848 to find out if the
documentation is obsolete, or if there may be a bug. Please follow that
bug to find out the answer from SchedMD.
What I *believe* (but not with 100% certainty) really happens with power
saving in the current Slurm versions is what I wrote yesterday:
Slurmctld expects suspended nodes to *really* power
down (slurmd is stopped). When slurmctld resumes a suspended node, it
expects slurmd to start up when the node is powered on. There is a
ResumeTimeout parameter which I've set to about 15-30 minutes in case of
delays due to BIOS updates and the like - the default of 60 seconds is
WAY too small!
I hope this helps,
Ole