Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

Ben Polman Wed, 29 Mar 2023 05:43:48 -0700

I'd be interested in your kludge, we face a similar situation where the slurmctld node does not have access to the ipmi network and can not ssh to machines that have access. We are thinking on creating a rest interface to a control server which would be running the ipmi commands


Ben


On 29-03-2023 14:16, Dr. Thomas Orgis wrote:

Am Mon, 27 Mar 2023 13:17:01 +0200
schrieb Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>:

FYI: Slurm power_save works very well for us without the issues that you
describe below.  We run Slurm 22.05.8, what's your version?

I'm sure that there are setups where it works nicely;-) For us, it
didn't, and I was faced with hunting the bug in slurm or working around
it with more control, fixing the underlying issue of the node resume
script being called _after_ the job has been allocated to the node.
That is too late in case of node bootup failure and causes annoying
delays for users only to see jobs fail.

We do run 21.08.8-2, which means any debugging of this on the slurm
side would mean upgrading first (we don't upgrade just for upgrade's
sake). And, as I said: The issue of the wrong timing remains, unless I
try deeper changes in slurm's logic. The other issue is that we had a
kludge in place, anyway, to enable slurmctld to power on nodes via
IPMI. The machine slurmctld runs on has no access to the IPMI network
itself, so we had to build a polling communication channel to the node
which has this access (and which is on another security layer, hence no
ssh into it). For all I know, this communication kludge is not to
blame, as, in the spurious failures, the nodes did boot up just fine
and were ready. Only slurmctld decided to let the timeout pass first,
then recognize that the slurmd on the node is there, right that instant.

Did your power up/down script workflow work with earlier slurm
versions, too? Did you use it on bare metal servers or mostly on cloud
instances?

Do you see a chance for

a) fixing up the internal powersaving logic to properly allocating
    nodes to a job only when these nodes are actually present (ideally,
    with a health check passing) or
b) designing an interface between slurm as manager of available
    resources and another site-specific service responsible for off-/onlining
    resources that are known to slurm, but down/drained?

My view is that Slurm's task is to distribute resources among users.
The cluster manager (person or $MIGHTY_SITE_SPECIFIC_SOFTWARE) decides
if a node is currently available to Slurm or down for maintenance, for
example. Power saving would be another reason for a node being taken
out of service.

Maybe I got an old-fashioned minority view …


Alrighty then,

Thomas

PS: I guess solution a) above goes against Slurm's focus on throughput
and avoiding delays caused by synchronization points, while our idea here
is that batch jobs where that matters should be written differently,
packing more than a few seconds worth of work into each step.


--
---------------------------------------------------------------------
Dr. B.J.W. Polman, C&CZ, Radboud University Nijmegen.
Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands, Phone: +31-24-3653360
e-mail: ben.pol...@science.ru.nl

OpenPGP_0xEE3D0443F73E4A1D.asc
Description: OpenPGP public key

OpenPGP_signature
Description: OpenPGP digital signature

Re: [slurm-users] error: power_save module disabled, NULL SuspendProgram

Reply via email to