Dear Slurm-user list,
Sadly, my question got no answers. If the question is unclear and you
have ideas how I can improve it, please let me know. We will soon try to
update Slurm to see if the unwanted behavior disappears with that.
Best regards,
Xaver Stiensmeier
On 11/18/24 12:03, Xaver Stiens
Dear Slurm-user list,
when a job fails because the node startup fails (cloud scheduling), the
job should be re-queued:
Resume Timeout
Maximum time permitted (in seconds) between when a node resume
request is issued and when the node is actually available for use.
Nodes which fail to
.de>| www.igd.fraunhofer.de
<http://www.igd.fraunhofer.de/>
*From:*Xaver Stiensmeier via slurm-users
*Sent:* Freitag, 15. November 2024 14:03
*To:* slurm-users@lists.schedmd.com
*Subject:* [slurm-users] Re: How to power up all ~idle nodes and
verify that they have started up without issue programmatical
sers:
Hi Xaver,
On 11/14/24 12:59, Xaver Stiensmeier via slurm-users wrote:
I would like to startup all ~idle (idle and powered down) nodes and
check programmatically if all came up as expected. For context: this
is for a program that sets up slurm clusters with on demand cloud
scheduling.
program. Of course this includes checking whether the nodes power up.
Best regards,
Xaver
Am 14/11/2024 um 14:57 schrieb Ole Holm Nielsen via slurm-users:
Hi Xaver,
On 11/14/24 12:59, Xaver Stiensmeier via slurm-users wrote:
I would like to startup all ~idle (idle and powered down) nodes and
Dear Slurm User list,
I would like to startup all ~idle (idle and powered down) nodes and
check programmatically if all came up as expected. For context: this is
for a program that sets up slurm clusters with on demand cloud scheduling.
In the most easiest fashion this could be executing a comma
Hey Nate,
we actually fixed our underlying issue that caused the NOT_RESPONDING
flag - on fails we automatically terminated the node manually instead of
letting Slurm call the terminate script. That lead to Slurm believing
the node should still be there when it was terminated already.
Therefore,
Thanks Steffen,
that makes a lot of sense. I will just not start slurmd in the master
ansible role when the master is not to be used for computing.
Best regards,
Xaver
On 24.06.24 14:23, Steffen Grunewald via slurm-users wrote:
On Mon, 2024-06-24 at 13:54:43 +0200, Slurm users wrote:
Dear Slu
Dear Slurm users,
in our project we exclude the master from computing before starting
Slurmctld. We used to exclude the master from computing by simply not
mentioning it in the configuration i.e. just not having:
PartitionName=SomePartition Nodes=master
or something similar. Apparently, thi
Dear slurm-user list,
as far as I understood it, the slurm.conf needs to be present on the
master and on the workers at slurm.conf (if no other path is set via
SLURM_CONF). However, I noticed that when adding a partition only in the
master's slurm.conf, all workers were able to "correctly" show t
any hot-fix/updates
from the base image or changes. By running it from the node, it would
alleviate any cpu spikes on the slurm head node.
Just a possible path to look at.
Brian Andrus
On 4/8/2024 6:10 AM, Xaver Stiensmeier via slurm-users wrote:
Dear slurm user list,
we make use of elast
Dear slurm user list,
we make use of elastic cloud computing i.e. node instances are created
on demand and are destroyed when they are not used for a certain amount
of time. Created instances are set up via Ansible. If more than one
instance is requested at the exact same time, Slurm will pass th
I am wondering why my question (below) didn't catch anyone's attention.
Just for me as a feedback. Is it unclear where my problem lies or is it
clear, but no solution is known? I looked through the documentation and
now searched the Slurm repository, but am still unable to clearly
identify how to
Dear slurm-user list,
I have a cloud node that is powered up and down on demand. Rarely it can
happen that slurm's resumeTimeout is reached and the node is therefore
powered down. We have set ReturnToService=2 in order to avoid the node
being marked down, because the instance behind that node is
Dear slurm-user list,
I had cases where our resumeProgram failed due to temporary cloud
timeouts. In that case the resumeProgram returns a value =/= 0. Why does
Slurm still wait until resumeTimeout instead of just accepting the
startup as failed which then should lead to a rescheduling of the job
Thank you for your response.
I have found found out why there was no error in the log: I've been
looking at the wrong log. The error didn't occur on the master, but on
our vpn-gateway (it is a hybrid cloud setup) - but you can thin of it as
just another worker in the same network. The error I get
Dear slurm-user list,
I got this error:
Unable to start service slurmctld: Job for slurmctld.service failed
because the control process exited with error code.\nSee \"systemctl
status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for
details.
but in slurmctld.service I see nothi
17 matches
Mail list logo