[slurm-users] Re: Why is my job killed when ResumeTimeout is reached instead of it being requeued?

2024-12-09 Thread Xaver Stiensmeier via slurm-users
Dear Slurm-user list, Sadly, my question got no answers. If the question is unclear and you have ideas how I can improve it, please let me know. We will soon try to update Slurm to see if the unwanted behavior disappears with that. Best regards, Xaver Stiensmeier On 11/18/24 12:03, Xaver Stiens

[slurm-users] Why is my job killed when ResumeTimeout is reached instead of it being requeued?

2024-11-18 Thread Xaver Stiensmeier via slurm-users
Dear Slurm-user list, when a job fails because the node startup fails (cloud scheduling), the job should be re-queued: Resume Timeout Maximum time permitted (in seconds) between when a node resume request is issued and when the node is actually available for use. Nodes which fail to

[slurm-users] Re: How to power up all ~idle nodes and verify that they have started up without issue programmatically

2024-11-15 Thread Xaver Stiensmeier via slurm-users
.de>| www.igd.fraunhofer.de <http://www.igd.fraunhofer.de/> *From:*Xaver Stiensmeier via slurm-users *Sent:* Freitag, 15. November 2024 14:03 *To:* slurm-users@lists.schedmd.com *Subject:* [slurm-users] Re: How to power up all ~idle nodes and verify that they have started up without issue programmatical

[slurm-users] Re: How to power up all ~idle nodes and verify that they have started up without issue programmatically

2024-11-15 Thread Xaver Stiensmeier via slurm-users
sers: Hi Xaver, On 11/14/24 12:59, Xaver Stiensmeier via slurm-users wrote: I would like to startup all ~idle (idle and powered down) nodes and check programmatically if all came up as expected. For context: this is for a program that sets up slurm clusters with on demand cloud scheduling.

[slurm-users] Re: How to power up all ~idle nodes and verify that they have started up without issue programmatically

2024-11-14 Thread Xaver Stiensmeier via slurm-users
program. Of course this includes checking whether the nodes power up. Best regards, Xaver Am 14/11/2024 um 14:57 schrieb Ole Holm Nielsen via slurm-users: Hi Xaver, On 11/14/24 12:59, Xaver Stiensmeier via slurm-users wrote: I would like to startup all ~idle (idle and powered down) nodes and

[slurm-users] How to power up all ~idle nodes and verify that they have started up without issue programmatically

2024-11-14 Thread Xaver Stiensmeier via slurm-users
Dear Slurm User list, I would like to startup all ~idle (idle and powered down) nodes and check programmatically if all came up as expected. For context: this is for a program that sets up slurm clusters with on demand cloud scheduling. In the most easiest fashion this could be executing a comma

[slurm-users] Re: Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

2024-09-20 Thread Xaver Stiensmeier via slurm-users
Hey Nate, we actually fixed our underlying issue that caused the NOT_RESPONDING flag - on fails we automatically terminated the node manually instead of letting Slurm call the terminate script. That lead to Slurm believing the node should still be there when it was terminated already. Therefore,

[slurm-users] Re: How to exclude master from computing? Set to DRAINED?

2024-06-24 Thread Xaver Stiensmeier via slurm-users
Thanks Steffen, that makes a lot of sense. I will just not start slurmd in the master ansible role when the master is not to be used for computing. Best regards, Xaver On 24.06.24 14:23, Steffen Grunewald via slurm-users wrote: On Mon, 2024-06-24 at 13:54:43 +0200, Slurm users wrote: Dear Slu

[slurm-users] How to exclude master from computing? Set to DRAINED?

2024-06-24 Thread Xaver Stiensmeier via slurm-users
Dear Slurm users, in our project we exclude the master from computing before starting Slurmctld. We used to exclude the master from computing by simply not mentioning it in the configuration i.e. just not having:     PartitionName=SomePartition Nodes=master or something similar. Apparently, thi

[slurm-users] Slurm.conf and workers

2024-04-15 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, as far as I understood it, the slurm.conf needs to be present on the master and on the workers at slurm.conf (if no other path is set via SLURM_CONF). However, I noticed that when adding a partition only in the master's slurm.conf, all workers were able to "correctly" show t

[slurm-users] Re: Elastic Computing: Is it possible to incentivize grouping power_up calls?

2024-04-09 Thread Xaver Stiensmeier via slurm-users
any hot-fix/updates from the base image or changes. By running it from the node, it would alleviate any cpu spikes on the slurm head node. Just a possible path to look at. Brian Andrus On 4/8/2024 6:10 AM, Xaver Stiensmeier via slurm-users wrote: Dear slurm user list, we make use of elast

[slurm-users] Elastic Computing: Is it possible to incentivize grouping power_up calls?

2024-04-08 Thread Xaver Stiensmeier via slurm-users
Dear slurm user list, we make use of elastic cloud computing i.e. node instances are created on demand and are destroyed when they are not used for a certain amount of time. Created instances are set up via Ansible. If more than one instance is requested at the exact same time, Slurm will pass th

[slurm-users] Re: Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

2024-02-29 Thread Xaver Stiensmeier via slurm-users
I am wondering why my question (below) didn't catch anyone's attention. Just for me as a feedback. Is it unclear where my problem lies or is it clear, but no solution is known? I looked through the documentation and now searched the Slurm repository, but am still unable to clearly identify how to

[slurm-users] Can't schedule on cloud node: State=IDLE+CLOUD+POWERED_DOWN+NOT_RESPONDING

2024-02-23 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I have a cloud node that is powered up and down on demand. Rarely it can happen that slurm's resumeTimeout is reached and the node is therefore powered down. We have set ReturnToService=2 in order to avoid the node being marked down, because the instance behind that node is

[slurm-users] Slurm Power Saving Guide: Why doesnt slurm mark as failed when resumeProgram returns =/= 0

2024-02-19 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I had cases where our resumeProgram failed due to temporary cloud timeouts. In that case the resumeProgram returns a value =/= 0. Why does Slurm still wait until resumeTimeout instead of just accepting the startup as failed which then should lead to a rescheduling of the job

[slurm-users] Re: Errors upgrading to 23.11.0 -- jwt-secret.key

2024-02-08 Thread Xaver Stiensmeier via slurm-users
Thank you for your response. I have found found out why there was no error in the log: I've been looking at the wrong log. The error didn't occur on the master, but on our vpn-gateway (it is a hybrid cloud setup) - but you can thin of it as just another worker in the same network. The error I get

[slurm-users] Errors upgrading to 23.11.0

2024-02-07 Thread Xaver Stiensmeier via slurm-users
Dear slurm-user list, I got this error: Unable to start service slurmctld: Job for slurmctld.service failed because the control process exited with error code.\nSee \"systemctl status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for details. but in slurmctld.service I see nothi